← ML Research Wiki / 2303.03378

PaLM-E: An Embodied Multimodal Language Model Scene Unde Visual Q&A Task and Motion Planning

Danny Driess Robotics at Google Berlin, Fei Xia Robotics at Google, Mehdi S M Sajjadi Google Research, Corey Lynch Robotics at Google, Aakanksha Chowdhery Google Research, Brian Ichter Robotics at Google, Ayzaan Wahid Robotics at Google, Jonathan Tompson Robotics at Google, Quan Vuong Robotics at Google, Tianhe Yu Robotics at Google, Wenlong Huang Robotics at Google, Yevgen Chebotar Robotics at Google, Pierre Sermanet Robotics at Google, Daniel Duckworth Google Research, Sergey Levine Robotics at Google, Vincent Vanhoucke Robotics at Google, Karol Hausman Robotics at Google, Marc Toussaint Berlin, Klaus Greff Google Research, Andy Zeng Robotics at Google, Igor Mordatch Google Research, Pete Florence Robotics at Google (2023)

Paper Information

arXiv ID

2303.03378

Venue

International Conference on Machine Learning

Domain

artificial intelligence, robotics, multimodal learning

SOTA Claim

Yes

Reproducibility

7/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Abstract not available.

Summary

The paper presents PaLM-E, an embodied multimodal language model designed to integrate visual and physical sensor modalities for improved reasoning in robotic tasks. It addresses the limitations of large language models (LLMs) by injecting continuous observations like images into the LLM's embedding space, allowing for grounded decision-making. The model shows strong capabilities in zero-shot multimodal reasoning, OCR-free math reasoning, and can perform tasks across various robotic manipulation domains, achieving state-of-the-art results on benchmarks like OK-VQA without task-specific fine-tuning. It also exhibits transfer learning effectiveness, allowing it to generalize across tasks and environments with limited data.

Methods

This paper employs the following methods:

Transformer
ViT
Object Scene Representation Transformer

Models Used

PaLM-E-562B
ViT-4B
ViT-22B

Datasets

The following datasets were used in this research:

OK-VQA
VQAv2
COCO
Language-Table

Evaluation Metrics

Accuracy
F1-score

Results

PaLM-E-562B achieves state-of-the-art performance on OK-VQA
Demonstrates one-shot and zero-shot generalization
Exhibits high data-efficiency in robotics tasks

Limitations

The authors identified the following limitations:

Current state-of-the-art visual-language models do not effectively address embodied reasoning problems
Low training data may limit performance across certain tasks

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

embodied language model multimodal perception robot planning vision-language models grounded reasoning

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 71
Influential Citations: 70

PaLM-E: An Embodied Multimodal Language Model Scene Unde Visual Q&A Task and Motion Planning

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers