← ML Research Wiki / 2303.03378

PaLM-E: An Embodied Multimodal Language Model Scene Unde Visual Q&A Task and Motion Planning

Danny Driess Robotics at Google Berlin, Fei Xia Robotics at Google, Mehdi S M Sajjadi Google Research, Corey Lynch Robotics at Google, Aakanksha Chowdhery Google Research, Brian Ichter Robotics at Google, Ayzaan Wahid Robotics at Google, Jonathan Tompson Robotics at Google, Quan Vuong Robotics at Google, Tianhe Yu Robotics at Google, Wenlong Huang Robotics at Google, Yevgen Chebotar Robotics at Google, Pierre Sermanet Robotics at Google, Daniel Duckworth Google Research, Sergey Levine Robotics at Google, Vincent Vanhoucke Robotics at Google, Karol Hausman Robotics at Google, Marc Toussaint Berlin, Klaus Greff Google Research, Andy Zeng Robotics at Google, Igor Mordatch Google Research, Pete Florence Robotics at Google (2023)

Paper Information
arXiv ID
Venue
International Conference on Machine Learning
Domain
artificial intelligence, robotics, multimodal learning
SOTA Claim
Yes
Reproducibility
7/10

Abstract

Abstract not available.

Summary

The paper presents PaLM-E, an embodied multimodal language model designed to integrate visual and physical sensor modalities for improved reasoning in robotic tasks. It addresses the limitations of large language models (LLMs) by injecting continuous observations like images into the LLM's embedding space, allowing for grounded decision-making. The model shows strong capabilities in zero-shot multimodal reasoning, OCR-free math reasoning, and can perform tasks across various robotic manipulation domains, achieving state-of-the-art results on benchmarks like OK-VQA without task-specific fine-tuning. It also exhibits transfer learning effectiveness, allowing it to generalize across tasks and environments with limited data.

Methods

This paper employs the following methods:

  • Transformer
  • ViT
  • Object Scene Representation Transformer

Models Used

  • PaLM-E-562B
  • ViT-4B
  • ViT-22B

Datasets

The following datasets were used in this research:

  • OK-VQA
  • VQAv2
  • COCO
  • Language-Table

Evaluation Metrics

  • Accuracy
  • F1-score

Results

  • PaLM-E-562B achieves state-of-the-art performance on OK-VQA
  • Demonstrates one-shot and zero-shot generalization
  • Exhibits high data-efficiency in robotics tasks

Limitations

The authors identified the following limitations:

  • Current state-of-the-art visual-language models do not effectively address embodied reasoning problems
  • Low training data may limit performance across certain tasks

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

embodied language model multimodal perception robot planning vision-language models grounded reasoning

Papers Using Similar Methods

External Resources