Danny Driess Robotics at Google Berlin, Fei Xia Robotics at Google, Mehdi S M Sajjadi Google Research, Corey Lynch Robotics at Google, Aakanksha Chowdhery Google Research, Brian Ichter Robotics at Google, Ayzaan Wahid Robotics at Google, Jonathan Tompson Robotics at Google, Quan Vuong Robotics at Google, Tianhe Yu Robotics at Google, Wenlong Huang Robotics at Google, Yevgen Chebotar Robotics at Google, Pierre Sermanet Robotics at Google, Daniel Duckworth Google Research, Sergey Levine Robotics at Google, Vincent Vanhoucke Robotics at Google, Karol Hausman Robotics at Google, Marc Toussaint Berlin, Klaus Greff Google Research, Andy Zeng Robotics at Google, Igor Mordatch Google Research, Pete Florence Robotics at Google (2023)
The paper presents PaLM-E, an embodied multimodal language model designed to integrate visual and physical sensor modalities for improved reasoning in robotic tasks. It addresses the limitations of large language models (LLMs) by injecting continuous observations like images into the LLM's embedding space, allowing for grounded decision-making. The model shows strong capabilities in zero-shot multimodal reasoning, OCR-free math reasoning, and can perform tasks across various robotic manipulation domains, achieving state-of-the-art results on benchmarks like OK-VQA without task-specific fine-tuning. It also exhibits transfer learning effectiveness, allowing it to generalize across tasks and environments with limited data.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: