← ML Research Wiki / 2401.13601

MM-LLMs: Recent Advances in MultiModal Large Language Models

Du-Zhen Zhang [email protected] Tencent AI Lab China, Yahan Yu Kyoto University Japan, Jiahua Dong [email protected] Mohamed bin Zayed University of Artificial Intelligence United Arab Emirates, Chenxing Li chenxingli@ Tencent AI Lab China, Dan Su dansu@ Tencent AI Lab China, Chenhui Chu [email protected] Kyoto University Japan, Dong Yu Tencent AI Lab USA, Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sagnak 2023 Taşırlar, Introducing, Furkan Ali, Ron Biten, Yusheng Litman, Srikar Xie, R Appalaraju, 2022 Manmatha, Latr, Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Fei-Long Chen, Ming-Lun Han, Xiu- Yi Chen, Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Josip Djolonga, Piotr Padlewski, Basil Mustafa, AJSoravit Changpinyo, Jialin Wu, Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Daniel Salz, Adam Grycner, Lucas, Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, Ajay Divakaran, Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, Mehdi Cherti Romain Beaumont, Ross Wightman, Gabriel IlharcoMitchell Wortsman, Cade Gordon, Christoph Schuhmann Ludwig Schmidt, and Jenia Jitsev 2023, 90%* Chat-GPT Quality. Aakanksha Chowdhery Sharan Narang, Jacob Devlin, Gaurav Mishra, Hyung Won Chung, SebasMaarten Bosma, Adam Roberts, Paul Barham, Charles Sutton (2024)

Paper Information
arXiv ID
Venue
Annual Meeting of the Association for Computational Linguistics
Domain
Artificial Intelligence, Machine Learning, Natural Language Processing, Computer Vision
SOTA Claim
Yes
Reproducibility
6/10

Abstract

In the past year, MultiModal Large Language Models (MM-LLMs) have undergone substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs via cost-effective training strategies.The resulting models not only preserve the inherent reasoning and decision-making capabilities of LLMs but also empower a diverse range of MM tasks.In this paper, we provide a comprehensive survey aimed at facilitating further research of MM-LLMs.Initially, we outline general design formulations for model architecture and training pipeline.Subsequently, we introduce a taxonomy encompassing 126 MM-LLMs, each characterized by its specific formulations.Furthermore, we review the performance of selected MM-LLMs on mainstream benchmarks and summarize key training recipes to enhance the potency of MM-LLMs.Finally, we explore promising directions for MM-LLMs while concurrently maintaining a real-time tracking website 1 for the latest developments in the field.We hope that this survey contributes to the ongoing advancement of the MM-LLMs domain.* Equal contributions.

Summary

The paper provides a comprehensive survey on recent advances in MultiModal Large Language Models (MM-LLMs). It outlines the models' capabilities, which integrate multi-modal inputs and outputs while harnessing the reasoning skills of large language models (LLMs). The authors categorize 126 MM-LLMs based on their design and training methodologies and review their performances on various benchmarks. The paper emphasizes the importance of leveraging pre-trained foundation models from different modalities to mitigate computational costs. Additionally, it discusses the architecture and training pipeline, which include stages such as pre-training and instruction tuning (MM IT). The paper concludes by suggesting future directions for research in this domain, including the enhancement of generalization capabilities and the need for more challenging benchmarks.

Models Used

  • GPT-4
  • Gemini
  • BLIP-2
  • LLaVA
  • MiniGPT-4
  • OpenFlamingo
  • VideoChat
  • SpeechGPT
  • AudioPaLM
  • DRESS
  • Qwen-VL
  • InstructBLIP
  • LLaVA-1.5
  • MobileVLM
  • NExT-GPT
  • CoDi-2

Datasets

The following datasets were used in this research:

  • None specified

Evaluation Metrics

  • None specified

Limitations

The authors identified the following limitations:

  • Some recent advances may not be fully captured in this survey.
  • Page limit constraints restrict the depth of the discussion on each model.

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

multimodal large language models MM-LLMs model architecture training pipeline benchmark generative models multi-modal tasks vision-language understanding instruction tuning

External Resources