Du-Zhen Zhang [email protected] Tencent AI Lab China, Yahan Yu Kyoto University Japan, Jiahua Dong [email protected] Mohamed bin Zayed University of Artificial Intelligence United Arab Emirates, Chenxing Li chenxingli@ Tencent AI Lab China, Dan Su dansu@ Tencent AI Lab China, Chenhui Chu [email protected] Kyoto University Japan, Dong Yu Tencent AI Lab USA, Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sagnak 2023 Taşırlar, Introducing, Furkan Ali, Ron Biten, Yusheng Litman, Srikar Xie, R Appalaraju, 2022 Manmatha, Latr, Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Fei-Long Chen, Ming-Lun Han, Xiu- Yi Chen, Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Josip Djolonga, Piotr Padlewski, Basil Mustafa, AJSoravit Changpinyo, Jialin Wu, Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Daniel Salz, Adam Grycner, Lucas, Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, Ajay Divakaran, Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, Mehdi Cherti Romain Beaumont, Ross Wightman, Gabriel IlharcoMitchell Wortsman, Cade Gordon, Christoph Schuhmann Ludwig Schmidt, and Jenia Jitsev 2023, 90%* Chat-GPT Quality. Aakanksha Chowdhery Sharan Narang, Jacob Devlin, Gaurav Mishra, Hyung Won Chung, SebasMaarten Bosma, Adam Roberts, Paul Barham, Charles Sutton (2024)
The paper provides a comprehensive survey on recent advances in MultiModal Large Language Models (MM-LLMs). It outlines the models' capabilities, which integrate multi-modal inputs and outputs while harnessing the reasoning skills of large language models (LLMs). The authors categorize 126 MM-LLMs based on their design and training methodologies and review their performances on various benchmarks. The paper emphasizes the importance of leveraging pre-trained foundation models from different modalities to mitigate computational costs. Additionally, it discusses the architecture and training pipeline, which include stages such as pre-training and instruction tuning (MM IT). The paper concludes by suggesting future directions for research in this domain, including the enhancement of generalization capabilities and the need for more challenging benchmarks.
The following datasets were used in this research:
The authors identified the following limitations: