Brandon Mckinzie [email protected], Zhe Gan [email protected], Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter ⋆ , Dhruti Shah ⋆ , Xianzhi Du ⋆Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, Yinfei Yang (2024)
The paper discusses the construction of Multimodal Large Language Models (MLLMs), particularly focusing on the architectural components and data choices that drive their performance. Through comprehensive ablation studies on the image encoder, vision-language connector, and pre-training data, significant insights are gleaned about the design of these models. Key findings highlight the importance of a mix of training data types (image-caption, interleaved image-text, and text-only) for state-of-the-art few-shot results. The study presents MM1, a family of MLLMs ranging from dense models (up to 30 billion parameters) to mixture-of-experts (64 billion parameters) which achieve competitive results on established benchmarks while showcasing improved in-context learning and multi-image reasoning capabilities after pre-training. The paper outlines empirical setups, evaluates the impact of varying design decisions, and documents a final training recipe that enhances model performance, ultimately setting a foundation for future developments in multimodal models.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: