← ML Research Wiki / 2403.09611

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Brandon Mckinzie [email protected], Zhe Gan [email protected], Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter ⋆ , Dhruti Shah ⋆ , Xianzhi Du ⋆Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, Yinfei Yang (2024)

Paper Information

arXiv ID

2403.09611

Venue

arXiv.org

Domain

Not specified

SOTA Claim

Yes

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

In this work, we discuss building performant Multimodal Large Language Models (MLLMs).In particular, we study the importance of various architecture components and data choices.Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons.For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving stateof-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published multimodal pre-training results.Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance.By scaling up the presented recipe, we build MM1, a family of multimodal models, including both dense variants up to 30B and mixture-of-experts (MoE) variants up to 64B, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks.Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.

Summary

The paper discusses the construction of Multimodal Large Language Models (MLLMs), particularly focusing on the architectural components and data choices that drive their performance. Through comprehensive ablation studies on the image encoder, vision-language connector, and pre-training data, significant insights are gleaned about the design of these models. Key findings highlight the importance of a mix of training data types (image-caption, interleaved image-text, and text-only) for state-of-the-art few-shot results. The study presents MM1, a family of MLLMs ranging from dense models (up to 30 billion parameters) to mixture-of-experts (64 billion parameters) which achieve competitive results on established benchmarks while showcasing improved in-context learning and multi-image reasoning capabilities after pre-training. The paper outlines empirical setups, evaluates the impact of varying design decisions, and documents a final training recipe that enhances model performance, ultimately setting a foundation for future developments in multimodal models.

Methods

This paper employs the following methods:

Ablation Study
Mixture-of-Experts (MoE)

Models Used

MM1-30B

Datasets

The following datasets were used in this research:

COCO 2014
CC3M
CC12M
HQIPT-204M
COYO
DFN-5B
VeCap-300M

Evaluation Metrics

0-shot
4-shot
8-shot
Accuracy

Results

State-of-the-art few-shot performance across benchmarks
Enhanced in-context learning
Competitive performance on 12 multimodal benchmarks post fine-tuning

Limitations

The authors identified the following limitations:

Limited insights on specific architectural decisions of VL connector
High resource requirements for training large models

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 132
Influential Citations: 16

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Related Papers