← ML Research Wiki / 2403.09611

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Brandon Mckinzie [email protected], Zhe Gan [email protected], Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter ⋆ , Dhruti Shah ⋆ , Xianzhi Du ⋆Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, Yinfei Yang (2024)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
Not specified
SOTA Claim
Yes

Abstract

In this work, we discuss building performant Multimodal Large Language Models (MLLMs).In particular, we study the importance of various architecture components and data choices.Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons.For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving stateof-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published multimodal pre-training results.Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance.By scaling up the presented recipe, we build MM1, a family of multimodal models, including both dense variants up to 30B and mixture-of-experts (MoE) variants up to 64B, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks.Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.

Summary

The paper discusses the construction of Multimodal Large Language Models (MLLMs), particularly focusing on the architectural components and data choices that drive their performance. Through comprehensive ablation studies on the image encoder, vision-language connector, and pre-training data, significant insights are gleaned about the design of these models. Key findings highlight the importance of a mix of training data types (image-caption, interleaved image-text, and text-only) for state-of-the-art few-shot results. The study presents MM1, a family of MLLMs ranging from dense models (up to 30 billion parameters) to mixture-of-experts (64 billion parameters) which achieve competitive results on established benchmarks while showcasing improved in-context learning and multi-image reasoning capabilities after pre-training. The paper outlines empirical setups, evaluates the impact of varying design decisions, and documents a final training recipe that enhances model performance, ultimately setting a foundation for future developments in multimodal models.

Methods

This paper employs the following methods:

  • Ablation Study
  • Mixture-of-Experts (MoE)

Models Used

  • MM1-30B

Datasets

The following datasets were used in this research:

  • COCO 2014
  • CC3M
  • CC12M
  • HQIPT-204M
  • COYO
  • DFN-5B
  • VeCap-300M

Evaluation Metrics

  • 0-shot
  • 4-shot
  • 8-shot
  • Accuracy

Results

  • State-of-the-art few-shot performance across benchmarks
  • Enhanced in-context learning
  • Competitive performance on 12 multimodal benchmarks post fine-tuning

Limitations

The authors identified the following limitations:

  • Limited insights on specific architectural decisions of VL connector
  • High resource requirements for training large models

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Papers Using Similar Methods

External Resources