← ML Research Wiki / 2405.09818

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team FAIR at Meta (2024)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
artificial intelligence, machine learning, multimodal modeling
SOTA Claim
Yes
Reproducibility
8/10

Abstract

We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence.We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting.The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation.Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model.It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text.Chameleon marks a significant step forward in a unified modeling of full multimodal documents.

Summary

Chameleon is a new family of early-fusion, token-based mixed-modal foundation models capable of comprehensively understanding and generating images and text in an interleaved fashion. The paper outlines its stable training methodology, architectural innovations, and evaluation results across a wide range of tasks such as visual question answering, image captioning, text generation, and long-form mixed-modal generation. Chameleon exhibits exceptional capabilities, achieving state-of-the-art performance in image captioning and surpassing Llama-2 in text-only tasks, while being competitive with existing models like Mixtral 8x7B and Gemini-Pro. The model, which is trained on a massive dataset of interleaved text and image tokens, is noted for unmatched performance in generating mixed-modal content, introducing novel architectural techniques, addressing optimization challenges, and demonstrating superior alignment and safety during human testing.

Methods

This paper employs the following methods:

  • Transformer
  • Early-fusion
  • Token-based representation

Models Used

  • Chameleon
  • Llama-2
  • Mixtral 8x7B
  • Gemini-Pro
  • Gemini
  • GPT-4V

Datasets

The following datasets were used in this research:

  • None specified

Evaluation Metrics

  • None specified

Results

  • State-of-the-art performance in image captioning
  • Outperforms Llama-2 in text-only tasks
  • Competitive performance with Mixtral 8x7B and Gemini-Pro
  • High preference rates in human evaluations

Limitations

The authors identified the following limitations:

  • Limited evaluation scope with human annotations
  • Challenges in OCR-related tasks
  • Notable inter-model tie rates during human evaluations

Technical Requirements

  • Number of GPUs: 0
  • GPU Type: NVIDIA A100 80 GB

Keywords

multimodal models early-fusion transformer image and text reasoning mixed-modal generation

Papers Using Similar Methods

External Resources