← ML Research Wiki / 2405.09818

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team FAIR at Meta (2024)

Paper Information

arXiv ID

2405.09818

Venue

arXiv.org

Domain

artificial intelligence, machine learning, multimodal modeling

SOTA Claim

Yes

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence.We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting.The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation.Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model.It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text.Chameleon marks a significant step forward in a unified modeling of full multimodal documents.

Summary

Chameleon is a new family of early-fusion, token-based mixed-modal foundation models capable of comprehensively understanding and generating images and text in an interleaved fashion. The paper outlines its stable training methodology, architectural innovations, and evaluation results across a wide range of tasks such as visual question answering, image captioning, text generation, and long-form mixed-modal generation. Chameleon exhibits exceptional capabilities, achieving state-of-the-art performance in image captioning and surpassing Llama-2 in text-only tasks, while being competitive with existing models like Mixtral 8x7B and Gemini-Pro. The model, which is trained on a massive dataset of interleaved text and image tokens, is noted for unmatched performance in generating mixed-modal content, introducing novel architectural techniques, addressing optimization challenges, and demonstrating superior alignment and safety during human testing.

Methods

This paper employs the following methods:

Transformer
Early-fusion
Token-based representation

Models Used

Chameleon
Llama-2
Mixtral 8x7B
Gemini-Pro
Gemini
GPT-4V

Datasets

The following datasets were used in this research:

None specified

Evaluation Metrics

None specified

Results

State-of-the-art performance in image captioning
Outperforms Llama-2 in text-only tasks
Competitive performance with Mixtral 8x7B and Gemini-Pro
High preference rates in human evaluations

Limitations

The authors identified the following limitations:

Limited evaluation scope with human annotations
Challenges in OCR-related tasks
Notable inter-model tie rates during human evaluations

Technical Requirements

Number of GPUs: 0
GPU Type: NVIDIA A100 80 GB

Keywords

multimodal models early-fusion transformer image and text reasoning mixed-modal generation

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 61
Influential Citations: 36

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers