Domain
artificial intelligence, machine learning, multimodal modeling
We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence.We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting.The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation.Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model.It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text.Chameleon marks a significant step forward in a unified modeling of full multimodal documents.
Chameleon is a new family of early-fusion, token-based mixed-modal foundation models capable of comprehensively understanding and generating images and text in an interleaved fashion. The paper outlines its stable training methodology, architectural innovations, and evaluation results across a wide range of tasks such as visual question answering, image captioning, text generation, and long-form mixed-modal generation. Chameleon exhibits exceptional capabilities, achieving state-of-the-art performance in image captioning and surpassing Llama-2 in text-only tasks, while being competitive with existing models like Mixtral 8x7B and Gemini-Pro. The model, which is trained on a massive dataset of interleaved text and image tokens, is noted for unmatched performance in generating mixed-modal content, introducing novel architectural techniques, addressing optimization challenges, and demonstrating superior alignment and safety during human testing.
This paper employs the following methods:
- Transformer
- Early-fusion
- Token-based representation
- Chameleon
- Llama-2
- Mixtral 8x7B
- Gemini-Pro
- Gemini
- GPT-4V
The following datasets were used in this research:
- State-of-the-art performance in image captioning
- Outperforms Llama-2 in text-only tasks
- Competitive performance with Mixtral 8x7B and Gemini-Pro
- High preference rates in human evaluations
The authors identified the following limitations:
- Limited evaluation scope with human annotations
- Challenges in OCR-related tasks
- Notable inter-model tie rates during human evaluations
- Number of GPUs: 0
- GPU Type: NVIDIA A100 80 GB
multimodal models
early-fusion
transformer
image and text reasoning
mixed-modal generation