← ML Research Wiki / 2401.04088

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Singh Devendra, Diego Chaplot, Emma Bou De Las Casas, Florian Hanna, Gianna Bressand, Guillaume Lengyel, Guillaume Bour, Lample, Renard Lélio, Lucile Lavaud, Marie-Anne Saulnier, Pierre Lachaux, Sandeep Stock, Sophia Subramanian, Szymon Yang, Teven Antoniak, Théophile Le Scao, Thibaut Gervet, Thomas Lavril, Timothée Wang, William Lacroix, El Sayed (2024)

Paper Information

arXiv ID

2401.04088

Venue

arXiv.org

Domain

Natural language processing

SOTA Claim

Yes

Code

Available

Reproducibility

7/10

Contents

Abstract
Methods
Datasets
Results
Related Work
External Resources

Abstract

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model.Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e.experts).For every token, at each layer, a router network selects two experts to process the current state and combine their outputs.Even though each token only sees two experts, the selected experts can be different at each timestep.As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference.Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks.In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks.We also provide a model finetuned to follow instructions, Mixtral 8x7B -Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1,Gemini Pro, and Llama 2 70B -chat model on human benchmarks.Both the base and instruct models are released under the Apache 2.0 license.

Summary

This paper introduces Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model that builds upon the architecture of Mistral 7B. The model comprises 8 feedforward blocks, with a router network selecting two experts for token processing at each layer. This mechanism enables Mixtral to utilize 47B parameters while only activating 13B during inference. Trained with a context size of 32k tokens, Mixtral claims superior or equivalent performance against Llama 2 70B and GPT-3.5 across various benchmarks, particularly excelling in mathematics, code generation, and multilingual tasks. Additionally, Mixtral 8x7B -Instruct, a fine-tuned version of the base model, demonstrates improved instruction-following capabilities, outperforming several leading models on human evaluation metrics. Both versions of Mixtral are released under the Apache 2.0 license for accessible use in academia and industry.

Methods

This paper employs the following methods:

Sparse Mixture of Experts (SMoE)
Direct Preference Optimization (DPO)

Models Used

Mixtral 8x7B
Mixtral 8x7B -Instruct
Llama 2 70B
GPT-3.5
Claude-2.1
Gemini Pro

Datasets

The following datasets were used in this research:

Hellaswag
Winogrande
PIQA
SIQA
OpenbookQA
ARC-Easy
ARC-Challenge
CommonsenseQA
NaturalQuestions
TriviaQA
BoolQ
QuAC
GSM8K
MATH
Humaneval
MBPP
MMLU
BBH
AGI Eval
BBQ
BOLD
The Pile

Evaluation Metrics

MMLU
maj@4
maj@8
pass@1

Results

Outperforms Llama 2 70B and GPT-3.5 on benchmarks
Mixtral 8x7B -Instruct surpasses GPT-3.5 Turbo, Claude-2.1, and Gemini Pro in instruction-following tasks.
Mixtral achieves 100% retrieval accuracy on long-context tasks.

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

mixture of experts sparse models language modeling multilingual long context

Papers Using Similar Methods

External Resources

Funding: None specified
References: 35
Influential Citations: 90

Mixtral of Experts

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Technical Requirements edit

Keywords add

Related Papers