← ML Research Wiki / 2401.04088

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Singh Devendra, Diego Chaplot, Emma Bou De Las Casas, Florian Hanna, Gianna Bressand, Guillaume Lengyel, Guillaume Bour, Lample, Renard Lélio, Lucile Lavaud, Marie-Anne Saulnier, Pierre Lachaux, Sandeep Stock, Sophia Subramanian, Szymon Yang, Teven Antoniak, Théophile Le Scao, Thibaut Gervet, Thomas Lavril, Timothée Wang, William Lacroix, El Sayed (2024)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
Natural language processing
SOTA Claim
Yes
Code
Reproducibility
7/10

Abstract

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model.Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e.experts).For every token, at each layer, a router network selects two experts to process the current state and combine their outputs.Even though each token only sees two experts, the selected experts can be different at each timestep.As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference.Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks.In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks.We also provide a model finetuned to follow instructions, Mixtral 8x7B -Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1,Gemini Pro, and Llama 2 70B -chat model on human benchmarks.Both the base and instruct models are released under the Apache 2.0 license.

Summary

This paper introduces Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model that builds upon the architecture of Mistral 7B. The model comprises 8 feedforward blocks, with a router network selecting two experts for token processing at each layer. This mechanism enables Mixtral to utilize 47B parameters while only activating 13B during inference. Trained with a context size of 32k tokens, Mixtral claims superior or equivalent performance against Llama 2 70B and GPT-3.5 across various benchmarks, particularly excelling in mathematics, code generation, and multilingual tasks. Additionally, Mixtral 8x7B -Instruct, a fine-tuned version of the base model, demonstrates improved instruction-following capabilities, outperforming several leading models on human evaluation metrics. Both versions of Mixtral are released under the Apache 2.0 license for accessible use in academia and industry.

Methods

This paper employs the following methods:

  • Sparse Mixture of Experts (SMoE)
  • Direct Preference Optimization (DPO)

Models Used

  • Mixtral 8x7B
  • Mixtral 8x7B -Instruct
  • Llama 2 70B
  • GPT-3.5
  • Claude-2.1
  • Gemini Pro

Datasets

The following datasets were used in this research:

  • Hellaswag
  • Winogrande
  • PIQA
  • SIQA
  • OpenbookQA
  • ARC-Easy
  • ARC-Challenge
  • CommonsenseQA
  • NaturalQuestions
  • TriviaQA
  • BoolQ
  • QuAC
  • GSM8K
  • MATH
  • Humaneval
  • MBPP
  • MMLU
  • BBH
  • AGI Eval
  • BBQ
  • BOLD
  • The Pile

Evaluation Metrics

  • MMLU
  • maj@4
  • maj@8
  • pass@1

Results

  • Outperforms Llama 2 70B and GPT-3.5 on benchmarks
  • Mixtral 8x7B -Instruct surpasses GPT-3.5 Turbo, Claude-2.1, and Gemini Pro in instruction-following tasks.
  • Mixtral achieves 100% retrieval accuracy on long-context tasks.

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

mixture of experts sparse models language modeling multilingual long context

Papers Using Similar Methods

External Resources