Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Singh Devendra, Diego Chaplot, Emma Bou De Las Casas, Florian Hanna, Gianna Bressand, Guillaume Lengyel, Guillaume Bour, Lample, Renard Lélio, Lucile Lavaud, Marie-Anne Saulnier, Pierre Lachaux, Sandeep Stock, Sophia Subramanian, Szymon Yang, Teven Antoniak, Théophile Le Scao, Thibaut Gervet, Thomas Lavril, Timothée Wang, William Lacroix, El Sayed (2024)
This paper introduces Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model that builds upon the architecture of Mistral 7B. The model comprises 8 feedforward blocks, with a router network selecting two experts for token processing at each layer. This mechanism enables Mixtral to utilize 47B parameters while only activating 13B during inference. Trained with a context size of 32k tokens, Mixtral claims superior or equivalent performance against Llama 2 70B and GPT-3.5 across various benchmarks, particularly excelling in mathematics, code generation, and multilingual tasks. Additionally, Mixtral 8x7B -Instruct, a fine-tuned version of the base model, demonstrates improved instruction-following capabilities, outperforming several leading models on human evaluation metrics. Both versions of Mixtral are released under the Apache 2.0 license for accessible use in academia and industry.
This paper employs the following methods:
The following datasets were used in this research: