Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Singh Devendra, Diego Chaplot, Florian De Las Casas, Gianna Bressand, Guillaume Lengyel, Lucile Lample, LélioRenard Saulnier, Marie-Anne Lavaud, Pierre Lachaux, Teven Stock, Thibaut Le Scao, Thomas Lavril, Timothée Wang, William Lacroix, El Sayed (2023)
The paper introduces Mistral 7B, a 7-billion-parameter language model designed for superior performance and efficiency, outperforming Llama 2 (the best open 13B model) in various benchmarks, particularly in reasoning, mathematics, and code generation. The model utilizes grouped-query attention (GQA) for faster inference and sliding window attention (SWA) to manage long sequences, resulting in higher throughput and reduced costs. Mistral 7B is also fine-tuned as Mistral 7B-Instruct, demonstrating superior performance in human and automated evaluations. The paper discusses architectural innovations like SWA and the rolling buffer cache that enhance memory efficiency and processing speed. Through thorough benchmarking, Mistral 7B proves its effectiveness across a range of tasks, including commonsense reasoning and math, and introduces methodologies for safety and content moderation in AI models.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: