← ML Research Wiki / 2310.06825

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Singh Devendra, Diego Chaplot, Florian De Las Casas, Gianna Bressand, Guillaume Lengyel, Lucile Lample, LélioRenard Saulnier, Marie-Anne Lavaud, Pierre Lachaux, Teven Stock, Thibaut Le Scao, Thomas Lavril, Timothée Wang, William Lacroix, El Sayed (2023)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
Natural Language Processing
SOTA Claim
Yes
Code
Reproducibility
8/10

Abstract

We introduce Mistral 7B, a 7-billion-parameter language model engineered for superior performance and efficiency.Mistral 7B outperforms the best open 13B model (Llama 2) across all evaluated benchmarks, and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation.Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost.We also provide a model fine-tuned to follow instructions, Mistral 7B -Instruct, that surpasses Llama 2 13B -chat model both on human and automated benchmarks.Our models are released under the Apache 2.0 license.

Summary

The paper introduces Mistral 7B, a 7-billion-parameter language model designed for superior performance and efficiency, outperforming Llama 2 (the best open 13B model) in various benchmarks, particularly in reasoning, mathematics, and code generation. The model utilizes grouped-query attention (GQA) for faster inference and sliding window attention (SWA) to manage long sequences, resulting in higher throughput and reduced costs. Mistral 7B is also fine-tuned as Mistral 7B-Instruct, demonstrating superior performance in human and automated evaluations. The paper discusses architectural innovations like SWA and the rolling buffer cache that enhance memory efficiency and processing speed. Through thorough benchmarking, Mistral 7B proves its effectiveness across a range of tasks, including commonsense reasoning and math, and introduces methodologies for safety and content moderation in AI models.

Methods

This paper employs the following methods:

  • Grouped-Query Attention (GQA)
  • Sliding Window Attention (SWA)

Models Used

  • Mistral 7B
  • Llama 1
  • Llama 2
  • Code-Llama 7B
  • Mistral 7B -Instruct

Datasets

The following datasets were used in this research:

  • Hellaswag
  • Winogrande
  • PIQA
  • SIQA
  • OpenbookQA
  • ARC-Easy
  • ARC-Challenge
  • CommonsenseQA
  • NaturalQuestions
  • TriviaQA
  • BoolQ
  • QuAC
  • GSM8K
  • MATH
  • Humaneval
  • MBPP
  • MMLU
  • BBH
  • AGI Eval

Evaluation Metrics

  • MAJ@8
  • MAJ@4
  • MT-Bench

Results

  • Mistral 7B outperforms Llama 2 on all evaluated metrics
  • Superior performance in mathematics and code generation compared to Llama 1 34B
  • Mistral 7B-Instruct surpasses Llama 2 13B on MT-Bench and human evaluations

Limitations

The authors identified the following limitations:

  • Performance on knowledge benchmarks is comparable but not superior to Llama 2 13B due to parameter count limits.

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

Large language models Attention mechanisms Inference efficiency Instruction tuning Content moderation

Papers Using Similar Methods

External Resources