← ML Research Wiki / 2310.06825

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Singh Devendra, Diego Chaplot, Florian De Las Casas, Gianna Bressand, Guillaume Lengyel, Lucile Lample, LélioRenard Saulnier, Marie-Anne Lavaud, Pierre Lachaux, Teven Stock, Thibaut Le Scao, Thomas Lavril, Timothée Wang, William Lacroix, El Sayed (2023)

Paper Information

arXiv ID

2310.06825

Venue

arXiv.org

Domain

Natural Language Processing

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

We introduce Mistral 7B, a 7-billion-parameter language model engineered for superior performance and efficiency.Mistral 7B outperforms the best open 13B model (Llama 2) across all evaluated benchmarks, and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation.Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost.We also provide a model fine-tuned to follow instructions, Mistral 7B -Instruct, that surpasses Llama 2 13B -chat model both on human and automated benchmarks.Our models are released under the Apache 2.0 license.

Summary

The paper introduces Mistral 7B, a 7-billion-parameter language model designed for superior performance and efficiency, outperforming Llama 2 (the best open 13B model) in various benchmarks, particularly in reasoning, mathematics, and code generation. The model utilizes grouped-query attention (GQA) for faster inference and sliding window attention (SWA) to manage long sequences, resulting in higher throughput and reduced costs. Mistral 7B is also fine-tuned as Mistral 7B-Instruct, demonstrating superior performance in human and automated evaluations. The paper discusses architectural innovations like SWA and the rolling buffer cache that enhance memory efficiency and processing speed. Through thorough benchmarking, Mistral 7B proves its effectiveness across a range of tasks, including commonsense reasoning and math, and introduces methodologies for safety and content moderation in AI models.

Methods

This paper employs the following methods:

Grouped-Query Attention (GQA)
Sliding Window Attention (SWA)

Models Used

Mistral 7B
Llama 1
Llama 2
Code-Llama 7B
Mistral 7B -Instruct

Datasets

The following datasets were used in this research:

Hellaswag
Winogrande
PIQA
SIQA
OpenbookQA
ARC-Easy
ARC-Challenge
CommonsenseQA
NaturalQuestions
TriviaQA
BoolQ
QuAC
GSM8K
MATH
Humaneval
MBPP
MMLU
BBH
AGI Eval

Evaluation Metrics

MAJ@8
MAJ@4
MT-Bench

Results

Mistral 7B outperforms Llama 2 on all evaluated metrics
Superior performance in mathematics and code generation compared to Llama 1 34B
Mistral 7B-Instruct surpasses Llama 2 13B on MT-Bench and human evaluations

Limitations

The authors identified the following limitations:

Performance on knowledge benchmarks is comparable but not superior to Llama 2 13B due to parameter count limits.

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

Large language models Attention mechanisms Inference efficiency Instruction tuning Content moderation

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 29
Influential Citations: 225

Mistral 7B

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers