← ML Research Wiki / 2405.04434

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

(2024)

Paper Information

arXiv ID

2405.04434

Venue

arXiv.org

Domain

Natural Language Processing

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference.It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens.DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE.MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation.Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times.We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential.Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V2.

Summary

DeepSeek-V2 is an advanced Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference, boasting 236B parameters with 21B activated for each token. It supports a context length of 128K tokens and features innovative architectures such as Multi-head Latent Attention (MLA) for reducing KV cache and DeepSeekMoE for economical training. Compared to its predecessor, DeepSeek 67B, it demonstrates stronger performance while saving 42.5% in training costs and achieving 5.76 times higher generation throughput. The architecture optimizes attention modules and Feed-Forward Networks (FFNs) within the Transformer framework, promoting expert specialization and reducing communication overheads. DeepSeek-V2 is pretrained on a multi-source corpus of 8.1T tokens and undergoes Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for improved performance across various benchmarks. The model is recognized for its top-tier performance among open-source models, especially in English and Chinese benchmarks, and its efficiency is evidenced by substantial gains in training and inference processes.

Methods

This paper employs the following methods:

Multi-head Latent Attention (MLA)
DeepSeekMoE

Models Used

DeepSeek-V2
DeepSeek 67B

Datasets

The following datasets were used in this research:

8.1T Tokens

Evaluation Metrics

Generation throughput
Top-tier performance on MMLU
38.9 length-controlled win rate on AlpacaEval 2.0
8.97 overall score on MT-Bench
7.91 overall score on AlignBench

Results

Achieved top-tier performance among open-source models with only 21B activated parameters
Saved 42.5% in training costs compared to DeepSeek 67B
Reduced KV cache by 93.3% and boosted maximum generation throughput to 5.76 times

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: 8
GPU Type: NVIDIA H800

Keywords

Large Language Models Mixture-of-Experts MLA DeepSeekMoE Transformer Long context

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 58
Influential Citations: 20

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers