Domain
Natural Language Processing
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference.It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens.DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE.MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation.Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times.We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential.Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V2.
DeepSeek-V2 is an advanced Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference, boasting 236B parameters with 21B activated for each token. It supports a context length of 128K tokens and features innovative architectures such as Multi-head Latent Attention (MLA) for reducing KV cache and DeepSeekMoE for economical training. Compared to its predecessor, DeepSeek 67B, it demonstrates stronger performance while saving 42.5% in training costs and achieving 5.76 times higher generation throughput. The architecture optimizes attention modules and Feed-Forward Networks (FFNs) within the Transformer framework, promoting expert specialization and reducing communication overheads. DeepSeek-V2 is pretrained on a multi-source corpus of 8.1T tokens and undergoes Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for improved performance across various benchmarks. The model is recognized for its top-tier performance among open-source models, especially in English and Chinese benchmarks, and its efficiency is evidenced by substantial gains in training and inference processes.
This paper employs the following methods:
- Multi-head Latent Attention (MLA)
- DeepSeekMoE
The following datasets were used in this research:
- Generation throughput
- Top-tier performance on MMLU
- 38.9 length-controlled win rate on AlpacaEval 2.0
- 8.97 overall score on MT-Bench
- 7.91 overall score on AlignBench
- Achieved top-tier performance among open-source models with only 21B activated parameters
- Saved 42.5% in training costs compared to DeepSeek 67B
- Reduced KV cache by 93.3% and boosted maximum generation throughput to 5.76 times
The authors identified the following limitations:
- Number of GPUs: 8
- GPU Type: NVIDIA H800
Large Language Models
Mixture-of-Experts
MLA
DeepSeekMoE
Transformer
Long context