← ML Research Wiki / 2405.04434

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

(2024)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
Natural Language Processing
SOTA Claim
Yes
Code
Reproducibility
8/10

Abstract

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference.It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens.DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE.MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation.Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times.We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential.Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V2.

Summary

DeepSeek-V2 is an advanced Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference, boasting 236B parameters with 21B activated for each token. It supports a context length of 128K tokens and features innovative architectures such as Multi-head Latent Attention (MLA) for reducing KV cache and DeepSeekMoE for economical training. Compared to its predecessor, DeepSeek 67B, it demonstrates stronger performance while saving 42.5% in training costs and achieving 5.76 times higher generation throughput. The architecture optimizes attention modules and Feed-Forward Networks (FFNs) within the Transformer framework, promoting expert specialization and reducing communication overheads. DeepSeek-V2 is pretrained on a multi-source corpus of 8.1T tokens and undergoes Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for improved performance across various benchmarks. The model is recognized for its top-tier performance among open-source models, especially in English and Chinese benchmarks, and its efficiency is evidenced by substantial gains in training and inference processes.

Methods

This paper employs the following methods:

  • Multi-head Latent Attention (MLA)
  • DeepSeekMoE

Models Used

  • DeepSeek-V2
  • DeepSeek 67B

Datasets

The following datasets were used in this research:

  • 8.1T Tokens

Evaluation Metrics

  • Generation throughput
  • Top-tier performance on MMLU
  • 38.9 length-controlled win rate on AlpacaEval 2.0
  • 8.97 overall score on MT-Bench
  • 7.91 overall score on AlignBench

Results

  • Achieved top-tier performance among open-source models with only 21B activated parameters
  • Saved 42.5% in training costs compared to DeepSeek 67B
  • Reduced KV cache by 93.3% and boosted maximum generation throughput to 5.76 times

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: 8
  • GPU Type: NVIDIA H800

Keywords

Large Language Models Mixture-of-Experts MLA DeepSeekMoE Transformer Long context

Papers Using Similar Methods

External Resources