← ML Research Wiki / 2408.00118

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Google Deepmind (2024)

Paper Information

arXiv ID

2408.00118

Venue

arXiv.org

Domain

natural language processing

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters.In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023).We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction.The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3× bigger.We release all our models to the community.

Summary

This paper introduces Gemma 2, an updated language model from Google DeepMind that utilizes knowledge distillation and various Transformer architecture modifications to enhance performance while maintaining a practical size. The models range from 2 billion to 27 billion parameters and are designed to provide competitive performance relative to larger models. The main focus is on the training techniques, including rich objectives and knowledge distillation, which improves data efficiency. The paper discusses architecture modifications such as interleaved local-global attention and Grouped-Query Attention. It also addresses the extensive testing across automated benchmarks, human evaluations, and safety measures regarding misuse. The results indicate improvements across various evaluation domains, affirming Gemma 2's advancements over its predecessors and competitive standing against other large language models.

Methods

This paper employs the following methods:

Transformer
Knowledge Distillation
Grouped-Query Attention
Interleaved Local-Global Attention

Models Used

Gemma 2
Gemma 1

Datasets

The following datasets were used in this research:

None specified

Evaluation Metrics

None specified

Results

Best performance for model size
Competitive with larger models (2-3x larger)
Significant improvements in benchmarks compared to previous versions

Limitations

The authors identified the following limitations:

Can't cover all applications and scenarios
Further research needed for factuality, robustness, reasoning, and alignment

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

language models open models transformer modifications knowledge distillation safety and responsibility

Papers Using Similar Methods

External Resources

Funding: Google Deepmind
References: 47
Influential Citations: 88

Gemma 2: Improving Open Language Models at a Practical Size

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers