Domain
natural language processing
In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters.In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023).We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction.The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3× bigger.We release all our models to the community.
This paper introduces Gemma 2, an updated language model from Google DeepMind that utilizes knowledge distillation and various Transformer architecture modifications to enhance performance while maintaining a practical size. The models range from 2 billion to 27 billion parameters and are designed to provide competitive performance relative to larger models. The main focus is on the training techniques, including rich objectives and knowledge distillation, which improves data efficiency. The paper discusses architecture modifications such as interleaved local-global attention and Grouped-Query Attention. It also addresses the extensive testing across automated benchmarks, human evaluations, and safety measures regarding misuse. The results indicate improvements across various evaluation domains, affirming Gemma 2's advancements over its predecessors and competitive standing against other large language models.
This paper employs the following methods:
- Transformer
- Knowledge Distillation
- Grouped-Query Attention
- Interleaved Local-Global Attention
The following datasets were used in this research:
- Best performance for model size
- Competitive with larger models (2-3x larger)
- Significant improvements in benchmarks compared to previous versions
The authors identified the following limitations:
- Can't cover all applications and scenarios
- Further research needed for factuality, robustness, reasoning, and alignment
- Number of GPUs: None specified
- GPU Type: None specified
language models
open models
transformer modifications
knowledge distillation
safety and responsibility