← ML Research Wiki / 2403.19887

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Abend Omri, Alon Raz, Asida Amir Tomer, Roman Bergman, Michael Glozman, Avshalom Gokhman, Nir Manevich, Noam Ratner, Erez Rozen, Mor Schwartz, Yoav Zusman, Shoham (2024)

Paper Information

arXiv ID

2403.19887

Venue

arXiv.org

Domain

Natural Language Processing

SOTA Claim

Yes

Reproducibility

7/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture.Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families.MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable.This flexible architecture allows resource-and objective-specific configurations.In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU.Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations.Remarkably, the model presents strong results for up to 256K tokens context length.We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling.We also describe several interesting properties of these architectures which the training and evaluation of Jamba have revealed, and plan to release checkpoints from various ablation runs, to encourage further exploration of this novel architecture.We make the weights of our implementation of Jamba publicly available under a permissive license.Model: https://huggingface.co/ai21labs/Jamba-v0.1

Summary

This paper presents Jamba, a novel large language model based on a hybrid Transformer-Mamba architecture. Jamba integrates transformer layers with Mamba layers in a mixture-of-experts setup to enhance model capacity while minimizing active parameter usage. It boasts high throughput and a small memory footprint, supporting a context length of up to 256K tokens, which is the longest for publicly available models. The authors conduct numerous ablation experiments to analyze design choices and their impact on performance across various benchmarks, revealing that the hybrid architecture yields improved performance compared to pure Transformer or Mamba models. Jamba is released under an open license to foster further research and experimentation in hybrid models, contributing to advancements in the field of language modeling.

Methods

This paper employs the following methods:

Transformer
Mamba
Mixture-of-Experts (MoE)

Models Used

Jamba
Mixtral-8x7B
Llama-2 70B

Datasets

The following datasets were used in this research:

HellaSwag
WinoGrande
ARC-E
ARC-Challenge
PIQA
BoolQ
QuAC
GSM8K
HumanEval
Natural Questions
TruthfulQA
MMLU
BBH
NarrativeQA
LongFQA
CUAD
Banking77
Trec-Fine
NLU Intent

Evaluation Metrics

F1
True F1

Results

Jamba model fits in a single 80GB GPU
High throughput compared to vanilla Transformers
State-of-the-art performance on language model benchmarks
Strong long-context performance for up to 256K tokens

Limitations

The authors identified the following limitations:

Model was not aligned or instruction-tuned
Limited to pre-training with no additional adaptation

Technical Requirements

Number of GPUs: 1
GPU Type: NVIDIA A100 80GB

Keywords

Transformer Mamba Mixture-of-Experts Long Contexts Language Models

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 56
Influential Citations: 19

Jamba: A Hybrid Transformer-Mamba Language Model

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers