← ML Research Wiki / 2403.03507

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Jiawei Zhao California Institute of Technology, Zhenyu Zhang University of Texas Austin, Beidi Chen Meta AI Carnegie Mellon University, Zhangyang Wang University of Texas Austin, Anima Anandkumar California Institute of Technology, Yuandong Tian <[email protected]>. Meta AI (2024)

Paper Information

arXiv ID

2403.03507

Venue

International Conference on Machine Learning

Domain

machine learning, natural language processing

Code

Available

Reproducibility

7/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states.Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer.However, such approaches typically underperform training with full-rank weights in both pretraining and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start.In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows fullparameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA.Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks.Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline.Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.Code is provided in the link.

Summary

The paper presents GaLore, a novel training strategy for large language models (LLMs) that addresses memory challenges during training by using Gradient Low-Rank Projection. This method proposes to maintain full-parameter learning while enhancing memory efficiency compared to traditional low-rank adaptation methods like LoRA, achieving a reduction of up to 65.5% in optimizer states without compromising performance on both pre-training and fine-tuning tasks. The authors demonstrate the effectiveness of GaLore through experiments on LLaMA and RoBERTa models, showcasing comparable or superior results to existing methods. Notably, GaLore allows for the feasibility of training a 7B model on consumer-grade GPUs with limited memory, paving the way for efficient model training using low-memory hardware resources.

Methods

This paper employs the following methods:

Gradient Low-Rank Projection (GaLore)

Models Used

LLaMA
RoBERTa

Datasets

The following datasets were used in this research:

C4
GLUE
SQuAD

Evaluation Metrics

Perplexity
F1-score
Exact Match
GLUE score

Results

Reduces memory usage by up to 65.5% in optimizer states
Allows training of 7B model on consumer GPUs with 24GB memory
Outperforms LoRA in fine-tuning RoBERTa on GLUE tasks with a score of 85.89 compared to LoRA's 85.61

Limitations

The authors identified the following limitations:

The impact of subspace switching frequency on convergence needs further exploration.

Technical Requirements

Number of GPUs: 64
GPU Type: NVIDIA A100

Keywords

LLM training memory efficiency gradient low-rank projection GaLore LoRA optimizer memory reduction

Papers Using Similar Methods

External Resources

Funding: Meta AI, California Institute of Technology, University of Texas, Carnegie Mellon University
References: 54
Influential Citations: 34

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers