Venue
International Conference on Machine Learning
Domain
machine learning, natural language processing
Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states.Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer.However, such approaches typically underperform training with full-rank weights in both pretraining and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start.In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows fullparameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA.Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks.Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline.Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.Code is provided in the link.
The paper presents GaLore, a novel training strategy for large language models (LLMs) that addresses memory challenges during training by using Gradient Low-Rank Projection. This method proposes to maintain full-parameter learning while enhancing memory efficiency compared to traditional low-rank adaptation methods like LoRA, achieving a reduction of up to 65.5% in optimizer states without compromising performance on both pre-training and fine-tuning tasks. The authors demonstrate the effectiveness of GaLore through experiments on LLaMA and RoBERTa models, showcasing comparable or superior results to existing methods. Notably, GaLore allows for the feasibility of training a 7B model on consumer-grade GPUs with limited memory, paving the way for efficient model training using low-memory hardware resources.
This paper employs the following methods:
- Gradient Low-Rank Projection (GaLore)
The following datasets were used in this research:
- Perplexity
- F1-score
- Exact Match
- GLUE score
- Reduces memory usage by up to 65.5% in optimizer states
- Allows training of 7B model on consumer GPUs with 24GB memory
- Outperforms LoRA in fine-tuning RoBERTa on GLUE tasks with a score of 85.89 compared to LoRA's 85.61
The authors identified the following limitations:
- The impact of subspace switching frequency on convergence needs further exploration.
- Number of GPUs: 64
- GPU Type: NVIDIA A100
LLM training
memory efficiency
gradient low-rank projection
GaLore
LoRA
optimizer memory reduction