← ML Research Wiki / 2305.14314

QLORA: Efficient Finetuning of Quantized LLMs

Tim Dettmers [email protected] University of Washington, Artidoro Pagnoni [email protected] University of Washington, Ari Holtzman University of Washington, Luke Zettlemoyer University of Washington (2023)

Paper Information

arXiv ID

2305.14314

Venue

Neural Information Processing Systems

Domain

natural language processing

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

We present QLORA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance.QLORA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters (LoRA).Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU.QLORA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) Double Quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) Paged Optimizers to manage memory spikes.We use QLORA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g.33B and 65B parameter models).Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA.We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation.Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots.A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT.We release all of our models and code, including CUDA kernels for 4-bit training. 2* Equal contribution.

Summary

QLORA introduces an efficient finetuning method for quantized large language models (LLMs), capable of finetuning 65B parameter models on single consumer GPUs. The technique employs a frozen, 4-bit quantized pretrained language model, enhancing memory efficiency through innovations such as 4-bit NormalFloat, Double Quantization, and Paged Optimizers. Results indicate that the Guanaco model family outperforms existing models on the Vicuna benchmark, demonstrating 99.3% performance relative to ChatGPT in just 24 hours of finetuning. The authors evaluate over 1,000 models across multiple datasets, showing that data quality surpasses size in importance for model performance. They provide a detailed analysis of instruction-following and chatbot capabilities, comparing human evaluations with GPT-4 assessments, while cautioning about the reliability of existing chatbot benchmarks. The paper concludes by highlighting QLORA's potential for democratizing access to LLM finetuning, significantly reducing the resources required for training state-of-the-art models.

Methods

This paper employs the following methods:

QLORA
Low Rank Adapters (LoRA)
Double Quantization
Paged Optimizers

Models Used

Guanaco
LLaMA
T5

Datasets

The following datasets were used in this research:

Vicuna
OASST1
FLAN v2

Evaluation Metrics

MMLU
Elo

Results

Guanaco outperforms previous models on the Vicuna benchmark
Achieved 99.3% of ChatGPT's performance with 24 hours of finetuning

Limitations

The authors identified the following limitations:

Performance not established for 33B and 65B scales with full 16-bit finetuning
Lack of comprehensive evaluation across multiple benchmarks

Technical Requirements

Number of GPUs: 1
GPU Type: 48GB GPU

Keywords

quantization LLMs finetuning Low Rank Adapters efficient training

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 72
Influential Citations: 254

QLORA: Efficient Finetuning of Quantized LLMs

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers