← ML Research Wiki / 2305.18290

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov [email protected] Stanford University ‡ CZ Biohub, Archit Sharma [email protected] Stanford University ‡ CZ Biohub, Eric Mitchell [email protected] Stanford University ‡ CZ Biohub, Stefano Ermon Stanford University ‡ CZ Biohub, Christopher D Manning Stanford University ‡ CZ Biohub, Chelsea Finn Stanford University ‡ CZ Biohub (2023)

Paper Information

arXiv ID

2305.18290

Venue

Neural Information Processing Systems

Domain

Natural Language Processing

SOTA Claim

Yes

Reproducibility

7/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training.Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF).However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model.In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss.The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning.Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

Summary

This paper presents Direct Preference Optimization (DPO), a novel approach for fine-tuning large language models (LMs) without the complexities associated with reinforcement learning from human feedback (RLHF). DPO directly optimizes an implicit reward model based on human preferences, enabling the training of LMs using a straightforward binary cross-entropy loss. It simplifies the training process by eliminating the need for sampling and extensive hyperparameter tuning. The authors demonstrate that DPO aligns LMs with human preferences effectively, matching or surpassing existing methods such as PPO-based RLHF in various tasks, including sentiment modulation and summarization. The experiments conducted validate DPO's efficiency and effectiveness, showing superior performance in win rates against several baselines, including human-designed responses.

Methods

This paper employs the following methods:

Direct Preference Optimization (DPO)
PPO

Models Used

GPT-2
GPT-J
Pythia-2.8B

Datasets

The following datasets were used in this research:

IMDB
Reddit TL;DR
Anthropic Helpful and Harmless dialogue

Evaluation Metrics

ROUGE
win rate against baseline
KL-divergence
binary cross-entropy loss

Results

DPO is stable, performant, and computationally lightweight
DPO exceeds PPO-based RLHF in ability to control sentiment of generations
DPO matches or improves response quality in summarization and single-turn dialogue

Limitations

The authors identified the following limitations:

The generalization of DPO policies out of distribution needs more comprehensive study.
Exploration of scaling DPO to larger models is needed.

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

Preference Optimization Reward Model Reinforcement Learning from Human Feedback RLHF Preference-based Reinforcement Learning Language Models Fine-tuning Alignment

Papers Using Similar Methods

Deep Reinforcement Learning that Matters (2017)
Mixtral of Experts (2024)
QWEN2 TECHNICAL REPORT (2024)
Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning (2021)
AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback (2023)

External Resources

Funding: Knight-Hennessy Graduate Fellowship; Stanford Accelerator for Learning (SAL); Stanford Institute for Human-Centered Artificial Intelligence (HAI); Stanford Center for Research on Foundation Models (CRFM); ONR grant N00014-20-1-2675
References: 59
Influential Citations: 831

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers