← ML Research Wiki / 2405.14734

SimPO: Simple Preference Optimization with a Reference-Free Reward

Yu Meng [email protected] Computer Science Department University of Virginia, Mengzhou Xia [email protected] Princeton Language and Intelligence (PLI) Princeton University, Danqi Chen [email protected] Princeton Language and Intelligence (PLI) Princeton University (2024)

Paper Information

arXiv ID

2405.14734

Venue

Neural Information Processing Systems

Domain

Natural Language Processing, Machine Learning

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Direct Preference Optimization (DPO) is a widely used offline preference optimization algorithm that reparameterizes reward functions in reinforcement learning from human feedback (RLHF) to enhance simplicity and training stability.In this work, we propose SimPO, a simpler yet more effective approach.The effectiveness of SimPO is attributed to a key design: using the average log probability of a sequence as the implicit reward.This reward formulation better aligns with model generation and eliminates the need for a reference model, making it more compute and memory efficient.Additionally, we introduce a target reward margin to the Bradley-Terry objective to encourage a larger margin between the winning and losing responses, further improving the algorithm's performance.We compare SimPO to DPO and its recent variants across various state-of-the-art training setups, including both base and instruction-tuned models such as Mistral, Llama 3, and Gemma 2. We evaluate on extensive chat-based evaluation benchmarks, including AlpacaEval 2, MT-Bench, and Arena-Hard.Our results demonstrate that SimPO consistently and significantly outperforms existing approaches without substantially increasing response length.Specifically, SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard.Our top-performing model, built on Gemma-2-9B-it, achieves a 72.4% length-controlled win rate on AlpacaEval 2, a 59.1% win rate on Arena-Hard, and ranks 1st on Chatbot Arena among <10B models with real user votes. 1 * Equal Contribution. 1 Code and models can be found at https://github.com/princeton-nlp/SimPO.

Summary

The paper presents SimPO (Simple Preference Optimization), a novel offline preference optimization algorithm designed to enhance the alignment of large language models (LLMs) using a reference-free reward mechanism. SimPO improves upon the existing Direct Preference Optimization (DPO) algorithm by utilizing the average log probability of generated sequences as an implicit reward, which is argued to more closely align with generation metrics, thus enhancing performance and efficiency. The authors introduce a target reward margin within the Bradley-Terry framework to create a more effective distinction between winning and losing responses. SimPO has been applied successfully to various models, demonstrating significant performance boosts in benchmarks such as AlpacaEval 2 and Arena-Hard, with claims of outperforming DPO by a considerable margin. The research provides a detailed analysis of performance across various setups and emphasizes the importance of length normalization and reward margin in achieving optimal results. Numerous evaluations and comparisons with other ranking methods highlight the robustness and effectiveness of SimPO in preference optimization tasks.

Methods

This paper employs the following methods:

Direct Preference Optimization (DPO)
Bradley-Terry model

Models Used

Mistral
Llama 3
Gemma 2

Datasets

The following datasets were used in this research:

UltraChat-200k
UltraFeedback
AlpacaEval 2
MT-Bench
Arena-Hard

Evaluation Metrics

Length-controlled win rate
Raw win rate
KL divergence

Results

SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2
SimPO achieves a 59.1% win rate on Arena-Hard
SimPO ranks 1st on Chatbot Arena among <10B models

Limitations

The authors identified the following limitations:

Requires more in-depth theoretical analysis
Safety and honesty considerations are not explicitly handled
Performance drop on reasoning-heavy tasks like GSM8K

Technical Requirements

Number of GPUs: 8
GPU Type: NVIDIA H100

Keywords

Preference Optimization RLHF Language Model Alignment Reward Function Reference-Free Reward

Papers Using Similar Methods

External Resources

Funding: National Science Foundation (IIS-2211779) and Sloan Research Fellowship
References: 98
Influential Citations: 73

SimPO: Simple Preference Optimization with a Reference-Free Reward

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers