Venue
Neural Information Processing Systems
Domain
Natural Language Processing, Machine Learning
Direct Preference Optimization (DPO) is a widely used offline preference optimization algorithm that reparameterizes reward functions in reinforcement learning from human feedback (RLHF) to enhance simplicity and training stability.In this work, we propose SimPO, a simpler yet more effective approach.The effectiveness of SimPO is attributed to a key design: using the average log probability of a sequence as the implicit reward.This reward formulation better aligns with model generation and eliminates the need for a reference model, making it more compute and memory efficient.Additionally, we introduce a target reward margin to the Bradley-Terry objective to encourage a larger margin between the winning and losing responses, further improving the algorithm's performance.We compare SimPO to DPO and its recent variants across various state-of-the-art training setups, including both base and instruction-tuned models such as Mistral, Llama 3, and Gemma 2. We evaluate on extensive chat-based evaluation benchmarks, including AlpacaEval 2, MT-Bench, and Arena-Hard.Our results demonstrate that SimPO consistently and significantly outperforms existing approaches without substantially increasing response length.Specifically, SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard.Our top-performing model, built on Gemma-2-9B-it, achieves a 72.4% length-controlled win rate on AlpacaEval 2, a 59.1% win rate on Arena-Hard, and ranks 1st on Chatbot Arena among <10B models with real user votes. 1 * Equal Contribution. 1 Code and models can be found at https://github.com/princeton-nlp/SimPO.
The paper presents SimPO (Simple Preference Optimization), a novel offline preference optimization algorithm designed to enhance the alignment of large language models (LLMs) using a reference-free reward mechanism. SimPO improves upon the existing Direct Preference Optimization (DPO) algorithm by utilizing the average log probability of generated sequences as an implicit reward, which is argued to more closely align with generation metrics, thus enhancing performance and efficiency. The authors introduce a target reward margin within the Bradley-Terry framework to create a more effective distinction between winning and losing responses. SimPO has been applied successfully to various models, demonstrating significant performance boosts in benchmarks such as AlpacaEval 2 and Arena-Hard, with claims of outperforming DPO by a considerable margin. The research provides a detailed analysis of performance across various setups and emphasizes the importance of length normalization and reward margin in achieving optimal results. Numerous evaluations and comparisons with other ranking methods highlight the robustness and effectiveness of SimPO in preference optimization tasks.
This paper employs the following methods:
- Direct Preference Optimization (DPO)
- Bradley-Terry model
The following datasets were used in this research:
- UltraChat-200k
- UltraFeedback
- AlpacaEval 2
- MT-Bench
- Arena-Hard
- Length-controlled win rate
- Raw win rate
- KL divergence
- SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2
- SimPO achieves a 59.1% win rate on Arena-Hard
- SimPO ranks 1st on Chatbot Arena among <10B models
The authors identified the following limitations:
- Requires more in-depth theoretical analysis
- Safety and honesty considerations are not explicitly handled
- Performance drop on reasoning-heavy tasks like GSM8K
- Number of GPUs: 8
- GPU Type: NVIDIA H100
Preference Optimization
RLHF
Language Model Alignment
Reward Function
Reference-Free Reward