← ML Research Wiki / 2405.14734

SimPO: Simple Preference Optimization with a Reference-Free Reward

Yu Meng [email protected] Computer Science Department University of Virginia, Mengzhou Xia [email protected] Princeton Language and Intelligence (PLI) Princeton University, Danqi Chen [email protected] Princeton Language and Intelligence (PLI) Princeton University (2024)

Paper Information
arXiv ID
Venue
Neural Information Processing Systems
Domain
Natural Language Processing, Machine Learning
SOTA Claim
Yes
Code
Reproducibility
8/10

Abstract

Direct Preference Optimization (DPO) is a widely used offline preference optimization algorithm that reparameterizes reward functions in reinforcement learning from human feedback (RLHF) to enhance simplicity and training stability.In this work, we propose SimPO, a simpler yet more effective approach.The effectiveness of SimPO is attributed to a key design: using the average log probability of a sequence as the implicit reward.This reward formulation better aligns with model generation and eliminates the need for a reference model, making it more compute and memory efficient.Additionally, we introduce a target reward margin to the Bradley-Terry objective to encourage a larger margin between the winning and losing responses, further improving the algorithm's performance.We compare SimPO to DPO and its recent variants across various state-of-the-art training setups, including both base and instruction-tuned models such as Mistral, Llama 3, and Gemma 2. We evaluate on extensive chat-based evaluation benchmarks, including AlpacaEval 2, MT-Bench, and Arena-Hard.Our results demonstrate that SimPO consistently and significantly outperforms existing approaches without substantially increasing response length.Specifically, SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard.Our top-performing model, built on Gemma-2-9B-it, achieves a 72.4% length-controlled win rate on AlpacaEval 2, a 59.1% win rate on Arena-Hard, and ranks 1st on Chatbot Arena among <10B models with real user votes. 1 * Equal Contribution. 1 Code and models can be found at https://github.com/princeton-nlp/SimPO.

Summary

The paper presents SimPO (Simple Preference Optimization), a novel offline preference optimization algorithm designed to enhance the alignment of large language models (LLMs) using a reference-free reward mechanism. SimPO improves upon the existing Direct Preference Optimization (DPO) algorithm by utilizing the average log probability of generated sequences as an implicit reward, which is argued to more closely align with generation metrics, thus enhancing performance and efficiency. The authors introduce a target reward margin within the Bradley-Terry framework to create a more effective distinction between winning and losing responses. SimPO has been applied successfully to various models, demonstrating significant performance boosts in benchmarks such as AlpacaEval 2 and Arena-Hard, with claims of outperforming DPO by a considerable margin. The research provides a detailed analysis of performance across various setups and emphasizes the importance of length normalization and reward margin in achieving optimal results. Numerous evaluations and comparisons with other ranking methods highlight the robustness and effectiveness of SimPO in preference optimization tasks.

Methods

This paper employs the following methods:

  • Direct Preference Optimization (DPO)
  • Bradley-Terry model

Models Used

  • Mistral
  • Llama 3
  • Gemma 2

Datasets

The following datasets were used in this research:

  • UltraChat-200k
  • UltraFeedback
  • AlpacaEval 2
  • MT-Bench
  • Arena-Hard

Evaluation Metrics

  • Length-controlled win rate
  • Raw win rate
  • KL divergence

Results

  • SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2
  • SimPO achieves a 59.1% win rate on Arena-Hard
  • SimPO ranks 1st on Chatbot Arena among <10B models

Limitations

The authors identified the following limitations:

  • Requires more in-depth theoretical analysis
  • Safety and honesty considerations are not explicitly handled
  • Performance drop on reasoning-heavy tasks like GSM8K

Technical Requirements

  • Number of GPUs: 8
  • GPU Type: NVIDIA H100

Keywords

Preference Optimization RLHF Language Model Alignment Reward Function Reference-Free Reward

Papers Using Similar Methods

External Resources