← ML Research Wiki / 2401.10020

Self-Rewarding Language Models

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston, Nyu (2024)

Paper Information

arXiv ID

2401.10020

Venue

International Conference on Machine Learning

Domain

Artificial Intelligence, Natural Language Processing

SOTA Claim

Yes

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal.Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training.In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training.We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself.Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613.While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes.

Summary

This paper introduces Self-Rewarding Language Models (SRLMs) which utilize a self-improving approach to train language models by incorporating reward modeling into the learning process. The method combines instruction following and self-instruction creation to enhance both model performance and self-evaluation capabilities. By employing an Iterative Direct Preference Optimization (DPO) framework, the models are able to generate their own training data, progressively improving through multiple training iterations. The experiments demonstrate that fine-tuning the Llama 2 70B model via this method surpasses several existing language models in various benchmarks, including those on the AlpacaEval 2.0 leaderboard. The paper stresses the potential for continuous improvement in language models beyond the limitations of traditional human preference data.

Methods

This paper employs the following methods:

LLM-as-a-Judge
Direct Preference Optimization (DPO)

Models Used

Llama 2 70B
Claude 2
Gemini Pro
GPT-4

Datasets

The following datasets were used in this research:

Open Assistant
AlpacaEval 2.0
MT-Bench
ARC-Easy
ARC-Challenge
HellaSwag
SIQA
PIQA
GSM8K
MMLU
OBQA
NQ

Evaluation Metrics

Accuracy
Win Rate
Pairwise Accuracy
Spearman Correlation
Kendall's τ
ROUGE-L

Results

Outperformed existing models on the AlpacaEval 2.0 leaderboard
Self-Rewarding training significantly improved instruction following and reward modeling ability
Three iterations of training yielded superior performance compared to the baseline

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: None specified
GPU Type: NVIDIA A100 80GB

Keywords

Self-Rewarding Language Models Iterative DPO AI Feedback Language Model Alignment Self-Instruction Creation

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 51
Influential Citations: 22

Self-Rewarding Language Models

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers