← ML Research Wiki / 2402.01306

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, Douwe Kiela (2024)

Paper Information

arXiv ID

2402.01306

Venue

International Conference on Machine Learning

Domain

Artificial Intelligence, Machine Learning, Natural Language Processing

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Kahneman & Tversky's prospect theory tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse.We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases-the success of these objectives (e.g., DPO) over crossentropy minimization can partly be ascribed to them belonging to a family of loss functions that we call human-aware losses (HALOs).However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature.Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do.We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable.More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.

Summary

This paper presents KTO (Kahneman-Tversky Optimization), a novel method for aligning large language models (LLMs) with human feedback based on prospect theory. It highlights that existing alignment methods such as Direct Preference Optimization (DPO) implicitly account for human biases like loss aversion, but still do not fully align with the concepts in prospect theory. KTO introduces a human-aware loss function that maximizes utility rather than preference log-likelihood. The method requires less feedback data, utilizing binary signals to indicate whether outputs are desirable, and demonstrates competitive or superior performance compared to DPO across various scales of model parameters (from 1B to 30B). KTO’s key findings include its improved efficiency in handling imbalanced data and its potential to succeed without prior supervised fine-tuning (SFT). The paper concludes by arguing that the effectiveness of HALOs (human-aware losses) is context-dependent, emphasizing the need for diverse loss function designs.

Methods

This paper employs the following methods:

Kahneman-Tversky Optimization (KTO)
Direct Preference Optimization (DPO)
PPO-Clip

Models Used

Pythia-{1.4B, 2.8B, 6.9B, 12B}
Llama-{7B, 13B, 30B}
Mistral-7B

Datasets

The following datasets were used in this research:

SHP
HH
OASST

Evaluation Metrics

MMLU
GSM8K
HumanEval
BBH

Results

KTO matches or exceeds DPO performance at scales from 1B to 30B
KTO can handle extreme data imbalances, requiring less desirable examples
KTO can align models effectively without supervised finetuning

Limitations

The authors identified the following limitations:

KTO's performance could depend on the model's initial capacity
Binary signals may lead to underfitting in complex distributions

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

Model Alignment LLMs Human Feedback Prospect Theory Reward Optimization KTO Preferences Loss Functions Fine-tuning

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 56
Influential Citations: 75

KTO: Model Alignment as Prospect Theoretic Optimization

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers