← ML Research Wiki / 2402.01306

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, Douwe Kiela (2024)

Paper Information
arXiv ID
Venue
International Conference on Machine Learning
Domain
Artificial Intelligence, Machine Learning, Natural Language Processing
SOTA Claim
Yes
Code
Reproducibility
8/10

Abstract

Kahneman & Tversky's prospect theory tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse.We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases-the success of these objectives (e.g., DPO) over crossentropy minimization can partly be ascribed to them belonging to a family of loss functions that we call human-aware losses (HALOs).However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature.Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do.We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable.More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.

Summary

This paper presents KTO (Kahneman-Tversky Optimization), a novel method for aligning large language models (LLMs) with human feedback based on prospect theory. It highlights that existing alignment methods such as Direct Preference Optimization (DPO) implicitly account for human biases like loss aversion, but still do not fully align with the concepts in prospect theory. KTO introduces a human-aware loss function that maximizes utility rather than preference log-likelihood. The method requires less feedback data, utilizing binary signals to indicate whether outputs are desirable, and demonstrates competitive or superior performance compared to DPO across various scales of model parameters (from 1B to 30B). KTO’s key findings include its improved efficiency in handling imbalanced data and its potential to succeed without prior supervised fine-tuning (SFT). The paper concludes by arguing that the effectiveness of HALOs (human-aware losses) is context-dependent, emphasizing the need for diverse loss function designs.

Methods

This paper employs the following methods:

  • Kahneman-Tversky Optimization (KTO)
  • Direct Preference Optimization (DPO)
  • PPO-Clip

Models Used

  • Pythia-{1.4B, 2.8B, 6.9B, 12B}
  • Llama-{7B, 13B, 30B}
  • Mistral-7B

Datasets

The following datasets were used in this research:

  • SHP
  • HH
  • OASST

Evaluation Metrics

  • MMLU
  • GSM8K
  • HumanEval
  • BBH

Results

  • KTO matches or exceeds DPO performance at scales from 1B to 30B
  • KTO can handle extreme data imbalances, requiring less desirable examples
  • KTO can align models effectively without supervised finetuning

Limitations

The authors identified the following limitations:

  • KTO's performance could depend on the model's initial capacity
  • Binary signals may lead to underfitting in complex distributions

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

Model Alignment LLMs Human Feedback Prospect Theory Reward Optimization KTO Preferences Loss Functions Fine-tuning

Papers Using Similar Methods

External Resources