Venue
International Conference on Machine Learning
Domain
Artificial Intelligence, Machine Learning, Natural Language Processing
Kahneman & Tversky's prospect theory tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse.We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases-the success of these objectives (e.g., DPO) over crossentropy minimization can partly be ascribed to them belonging to a family of loss functions that we call human-aware losses (HALOs).However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature.Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do.We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable.More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.
This paper presents KTO (Kahneman-Tversky Optimization), a novel method for aligning large language models (LLMs) with human feedback based on prospect theory. It highlights that existing alignment methods such as Direct Preference Optimization (DPO) implicitly account for human biases like loss aversion, but still do not fully align with the concepts in prospect theory. KTO introduces a human-aware loss function that maximizes utility rather than preference log-likelihood. The method requires less feedback data, utilizing binary signals to indicate whether outputs are desirable, and demonstrates competitive or superior performance compared to DPO across various scales of model parameters (from 1B to 30B). KTO’s key findings include its improved efficiency in handling imbalanced data and its potential to succeed without prior supervised fine-tuning (SFT). The paper concludes by arguing that the effectiveness of HALOs (human-aware losses) is context-dependent, emphasizing the need for diverse loss function designs.
This paper employs the following methods:
- Kahneman-Tversky Optimization (KTO)
- Direct Preference Optimization (DPO)
- PPO-Clip
- Pythia-{1.4B, 2.8B, 6.9B, 12B}
- Llama-{7B, 13B, 30B}
- Mistral-7B
The following datasets were used in this research:
- KTO matches or exceeds DPO performance at scales from 1B to 30B
- KTO can handle extreme data imbalances, requiring less desirable examples
- KTO can align models effectively without supervised finetuning
The authors identified the following limitations:
- KTO's performance could depend on the model's initial capacity
- Binary signals may lead to underfitting in complex distributions
- Number of GPUs: None specified
- GPU Type: None specified
Model Alignment
LLMs
Human Feedback
Prospect Theory
Reward Optimization
KTO
Preferences
Loss Functions
Fine-tuning