← ML Research Wiki / 2404.04475

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois Stanford University, Balázs Galambosi Independent Researcher, Percy Liang Stanford University, Tatsunori B Hashimoto Stanford University (2024)

Paper Information

arXiv ID

2404.04475

Venue

arXiv.org

Domain

Natural Language Processing

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

LLM-based auto-annotators have become a key component of the LLM development process due to their cost-effectiveness and scalability compared to human-based evaluation.However, these auto-annotators can introduce biases that are hard to remove.Even simple, known confounders such as preference for longer outputs remain in existing automated evaluation metrics.We propose a simple regression analysis approach for controlling biases in auto-evaluations.As a real case study, we focus on reducing the length bias of AlpacaEval, a fast and affordable benchmark for instructiontuned LLMs that uses LLMs to estimate response quality.Despite being highly correlated with human preferences, AlpacaEval is known to favor models that generate longer outputs.We introduce a length-controlled AlpacaEval that aims to answer the counterfactual question: "What would the preference be if the model's and baseline's output had the same length?"To achieve this, we first fit a generalized linear model to predict the biased auto-annotator's preferences based on the mediators we want to control for (length difference) and other relevant features.We then obtain length-controlled preferences by predicting preferences while conditioning the GLM with a zero difference in lengths.Length-controlling not only improves the robustness of the metric to manipulations in model verbosity, we also find that it increases the Spearman correlation with LMSYS Chatbot Arena from 0.94 to 0.98.We release the code and resulting leaderboard.

Summary

The paper presents a novel method for mitigating the length bias of LLM-based automatic evaluations, specifically through an updated version of the AlpacaEval metric. The authors propose a length-controlled AlpacaEval, which corrects for biases in evaluation resulting from output length differences, particularly in auto-evaluators like AlpacaEval that favor longer responses. The method involves fitting a generalized linear model to the preferences of an auto-evaluator, allowing for counterfactual assessment of preferences conditioned on equal output lengths. The results show improved performance metrics, including a higher Spearman correlation with human preferences measured through the LMSYS Chatbot Arena, as well as reduced sensitivity to output length manipulations. The authors release their code and results to support reproducibility and further research in this area.

Methods

This paper employs the following methods:

Generalized Linear Model (GLM)
Regression Analysis

Datasets

The following datasets were used in this research:

AlpacaEval

Evaluation Metrics

Spearman correlation
Win rate

Results

Increased Spearman correlation with LMSYS Chatbot Arena from 0.94 to 0.98
Decreased length gameability from 25% to 10% for AlpacaEval-LC

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

auto-evaluation length bias regression analysis ALpacaEval bias control

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 27
Influential Citations: 83

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Abstract edit

Summary

Methods add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers