← ML Research Wiki / 2403.04132

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang <[email protected]>, Lianmin Zheng, Ying Sheng, Anastasios N Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I Jordan, Joseph E Gonzalez, Ion Stoica (2024)

Paper Information

arXiv ID

2403.04132

Venue

International Conference on Machine Learning

Domain

artificial intelligence, natural language processing

SOTA Claim

Yes

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Large Language Models (LLMs) have unlocked new capabilities and applications; however, evaluating the alignment with human preferences still poses significant challenges.To address this issue, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences.Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing.The platform has been operational for several months, amassing over 240K votes.This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using for efficient and accurate evaluation and ranking of models.We confirm that the crowdsourced questions are sufficiently diverse and discriminating and that the crowdsourced human votes are in good agreement with those of expert raters.These analyses collectively establish a robust foundation for the credibility of Chatbot Arena.Because of its unique value and openness, Chatbot Arena has emerged as one of the most referenced LLM leaderboards, widely cited by leading LLM developers and companies.Our demo is publicly available at https://chat.lmsys.org.* Equal contribution 1 UC Berkeley 2 Stanford 3 UCSD.

Summary

This paper introduces Chatbot Arena, an open platform created for the evaluation of large language models (LLMs) based on human preferences, specifically utilizing a pairwise comparison method and a crowdsourced approach to gather user input. The platform has garnered significant engagement, with over 240K votes from diverse users across various languages and has been operational since April 2023. It discusses various existing benchmarks and their limitations in capturing nuanced aspects of LLM performance, advocating for the necessity of a live, human-preference-based evaluation method. The paper details the methodology used for data collection, the statistical techniques for evaluation, and outcomes from analyzing the data collected. It highlights key contributions, including the release of a human preference dataset with over 100K votes and a newly designed efficient sampling algorithm for model evaluation.

Methods

This paper employs the following methods:

pairwise comparison
crowdsourced evaluation
Bradley-Terry model

Models Used

GPT-4
Claude
LLaMA
Mistral
Gemini
gpt-4-turbo
gpt-3.5-turbo

Datasets

The following datasets were used in this research:

LMSYS-Chat-1M

Evaluation Metrics

Accuracy
Win-rate
Vote-quality

Results

Amassed over 240K votes from users
Crowdsourced votes show agreement with expert raters
Efficient sampling algorithms designed to select model pairs

Limitations

The authors identified the following limitations:

User base may be biased towards LLM hobbyists and researchers
Data mostly comes from a single online interface, potentially skewing prompt distribution
Focus on helpfulness, lacking in safety evaluations

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

LLMs human preferences evaluation platform crowdsourcing model ranking pairwise comparison

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 69
Influential Citations: 68

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers