← ML Research Wiki / 2307.03109

A Survey on Evaluation of Large Language Models

Yupeng Chang, Xu U Wang, Jindong Wang [email protected], Microsoft Research, China Yuan Wu, Linyi Yang, Westlake University, China Kaijie Zhu, Hao Chen, Y I Chang, Philip S Yu, Qiang Yang, Hong Kong, Xing Xie, Yuan Wu [email protected], Xiaoyuan Yi, Yi Chang, Yupeng Chang, Kaijie Zhu, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, School of Artificial Intelligence Jilin University China, School of Artificial Intelligence Jilin University China, Institute of Automation Chinese Academy of Sciences China, Carnegie Mellon University USA, XIAOYUAN YI Microsoft Research Asia China, Westlake University China, School of Artificial Intelligence Jilin University China, University of Illinois at Chicago USA, University of Science and Technology China, Microsoft Research Asia China, School of Artificial Intelligence Jilin University 2699 Qianjin St, Jindong Wang130012ChangchunChina, Microsoft Research Asia BeijingChina, School of Artificial Intelligence Jilin University ChangchunChina, Institute of Automation Westlake University Kaijie ZhuLinyi Yang, HangzhouChina, Chinese Academy of Sciences BeijingChina, Carnegie Mellon University PennsylvaniaUSA, Microsoft Research Asia Cunxiang WangBeijingChina, Yidong Wang Westlake University HangzhouChina, Peking University China; Wei YeBeijing, Peking University Beijing, Yue ZhangChina, Westlake University HangzhouChina, School of Artificial Intelligence Jilin University ChangchunChina, University of Illinois at Chicago Illinois, Qiang YangUSA, Hong Kong University of Science and Technology China; Xing XieKowloon, Hong Kong, Microsoft Research Asia BeijingChina (2023)

Paper Information

arXiv ID

2307.03109

Venue

ACM Transactions on Intelligent Systems and Technology

Domain

artificial intelligence, natural language processing, machine learning

Code

Available

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications.As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks.Over the past years, significant efforts have been made to examine LLMs from various perspectives.This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, education, natural and social sciences, agent applications, and other areas.Secondly, we answer the 'where' and 'how' questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing the performance of LLMs.Then, we summarize the success and failure cases of LLMs in different tasks.Finally, we shed light on several

Summary

The paper provides a comprehensive survey of evaluation methods for large language models (LLMs), focusing on critical dimensions of evaluation: what to evaluate (tasks), where to evaluate (datasets and benchmarks), and how to evaluate (evaluation processes). It discusses various aspects of LLMs, including natural language processing tasks such as sentiment analysis, text classification, reasoning capabilities, and ethical considerations. The paper highlights the challenges faced by LLMs, including robustness, biases, and factual accuracy, while summarizing success and failure cases of LLMs across different tasks. It also identifies the limitations of existing evaluation methods and proposes future challenges to enhance evaluation practices. Ultimately, the paper aims to foster a better understanding of LLMs and inform future developments in their evaluation.

Methods

This paper employs the following methods:

Evaluation Methods
Benchmarking Techniques

Models Used

ChatGPT
GPT-3
InstructGPT
GPT-4

Datasets

The following datasets were used in this research:

GLUE
SuperGLUE
TruthfulQA
MNLI
SQuAD

Evaluation Metrics

Accuracy
F1-score
ROUGE
Exact Match
BLEU

Results

Comprehensive overview of LLM evaluation methods
Identification of success and failure cases in LLMs
Discussion of future challenges in evaluation

Limitations

The authors identified the following limitations:

Existing evaluation protocols are insufficient for robust evaluation
Need for dynamic and evolving evaluation systems
Challenges in measuring AGI capabilities

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

large language models evaluation benchmarks robustness ethics societal impact

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 305
Influential Citations: 51

A Survey on Evaluation of Large Language Models

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers