← ML Research Wiki / 2307.03109

A Survey on Evaluation of Large Language Models

Yupeng Chang, Xu U Wang, Jindong Wang [email protected], Microsoft Research, China Yuan Wu, Linyi Yang, Westlake University, China Kaijie Zhu, Hao Chen, Y I Chang, Philip S Yu, Qiang Yang, Hong Kong, Xing Xie, Yuan Wu [email protected], Xiaoyuan Yi, Yi Chang, Yupeng Chang, Kaijie Zhu, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, School of Artificial Intelligence Jilin University China, School of Artificial Intelligence Jilin University China, Institute of Automation Chinese Academy of Sciences China, Carnegie Mellon University USA, XIAOYUAN YI Microsoft Research Asia China, Westlake University China, School of Artificial Intelligence Jilin University China, University of Illinois at Chicago USA, University of Science and Technology China, Microsoft Research Asia China, School of Artificial Intelligence Jilin University 2699 Qianjin St, Jindong Wang130012ChangchunChina, Microsoft Research Asia BeijingChina, School of Artificial Intelligence Jilin University ChangchunChina, Institute of Automation Westlake University Kaijie ZhuLinyi Yang, HangzhouChina, Chinese Academy of Sciences BeijingChina, Carnegie Mellon University PennsylvaniaUSA, Microsoft Research Asia Cunxiang WangBeijingChina, Yidong Wang Westlake University HangzhouChina, Peking University China; Wei YeBeijing, Peking University Beijing, Yue ZhangChina, Westlake University HangzhouChina, School of Artificial Intelligence Jilin University ChangchunChina, University of Illinois at Chicago Illinois, Qiang YangUSA, Hong Kong University of Science and Technology China; Xing XieKowloon, Hong Kong, Microsoft Research Asia BeijingChina (2023)

Paper Information
arXiv ID
Venue
ACM Transactions on Intelligent Systems and Technology
Domain
artificial intelligence, natural language processing, machine learning
Code
Available

Abstract

Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications.As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks.Over the past years, significant efforts have been made to examine LLMs from various perspectives.This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, education, natural and social sciences, agent applications, and other areas.Secondly, we answer the 'where' and 'how' questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing the performance of LLMs.Then, we summarize the success and failure cases of LLMs in different tasks.Finally, we shed light on several

Summary

The paper provides a comprehensive survey of evaluation methods for large language models (LLMs), focusing on critical dimensions of evaluation: what to evaluate (tasks), where to evaluate (datasets and benchmarks), and how to evaluate (evaluation processes). It discusses various aspects of LLMs, including natural language processing tasks such as sentiment analysis, text classification, reasoning capabilities, and ethical considerations. The paper highlights the challenges faced by LLMs, including robustness, biases, and factual accuracy, while summarizing success and failure cases of LLMs across different tasks. It also identifies the limitations of existing evaluation methods and proposes future challenges to enhance evaluation practices. Ultimately, the paper aims to foster a better understanding of LLMs and inform future developments in their evaluation.

Methods

This paper employs the following methods:

  • Evaluation Methods
  • Benchmarking Techniques

Models Used

  • ChatGPT
  • GPT-3
  • InstructGPT
  • GPT-4

Datasets

The following datasets were used in this research:

  • GLUE
  • SuperGLUE
  • TruthfulQA
  • MNLI
  • SQuAD

Evaluation Metrics

  • Accuracy
  • F1-score
  • ROUGE
  • Exact Match
  • BLEU

Results

  • Comprehensive overview of LLM evaluation methods
  • Identification of success and failure cases in LLMs
  • Discussion of future challenges in evaluation

Limitations

The authors identified the following limitations:

  • Existing evaluation protocols are insufficient for robust evaluation
  • Need for dynamic and evolving evaluation systems
  • Challenges in measuring AGI capabilities

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

large language models evaluation benchmarks robustness ethics societal impact

Papers Using Similar Methods

External Resources