Yupeng Chang, Xu U Wang, Jindong Wang [email protected], Microsoft Research, China Yuan Wu, Linyi Yang, Westlake University, China Kaijie Zhu, Hao Chen, Y I Chang, Philip S Yu, Qiang Yang, Hong Kong, Xing Xie, Yuan Wu [email protected], Xiaoyuan Yi, Yi Chang, Yupeng Chang, Kaijie Zhu, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, School of Artificial Intelligence Jilin University China, School of Artificial Intelligence Jilin University China, Institute of Automation Chinese Academy of Sciences China, Carnegie Mellon University USA, XIAOYUAN YI Microsoft Research Asia China, Westlake University China, School of Artificial Intelligence Jilin University China, University of Illinois at Chicago USA, University of Science and Technology China, Microsoft Research Asia China, School of Artificial Intelligence Jilin University 2699 Qianjin St, Jindong Wang130012ChangchunChina, Microsoft Research Asia BeijingChina, School of Artificial Intelligence Jilin University ChangchunChina, Institute of Automation Westlake University Kaijie ZhuLinyi Yang, HangzhouChina, Chinese Academy of Sciences BeijingChina, Carnegie Mellon University PennsylvaniaUSA, Microsoft Research Asia Cunxiang WangBeijingChina, Yidong Wang Westlake University HangzhouChina, Peking University China; Wei YeBeijing, Peking University Beijing, Yue ZhangChina, Westlake University HangzhouChina, School of Artificial Intelligence Jilin University ChangchunChina, University of Illinois at Chicago Illinois, Qiang YangUSA, Hong Kong University of Science and Technology China; Xing XieKowloon, Hong Kong, Microsoft Research Asia BeijingChina (2023)
The paper provides a comprehensive survey of evaluation methods for large language models (LLMs), focusing on critical dimensions of evaluation: what to evaluate (tasks), where to evaluate (datasets and benchmarks), and how to evaluate (evaluation processes). It discusses various aspects of LLMs, including natural language processing tasks such as sentiment analysis, text classification, reasoning capabilities, and ethical considerations. The paper highlights the challenges faced by LLMs, including robustness, biases, and factual accuracy, while summarizing success and failure cases of LLMs across different tasks. It also identifies the limitations of existing evaluation methods and proposes future challenges to enhance evaluation practices. Ultimately, the paper aims to foster a better understanding of LLMs and inform future developments in their evaluation.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: