Yejin Bang [email protected] Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, Samuel Cahyawijaya Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, Nayeon Lee Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, Wenliang Dai Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, Dan Su Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, Bryan Wilie Holy Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, Lovenia Ziwei Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, Ji Tiezheng Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, Yu Willy Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, Chung Quyet Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, V Do Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, Yan Xu Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, Pascale Fung [email protected] Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology (2023)
This paper investigates the capabilities and limitations of ChatGPT through a structured evaluation framework on various NLP tasks. The authors evaluate ChatGPT's performance across 21 datasets in eight different NLP application areas, examining its multitask, multilingual, and multimodal capabilities. Findings reveal that ChatGPT excels in zero-shot learning tasks and showcases notable performance in understanding non-Latin script languages. However, it struggles with inductive reasoning and generates a significant amount of hallucinated information. The study highlights the model's accuracy across different reasoning categories, where it achieved 64.33% overall accuracy in reasoning tasks. The paper emphasizes the improvements in task performance through interactive, multi-turn dialogues, leading to enhanced summarization and translation results.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: