← ML Research Wiki / 2302.04023

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

Yejin Bang [email protected] Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, Samuel Cahyawijaya Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, Nayeon Lee Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, Wenliang Dai Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, Dan Su Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, Bryan Wilie Holy Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, Lovenia Ziwei Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, Ji Tiezheng Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, Yu Willy Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, Chung Quyet Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, V Do Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, Yan Xu Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology, Pascale Fung [email protected] Centre for Artificial Intelligence Research (CAiRE) The Hong Kong University of Science and Technology (2023)

Paper Information
arXiv ID
Venue
International Joint Conference on Natural Language Processing
Domain
natural language processing
SOTA Claim
Yes
Reproducibility
7/10

Abstract

This paper proposes a framework for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available data sets. We carry out an extensive technical evaluation of ChatGPT using 21 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multimodal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. We find that ChatGPT outperforms LLMs with zero-shot learning on most tasks and even outperforms fine-tuned models on some tasks. We find that it is better at understanding non-Latin script languages than generating them. It is able to generate multimodal content from textual prompts, via an intermediate code generation step. Moreover, we find that ChatGPT is 64.33% accurate on average in 10 different reasoning categories under logical reasoning, nontextual reasoning, and commonsense reasoning, hence making it an unreliable reasoner. It is, for example, better at deductive than inductive reasoning. ChatGPT suffers from hallucination problems like other LLMs and it generates more extrinsic hallucinations from its parametric memory as it does not have access to an external knowledge base. Finally, the interactive feature of ChatGPT enables human collaboration with the underlying LLM to improve its performance, i.e, 8% ROUGE-1 on summarization and 2% ChrF++ on machine translation, in a multi-turn "prompt engineering" fashion.

Summary

This paper investigates the capabilities and limitations of ChatGPT through a structured evaluation framework on various NLP tasks. The authors evaluate ChatGPT's performance across 21 datasets in eight different NLP application areas, examining its multitask, multilingual, and multimodal capabilities. Findings reveal that ChatGPT excels in zero-shot learning tasks and showcases notable performance in understanding non-Latin script languages. However, it struggles with inductive reasoning and generates a significant amount of hallucinated information. The study highlights the model's accuracy across different reasoning categories, where it achieved 64.33% overall accuracy in reasoning tasks. The paper emphasizes the improvements in task performance through interactive, multi-turn dialogues, leading to enhanced summarization and translation results.

Methods

This paper employs the following methods:

  • Multitask Evaluation
  • Zero-shot Learning
  • Multimodal Generation

Models Used

  • ChatGPT
  • InstructGPT

Datasets

The following datasets were used in this research:

  • None specified

Evaluation Metrics

  • ROUGE-1
  • ChrF++
  • Accuracy
  • F1-score

Results

  • ChatGPT outperforms other LLMs with zero-shot learning on 9 out of 13 datasets.
  • ChatGPT achieves 64.33% average accuracy in reasoning categories.
  • Interactive feature improves performance by 8% ROUGE-1 in summarization and 2% ChrF++ in translation.

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

ChatGPT evaluation reasoning hallucination interactivity

Papers Using Similar Methods

External Resources