← ML Research Wiki / 2302.06476

Is ChatGPT a General-Purpose Natural Language Processing Task Solver?

Chengwei Qin Nanyang Technological University, ♣ Shanghai Jiao Tong University ♠ Georgia Institute of Technology ♦ Stanford University, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, Diyi Yang, Xinyu Wang, Yong Jiang, Nguyen Bach, Tao Wang, Zhongqiang Huang, Fei Huang, Kewei Tu, Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed Chi, Denny 2022 Zhou, Rationale, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, An- Drew M Dai, Finetuned Lan-, Xuezhi Wang, Yu Wu, Wei Wu, Shuohang Wang, Yichong Xu, Yuwei Fang, Wenhao Yu, Yang Liu, Hai Zhao, Chen- Guang Zhu, Michael 2022 Zeng, Mu Li, Nathanael Schärli, Le Hou, Xiangyang Zhou, Lu Li, Daxiang Dong, Yi Liu, Ying Chen, Wayne Xin Zhao, Dianhai Yu, Hua Wu, Association for Computational Linguistics SeattleUnited States, Nathan Scales Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed Chi2022Xuezhi Wang (2023)

Paper Information
arXiv ID
Venue
Conference on Empirical Methods in Natural Language Processing
Domain
Natural Language Processing
Reproducibility
8/10

Abstract

Spurred by advancements in scale, large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot-i.e., without adaptation on downstream data.Recently, the debut of ChatGPT 1 has drawn a great deal of attention from the natural language processing (NLP) community due to the fact that it can generate high-quality responses to human input and self-correct previous mistakes based on subsequent conversations.However, it is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.In this work, we empirically analyze the zero-shot learning ability of ChatGPT by evaluating it on 20 popular NLP datasets covering 7 representative task categories.With extensive empirical studies, we demonstrate both the effectiveness and limitations of the current version of ChatGPT.We find that ChatGPT performs well on many tasks favoring reasoning capabilities (e.g., arithmetic reasoning) while it still faces challenges when solving specific tasks such as sequence tagging.We additionally provide in-depth analysis through qualitative case studies.

Summary

This paper investigates whether ChatGPT functions effectively as a general-purpose natural language processing (NLP) task solver. It examines the model's zero-shot learning capabilities across 20 diverse NLP tasks. The results indicate that while ChatGPT performs well on reasoning tasks and dialogue generation, it struggles with specific tasks like sequence tagging and summarization. The study also provides qualitative analyses and comparative evaluations against other models, highlighting both strengths and limitations of ChatGPT in NLP scenarios.

Methods

This paper employs the following methods:

  • Zero-shot learning
  • Chain-of-thought prompting

Models Used

  • ChatGPT
  • GPT-3.5

Datasets

The following datasets were used in this research:

  • MultiArith
  • GSM8K
  • AddSub
  • AQUA-RAT
  • SingleEq
  • SVAMP
  • CSQA
  • StrategyQA
  • COPA
  • SAMSum
  • CoNLL03
  • SST2
  • BoolQ

Evaluation Metrics

  • Accuracy
  • ROUGE-1
  • ROUGE-2
  • ROUGE-L
  • F1

Results

  • ChatGPT demonstrates strong performance in zero-shot reasoning tasks while underperforming in sequence tagging and summarization.
  • It outperforms predecessors in natural language inference and dialogue tasks.

Limitations

The authors identified the following limitations:

  • Excludes larger-scale datasets and more task categories due to cost.
  • Limited insight into the full capabilities of ChatGPT compared to fine-tuned models.

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

Large Language Models Zero-shot learning Chain-of-Thought prompting GPT-3.5 ChatGPT NLP datasets

Papers Using Similar Methods

External Resources