← ML Research Wiki / 2301.07597

How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection

Biyang Guo School of Information Management and Engineering Economics AI Lab Shanghai University of Finance, Xin Zhang Institute of Computing and Intelligence Harbin Institute of Technology (Shenzhen, Ziyuan Wang School of Information Management and Engineering Economics AI Lab Shanghai University of Finance, Minqi Jiang School of Information Management and Engineering Economics AI Lab Shanghai University of Finance, Jinran Nie School of Information Science Beijing Language and Culture University, Yuxuan Ding School of Electronic Engineering Xidian University, Jianwei Yue School of Computing Queen's University 6 Wind Information CoLtd, Yupeng Wu (2023)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
Natural language processing, Artificial intelligence
SOTA Claim
Yes
Code
Reproducibility
8/10

Abstract

The introduction of ChatGPT 2 has garnered widespread attention in both academic and industrial communities. ChatGPT is able to respond effectively to a wide range of human questions, providing fluent and comprehensive answers that significantly surpass previous public chatbots in terms of security and usefulness. On one hand, people are curious about how ChatGPT is able to achieve such strength and how far it is from human experts. On the other hand, people are starting to worry about the potential negative impacts that large language models (LLMs) like ChatGPT could have on society, such as fake news, plagiarism, and social security issues. In this work, we collected tens of thousands of comparison responses from both human experts and ChatGPT, with questions ranging from open-domain, financial, medical, legal, and psychological areas. We call the collected dataset the Human ChatGPT Comparison Corpus (HC3). Based on the HC3 dataset, we study the characteristics of ChatGPT's responses, the differences and gaps from human experts, and future directions for LLMs. We conducted comprehensive human evaluations and linguistic analyses of ChatGPT-generated content compared with that of humans, where many interesting results are revealed. After that, we conduct extensive experiments on how to effectively detect whether a certain text is generated by ChatGPT or humans. We build three different detection systems, explore several key factors that influence their effectiveness, and evaluate them in different scenarios. The dataset, code, and models are all publicly available at https: //github.com/Hello-SimpleAI/chatgpt-comparison-detection. * Equal Contribution. † Project Lead. Corresponding to [email protected] + Each author has made unique contributions to the project. 2 Launched by OpenAI in November 2022. https://chat.openai.com/chat arXiv:2301.07597v1 [cs.CL] 18 Jan 2023 tasks, such as translating natural language to code [5], completing the extremely masked text[15]or generating stories given user-defined elements and styles[40], let alone typical NLP tasks like text classification, entity extraction, translation, etc. Furthermore, the carefully collected human-written demonstrations also make ChatGPT able to admit its mistakes, challenge incorrect premises and reject even inappropriate requests, as claimed by OpenAI 3 .The surprisingly strong capabilities of ChatGPT have raised many interests, as well as concerns:On the one hand, people are curious about how close is ChatGPT to human experts. Different from previous LLMs like , which usually fails to properly respond to human queries, InstructGPT [25] and the stronger ChatGPT have improved greatly in interactions with humans. Therefore, ChatGPT has great potential to become a daily assistant for general or professional consulting purposes[20,21]. From the linguistic or NLP perspectives, we are also interested in where are the remaining gaps between ChatGPT and humans and what are their implicit linguistic differences[14,18].On the other hand, people are worried about the potential risks brought by LLMs like ChatGPT. With the free preview demo of ChatGPT going virus, a large amount of ChatGPT-generated content crowded into all kinds of UGC (User-Generated Content) platforms, threatening the quality and reliability of the platforms. For example, Stack Overflow, the famous programming questionanswering website, has temporarily banned ChatGPT-generated content 4 , because it believes "the average rate of getting correct answers from ChatGPT is too low, the posting of answers created by ChatGPT is substantially harmful to the site and to users who are asking and looking for correct answers". Many other applications and activities are facing similar issues, such as online exams[33]and medical analysis[20]. Our empirical evaluation of ChatGPT on legal, medical, and financial questions also reveals that potentially harmful or fake information can be generated.Considering the opaqueness of ChatGPT and the potential social risks associated with model misuse, we make the following contributions to both the academy and society:1. To facilitate LLM-related research, especially the study on the comparison between humans and LLMs, we collect nearly 40K questions and their corresponding answers from human experts and ChatGPT, covering a wide range of domains (open-domain, computer science, finance, medicine, law, and psychology), named as the Human ChatGPT Comparison Corpus (HC3) dataset. The HC3 dataset is a valuable resource to analyze the linguistic and stylist characteristics of both humans and ChatGPT, which helps to investigate the future improvement directions for LLMs; 2. We conduct comprehensive human evaluations as well as linguistic analysis on human/ChatGPT-generated answers, discovering many interesting patterns exhibited by humans and ChatGPT. These findings can help to distinguish whether certain content is generated by LLMs, and also provide insights about where language models should be heading in the future;3. Based on the HC3 dataset and the analysis, we develop several ChatGPT detecting models, targeting different detection scenarios. These detectors show decent performance in our held-out test sets. We also conclude several key factors that are essential to the detector's effectiveness.4. We open-source all the collected comparison corpus, evaluations, and detection models, to facilitate future academic research and online platform regulations on AI-generated content. 1 { 2 " question ": " Q1 " , 3 " human_answers ": [" A1 " , " A2 "] , 4 " chatgpt_answers ": [" B1 "] 5 } Overall, we collected 24, 322 questions, 58, 546 human answers and 26, 903 ChatGPT answers for the English version, and 12, 853 questions, 22, 259 human answers and 17, 522 ChatGPT answers for the Chinese version. The meta-information of each dataset split is illustrated inTable 1.In this section, we invite many volunteer testers and conduct extensive human evaluations from different aspects. After the human evaluation, we make our collected comparison corpus available to the volunteers and ask them to manually conclude some characteristics. We then summarize the feedback from the volunteers combined with our observations.Human EvaluationThe human evaluation is divided into the Turing test and the Helpfulness Test. The Turing Test [34] is a test of a machine's ability to exhibit intelligent behavior that is indistinguishable from a human. We invite 17 volunteers, divided into two groups: 8 experts (who are frequent users of ChatGPT) and 9 amateurs (who have never heard of ChatGPT). This is because people who are familiar with ChatGPT may have memorized some patterns exhibited by ChatGPT, helping them to easily distinguish the role.We designed four types of evaluations, using different query formats or testing groups. We introduce the specific evaluation design and results in the following parts:A. Expert Turing Test, Paired Text (pair-expert) The pair-expert test is conducted in the expert group. Each tester is required to do a series of tests, each test containing one question and a pair of answers (one from humans and another from ChatGPT). The tester needs to determine which answer is generated by ChatGPT.B. Expert Turing Test, Single Text (single-expert)The single-expert test is also conducted in the expert group. Each tester is required to do a series of tests, each test containing one question and a single answer randomly given by humans or ChatGPT. The tester needs to determine whether the answer is generated by ChatGPT.C. Amateur Turing Test, Single Text (single-amateur)The single-amateur test is conducted in the amateur group. Each tester is required to do a series of tests, each test containing one question and a single answer randomly given by humans or ChatGPT. The tester needs to determine whether the answer is generated by ChatGPT.D. Helpfulness Test (helpfulness)We are also curious about how helpful are the answers from ChatGPT compared with humans' answers to one question. Note that helpfulness is a very subjective metric, which can be influenced by many factors, including emotion, tester personality, personal preference, etc. Therefore, simply providing more accurate information or a more detailed analysis may not always lead to a more helpful answer.The helpfulness test is conducted in the expert group. Each tester is required to do a series of tests, each containing one question and a pair of answers (one from human and another from ChatGPT).

Summary

This paper investigates the capabilities of ChatGPT in comparison to human experts across various domains such as finance, medicine, and law. The authors created a substantial dataset termed the Human ChatGPT Comparison Corpus (HC3), comprising nearly 40,000 questions and corresponding answers from both ChatGPT and human experts. The research includes detailed human evaluations to analyze the differences in response characteristics between ChatGPT and humans, as well as detection experiments to identify machine-generated text. The results indicate that despite the strong performance of ChatGPT, there are significant gaps in areas like helpfulness, particularly in medical queries. The findings also highlight the potential risks associated with ChatGPT-generated content, such as the propagation of misinformation. The paper underscores the importance of developing robust detection systems for AI-generated content and proposes several models for this purpose, offering valuable resources for future research.

Methods

This paper employs the following methods:

  • Reinforcement Learning from Human Feedback (RLHF)
  • Turing Test
  • Helpfulness Test

Models Used

  • ChatGPT
  • RoBERTa

Datasets

The following datasets were used in this research:

  • Human ChatGPT Comparison Corpus (HC3)
  • ELI5
  • WikiQA
  • Medical Dialog dataset
  • FiQA
  • WebTextQA
  • BaiduBaike
  • NLPCC-DBQA

Evaluation Metrics

  • F1-score
  • Helpfulness
  • Turing Test

Results

  • ChatGPT is less emotional than human responses.
  • In specific domains such as finance and psychology, ChatGPT's answers were found to be more helpful than those from humans.
  • The RoBERTa-based detectors outperformed GLTR-based detectors in most scenarios.

Limitations

The authors identified the following limitations:

  • The dataset lacks diversity and is smaller than desired.
  • ChatGPT answers were generated without special prompts, limiting the analysis scope.

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

ChatGPT human vs AI response comparison corpus creation response evaluation AI detection

Papers Using Similar Methods

External Resources