← ML Research Wiki / 2305.14251

FACTSCORE: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

Sewon Min University of Washington, Kalpesh Krishna [email protected] University of Massachusetts Amherst, Xinxi Lyu University of Washington, Mike Lewis [email protected] MetaAI, Wen-Tau Yih MetaAI, Wei Pang [email protected], Koh University of Washington, Mohit Iyyer [email protected] University of Massachusetts Amherst, Luke Zettlemoyer University of Washington MetaAI, Hannaneh Hajishirzi [email protected] University of Washington Allen Institute for AI, Biyang Guo, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jinran Nie, Yuxuan Ding, Jianwei Yue, Yupeng 2023 Wu, How Close, Kelvin Guu, Tatsunori B Hashimoto, Yonatan Oren, Percy Liang, Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego De Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katherine Millican, George Van Den Driessche, Bog- Dan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack William Rae, Laurent 2022 Sifre, An, Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, ZacNicholas Schiefer, Hatfield Dodds, Nova Dassarma, Eli Tran-Johnson, Ryo Kamoi, Tanya Goyal, Juan Diego Rodriguez, Greg 2023 Durrett, Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, Colin 2022 Raffel, Large, Erin Bransom, Bailey Kuehl, Pradeep Dasigi, Arman Cohan, Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Peter Welinder, Paul Christiano, Jan Leike, Ryan 2022 Lowe, Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Lora Aroyo, Michael Collins, Dipanjan Das, Slav Petrov, Singh Gaurav, Iulia Tomar, David Reitter 2021 Turc, Shaden Shaar, Firoj Alam, Giovanni Da, San Martino, Preslav 2022 Nakov, Ori Shapira, David Gabay, Yang Gao, Hadar Ronen, Ra- Makanth Pasunuru, Mohit Bansal, Yael Amsterdamer, Ido 2019 Dagan, Crowdsourcing, Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Stanford, Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro (2023)

Paper Information
arXiv ID
Venue
Conference on Empirical Methods in Natural Language Processing
Domain
natural language processing
Code
Available
Reproducibility
7/10

Abstract

Evaluating the factuality of long-form text generated by large language models (LMs) is nontrivial because (1) generations often contain a mixture of supported and unsupported pieces of information, making binary judgments of quality inadequate, and (2) human evaluation is time-consuming and costly.In this paper, we introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source.We conduct an extensive human evaluation to obtain FACTSCOREs of people biographies generated by several state-of-the-art commercial LMs-InstructGPT, ChatGPT, and the retrievalaugmented PerplexityAI-and report new analysis demonstrating the need for such a finegrained score (e.g., ChatGPT only achieves 58%).Since human evaluation is costly, we also introduce an automated model that estimates FACTSCORE using retrieval and a strong language model, with less than a 2% error rate.Finally, we use this automated metric to evaluate 6,500 generations from a new set of 13 recent LMs that would have cost $26K if evaluated by humans, with various findings: GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models.FACTSCORE is available for public use via pip install factscore. 1Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen.2023.Enabling large language models to generate text with citations.

Summary

This paper presents FACTSCORE, a fine-grained evaluation metric for assessing the factual precision of text generated by large language models (LMs), specifically focusing on long-form text. The authors argue that traditional binary evaluation methods are inadequate because generated texts often include both supported and unsupported pieces of information. FACTSCORE evaluates the percentage of atomic facts in a text that are validated against a reliable knowledge source, based on extensive human evaluation. The study assesses state-of-the-art LMs including InstructGPT, ChatGPT, and PerplexityAI, revealing lower than expected FACTSCOREs that vary with the rarity of entities. An automated estimator for FACTSCORE is also introduced, demonstrated to closely approximate human evaluations with under 2% error. The paper discusses limitations and suggests avenues for future expansion of FACTSCORE beyond biographical texts.

Methods

This paper employs the following methods:

  • FACTSCORE
  • automated estimator

Models Used

  • InstructGPT
  • ChatGPT
  • PerplexityAI
  • GPT-4
  • Vicuna
  • Alpaca

Datasets

The following datasets were used in this research:

  • None specified

Evaluation Metrics

  • FACTSCORE
  • Error Rate

Results

  • ChatGPT achieves a FACTSCORE of 58%
  • InstructGPT achieves a FACTSCORE of 42%
  • PerplexityAI achieves a FACTSCORE of 71%
  • GPT-4 and ChatGPT are more factual than public models like Vicuna and Alpaca

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

factuality long form text generation evaluation metric FACTSCORE automated evaluation human annotation

Papers Using Similar Methods

External Resources