Sewon Min University of Washington, Kalpesh Krishna [email protected] University of Massachusetts Amherst, Xinxi Lyu University of Washington, Mike Lewis [email protected] MetaAI, Wen-Tau Yih MetaAI, Wei Pang [email protected], Koh University of Washington, Mohit Iyyer [email protected] University of Massachusetts Amherst, Luke Zettlemoyer University of Washington MetaAI, Hannaneh Hajishirzi [email protected] University of Washington Allen Institute for AI, Biyang Guo, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jinran Nie, Yuxuan Ding, Jianwei Yue, Yupeng 2023 Wu, How Close, Kelvin Guu, Tatsunori B Hashimoto, Yonatan Oren, Percy Liang, Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego De Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katherine Millican, George Van Den Driessche, Bog- Dan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack William Rae, Laurent 2022 Sifre, An, Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, ZacNicholas Schiefer, Hatfield Dodds, Nova Dassarma, Eli Tran-Johnson, Ryo Kamoi, Tanya Goyal, Juan Diego Rodriguez, Greg 2023 Durrett, Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, Colin 2022 Raffel, Large, Erin Bransom, Bailey Kuehl, Pradeep Dasigi, Arman Cohan, Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Peter Welinder, Paul Christiano, Jan Leike, Ryan 2022 Lowe, Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Lora Aroyo, Michael Collins, Dipanjan Das, Slav Petrov, Singh Gaurav, Iulia Tomar, David Reitter 2021 Turc, Shaden Shaar, Firoj Alam, Giovanni Da, San Martino, Preslav 2022 Nakov, Ori Shapira, David Gabay, Yang Gao, Hadar Ronen, Ra- Makanth Pasunuru, Mohit Bansal, Yael Amsterdamer, Ido 2019 Dagan, Crowdsourcing, Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Stanford, Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro (2023)
This paper presents FACTSCORE, a fine-grained evaluation metric for assessing the factual precision of text generated by large language models (LMs), specifically focusing on long-form text. The authors argue that traditional binary evaluation methods are inadequate because generated texts often include both supported and unsupported pieces of information. FACTSCORE evaluates the percentage of atomic facts in a text that are validated against a reliable knowledge source, based on extensive human evaluation. The study assesses state-of-the-art LMs including InstructGPT, ChatGPT, and PerplexityAI, revealing lower than expected FACTSCOREs that vary with the rarity of entities. An automated estimator for FACTSCORE is also introduced, demonstrated to closely approximate human evaluations with under 2% error. The paper discusses limitations and suggests avenues for future expansion of FACTSCORE beyond biographical texts.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: