← ML Research Wiki / 2005.14165

Language Models are Few-Shot Learners

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mccandlish, Alec Radford, Ilya Sutskever, Dario Amodei Openai (2020)

Paper Information

arXiv ID

2005.14165

Venue

Neural Information Processing Systems

Domain

Natural Language Processing

SOTA Claim

Yes

Reproducibility

7/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions -something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art finetuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general. * Equal contribution † Johns Hopkins University, OpenAI Recent years have featured a trend towards pre-trained language representations in NLP systems, applied in increasingly flexible and task-agnostic ways for downstream transfer. First, single-layer representations were learned using word vectors [MCCD13, PSM14] and fed to task-specific architectures, then RNNs with multiple layers of representations and contextual state were used to form stronger representations [DL15, MBXS17, PNZtY18] (though still applied to task-specific architectures), and more recently pre-trained recurrent or transformer language models [VSP + 17] have been directly fine-tuned, entirely removing the need for task-specific architectures [RNSS18, DCLT18, HR18].This last paradigm has led to substantial progress on many challenging NLP tasks such as reading comprehension, question answering, textual entailment, and many others, and has continued to advance based on new architectures and algorithms [RSR + 19, LOG + 19, YDY + 19, LCG + 19]. However, a major limitation to this approach is that while the architecture is task-agnostic, there is still a need for task-specific datasets and task-specific fine-tuning: to achieve strong performance on a desired task typically requires fine-tuning on a dataset of thousands to hundreds of thousands of examples specific to that task. Removing this limitation would be desirable, for several reasons.First, from a practical perspective, the need for a large dataset of labeled examples for every new task limits the applicability of language models. There exists a very wide range of possible useful language tasks, encompassing anything from correcting grammar, to generating examples of an abstract concept, to critiquing a short story. For many of these tasks it is difficult to collect a large supervised training dataset, especially when the process must be repeated for every new task. Good English output: I ate the purple berries.Poor English input: Thank you for picking me as your designer. I'd appreciate it. Good English output: Thank you for choosing me as your designer. I appreciate it. Poor English input: The mentioned changes have done. or I did the alteration that you requested. or I changed things you wanted and did the modifications. Good English output: The requested changes have been made. or I made the alteration that you requested. or I changed things you wanted and made the modifications. Poor English input: I'd be more than happy to work with you in another project.Good English output: I'd be more than happy to work with you on another project.Poor English input: Please provide me with a short brief of the design you're looking for and that'd be nice if you could share some examples or project you did before.Good English output: Please provide me with a brief description of the design you're looking for and that would be nice if you could share some examples or projects you have done before.

Summary

This paper introduces GPT-3, a 175 billion parameter language model that showcases significant improvements in natural language processing tasks via few-shot learning, where the model performs tasks with minimal examples and without fine-tuning. The authors found that scaling up the model results in enhanced performance across multiple benchmarks, sometimes approaching or exceeding that of state-of-the-art fine-tuned systems. The paper discusses various methodologies, including the evaluation of GPT-3 across multiple tasks such as translation, question-answering, arithmetic and reasoning tasks, highlighting its capabilities in generating coherent text that is challenging for humans to distinguish from machine-generated content. Furthermore, the study addresses limitations regarding data contamination, bias, and potential societal impacts, advocating for future directions in research to mitigate these issues and enhance model efficiency.

Methods

This paper employs the following methods:

GPT-3
Few-shot Learning
In-context Learning
Fine-tuning

Models Used

GPT-3

Datasets

The following datasets were used in this research:

Common Crawl
WebText
Books1
Books2
Wikipedia
LAMBADA
HellaSwag
TriviaQA
Natural Questions (NQ)
WebQuestions
CoQA
SQuAD 2.0
ARC
Winograd
Winogrande
PIQA
DROP

Evaluation Metrics

F1
Accuracy
BLEU

Results

Achieves strong performance on numerous NLP datasets
Can generate human-like text that is challenging to distinguish from actual human writing
Displays notable capabilities in tasks requiring on-the-fly reasoning

Limitations

The authors identified the following limitations:

Struggles with certain natural language inference tasks
Vulnerable to biases present in training data
Exhibits issues related to overfitting on benchmark datasets

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

language models few-shot learning GPT-3 scaling laws natural language understanding few-shot, zero-shot, and one-shot learning

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 146
Influential Citations: 4002

Language Models are Few-Shot Learners

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers