Venue
Neural Information Processing Systems
Domain
Natural Language Processing
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions -something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art finetuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general. * Equal contribution † Johns Hopkins University, OpenAI Recent years have featured a trend towards pre-trained language representations in NLP systems, applied in increasingly flexible and task-agnostic ways for downstream transfer. First, single-layer representations were learned using word vectors [MCCD13, PSM14] and fed to task-specific architectures, then RNNs with multiple layers of representations and contextual state were used to form stronger representations [DL15, MBXS17, PNZtY18] (though still applied to task-specific architectures), and more recently pre-trained recurrent or transformer language models [VSP + 17] have been directly fine-tuned, entirely removing the need for task-specific architectures [RNSS18, DCLT18, HR18].This last paradigm has led to substantial progress on many challenging NLP tasks such as reading comprehension, question answering, textual entailment, and many others, and has continued to advance based on new architectures and algorithms [RSR + 19, LOG + 19, YDY + 19, LCG + 19]. However, a major limitation to this approach is that while the architecture is task-agnostic, there is still a need for task-specific datasets and task-specific fine-tuning: to achieve strong performance on a desired task typically requires fine-tuning on a dataset of thousands to hundreds of thousands of examples specific to that task. Removing this limitation would be desirable, for several reasons.First, from a practical perspective, the need for a large dataset of labeled examples for every new task limits the applicability of language models. There exists a very wide range of possible useful language tasks, encompassing anything from correcting grammar, to generating examples of an abstract concept, to critiquing a short story. For many of these tasks it is difficult to collect a large supervised training dataset, especially when the process must be repeated for every new task. Good English output: I ate the purple berries.Poor English input: Thank you for picking me as your designer. I'd appreciate it. Good English output: Thank you for choosing me as your designer. I appreciate it. Poor English input: The mentioned changes have done. or I did the alteration that you requested. or I changed things you wanted and did the modifications. Good English output: The requested changes have been made. or I made the alteration that you requested. or I changed things you wanted and made the modifications. Poor English input: I'd be more than happy to work with you in another project.Good English output: I'd be more than happy to work with you on another project.Poor English input: Please provide me with a short brief of the design you're looking for and that'd be nice if you could share some examples or project you did before.Good English output: Please provide me with a brief description of the design you're looking for and that would be nice if you could share some examples or projects you have done before.
This paper introduces GPT-3, a 175 billion parameter language model that showcases significant improvements in natural language processing tasks via few-shot learning, where the model performs tasks with minimal examples and without fine-tuning. The authors found that scaling up the model results in enhanced performance across multiple benchmarks, sometimes approaching or exceeding that of state-of-the-art fine-tuned systems. The paper discusses various methodologies, including the evaluation of GPT-3 across multiple tasks such as translation, question-answering, arithmetic and reasoning tasks, highlighting its capabilities in generating coherent text that is challenging for humans to distinguish from machine-generated content. Furthermore, the study addresses limitations regarding data contamination, bias, and potential societal impacts, advocating for future directions in research to mitigate these issues and enhance model efficiency.
This paper employs the following methods:
- GPT-3
- Few-shot Learning
- In-context Learning
- Fine-tuning
The following datasets were used in this research:
- Common Crawl
- WebText
- Books1
- Books2
- Wikipedia
- LAMBADA
- HellaSwag
- TriviaQA
- Natural Questions (NQ)
- WebQuestions
- CoQA
- SQuAD 2.0
- ARC
- Winograd
- Winogrande
- PIQA
- DROP
- Achieves strong performance on numerous NLP datasets
- Can generate human-like text that is challenging to distinguish from actual human writing
- Displays notable capabilities in tasks requiring on-the-fly reasoning
The authors identified the following limitations:
- Struggles with certain natural language inference tasks
- Vulnerable to biases present in training data
- Exhibits issues related to overfitting on benchmark datasets
- Number of GPUs: None specified
- GPU Type: None specified
language models
few-shot learning
GPT-3
scaling laws
natural language understanding
few-shot, zero-shot, and one-shot learning