← ML Research Wiki / 1910.10683

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel [email protected], Noam Shazeer, Adam Roberts, Katherine Lee [email protected], Sharan Narang [email protected], Michael Matena [email protected], Yanqi Zhou [email protected], Wei Li [email protected], Peter J Liu [email protected] (2019)

Paper Information
arXiv ID
Venue
Journal of machine learning research
Domain
artificial intelligence, machine learning, NLP
SOTA Claim
Yes
Reproducibility
7/10

Abstract

Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code. 1

Summary

This paper presents a comprehensive examination of transfer learning in natural language processing (NLP) using a unified text-to-text framework. It introduces the Text-to-Text Transfer Transformer (T5) model, which casts all NLP tasks as text-to-text conversion problems, allowing the use of the same model architecture and training procedure for various tasks including translation, summarization, and question answering. The paper describes extensive experiments that explore the impact of different pre-training objectives, model sizes, and datasets on performance across multiple NLP benchmarks. The authors propose the Colossal Clean Crawled Corpus (C4) as a new large-scale dataset for training. Their findings indicate that scaling up model size and training data size significantly improves performance, leading to state-of-the-art results on several benchmarks. Furthermore, the authors emphasize the importance of transferring knowledge from large, pre-trained models to downstream tasks, and they provide the model and dataset for future research.

Methods

This paper employs the following methods:

  • Text-to-Text Transfer Transformer

Models Used

  • T5-11B
  • T5-3B

Datasets

The following datasets were used in this research:

  • Colossal Clean Crawled Corpus (C4)
  • SQuAD
  • GLUE
  • SuperGLUE
  • CNN/Daily Mail
  • WMT

Evaluation Metrics

  • GLUE
  • Exact Match (SQuAD)
  • F1 (SQuAD)
  • BLEU (WMT)
  • ROUGE (CNN/Daily Mail)

Results

  • State-of-the-art performance on 18 of 24 tasks
  • Achieved an average GLUE score of 90.3
  • Improved SQuAD exact match score beyond previous benchmarks
  • Exceeds human performance on certain reading comprehension tasks in SuperGLUE

Limitations

The authors identified the following limitations:

  • High computational resources required for model training and inference
  • Dependence on large-scale clean datasets for effective training

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

transfer learning text-to-text transfer transformer T5 unsupervised learning pre-training fine-tuning scale benchmark

Papers Using Similar Methods

External Resources