← ML Research Wiki / 1907.11692

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu [email protected] Facebook AI Equal contribution, Myle Ott [email protected] Facebook AI Equal contribution, Naman Goyal Facebook AI Equal contribution, Jingfei Du [email protected] Facebook AI Equal contribution, Mandar Joshi [email protected] Paul G. Allen School of Computer Science & Engineering University of Washington SeattleWA, Danqi Chen Facebook AI, Omer Levy [email protected] Facebook AI, Mike Lewis [email protected] Facebook AI, Luke Zettlemoyer Paul G. Allen School of Computer Science & Engineering University of Washington SeattleWA Facebook AI, Veselin Stoyanov Facebook AI (2019)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
natural language processing
SOTA Claim
Yes
Code
Reproducibility
8/10

Abstract

Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging.Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results.We present a replication study of BERT pretraining(Devlin et al., 2019)that carefully measures the impact of many key hyperparameters and training data size.We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it.Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD.These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements.We release our models and code. 1

Summary

This paper presents RoBERTa, an improved version of the BERT pretraining approach that emphasizes the importance of training duration, data size, and hyperparameter tuning. The authors perform a replication study of BERT, finding it was significantly undertrained, and propose modifications that lead to superior performance on various benchmarks. Key changes include training the model longer with larger batches, removing the next sentence prediction (NSP) objective, training on longer sequences, and dynamically changing the masking patterns. The proposed model, RoBERTa, is trained on an extensive dataset, including the novel CC-NEWS dataset, and achieves state-of-the-art results on GLUE, RACE, and SQuAD, showcasing the effectiveness of the improvements made. The authors provide an in-depth evaluation of the effects of these changes and release their models and code for further research.

Methods

This paper employs the following methods:

  • Transformer
  • Dynamic Masking
  • Masked Language Modeling
  • Next Sentence Prediction (NSP)

Models Used

  • BERT
  • RoBERTa
  • XLNet

Datasets

The following datasets were used in this research:

  • CC-NEWS
  • GLUE
  • RACE
  • SQuAD

Evaluation Metrics

  • GLUE Score
  • SQuAD F1 Score
  • ROUGE

Results

  • State-of-the-art results on GLUE, RACE, and SQuAD
  • Achieves a score of 88.5 on the GLUE leaderboard
  • Matches or exceeds performance of all post-BERT models
  • Improves performance on 4 out of 9 GLUE tasks

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: 8
  • GPU Type: Nvidia V100

Keywords

RoBERTa BERT transformer pretraining language model

Papers Using Similar Methods

External Resources