← ML Research Wiki / 1907.11692

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu [email protected] Facebook AI Equal contribution, Myle Ott [email protected] Facebook AI Equal contribution, Naman Goyal Facebook AI Equal contribution, Jingfei Du [email protected] Facebook AI Equal contribution, Mandar Joshi [email protected] Paul G. Allen School of Computer Science & Engineering University of Washington SeattleWA, Danqi Chen Facebook AI, Omer Levy [email protected] Facebook AI, Mike Lewis [email protected] Facebook AI, Luke Zettlemoyer Paul G. Allen School of Computer Science & Engineering University of Washington SeattleWA Facebook AI, Veselin Stoyanov Facebook AI (2019)

Paper Information

arXiv ID

1907.11692

Venue

arXiv.org

Domain

natural language processing

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging.Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results.We present a replication study of BERT pretraining(Devlin et al., 2019)that carefully measures the impact of many key hyperparameters and training data size.We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it.Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD.These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements.We release our models and code. 1

Summary

This paper presents RoBERTa, an improved version of the BERT pretraining approach that emphasizes the importance of training duration, data size, and hyperparameter tuning. The authors perform a replication study of BERT, finding it was significantly undertrained, and propose modifications that lead to superior performance on various benchmarks. Key changes include training the model longer with larger batches, removing the next sentence prediction (NSP) objective, training on longer sequences, and dynamically changing the masking patterns. The proposed model, RoBERTa, is trained on an extensive dataset, including the novel CC-NEWS dataset, and achieves state-of-the-art results on GLUE, RACE, and SQuAD, showcasing the effectiveness of the improvements made. The authors provide an in-depth evaluation of the effects of these changes and release their models and code for further research.

Methods

This paper employs the following methods:

Transformer
Dynamic Masking
Masked Language Modeling
Next Sentence Prediction (NSP)

Models Used

BERT
RoBERTa
XLNet

Datasets

The following datasets were used in this research:

CC-NEWS
GLUE
RACE
SQuAD

Evaluation Metrics

GLUE Score
SQuAD F1 Score
ROUGE

Results

State-of-the-art results on GLUE, RACE, and SQuAD
Achieves a score of 88.5 on the GLUE leaderboard
Matches or exceeds performance of all post-BERT models
Improves performance on 4 out of 9 GLUE tasks

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: 8
GPU Type: Nvidia V100

Keywords

RoBERTa BERT transformer pretraining language model

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 68
Influential Citations: 5093

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers