← ML Research Wiki / 1810.04805

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin [email protected] Google AI Language, Ming-Wei Chang [email protected] Google AI Language, Kenton Lee [email protected] Google AI Language, Kristina Toutanova Google AI Language (2019)

Paper Information

arXiv ID

1810.04805

Venue

North American Chapter of the Association for Computational Linguistics

Domain

Natural language processing

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers.Unlike recent language representation models(Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful.It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

Summary

This paper presents BERT (Bidirectional Encoder Representations from Transformers), a novel language representation model that pre-trains deep bidirectional representations from text. Unlike previous models, BERT uses a masked language model for pre-training, enabling it to consider both left and right contexts. This allows BERT to achieve state-of-the-art results across eleven NLP tasks, including question answering and language inference. Significant advancements are reported on benchmarks like GLUE and SQuAD. BERT's architecture is based on multi-layer bidirectional transformers, with essential tasks including Masked LM (MLM) and next sentence prediction (NSP) to enhance understanding of sentence relationships. Results demonstrate substantial improvements over traditional left-to-right models, highlighting the effectiveness of bidirectional representations in NLP tasks.

Methods

This paper employs the following methods:

Transformer
Masked Language Model (MLM)
Next Sentence Prediction (NSP)

Models Used

BERT BASE
BERT LARGE

Datasets

The following datasets were used in this research:

GLUE
SQuAD v1.1
SQuAD v2.0
MNLI
SWAG
CoLA
STS-B
MRPC
QQP

Evaluation Metrics

GLUE score
F1
Accuracy

Results

GLUE score of 80.5%
MultiNLI accuracy of 86.7%
SQuAD v1.1 F1 score of 93.2
SQuAD v2.0 F1 score of 83.1

Limitations

The authors identified the following limitations:

Pre-training requires large amounts of data
Fine-tuning can be expensive computationally

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

BERT Transformer Pretraining Language understanding Masked language model Next sentence prediction

Papers Using Similar Methods

Attention Is All You Need (2017)
AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE (2020)
RoBERTa: A Robustly Optimized BERT Pretraining Approach (2019)
End-to-End Object Detection with Transformers (2020)
GPT-4 Technical Report (2023)

External Resources

Funding: Google AI Language
References: 63
Influential Citations: 19847

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers