← ML Research Wiki / 1810.04805

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin [email protected] Google AI Language, Ming-Wei Chang [email protected] Google AI Language, Kenton Lee [email protected] Google AI Language, Kristina Toutanova Google AI Language (2019)

Paper Information
arXiv ID
Venue
North American Chapter of the Association for Computational Linguistics
Domain
Natural language processing
SOTA Claim
Yes
Code
Reproducibility
8/10

Abstract

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers.Unlike recent language representation models(Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful.It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

Summary

This paper presents BERT (Bidirectional Encoder Representations from Transformers), a novel language representation model that pre-trains deep bidirectional representations from text. Unlike previous models, BERT uses a masked language model for pre-training, enabling it to consider both left and right contexts. This allows BERT to achieve state-of-the-art results across eleven NLP tasks, including question answering and language inference. Significant advancements are reported on benchmarks like GLUE and SQuAD. BERT's architecture is based on multi-layer bidirectional transformers, with essential tasks including Masked LM (MLM) and next sentence prediction (NSP) to enhance understanding of sentence relationships. Results demonstrate substantial improvements over traditional left-to-right models, highlighting the effectiveness of bidirectional representations in NLP tasks.

Methods

This paper employs the following methods:

  • Transformer
  • Masked Language Model (MLM)
  • Next Sentence Prediction (NSP)

Models Used

  • BERT BASE
  • BERT LARGE

Datasets

The following datasets were used in this research:

  • GLUE
  • SQuAD v1.1
  • SQuAD v2.0
  • MNLI
  • SWAG
  • CoLA
  • STS-B
  • MRPC
  • QQP

Evaluation Metrics

  • GLUE score
  • F1
  • Accuracy

Results

  • GLUE score of 80.5%
  • MultiNLI accuracy of 86.7%
  • SQuAD v1.1 F1 score of 93.2
  • SQuAD v2.0 F1 score of 83.1

Limitations

The authors identified the following limitations:

  • Pre-training requires large amounts of data
  • Fine-tuning can be expensive computationally

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

BERT Transformer Pretraining Language understanding Masked language model Next sentence prediction

Papers Using Similar Methods

External Resources