← ML Research Wiki / 2308.02976

SPANISH PRE-TRAINED BERT MODEL AND EVALUATION DATA

José Cañete [email protected] Department of Computer Science Department of Electrical Engineering Department of Computer Science Universidad de Chile & Millennium Institute for Foundational Resea, Gabriel Chaperon [email protected] Department of Computer Science Department of Electrical Engineering Department of Computer Science Universidad de Chile & Millennium Institute for Foundational, Rodrigo Fuentes [email protected] Department of Computer Science Department of Electrical Engineering Department of Computer Science Universidad de Chile & Millennium Institute for Foundational , Jou-Hui Ho [email protected] Department of Computer Science Department of Electrical Engineering Department of Computer Science Universidad de Chile & Millennium Institute for Foundational Resea, Hojin Kang Department of Computer Science Department of Electrical Engineering Department of Computer Science Universidad de Chile & Millennium Institute for Foundational Research on Data (IMFD) Univ, Jorge Pérez [email protected] Department of Computer Science Department of Electrical Engineering Department of Computer Science Universidad de Chile & Millennium Institute for Foundational Resear (2023)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
natural language processing
SOTA Claim
Yes
Reproducibility
8/10

Abstract

The Spanish language is one of the top 5 spoken languages in the world. Nevertheless, finding resources to train or evaluate Spanish language models is not an easy task. In this paper we help bridge this gap by presenting a BERT-based language model pre-trained exclusively on Spanish data. As a second contribution, we also compiled several tasks specifically for the Spanish language in a single repository much in the spirit of the GLUE benchmark. By fine-tuning our pretrained Spanish model, we obtain better results compared to other BERT-based models pre-trained on multilingual corpora for most of the tasks, even achieving a new state-of-the-art on some of them. We have publicly released our model, the pre-training data, and the compilation of the Spanish benchmarks. : Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Summary

The paper presents the first BERT-based language model pre-trained exclusively on Spanish data, alongside a compilation of Spanish-specific natural language processing (NLP) tasks into a benchmark called GLUES, analogous to the original English GLUE benchmark. The authors demonstrate that fine-tuning their Spanish-specific BERT model yields superior performance compared to multilingual BERT models across various tasks, achieving state-of-the-art results in some cases. The paper emphasizes the challenges in obtaining resources for training and evaluating Spanish language models and aims to contribute to the community by providing the model, pre-training data, and benchmark repository as public resources.

Methods

This paper employs the following methods:

  • BERT
  • Transformer

Models Used

  • Spanish-BERT
  • mBERT

Datasets

The following datasets were used in this research:

  • XNLI
  • PAWS-X
  • CoNLL
  • Universal Dependencies v1.4
  • MLDoc
  • Universal Dependencies v2.2
  • MLQA
  • TAR
  • XQuAD

Evaluation Metrics

  • Accuracy
  • F1 score
  • Unlabeled Attachment Score (UAS)
  • Labeled Attachment Score (LAS)
  • Exact match

Results

  • Better results compared to other BERT-based models pre-trained on multilingual corpora for most tasks
  • New state-of-the-art on some GLUES tasks

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: 1
  • GPU Type: Google TPU v3-8

Keywords

Spanish BERT Pretrained language model NLP benchmarks Spanish NLP Transformers

Papers Using Similar Methods

External Resources