← ML Research Wiki / 2308.02976

SPANISH PRE-TRAINED BERT MODEL AND EVALUATION DATA

José Cañete [email protected] Department of Computer Science Department of Electrical Engineering Department of Computer Science Universidad de Chile & Millennium Institute for Foundational Resea, Gabriel Chaperon [email protected] Department of Computer Science Department of Electrical Engineering Department of Computer Science Universidad de Chile & Millennium Institute for Foundational, Rodrigo Fuentes [email protected] Department of Computer Science Department of Electrical Engineering Department of Computer Science Universidad de Chile & Millennium Institute for Foundational , Jou-Hui Ho [email protected] Department of Computer Science Department of Electrical Engineering Department of Computer Science Universidad de Chile & Millennium Institute for Foundational Resea, Hojin Kang Department of Computer Science Department of Electrical Engineering Department of Computer Science Universidad de Chile & Millennium Institute for Foundational Research on Data (IMFD) Univ, Jorge Pérez [email protected] Department of Computer Science Department of Electrical Engineering Department of Computer Science Universidad de Chile & Millennium Institute for Foundational Resear (2023)

Paper Information

arXiv ID

2308.02976

Venue

arXiv.org

Domain

natural language processing

SOTA Claim

Yes

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

The Spanish language is one of the top 5 spoken languages in the world. Nevertheless, finding resources to train or evaluate Spanish language models is not an easy task. In this paper we help bridge this gap by presenting a BERT-based language model pre-trained exclusively on Spanish data. As a second contribution, we also compiled several tasks specifically for the Spanish language in a single repository much in the spirit of the GLUE benchmark. By fine-tuning our pretrained Spanish model, we obtain better results compared to other BERT-based models pre-trained on multilingual corpora for most of the tasks, even achieving a new state-of-the-art on some of them. We have publicly released our model, the pre-training data, and the compilation of the Spanish benchmarks. : Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Summary

The paper presents the first BERT-based language model pre-trained exclusively on Spanish data, alongside a compilation of Spanish-specific natural language processing (NLP) tasks into a benchmark called GLUES, analogous to the original English GLUE benchmark. The authors demonstrate that fine-tuning their Spanish-specific BERT model yields superior performance compared to multilingual BERT models across various tasks, achieving state-of-the-art results in some cases. The paper emphasizes the challenges in obtaining resources for training and evaluating Spanish language models and aims to contribute to the community by providing the model, pre-training data, and benchmark repository as public resources.

Methods

This paper employs the following methods:

BERT
Transformer

Models Used

Spanish-BERT
mBERT

Datasets

The following datasets were used in this research:

XNLI
PAWS-X
CoNLL
Universal Dependencies v1.4
MLDoc
Universal Dependencies v2.2
MLQA
TAR
XQuAD

Evaluation Metrics

Accuracy
F1 score
Unlabeled Attachment Score (UAS)
Labeled Attachment Score (LAS)
Exact match

Results

Better results compared to other BERT-based models pre-trained on multilingual corpora for most tasks
New state-of-the-art on some GLUES tasks

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: 1
GPU Type: Google TPU v3-8

Keywords

Spanish BERT Pretrained language model NLP benchmarks Spanish NLP Transformers

Papers Using Similar Methods

External Resources

Funding: Millennium Institute for Foundational Research on Data
References: 50
Influential Citations: 105

SPANISH PRE-TRAINED BERT MODEL AND EVALUATION DATA

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers