José Cañete [email protected] Department of Computer Science Department of Electrical Engineering Department of Computer Science Universidad de Chile & Millennium Institute for Foundational Resea, Gabriel Chaperon [email protected] Department of Computer Science Department of Electrical Engineering Department of Computer Science Universidad de Chile & Millennium Institute for Foundational, Rodrigo Fuentes [email protected] Department of Computer Science Department of Electrical Engineering Department of Computer Science Universidad de Chile & Millennium Institute for Foundational , Jou-Hui Ho [email protected] Department of Computer Science Department of Electrical Engineering Department of Computer Science Universidad de Chile & Millennium Institute for Foundational Resea, Hojin Kang Department of Computer Science Department of Electrical Engineering Department of Computer Science Universidad de Chile & Millennium Institute for Foundational Research on Data (IMFD) Univ, Jorge Pérez [email protected] Department of Computer Science Department of Electrical Engineering Department of Computer Science Universidad de Chile & Millennium Institute for Foundational Resear (2023)
The paper presents the first BERT-based language model pre-trained exclusively on Spanish data, alongside a compilation of Spanish-specific natural language processing (NLP) tasks into a benchmark called GLUES, analogous to the original English GLUE benchmark. The authors demonstrate that fine-tuning their Spanish-specific BERT model yields superior performance compared to multilingual BERT models across various tasks, achieving state-of-the-art results in some cases. The paper emphasizes the challenges in obtaining resources for training and evaluating Spanish language models and aims to contribute to the community by providing the model, pre-training data, and benchmark repository as public resources.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: