OpenSLR

Open Speech and Language Resources

Dataset Information
Modalities
Texts, Audio
Languages
English, Spanish, Bengali, Afrikaans, Basque, Catalan, Galician, Marathi, Tamil, Telugu, Yoruba, Gujarati, Javanese, Kannada, Central Khmer, Malayalam, Burmese, Nepali (individual language), Sinhala, Sundanese, Venda, Xhosa
Introduced
2020
Homepage

Overview

OpenSLR is a repository of open speech and language resources, including large-scale transcribed audio corpora and related software. It serves as a central platform for researchers and practitioners to access and share datasets used in speech recognition (ASR), text-to-speech (TTS), and linguistic research.

The OpenSLR collection includes over 30 diverse datasets spanning more than 25 languages, such as Javanese, Nepali, Malayalam, Yoruba, and various English and Spanish dialects. These datasets are contributed by institutions including Google, North West University, and others. Most contain audio recordings along with transcriptions, covering both crowd-sourced and professionally recorded material.

Many of the datasets are high-quality multi-speaker corpora intended for use in building ASR and TTS models, particularly for under-resourced languages. Use cases include multilingual speech recognition, dialect modeling, language technology research, and building open-source voice applications.

The OpenSLR site also acts as a mirror for widely-used tools and models to ensure continued availability.

Website: OpenSLR website

Source: openslr/openslr

Variants: OpenSLR

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

No recent benchmark submissions available for this dataset.

Research Papers

No papers with results on this dataset found.