MuST-C

Dataset Information
Modalities
Texts, Audio
Languages
French, Spanish, German, Italian, Chinese, Russian, Portuguese, Arabic, Czech, Dutch, Persian, Romanian, Turkish, Vietnamese
Introduced
2019
License
Homepage

Overview

MuST-C currently represents the largest publicly available multilingual corpus (one-to-many) for speech translation. It covers eight language directions, from English to German, Spanish, French, Italian, Dutch, Portuguese, Romanian and Russian. The corpus consists of audio, transcriptions and translations of English TED talks, and it comes with a predefined training, validation and test split.

Source: One-to-Many Multilingual End-to-End Speech Translation
Image Source: https://mt.fbk.eu/must-c

Variants: MuST-C EN->DE, MuST-C

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
Speech-to-Text Translation Transformer with Adapters Lightweight Adapter Tuning for Multilingual … 2021-06-02
Speech-to-Text Translation Dual-decoder Transformer Dual-decoder Transformer for Joint Automatic … 2020-11-02

Research Papers

Recent papers with results on this dataset: