MuST-C

Name: MuST-C
Published: 2019-01-01
License: CC BY-NC-ND 4.0

Dataset Information

Modalities

Texts, Audio

Languages

French, Spanish, German, Italian, Chinese, Russian, Portuguese, Arabic, Czech, Dutch, Persian, Romanian, Turkish, Vietnamese

Introduced

2019

License

CC BY-NC-ND 4.0

Homepage

Official Website

Contents

Overview
Associated Benchmarks
Recent Benchmark Submissions
Research Papers

Overview

MuST-C currently represents the largest publicly available multilingual corpus (one-to-many) for speech translation. It covers eight language directions, from English to German, Spanish, French, Italian, Dutch, Portuguese, Romanian and Russian. The corpus consists of audio, transcriptions and translations of English TED talks, and it comes with a predefined training, validation and test split.

Source: One-to-Many Multilingual End-to-End Speech Translation
Image Source: https://mt.fbk.eu/must-c

Variants: MuST-C EN->DE, MuST-C

Associated Benchmarks

This dataset is used in 1 benchmark:

Speech-to-Text Translation - Metrics: SacreBLEU

Recent Benchmark Submissions

Task	Model	Paper	Date
Speech-to-Text Translation	Transformer with Adapters	Lightweight Adapter Tuning for Multilingual …	2021-06-02
Speech-to-Text Translation	Dual-decoder Transformer	Dual-decoder Transformer for Joint Automatic …	2020-11-02

Research Papers

Recent papers with results on this dataset:

External Links:

MuST-C

Overview edit

Associated Benchmarks

Recent Benchmark Submissions

Research Papers

Edit Dataset Information

Overview