WenetSpeech

Dataset Information
Modalities
Speech
Languages
Mandarin Chinese
Introduced
2021
License
Homepage

Overview

WenetSpeech is a multi-domain Mandarin corpus consisting of 10,000+ hours high-quality labeled speech, 2,400+ hours weakly labelled speech, and about 10,000 hours unlabeled speech, with 22,400+ hours in total. The authors collected the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions.

Image source: https://github.com/wenet-e2e/wenetspeech

Variants: WenetSpeech

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
Speech Recognition Zipformer+pruned transducer (no external language model) Zipformer: A faster and better … 2023-10-17
Speech Recognition Paraformer-large FunASR: A Fundamental End-to-End Speech … 2023-05-18
Speech Recognition Conformer-MoE (32e) 3M: Multi-loss, Multi-path and Multi-level … 2022-04-07
Speech Recognition Conformer-MoE (16e) 3M: Multi-loss, Multi-path and Multi-level … 2022-04-07
Speech Recognition Conformer-MoE (64e) 3M: Multi-loss, Multi-path and Multi-level … 2022-04-07
Speech Recognition Wenet WenetSpeech: A 10000+ Hours Multi-domain … 2021-10-07
Speech Recognition Kaldi WenetSpeech: A 10000+ Hours Multi-domain … 2021-10-07
Speech Recognition Espnet WenetSpeech: A 10000+ Hours Multi-domain … 2021-10-07

Research Papers

Recent papers with results on this dataset: