WenetSpeech is a multi-domain Mandarin corpus consisting of 10,000+ hours high-quality labeled speech, 2,400+ hours weakly labelled speech, and about 10,000 hours unlabeled speech, with 22,400+ hours in total. The authors collected the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions.
Image source: https://github.com/wenet-e2e/wenetspeech
Variants: WenetSpeech
This dataset is used in 1 benchmark:
Task | Model | Paper | Date |
---|---|---|---|
Speech Recognition | Zipformer+pruned transducer (no external language model) | Zipformer: A faster and better … | 2023-10-17 |
Speech Recognition | Paraformer-large | FunASR: A Fundamental End-to-End Speech … | 2023-05-18 |
Speech Recognition | Conformer-MoE (32e) | 3M: Multi-loss, Multi-path and Multi-level … | 2022-04-07 |
Speech Recognition | Conformer-MoE (16e) | 3M: Multi-loss, Multi-path and Multi-level … | 2022-04-07 |
Speech Recognition | Conformer-MoE (64e) | 3M: Multi-loss, Multi-path and Multi-level … | 2022-04-07 |
Speech Recognition | Wenet | WenetSpeech: A 10000+ Hours Multi-domain … | 2021-10-07 |
Speech Recognition | Kaldi | WenetSpeech: A 10000+ Hours Multi-domain … | 2021-10-07 |
Speech Recognition | Espnet | WenetSpeech: A 10000+ Hours Multi-domain … | 2021-10-07 |
Recent papers with results on this dataset: