VoxCeleb2 is a large scale speaker recognition dataset obtained automatically from open-source media. VoxCeleb2 consists of over a million utterances from over 6k speakers. Since the dataset is collected ‘in the wild’, the speech segments are corrupted with real world noise including laughter, cross-talk, channel effects, music and other sounds. The dataset is also multilingual, with speech from speakers of 145 different nationalities, covering a wide range of accents, ages, ethnicities and languages. The dataset is audio-visual, so is also useful for a number of other applications, for example – visual speech synthesis, speech separation, cross-modal transfer from face to voice or vice versa and training face recognition from video to complement existing face recognition datasets.
Source: VoxCeleb2: Deep Speaker Recognition
Image Source: https://www.robots.ox.ac.uk/~vgg/data/voxceleb/
Variants: VoxCeleb2 - 1-shot learning, VoxCeleb2 - 8-shot learning, VoxCeleb2 - 32-shot learning, VoxCeleb2
This dataset is used in 2 benchmarks:
Task | Model | Paper | Date |
---|---|---|---|
Speech Separation | RTFS-Net-12 | RTFS-Net: Recurrent Time-Frequency Modelling for … | 2023-09-29 |
Speech Separation | RTFS-Net-6 | RTFS-Net: Recurrent Time-Frequency Modelling for … | 2023-09-29 |
Speech Separation | RTFS-Net-4 | RTFS-Net: Recurrent Time-Frequency Modelling for … | 2023-09-29 |
Speech Separation | IIANet | IIANet: An Intra- and Inter-Modality … | 2023-08-16 |
Speech Separation | CTCNet | An Audio-Visual Speech Separation Model … | 2022-12-21 |
Speaker Verification | ResNet-50 | VoxCeleb2: Deep Speaker Recognition | 2018-06-14 |
Recent papers with results on this dataset: