VoxCeleb2

Dataset Information
Modalities
Images, Videos, Texts, Audio
Languages
Multilingual
Introduced
2018
License
Homepage

Overview

VoxCeleb2 is a large scale speaker recognition dataset obtained automatically from open-source media. VoxCeleb2 consists of over a million utterances from over 6k speakers. Since the dataset is collected ‘in the wild’, the speech segments are corrupted with real world noise including laughter, cross-talk, channel effects, music and other sounds. The dataset is also multilingual, with speech from speakers of 145 different nationalities, covering a wide range of accents, ages, ethnicities and languages. The dataset is audio-visual, so is also useful for a number of other applications, for example – visual speech synthesis, speech separation, cross-modal transfer from face to voice or vice versa and training face recognition from video to complement existing face recognition datasets.

Source: VoxCeleb2: Deep Speaker Recognition
Image Source: https://www.robots.ox.ac.uk/~vgg/data/voxceleb/

Variants: VoxCeleb2 - 1-shot learning, VoxCeleb2 - 8-shot learning, VoxCeleb2 - 32-shot learning, VoxCeleb2

Associated Benchmarks

This dataset is used in 2 benchmarks:

Recent Benchmark Submissions

Task Model Paper Date
Speech Separation RTFS-Net-12 RTFS-Net: Recurrent Time-Frequency Modelling for … 2023-09-29
Speech Separation RTFS-Net-6 RTFS-Net: Recurrent Time-Frequency Modelling for … 2023-09-29
Speech Separation RTFS-Net-4 RTFS-Net: Recurrent Time-Frequency Modelling for … 2023-09-29
Speech Separation IIANet IIANet: An Intra- and Inter-Modality … 2023-08-16
Speech Separation CTCNet An Audio-Visual Speech Separation Model … 2022-12-21
Speaker Verification ResNet-50 VoxCeleb2: Deep Speaker Recognition 2018-06-14

Research Papers

Recent papers with results on this dataset: