The CATT benchmark dataset comprises 742 sentences, which were scraped from an internet news source in 2023. It covers multiple β¦
AudioCaps is a dataset of sounds with event descriptions that was introduced for the task of audio captioning, with sounds β¦
The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences β¦
LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of β¦
The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences β¦
LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of β¦
The Audio Signal and Information Processing Lab at Westlake University, in collaboration with AISHELL, has released the Real-recorded and annotated β¦
The AlpacaEval set contains 805 instructions form self-instruct, open-assistant, vicuna, koala, hh-rlhf. Those were selected so that the AlpacaEval ranking β¦
The Hume Vocal Burst Database (H-VB) includes all train, validation, and test recordings and corresponding emotion ratings for the train β¦
The quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness. CIFAKE is a dataset that β¦
The DFDC (Deepfake Detection Challenge) is a dataset for deepface detection consisting of more than 100,000 videos. The DFDC dataset β¦
FaceForensics is a video dataset consisting of more than 500,000 frames containing faces from 1004 videos that can be used β¦
FaceForensics++ is a forensics dataset consisting of 1000 original video sequences that have been manipulated with four automated face manipulation β¦
FakeAVCeleb is a novel Audio-Video Deepfake dataset that not only contains deepfake videos but respective synthesized cloned audios as well. β¦
Localized Audio Visual DeepFake Dataset (LAV-DF). Paper: Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method β¦
The EMOTIC dataset, named after EMOTions In Context, is a database of images with people in real environments, annotated with β¦
1000 songs has been selected from Free Music Archive (FMA). The excerpts which were annotated are available in the same β¦
Fer2013 contains approximately 30,000 facial RGB images of different expressions with size restricted to 48Γ48, and the main labels of β¦
The MSP-Podcast corpus contains speech segments from podcast recordings which are perceptually annotated using crowdsourcing. The collection of this corpus β¦
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7,356 files (total size: 24.8 GB). The database contains β¦
The SEED dataset contains subjects' EEG signals when they were watching films clips. The film clips are carefully selected so β¦
The football keyword dataset (FKD), as a new keyword spotting dataset in Persian, is collected with crowdsourcing. This dataset contains β¦
TAU Urban Acoustic Scenes 2019 development dataset consists of 10-seconds audio segments from 10 acoustic scenes: airport, indoor shopping mall, β¦
VoxForge is an open speech dataset that was set up to collect transcribed speech for use with Free and Open β¦
AliMeeting corpus consists of 120 hours of recorded Mandarin meeting data, including far-field data collected by 8-channel microphone array as β¦
The DIHARD II development and evaluation sets draw from a diverse set of sources exhibiting wide variation in recording equipment, β¦
VoxCeleb1 is an audio dataset containing over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.
VoxCeleb1 is an audio dataset containing over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.
CN-Celeb is a large-scale speaker recognition dataset collected `in the wild'. This dataset contains more than 130,000 utterances from 1,000 β¦
This is the forehead accelerometer variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource for advancing β¦
This is the reference headset microphone variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource for β¦
This is the in-ear rigid earpiece-embedded microphone variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource β¦
This is the in-ear comply foam-embedded microphone variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource β¦
This is the temple vibration pickup variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource for β¦
This is the throat microphone (laryngophone) variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource for β¦
VoxCeleb1 is an audio dataset containing over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.
VoxCeleb2 is a large scale speaker recognition dataset obtained automatically from open-source media. VoxCeleb2 consists of over a million utterances β¦
BERSt Dataset We release the BERSt Dataset for various speech recognition tasks including Automatic Speech Recognition (ASR) and Speech Emotion β¦
CREMA-D is an emotional multimodal actor data set of 7,442 original clips from 91 actors. These clips were from 48 β¦
The EMODB database is the freely available German emotional database. The database is created by the Institute of Communication Science, β¦
Multimodal Emotion Recognition IEMOCAP The IEMOCAP dataset consists of 151 videos of recorded dialogues, with 2 speakers per session for β¦
LSSED, a challenging large-scale english dataset for speech emotion recognition. It contains 147,025 sentences (206 hours and 25 minutes in β¦
We present the MSP-IMPROV corpus, a multimodal emotional database, where the goal is to have control over lexical content and β¦
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7,356 files (total size: 24.8 GB). The database contains β¦
Russian dataset of emotional speech dialogues. This dataset was assembled from ~3.5 hours of live speech by actors who voiced β¦
The database includes 3000 semi-natural utterances, equivalent to 3 hours and 25 minutes of speech data extracted from online radio β¦
The DNS Challenge at INTERSPEECH 2020 intended to promote collaborative research in single-channel Speech Enhancement aimed to maximize the perceptual β¦
The EARS-WHAM dataset mixes speech from the EARS dataset with real noise recordings from the WHAM! dataset. Speech and noise β¦
The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the cocktail party effect from an augmented-reality β¦
The Audio Signal and Information Processing Lab at Westlake University, in collaboration with AISHELL, has released the Real-recorded and annotated β¦
Uses same clean speech as VoiceBank+Demand but more noise types. Features much lower SNRs ([β10, β5, 0, 5, 10, 15, β¦
VoiceBank+DEMAND is a noisy speech database for training speech enhancement algorithms and TTS models. The database was designed to train β¦
VoiceBank+DEMAND is a noisy speech database for training speech enhancement algorithms and TTS models. The database was designed to train β¦
The WSJ0 Hipster Ambient Mixtures (WHAM!) dataset pairs each two-speaker mixture in the wsj0-2mix dataset with a unique noise background β¦
AISHELL-1 is a corpus for speech recognition research and building speech recognition systems for Mandarin. Source: [AISHELL-1: An Open-Source Mandarin β¦
AISHELL-2 contains 1000 hours of clean read-speech data from iOS is free for academic usage. Source: [AISHELL-2: Transforming Mandarin ASR β¦
Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded β¦
The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the cocktail party effect from an augmented-reality β¦
GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, β¦
This noisy speech test set is created from the Google Speech Commands v2 [1] and the Musan dataset[2]. It could β¦
The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences β¦
LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of β¦
Continuous speech separation (CSS) is an approach to handling overlapped speech in conversational audio signals. A real recorded dataset, called β¦
MediaSpeech is a media speech dataset (you might have guessed this) built with the purpose of testing Automated Speech Recognition β¦
Spoken Language Understanding Evaluation (SLUE) is a suite of benchmark tasks for spoken language understanding evaluation. It consists of limited-size β¦
SPGISpeech (pronounced βspeegie-speechβ) is a large-scale transcription dataset, freely available for academic research. SPGISpeech is a collection of 5,000 hours β¦
Speech Commands is an audio dataset of spoken words designed to help train and evaluate keyword spotting systems .
The TED-LIUM corpus consists of English-language TED talks. It includes transcriptions of these talks. The audio is sampled at 16kHz. β¦
The TIMIT Acoustic-Phonetic Continuous Speech Corpus is a standard dataset used for evaluation of automatic speech recognition systems. It consists β¦
Overall duration per microphone: about 36 hours (31 hrs train / 2.5 hrs dev / 2.5 hrs test) Count of β¦
We introduced a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled β¦
WenetSpeech is a multi-domain Mandarin corpus consisting of 10,000+ hours high-quality labeled speech, 2,400+ hours weakly labelled speech, and about β¦
The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences β¦
Continuous speech separation (CSS) is an approach to handling overlapped speech in conversational audio signals. A real recorded dataset, called β¦
VoxCeleb2 is a large scale speaker recognition dataset obtained automatically from open-source media. VoxCeleb2 consists of over a million utterances β¦
The WSJ0 Hipster Ambient Mixtures (WHAM!) dataset pairs each two-speaker mixture in the wsj0-2mix dataset with a unique noise background β¦
WHAMR! is a dataset for noisy and reverberant speech separation. It extends WHAM! by introducing synthetic reverberation to the speech β¦
WSJ0-2mix is a speech recognition corpus of speech mixtures using utterances from the Wall Street Journal (WSJ0) corpus. Source: [Deep β¦
The English data for voice building was obtained, prepared and provided the the challenge by Lessac Technologies Inc., having originally β¦
This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from β¦
LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate, prepared by β¦
CVSS is a massively multilingual-to-English speech to speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into β¦
Taiwanese Across Taiwan (TAT) corpus is a Large-Scale database of Native Taiwanese Article/Reading Speech collected across Taiwan. This corpus contains β¦
Fluent Speech Commands is an open source audio dataset for spoken language understanding (SLU) experiments. Each utterance is labeled with β¦
The SmartLights benchmark from Snipstests the capability of controlling lights in different rooms. It consists of 1660 requests which are β¦
The SmartSpeaker benchmark tests the performance of reacting to music player commands in English as well as in French. It β¦
In SpokenSQuAD, the document is in spoken form, the input question is in the form of text and the answer β¦
Timers and Such is an open source dataset of spoken English commands for common voice control use cases involving numbers. β¦
CNN/Daily Mail is a dataset for text summarization. Human generated abstractive summary bullets were generated from news stories in CNN β¦
COCO Captions contains over one and a half million captions describing over 330,000 images. For the training and validation images, β¦
CSL is a synthetic dataset introduced in Murphy et al. (2019) to test the expressivity of GNNs. In particular, graphs β¦
CommonGen is constructed through a combination of crowdsourced and existing caption corpora, consists of 79k commonsense descriptions over 35k unique β¦
Czech restaurant information is a dataset for NLG in task-oriented spoken dialogue systems with Czech as the target language. It β¦
DART is a large dataset for open-domain structured data record to text generation. DART consists of 82,191 examples across different β¦
DailyDialog is a high-quality multi-turn open-domain English dialog dataset. It contains 13,118 dialogues split into a training set with 11,118 β¦
Paper | Github | Dataset| Model As a part of our research efforts toward making LLMs more safe for public β¦
LCSTS is a large corpus of Chinese short text summarization dataset constructed from the Chinese microblogging website Sina Weibo, which β¦
OpenWebText is an open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit β¦
ROCStories is a collection of commonsense short stories. The corpus consists of 100,000 five-sentence stories. Each story logically follows everyday β¦
ReDial (Recommendation Dialogues) is an annotated dataset of dialogues, where users recommend movies to each other. The dataset consists of β¦
The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in β¦
This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from β¦
Trinity Gesture Dataset includes 23 takes, totalling 244 minutes of motion capture and audio of a male native English speaker β¦
The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences β¦
LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of β¦
This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about β¦