Machine Learning Benchmarks

Browse 111 benchmarks across 25 tasks
← ML Research Wiki / Benchmarks / Speech
Clear
Browse by Category

1 Image, 2*2 Stitchi

FQL-Driving

FQL-driving

πŸ“Š 1 results
πŸ“ Metrics: 0..5sec

Arabic Text Diacritization

CATT

The CATT benchmark dataset comprises 742 sentences, which were scraped from an internet news source in 2023. It covers multiple …

πŸ“Š 11 results
πŸ“ Metrics: DER(%), WER (%)

Audio Generation

AudioCaps

AudioCaps is a dataset of sounds with event descriptions that was introduced for the task of audio captioning, with sounds …

πŸ“Š 23 results
πŸ“ Metrics: FD_openl3, FAD, FD, KL_passt, IS, CLAP_LAION, CLAP_MS

Audio-Visual Speech Recognition

LRS2

The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences …

πŸ“Š 8 results
πŸ“ Metrics: Test WER

LRS3-TED

LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of …

πŸ“Š 12 results
πŸ“ Metrics: Word Error Rate (WER)

Automatic Speech Recognition (ASR)

LRS2

The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences …

πŸ“Š 9 results
πŸ“ Metrics: Test WER

LRS3-TED

LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of …

πŸ“Š 2 results
πŸ“ Metrics: WER, Word Error Rate (WER)

RealMAN

The Audio Signal and Information Processing Lab at Westlake University, in collaboration with AISHELL, has released the Real-recorded and annotated …

πŸ“Š 1 results
πŸ“ Metrics: CER

Sagalee

Speech Recognition Dataset for Oromo Language. πŸ“Š Key features of Sagalee: 100 hours of read speech. 283 gender balanced …

πŸ“Š 2 results
πŸ“ Metrics: Test WER

Chatbot

AlpacaEval

The AlpacaEval set contains 805 instructions form self-instruct, open-assistant, vicuna, koala, hh-rlhf. Those were selected so that the AlpacaEval ranking …

πŸ“Š 1 results
πŸ“ Metrics: Average win rate

Cultural Vocal Bursts Intensity Prediction

HUME-VB

The Hume Vocal Burst Database (H-VB) includes all train, validation, and test recordings and corresponding emotion ratings for the train …

πŸ“Š 1 results
πŸ“ Metrics: Concordance correlation coefficient (CCC)

DeepFake Detection

1

111

πŸ“Š 1 results
πŸ“ Metrics: 0L

CIFAKE: Real and AI-Generated Synthetic Images

The quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness. CIFAKE is a dataset that …

πŸ“Š 1 results
πŸ“ Metrics: AUC, Validation Accuracy

DFDC

The DFDC (Deepfake Detection Challenge) is a dataset for deepface detection consisting of more than 100,000 videos. The DFDC dataset …

πŸ“Š 3 results
πŸ“ Metrics: AUC, LogLoss

FaceForensics

FaceForensics is a video dataset consisting of more than 500,000 frames containing faces from 1004 videos that can be used …

πŸ“Š 1 results
πŸ“ Metrics: DF, FS, FSF, NT, Real, Total Accuracy

FaceForensics++

FaceForensics++ is a forensics dataset consisting of 1000 original video sequences that have been manipulated with four automated face manipulation …

πŸ“Š 6 results
πŸ“ Metrics: AUC, LogLoss

FakeAVCeleb

FakeAVCeleb is a novel Audio-Video Deepfake dataset that not only contains deepfake videos but respective synthesized cloned audios as well. …

πŸ“Š 10 results
πŸ“ Metrics: ROC AUC, AP, Accuracy (%)

LAV-DF

Localized Audio Visual DeepFake Dataset (LAV-DF). Paper: Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method …

πŸ“Š 1 results
πŸ“ Metrics: AUC

Emotion Recognition

EMOTIC

The EMOTIC dataset, named after EMOTions In Context, is a database of images with people in real environments, annotated with …

πŸ“Š 2 results
πŸ“ Metrics: Top-3 Accuracy (%)

Emomusic

1000 songs has been selected from Free Music Archive (FMA). The excerpts which were annotated are available in the same …

πŸ“Š 5 results
πŸ“ Metrics: EmoA, EmoV

FER2013

Fer2013 contains approximately 30,000 facial RGB images of different expressions with size restricted to 48Γ—48, and the main labels of …

πŸ“Š 1 results
πŸ“ Metrics: 5-class test accuracy

MSP-Podcast

The MSP-Podcast corpus contains speech segments from podcast recordings which are perceptually annotated using crowdsourcing. The collection of this corpus …

πŸ“Š 1 results
πŸ“ Metrics: Concordance correlation coefficient (CCC)

RAVDESS

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7,356 files (total size: 24.8 GB). The database contains …

πŸ“Š 2 results
πŸ“ Metrics: Accuracy, WAR

SEED

The SEED dataset contains subjects' EEG signals when they were watching films clips. The film clips are carefully selected so …

πŸ“Š 1 results
πŸ“ Metrics: Accuracy

Keyword Spotting

FKD

The football keyword dataset (FKD), as a new keyword spotting dataset in Persian, is collected with crowdsourcing. This dataset contains …

πŸ“Š 2 results
πŸ“ Metrics: Accuracy

TAU Urban Acoustic Scenes 2019

TAU Urban Acoustic Scenes 2019 development dataset consists of 10-seconds audio segments from 10 acoustic scenes: airport, indoor shopping mall, …

πŸ“Š 2 results
πŸ“ Metrics: Accuracy

VoxForge

VoxForge is an open speech dataset that was set up to collect transcribed speech for use with Free and Open …

πŸ“Š 2 results
πŸ“ Metrics: Accuracy (%)

Speaker Diarization

AliMeeting

AliMeeting corpus consists of 120 hours of recorded Mandarin meeting data, including far-field data collected by 8-channel microphone array as …

πŸ“Š 1 results
πŸ“ Metrics: DER(%)

DIHARD II

The DIHARD II development and evaluation sets draw from a diverse set of sources exhibiting wide variation in recording equipment, …

πŸ“Š 1 results
πŸ“ Metrics: DER(%), DER - no overlap

Speaker Identification

VoxCeleb1

VoxCeleb1 is an audio dataset containing over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.

πŸ“Š 12 results
πŸ“ Metrics: Top-1 (%), Top-5 (%), Number of Params, Accuracy

Speaker Recognition

VoxCeleb1

VoxCeleb1 is an audio dataset containing over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.

πŸ“Š 2 results
πŸ“ Metrics: EER

Speaker Verification

CN-CELEB

CN-Celeb is a large-scale speaker recognition dataset collected `in the wild'. This dataset contains more than 130,000 utterances from 1,000 …

πŸ“Š 2 results
πŸ“ Metrics: EER

VibraVox (forehead accelerometer)

This is the forehead accelerometer variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource for advancing …

πŸ“Š 1 results
πŸ“ Metrics: Test EER, Test min-DCF

VibraVox (headset microphone)

This is the reference headset microphone variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource for …

πŸ“Š 1 results
πŸ“ Metrics: Test EER, Test min-DCF

VibraVox (rigid in-ear microphone)

This is the in-ear rigid earpiece-embedded microphone variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource …

πŸ“Š 1 results
πŸ“ Metrics: Test EER, Test min-DCF

VibraVox (soft in-ear microphone)

This is the in-ear comply foam-embedded microphone variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource …

πŸ“Š 1 results
πŸ“ Metrics: Test EER, Test min-DCF

VibraVox (temple vibration pickup)

This is the temple vibration pickup variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource for …

πŸ“Š 1 results
πŸ“ Metrics: Test EER, Test min-DCF

VibraVox (throat microphone)

This is the throat microphone (laryngophone) variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource for …

πŸ“Š 1 results
πŸ“ Metrics: Test EER, Test min-DCF

VoxCeleb1

VoxCeleb1 is an audio dataset containing over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.

πŸ“Š 16 results
πŸ“ Metrics: EER

VoxCeleb2

VoxCeleb2 is a large scale speaker recognition dataset obtained automatically from open-source media. VoxCeleb2 consists of over a million utterances …

πŸ“Š 1 results
πŸ“ Metrics: EER

Speech Emotion Recognition

BERSt

BERSt Dataset We release the BERSt Dataset for various speech recognition tasks including Automatic Speech Recognition (ASR) and Speech Emotion …

πŸ“Š 3 results
πŸ“ Metrics: Unweighted Accuracy (UA), Weighted Accuracy (WA)

CREMA-D

CREMA-D is an emotional multimodal actor data set of 7,442 original clips from 91 actors. These clips were from 48 …

πŸ“Š 8 results
πŸ“ Metrics: Accuracy

EmoDB Dataset

The EMODB database is the freely available German emotional database. The database is created by the Institute of Communication Science, …

πŸ“Š 1 results
πŸ“ Metrics: Accuracy, F1

IEMOCAP

Multimodal Emotion Recognition IEMOCAP The IEMOCAP dataset consists of 151 videos of recorded dialogues, with 2 speakers per session for …

πŸ“Š 5 results
πŸ“ Metrics: UA CV, WA CV, UA, WA, F1

LSSED

LSSED, a challenging large-scale english dataset for speech emotion recognition. It contains 147,025 sentences (206 hours and 25 minutes in …

πŸ“Š 1 results
πŸ“ Metrics: Unweighted Accuracy (UA)

MSP-IMPROV

We present the MSP-IMPROV corpus, a multimodal emotional database, where the goal is to have control over lexical content and …

πŸ“Š 1 results
πŸ“ Metrics: UA

RAVDESS

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7,356 files (total size: 24.8 GB). The database contains …

πŸ“Š 1 results
πŸ“ Metrics: Accuracy, F1 Score, Precision, Recall, F1

RESD

Russian dataset of emotional speech dialogues. This dataset was assembled from ~3.5 hours of live speech by actors who voiced …

πŸ“Š 3 results
πŸ“ Metrics: Weighted Accuracy (WA), Unweighted Accuracy (UA), Weighted F1

ShEMO

The database includes 3000 semi-natural utterances, equivalent to 3 hours and 25 minutes of speech data extracted from online radio …

πŸ“Š 1 results
πŸ“ Metrics: Unweighted Accuracy

Speech Enhancement

DNS Challenge

The DNS Challenge at INTERSPEECH 2020 intended to promote collaborative research in single-channel Speech Enhancement aimed to maximize the perceptual …

πŸ“Š 4 results
πŸ“ Metrics: PESQ-NB, PESQ-WB

EARS-WHAM

The EARS-WHAM dataset mixes speech from the EARS dataset with real noise recordings from the WHAM! dataset. Speech and noise …

πŸ“Š 6 results
πŸ“ Metrics: PESQ-WB, SI-SDR, ESTOI, SIGMOS, DNSMOS, POLQA

EasyCom

The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the cocktail party effect from an augmented-reality …

πŸ“Š 6 results
πŸ“ Metrics: PESQ, STOI, ViSQOL, HASQI, Audio Quality MOS, SDR, ESTOI, HASPI, SI-SDR, SIIB, SNR, SegSNR

RealMAN

The Audio Signal and Information Processing Lab at Westlake University, in collaboration with AISHELL, has released the Real-recorded and annotated …

πŸ“Š 1 results
πŸ“ Metrics: DNSMOS, DNSMOS BAK, DNSMOS OVRL, DNSMOS SIG, PESQ-WB

VB-DemandEx

Uses same clean speech as VoiceBank+Demand but more noise types. Features much lower SNRs ([βˆ’10, βˆ’5, 0, 5, 10, 15, …

πŸ“Š 4 results
πŸ“ Metrics: ESTOI, Number of parameters (M), PESQ (wb), SI-SDR, SSNR

VoiceBank + DEMAND

VoiceBank+DEMAND is a noisy speech database for training speech enhancement algorithms and TTS models. The database was designed to train …

πŸ“Š 34 results
πŸ“ Metrics: PESQ (wb), CBAK, COVL, CSIG, STOI, ESTOI, SSNR, SI-SDR, Para. (M)

VoiceBank+DEMAND

VoiceBank+DEMAND is a noisy speech database for training speech enhancement algorithms and TTS models. The database was designed to train …

πŸ“Š 2 results
πŸ“ Metrics: PESQ, DNSMOS, DNSMOS BAK, DNSMOS OVRL, DNSMOS SIG, ESTOI, SI-SDR, PESQ (wb)

WHAM!

The WSJ0 Hipster Ambient Mixtures (WHAM!) dataset pairs each two-speaker mixture in the wsj0-2mix dataset with a unique noise background …

πŸ“Š 1 results
πŸ“ Metrics: PESQ, SDR, SI-SNR

WHAMR!

WHAMR! is a dataset for noisy and reverberant speech separation. It extends WHAM! by introducing synthetic reverberation to the speech …

πŸ“Š 2 results
πŸ“ Metrics: PESQ, SI-SDR, Ξ”PESQ, SI-SNR, SDR

Speech Recognition

AISHELL-1

AISHELL-1 is a corpus for speech recognition research and building speech recognition systems for Mandarin. Source: [AISHELL-1: An Open-Source Mandarin …

πŸ“Š 18 results
πŸ“ Metrics: Word Error Rate (WER), Params(M)

AISHELL-2

AISHELL-2 contains 1000 hours of clean read-speech data from iOS is free for academic usage. Source: [AISHELL-2: Transforming Mandarin ASR …

πŸ“Š 2 results
πŸ“ Metrics: Word Error Rate (WER)

Common Voice

Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded …

πŸ“Š 2 results
πŸ“ Metrics: Test WER

EasyCom

The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the cocktail party effect from an augmented-reality …

πŸ“Š 5 results
πŸ“ Metrics: WER (%)

GigaSpeech

GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, …

πŸ“Š 1 results
πŸ“ Metrics: Word Error Rate (WER)

Google Speech Commands - Musan

This noisy speech test set is created from the Google Speech Commands v2 [1] and the Musan dataset[2]. It could …

πŸ“Š 1 results
πŸ“ Metrics: Error rate - SNR 0dB

LRS2

The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences …

πŸ“Š 1 results
πŸ“ Metrics: Word Error Rate (WER)

LRS3-TED

LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of …

πŸ“Š 4 results
πŸ“ Metrics: Word Error Rate (WER)

LibriCSS

Continuous speech separation (CSS) is an approach to handling overlapped speech in conversational audio signals. A real recorded dataset, called …

πŸ“Š 2 results
πŸ“ Metrics: Word Error Rate (WER)

MediaSpeech

MediaSpeech is a media speech dataset (you might have guessed this) built with the purpose of testing Automated Speech Recognition …

πŸ“Š 8 results
πŸ“ Metrics: WER for Arabic, WER for French, WER for Spanish, WER for Turkish

SLUE

Spoken Language Understanding Evaluation (SLUE) is a suite of benchmark tasks for spoken language understanding evaluation. It consists of limited-size …

πŸ“Š 8 results
πŸ“ Metrics: VoxPopuli (Dev), VoxPopuli (Test), VoxCeleb (Dev), VoxCeleb (Test)

SPGISpeech

SPGISpeech (pronounced β€œspeegie-speech”) is a large-scale transcription dataset, freely available for academic research. SPGISpeech is a collection of 5,000 hours …

πŸ“Š 2 results
πŸ“ Metrics: Word Error Rate (WER)

Speech Commands

Speech Commands is an audio dataset of spoken words designed to help train and evaluate keyword spotting systems .

πŸ“Š 3 results
πŸ“ Metrics: Accuracy (%)

TED-LIUM

The TED-LIUM corpus consists of English-language TED talks. It includes transcriptions of these talks. The audio is sampled at 16kHz. …

πŸ“Š 2 results
πŸ“ Metrics: Word Error Rate (WER)

TIMIT

The TIMIT Acoustic-Phonetic Continuous Speech Corpus is a standard dataset used for evaluation of automatic speech recognition systems. It consists …

πŸ“Š 20 results
πŸ“ Metrics: Percentage error

TUDA

Overall duration per microphone: about 36 hours (31 hrs train / 2.5 hrs dev / 2.5 hrs test) Count of …

πŸ“Š 3 results
πŸ“ Metrics: Test WER

VietMed

We introduced a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled …

πŸ“Š 8 results
πŸ“ Metrics: Dev WER, Test WER

WenetSpeech

WenetSpeech is a multi-domain Mandarin corpus consisting of 10,000+ hours high-quality labeled speech, 2,400+ hours weakly labelled speech, and about …

πŸ“Š 8 results
πŸ“ Metrics: Character Error Rate (CER)

Speech Separation

LRS2

The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences …

πŸ“Š 8 results
πŸ“ Metrics: SI-SNRi, SDRi, PESQ, STOI

LibriCSS

Continuous speech separation (CSS) is an approach to handling overlapped speech in conversational audio signals. A real recorded dataset, called …

πŸ“Š 2 results
πŸ“ Metrics: 0S, 0L, 10%, 20%, 30%, 40%

VoxCeleb2

VoxCeleb2 is a large scale speaker recognition dataset obtained automatically from open-source media. VoxCeleb2 consists of over a million utterances …

πŸ“Š 5 results
πŸ“ Metrics: SI-SNRi, SDRi

WHAM!

The WSJ0 Hipster Ambient Mixtures (WHAM!) dataset pairs each two-speaker mixture in the wsj0-2mix dataset with a unique noise background …

πŸ“Š 5 results
πŸ“ Metrics: SI-SDRi

WHAMR!

WHAMR! is a dataset for noisy and reverberant speech separation. It extends WHAM! by introducing synthetic reverberation to the speech …

πŸ“Š 17 results
πŸ“ Metrics: SI-SDRi, MACs (G), Number of parameters (M), SDRi

WSJ0-2mix

WSJ0-2mix is a speech recognition corpus of speech mixtures using utterances from the Wall Street Journal (WSJ0) corpus. Source: [Deep …

πŸ“Š 36 results
πŸ“ Metrics: SI-SDRi, SDRi, Number of parameters (M), MACs (G)

Speech Synthesis

Blizzard Challenge 2013

The English data for voice building was obtained, prepared and provided the the challenge by Lessac Technologies Inc., having originally …

πŸ“Š 2 results
πŸ“ Metrics: NLL

LJSpeech

This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from …

πŸ“Š 4 results
πŸ“ Metrics: Mean Opinion Score

LibriTTS

LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate, prepared by …

πŸ“Š 15 results
πŸ“ Metrics: PESQ, M-STFT, MCD, Periodicity, V/UV F1

Speech-to-Speech Translation

CVSS

CVSS is a massively multilingual-to-English speech to speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into …

πŸ“Š 2 results
πŸ“ Metrics: ASR-BLEU, Parameters

TAT

Taiwanese Across Taiwan (TAT) corpus is a Large-Scale database of Native Taiwanese Article/Reading Speech collected across Taiwan. This corpus contains …

πŸ“Š 8 results
πŸ“ Metrics: ASR-BLEU (Dev), ASR-BLEU (Test)

Spoken Language Understanding

Fluent Speech Commands

Fluent Speech Commands is an open source audio dataset for spoken language understanding (SLU) experiments. Each utterance is labeled with …

πŸ“Š 17 results
πŸ“ Metrics: Accuracy (%)

Snips-SmartLights

The SmartLights benchmark from Snipstests the capability of controlling lights in different rooms. It consists of 1660 requests which are …

πŸ“Š 7 results
πŸ“ Metrics: Accuracy (%)

Snips-SmartSpeaker

The SmartSpeaker benchmark tests the performance of reacting to music player commands in English as well as in French. It …

πŸ“Š 5 results
πŸ“ Metrics: Accuracy-EN (%), Accuracy-FR (%)

Spoken-SQuAD

In SpokenSQuAD, the document is in spoken form, the input question is in the form of text and the answer …

πŸ“Š 4 results
πŸ“ Metrics: F1 score

Timers and Such

Timers and Such is an open source dataset of spoken English commands for common voice control use cases involving numbers. …

πŸ“Š 3 results
πŸ“ Metrics: Accuracy (%)

Text Generation

CNN/Daily Mail

CNN/Daily Mail is a dataset for text summarization. Human generated abstractive summary bullets were generated from news stories in CNN …

πŸ“Š 1 results
πŸ“ Metrics: ROUGE-L

COCO Captions

COCO Captions contains over one and a half million captions describing over 330,000 images. For the training and validation images, …

πŸ“Š 4 results
πŸ“ Metrics: BLEU-2, BLEU-3, BLEU-4, BLEU-5

CSL

CSL is a synthetic dataset introduced in Murphy et al. (2019) to test the expressivity of GNNs. In particular, graphs …

πŸ“Š 1 results
πŸ“ Metrics: ROUGE-L

CommonGen

CommonGen is constructed through a combination of crowdsourced and existing caption corpora, consists of 79k commonsense descriptions over 35k unique …

πŸ“Š 4 results
πŸ“ Metrics: CIDEr, METEOR, BLEU-4, SPICE

Czech restaurant information

Czech restaurant information is a dataset for NLG in task-oriented spoken dialogue systems with Czech as the target language. It …

πŸ“Š 3 results
πŸ“ Metrics: METEOR

DART

DART is a large dataset for open-domain structured data record to text generation. DART consists of 82,191 examples across different …

πŸ“Š 3 results
πŸ“ Metrics: BLEU, METEOR, FactSpotter

DailyDialog

DailyDialog is a high-quality multi-turn open-domain English dialog dataset. It contains 13,118 dialogues split into a training set with 11,118 …

πŸ“Š 1 results
πŸ“ Metrics: BLEU-1, BLEU-2, BLEU-3, BLEU-4

HarmfulQA

Paper | Github | Dataset| Model As a part of our research efforts toward making LLMs more safe for public …

πŸ“Š 1 results
πŸ“ Metrics: ASR

LCSTS

LCSTS is a large corpus of Chinese short text summarization dataset constructed from the Chinese microblogging website Sina Weibo, which …

πŸ“Š 1 results
πŸ“ Metrics: ROUGE-L

OpenWebText

OpenWebText is an open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit …

πŸ“Š 2 results
πŸ“ Metrics: eval_loss

ROCStories

ROCStories is a collection of commonsense short stories. The corpus consists of 100,000 five-sentence stories. Each story logically follows everyday …

πŸ“Š 4 results
πŸ“ Metrics: BLEU-1, Perplexity

ReDial

ReDial (Recommendation Dialogues) is an annotated dataset of dialogues, where users recommend movies to each other. The dataset consists of …

πŸ“Š 4 results
πŸ“ Metrics: Distinct-3, Distinct-4, Distinct-2, Perplexity

SciQ

The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in …

πŸ“Š 3 results
πŸ“ Metrics: Accuracy

Text-To-Speech Synthesis

20000 utterances

20000 utterances

πŸ“Š 1 results
πŸ“ Metrics: 10-keyword Speech Commands dataset

LJSpeech

This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from …

πŸ“Š 15 results
πŸ“ Metrics: Audio Quality MOS, Pleasantness MOS, Word Error Rate (WER), MOS, WER (%)

Trinity Speech-Gesture Dataset

Trinity Gesture Dataset includes 23 takes, totalling 244 minutes of motion capture and audio of a male native English speaker …

πŸ“Š 1 results
πŸ“ Metrics: MOS

Visual Speech Recognition

LRS2

The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences …

πŸ“Š 2 results
πŸ“ Metrics: Word Error Rate (WER)

LRS3-TED

LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of …

πŸ“Š 3 results
πŸ“ Metrics: Word Error Rate (WER)

Voice Conversion

VCTK

This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about …

πŸ“Š 1 results
πŸ“ Metrics: Total Length Error (TLE), Word Length Error (WLE), Phone Length Error (PLE)