Machine Learning Benchmarks

Browse 202 benchmarks across 34 tasks
← ML Research Wiki / Benchmarks / Audio
Clear
Browse by Category

1 Image, 2*2 Stitchi

FQL-Driving

FQL-driving

📊 1 results
📏 Metrics: 0..5sec

10-shot image generation

FQL-Driving

FQL-driving

📊 1 results
📏 Metrics: 0-shot MRR

FlyingThings3D

FlyingThings3D is a synthetic dataset for optical flow, disparity and scene flow estimation. It consists of everyday objects flying along …

📊 1 results
📏 Metrics: 0..5sec

MEAD

Multi-view Emotional Audio-visual Dataset

📊 1 results
📏 Metrics: 12k

Music21

Music21 is an untrimmed video dataset crawled by keyword query from Youtube. It contains music performances belonging to 21 categories. …

📊 1 results
📏 Metrics: 0..5sec

2D Semantic Segmentation

CamVid

CamVid (Cambridge-driving Labeled Video Database) is a road/driving scene understanding database which was originally captured as five video sequences with …

📊 1 results
📏 Metrics: mIoU

GF-PA66 3D XCT

Stack of 2D gray images of glass fiber-reinforced polyamide 66 (GF-PA66) 3D X-ray Computed Tomography (XCT) specimen. Usage: 2D/3D image …

📊 1 results
📏 Metrics: Jaccard (Mean)

WaterScenes

A Multi-Task 4D Radar-Camera Fusion Dataset for Autonomous Driving on Water Surfaces description of the dataset * WaterScenes, the first …

📊 1 results
📏 Metrics: mIoU

WildScenes

WildScenes is a bi-modal benchmark dataset consisting of multiple large-scale, sequential traversals in natural environments, including semantic annotations in high-resolution …

📊 5 results
📏 Metrics: mIoU, mIoU (Temporal DA) , mIoU (Env DA)

xBD

The xBD dataset contains over 45,000KM2 of polygon labeled pre and post disaster imagery. The dataset provides the post-disaster imagery …

📊 5 results
📏 Metrics: Weighted Average F1-score, Localization F1-score, Classification F1-score

Acoustic Scene Classification

CochlScene

CochlScene is a dataset for acoustic scene classification. The dataset consists of 76k samples collected from 831 participants in 13 …

📊 2 results
📏 Metrics: 1:1 Accuracy

DCASE 2019 Mobile

TAU Urban Acoustic Scenes 2019 Mobile development dataset consists of 10-seconds audio segments from 10 acoustic scenes: Airport Indoor shopping …

📊 1 results
📏 Metrics: Accuracy

TAU Urban Acoustic Scenes 2019

TAU Urban Acoustic Scenes 2019 development dataset consists of 10-seconds audio segments from 10 acoustic scenes: airport, indoor shopping mall, …

📊 1 results
📏 Metrics: 1:1 Accuracy

TUT Acoustic Scenes 2017

The TUT Acoustic Scenes 2017 dataset is a collection of recordings from various acoustic scenes all from distinct locations. For …

📊 1 results
📏 Metrics: 1:1 Accuracy

TUT Urban Acoustic Scenes 2018

The dataset for this task is the TUT Urban Acoustic Scenes 2018 dataset, consisting of recordings from various acoustic scenes. …

📊 1 results
📏 Metrics: Acc

Active Speaker Localization

EasyCom

The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the cocktail party effect from an augmented-reality …

📊 1 results
📏 Metrics: ASL mAP

Audio Classification

AudioSet

Audioset is an audio event dataset, which consists of over 2M human-annotated 10-second video clips. These clips are collected from …

📊 43 results
📏 Metrics: Test mAP, AUC, d-prime

CREMA-D

CREMA-D is an emotional multimodal actor data set of 7,442 original clips from 91 actors. These clips were from 48 …

📊 3 results
📏 Metrics: Accuracy

DEEP-VOICE: DeepFake Voice Recognition

DEEP-VOICE: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion This dataset contains examples of real human speech, and …

📊 1 results
📏 Metrics: Accuracy (10-fold)

EPIC-KITCHENS-100

This paper introduces the pipeline to scale the largest dataset in egocentric vision EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a …

📊 4 results
📏 Metrics: Top-1 Action, Top-1 Noun, Top-1 Verb, Top-5 Action, Top-5 Noun, Top-5 Verb

EPIC-SOUNDS

EPIC-SOUNDS is a large scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of …

📊 3 results
📏 Metrics: Accuracy

ESC-50

The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification. …

📊 26 results
📏 Metrics: Top-1 Accuracy, PRE-TRAINING DATASET, Accuracy (5-fold)

FSD50K

Freesound Dataset 50k (or FSD50K for short) is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally …

📊 10 results
📏 Metrics: mAP, Mean AP

ICBHI Respiratory Sound Database

The Respiratory Sound database was originally compiled to support the scientific challenge organized at Int. Conf. on Biomedical Health Informatics …

📊 20 results
📏 Metrics: ICBHI Score, Sensitivity, Specificity

MeerKAT: Meerkat Kalahari Audio Transcripts

A large-scale reference dataset for bioacoustics. MeerKAT is a 1068h large-scale dataset containing data from audio-recording collars worn by free-ranging …

📊 1 results
📏 Metrics: AP

Multimodal PISA

Dataset for multimodal skills assessment focusing on assessing piano player’s skill level. Annotations include player's skills level, and song difficulty …

📊 1 results
📏 Metrics: Accuracy (%)

RAVDESS

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7,356 files (total size: 24.8 GB). The database contains …

📊 1 results
📏 Metrics: Top-1 Accuracy

SHD

The Spiking Heidelberg Digits (SHD) dataset is an audio-based classification dataset of 1k spoken digits ranging from zero to nine

📊 11 results
📏 Metrics: Percentage correct

SSC

The SSC dataset is a spiking version of the Speech Commands dataset release by Google (Speech Commands). SSC was generated …

📊 3 results
📏 Metrics: Accuracy

Speech Commands

Speech Commands is an audio dataset of spoken words designed to help train and evaluate keyword spotting systems .

📊 7 results
📏 Metrics: Accuracy

UCR Time Series Classification Archive

The UCR Time Series Archive - introduced in 2002, has become an important resource in the time series data mining …

📊 1 results
📏 Metrics: FruitFlies, MosquitoSound, RightWhaleCalls

VocalSound

VocalSound is a free dataset consisting of 21,024 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs from …

📊 2 results
📏 Metrics: Accuracy

Audio Generation

AudioCaps

AudioCaps is a dataset of sounds with event descriptions that was introduced for the task of audio captioning, with sounds …

📊 23 results
📏 Metrics: FD_openl3, FAD, FD, KL_passt, IS, CLAP_LAION, CLAP_MS

Audio Source Separation

AudioSet

Audioset is an audio event dataset, which consists of over 2M human-annotated 10-second video clips. These clips are collected from …

📊 2 results
📏 Metrics: SDR, SAR, SIR

Audio Super-Resolution

DSD100

The dsd100 is a dataset of 100 full lengths of music tracks of different styles along with their isolated drums, …

📊 1 results
📏 Metrics: SNR

Audio Tagging

AudioSet

Audioset is an audio event dataset, which consists of over 2M human-annotated 10-second video clips. These clips are collected from …

📊 9 results
📏 Metrics: mean average precision

Audio captioning

AudioCaps

AudioCaps is a dataset of sounds with event descriptions that was introduced for the task of audio captioning, with sounds …

📊 13 results
📏 Metrics: SPIDEr, CIDEr, SPICE, BLEU-4, METEOR, ROUGE-L, FENSE, SPIDEr-FL, #params (M), ROUGE, Sentence-BERT

Clotho

Clotho is an audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total …

📊 9 results
📏 Metrics: SPIDEr, CIDEr, SPICE, BLEU-4, METEOR, ROUGE-L, FENSE, SPIDEr-FL, Sentence-BERT

Beat Tracking

ASAP

ASAP is a dataset of 222 digital musical scores aligned with 1068 performances (more than 92 hours) of Western classical …

📊 1 results
📏 Metrics: F1

Ballroom

This data set includes beat and bar annotations of the ballroom dataset, introduced by Gouyon et al. [1]. [1] Gouyon …

📊 1 results
📏 Metrics: F1

Beatles

This dataset includes the beat and downbeat annotations for Beatles albums. The annotations are provided by M. E. P. Davies …

📊 1 results
📏 Metrics: F1

Candombe

35 recordings of Candombe music with beat and downbeat annotations.

📊 1 results
📏 Metrics: F1

Filosax

48 multitrack jazz recordings with many annotations.

📊 1 results
📏 Metrics: F1

GTZAN

The gtzan8 audio dataset contains 1000 tracks of 30 second length. There are 10 genres, each containing 100 tracks which …

📊 1 results
📏 Metrics: F1

Groove

The Groove MIDI Dataset (GMD) is composed of 13.6 hours of aligned MIDI and (synthesized) audio of human-performed, tempo-aligned expressive …

📊 1 results
📏 Metrics: F1

GuitarSet

GuitarSet is a dataset of high-quality guitar recordings and rich annotations. It contains 360 excerpts 30 seconds in length. The …

📊 1 results
📏 Metrics: F1

HJDB

J. Hockman, M. E. Davies, and I. Fujinaga, “One in the jungle: Downbeat detection in hardcore, jungle, and drum and …

📊 1 results
📏 Metrics: F1

Hainsworth

S. W. Hainsworth and M. D. Macleod, “Particle filtering applied to musical tempo tracking,” EURASIP Journal on Advances in Signal …

📊 1 results
📏 Metrics: F1

Harmonix

Beats, downbeats, and functional structural annotations for 912 Pop tracks. Nieto, O., McCallum, M., Davies., M., Robertson, A., Stark, A., …

📊 1 results
📏 Metrics: F1

JAAH

Eremenko, E. Demirel, B. Bozkurt, and X. Serra, “Audio-aligned jazz harmony dataset for automatic chord transcription and corpus-based research,” in …

📊 1 results
📏 Metrics: F1

SIMAC

F. Gouyon, “A computational approach to rhythm description — Audio features for the computation of rhythm periodicity functions and their …

📊 1 results
📏 Metrics: F1

SMC

A. Holzapfel, M. E. Davies, J. R. Zapata, J. L. Oliveira, and F. Gouyon, “Selective sampling for beat tracking evaluation,” …

📊 1 results
📏 Metrics: F1

TapCorrect

J. Driedger, H. Schreiber, W. B. de Haas, and M. Müller, “Towards automatically correcting tapped beat annotations for music recordings.” …

📊 1 results
📏 Metrics: F1

Classification

Adult

Data Set Information: Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records …

📊 1 results
📏 Metrics: AUROC

BIOSCAN_1M_Insect Dataset

In an effort to catalog insect biodiversity, we propose a new large dataset of hand-labelled insect images, the BIOSCAN-1M Insect …

📊 2 results
📏 Metrics: Macro F1

BiasBios

The purpose of this dataset was to study gender bias in occupations. Online biographies, written in English, were collected to …

📊 1 results
📏 Metrics: 1:1 Accuracy

BoolQ

BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring – they are …

📊 2 results
📏 Metrics: Test Accuracy

Brain Tumor MRI Dataset

This dataset is a combination of the following three datasets : figshare, SARTAJ dataset and Br35H This dataset contains 7022 …

📊 1 results
📏 Metrics: F1 score

CIFAKE: Real and AI-Generated Synthetic Images

The quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness. CIFAKE is a dataset that …

📊 1 results
📏 Metrics: Validation Accuracy

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 1 results
📏 Metrics: Accuracy

CIFAR-10C

Common corruptions dataset for CIFAR10

📊 1 results
📏 Metrics: Accuracy on Brightness Corrupted Images

COVID-19 Image Data Collection

Contains hundreds of frontal view X-rays and is the largest public resource for COVID-19 image and prognostic data, making it …

📊 1 results
📏 Metrics: Accuracy

CWRU Bearing Dataset

Data was collected for normal bearings, single-point drive end and fan end defects. Data was collected at 12,000 samples/second and …

📊 1 results
📏 Metrics: 10 fold Cross validation

Chest X-Ray Images (Pneumonia)

The normal chest X-ray (left panel) depicts clear lungs without any areas of abnormal opacification in the image. Bacterial pneumonia …

📊 1 results
📏 Metrics: Accuracy

ForgeryNet

We construct the ForgeryNet dataset, an extremely large face forgery dataset with unified annotations in image- and video-level data across …

📊 3 results
📏 Metrics: AUC, Accuracy

Full-body Parkinson’s disease dataset

A public data set of walking full-body kinematics and kinetics in individuals with Parkinson’s disease

📊 7 results
📏 Metrics: F1-score (weighted)

HOWS

HOWS-CL-25 (Household Objects Within Simulation dataset for Continual Learning) is a synthetic dataset especially designed for object classification on mobile …

📊 1 results
📏 Metrics: Overall accuracy after last sequence

HRF

The HRF dataset is a dataset for retinal vessel segmentation which comprises 45 images and is organized as 15 subsets. …

📊 1 results
📏 Metrics: Accuracy

IRFL: Image Recognition of Figurative Language

The IRFL dataset consists of idioms, similes, and metaphors with matching figurative and literal images, as well as two novel …

📊 1 results
📏 Metrics: 1-of-100 Accuracy

ISIC 2019

The goal for ISIC 2019 is classify dermoscopic images among nine different diagnostic categories.25,331 images are available for training across …

📊 1 results
📏 Metrics: Balanced Multi-Class Accuracy

ImageNet C-OOD (class-out-of-distribution)

This dataset was presented as part of the ICLR 2023 paper 𝘈 𝘧𝘳𝘢𝘮𝘦𝘸𝘰𝘳𝘬 𝘧𝘰𝘳 𝘣𝘦𝘯𝘤𝘩𝘮𝘢𝘳𝘬𝘪𝘯𝘨 𝘊𝘭𝘢𝘴𝘴-𝘰𝘶𝘵-𝘰𝘧-𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯 𝘥𝘦𝘵𝘦𝘤𝘵𝘪𝘰𝘯 𝘢𝘯𝘥 𝘪𝘵𝘴 𝘢𝘱𝘱𝘭𝘪𝘤𝘢𝘵𝘪𝘰𝘯 …

📊 5 results
📏 Metrics: Detection AUROC (severity 0), Detection AUROC (severity 5), Detection AUROC (severity 10)

InDL

Dataset Introduction In this work, we introduce the In-Diagram Logic (InDL) dataset, an innovative resource crafted to rigorously evaluate the …

📊 9 results
📏 Metrics: Average Recall

LES-AV

This data set comprises 22 fundus images with their corresponding manual annotations for the blood vessels, separated as arteries and …

📊 1 results
📏 Metrics: Accuracy

Liver-US

The Liver-US dataset is a comprehensive collection of high-quality ultrasound images of the liver, including both normal and abnormal cases. …

📊 1 results
📏 Metrics: AUC

MHIST

The minimalist histopathology image analysis dataset (MHIST) is a binary classification dataset of 3,152 fixed-size images of colorectal polyps, each …

📊 6 results
📏 Metrics: Accuracy

MedSecId

The process by which sections in a document are demarcated and labeled is known as section identification. Such sections are …

📊 1 results
📏 Metrics: 1 shot Micro-F1

MixedWM38

MixedWM38 Dataset(WaferMap) has more than 38000 wafer maps, including 1 normal pattern, 8 single defect patterns, and 29 mixed defect …

📊 1 results
📏 Metrics: Accuracy, MCC

MuReD Dataset

Early detection of retinal diseases is one of the most important means of preventing partial or permanent blindness in patients. …

📊 1 results
📏 Metrics: ML F1, ML mAP, ML AUC

N-CARS

A large real-world event-based dataset for object classification. Source: HATS: Histograms of Averaged Time Surfaces for Robust Event-based Object Classification

📊 6 results
📏 Metrics: Accuracy (%), Architecture, Representation, Representation Time( ms / 100ms events), Inference Time, Params (M)

N-ImageNet

The N-ImageNet dataset is an event-camera counterpart for the ImageNet dataset. The dataset is obtained by moving an event camera …

📊 9 results
📏 Metrics: Accuracy (%)

RITE

The RITE (Retinal Images vessel Tree Extraction) is a database that enables comparative studies on segmentation or classification of arteries …

📊 1 results
📏 Metrics: Accuracy

RSSCN7

he RSSCN7 dataset contains satellite images acquired from Google Earth, which is originally collected for remote sensing scene classification. We …

📊 1 results
📏 Metrics: 1:1 Accuracy

RTE

The Recognizing Textual Entailment (RTE) datasets come from a series of textual entailment challenges. Data from RTE1, RTE2, RTE3 and …

📊 2 results
📏 Metrics: Test Accuracy

SGD

The Schema-Guided Dialogue (SGD) dataset consists of over 20k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. …

📊 1 results
📏 Metrics: F1 (Seqeval)

SHD - Adding

This dataset is based on the Spiking Heidelberg Digits (SHD) dataset. Sample inputs consist of two spike encoded digits sampled …

📊 3 results
📏 Metrics: Accuracy (%)

SPOT-10

The SPOTS-10 dataset is an extensive collection of grayscale images showcasing diverse patterns found in ten animal species. Specifically, SPOTS-10 …

📊 9 results
📏 Metrics: Accuracy

SST-2

The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the …

📊 2 results
📏 Metrics: Test Accuracy

Sentiment140

Sentiment140 is a dataset that allows you to discover the sentiment of a brand, product, or topic on Twitter. Source: …

📊 1 results
📏 Metrics: Accuracy

SimGas

This dataset consists of computer-generated images for gas leakage segmentation. It features diverse backgrounds, interfering foreground objects, and precise ground …

📊 1 results
📏 Metrics: Frame Level Accuracy

Sound-based drone fault classification using multitask learning

arxiv : https://arxiv.org/abs/2304.11708 Accepted at 29th International Congress on Sound and Vibration (ICSV29). The drone has been used for various …

📊 1 results
📏 Metrics: macro f1 score (A(100), B(100), C(100) Avg.)

TACM12K

Table-ACM12K (TACM12K) is a relational table dataset derived from the ACM heterogeneous graph dataset. It includes four tables: papers, authors, …

📊 1 results
📏 Metrics: Accuracy

TCGA

📊 1 results
📏 Metrics: AUPRC, AUROC

TLF2K

Table-LastFm2K (TLF2K) is a relational table dataset derived from the classical LastFM2K dataset. It contains three tables: artists, user_artists, and …

📊 1 results
📏 Metrics: Accuracy

TML1M

Table-MovieLens1M (TML1M) is a relational table dataset derived from the classical MovieLens1M dataset. It consists of three tables: users, movies, …

📊 1 results
📏 Metrics: Accuracy

WSC

The Winograd Schema Challenge was introduced both as an alternative to the Turing Test and as a test of a …

📊 2 results
📏 Metrics: Test Accuracy

WiC

WiC is a benchmark for the evaluation of context-sensitive word embeddings. WiC is framed as a binary classification task. Each …

📊 2 results
📏 Metrics: Test Accuracy

XImageNet-12

Enlarge the dataset to understand how image background effect the Computer Vision ML model. With the following topics: Blur Background …

📊 3 results
📏 Metrics: Robustness Score

DeepFake Detection

1

111

📊 1 results
📏 Metrics: 0L

CIFAKE: Real and AI-Generated Synthetic Images

The quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness. CIFAKE is a dataset that …

📊 1 results
📏 Metrics: AUC, Validation Accuracy

DFDC

The DFDC (Deepfake Detection Challenge) is a dataset for deepface detection consisting of more than 100,000 videos. The DFDC dataset …

📊 3 results
📏 Metrics: AUC, LogLoss

FaceForensics

FaceForensics is a video dataset consisting of more than 500,000 frames containing faces from 1004 videos that can be used …

📊 1 results
📏 Metrics: DF, FS, FSF, NT, Real, Total Accuracy

FaceForensics++

FaceForensics++ is a forensics dataset consisting of 1000 original video sequences that have been manipulated with four automated face manipulation …

📊 6 results
📏 Metrics: AUC, LogLoss

FakeAVCeleb

FakeAVCeleb is a novel Audio-Video Deepfake dataset that not only contains deepfake videos but respective synthesized cloned audios as well. …

📊 10 results
📏 Metrics: ROC AUC, AP, Accuracy (%)

LAV-DF

Localized Audio Visual DeepFake Dataset (LAV-DF). Paper: Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method …

📊 1 results
📏 Metrics: AUC

Directional Hearing

VCTK

This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about …

📊 1 results
📏 Metrics: SI-SDRi

Downbeat Tracking

ASAP

ASAP is a dataset of 222 digital musical scores aligned with 1068 performances (more than 92 hours) of Western classical …

📊 1 results
📏 Metrics: F1

Ballroom

This data set includes beat and bar annotations of the ballroom dataset, introduced by Gouyon et al. [1]. [1] Gouyon …

📊 1 results
📏 Metrics: F1

Beatles

This dataset includes the beat and downbeat annotations for Beatles albums. The annotations are provided by M. E. P. Davies …

📊 1 results
📏 Metrics: F1

Candombe

35 recordings of Candombe music with beat and downbeat annotations.

📊 1 results
📏 Metrics: F1

Filosax

48 multitrack jazz recordings with many annotations.

📊 1 results
📏 Metrics: F1

GTZAN

The gtzan8 audio dataset contains 1000 tracks of 30 second length. There are 10 genres, each containing 100 tracks which …

📊 1 results
📏 Metrics: F1

Groove

The Groove MIDI Dataset (GMD) is composed of 13.6 hours of aligned MIDI and (synthesized) audio of human-performed, tempo-aligned expressive …

📊 1 results
📏 Metrics: F1

GuitarSet

GuitarSet is a dataset of high-quality guitar recordings and rich annotations. It contains 360 excerpts 30 seconds in length. The …

📊 1 results
📏 Metrics: F1

HJDB

J. Hockman, M. E. Davies, and I. Fujinaga, “One in the jungle: Downbeat detection in hardcore, jungle, and drum and …

📊 1 results
📏 Metrics: F1

Hainsworth

S. W. Hainsworth and M. D. Macleod, “Particle filtering applied to musical tempo tracking,” EURASIP Journal on Advances in Signal …

📊 1 results
📏 Metrics: F1

Harmonix

Beats, downbeats, and functional structural annotations for 912 Pop tracks. Nieto, O., McCallum, M., Davies., M., Robertson, A., Stark, A., …

📊 1 results
📏 Metrics: F1

JAAH

Eremenko, E. Demirel, B. Bozkurt, and X. Serra, “Audio-aligned jazz harmony dataset for automatic chord transcription and corpus-based research,” in …

📊 1 results
📏 Metrics: F1

TapCorrect

J. Driedger, H. Schreiber, W. B. de Haas, and M. Müller, “Towards automatically correcting tapped beat annotations for music recordings.” …

📊 1 results
📏 Metrics: F1

Emotion Recognition

EMOTIC

The EMOTIC dataset, named after EMOTions In Context, is a database of images with people in real environments, annotated with …

📊 2 results
📏 Metrics: Top-3 Accuracy (%)

Emomusic

1000 songs has been selected from Free Music Archive (FMA). The excerpts which were annotated are available in the same …

📊 5 results
📏 Metrics: EmoA, EmoV

FER2013

Fer2013 contains approximately 30,000 facial RGB images of different expressions with size restricted to 48×48, and the main labels of …

📊 1 results
📏 Metrics: 5-class test accuracy

MSP-Podcast

The MSP-Podcast corpus contains speech segments from podcast recordings which are perceptually annotated using crowdsourcing. The collection of this corpus …

📊 1 results
📏 Metrics: Concordance correlation coefficient (CCC)

RAVDESS

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7,356 files (total size: 24.8 GB). The database contains …

📊 2 results
📏 Metrics: Accuracy, WAR

SEED

The SEED dataset contains subjects' EEG signals when they were watching films clips. The film clips are carefully selected so …

📊 1 results
📏 Metrics: Accuracy

Environmental Sound Classification

ESC-50

The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification. …

📊 1 results
📏 Metrics: Accuracy

FSD50K

Freesound Dataset 50k (or FSD50K for short) is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally …

📊 1 results
📏 Metrics: mAP

UrbanSound8K

Urban Sound 8K is an audio dataset that contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: …

📊 3 results
📏 Metrics: Accuracy

Few-Shot Learning

CaseHOLD

CaseHOLD (Case Holdings On Legal Decisions) is a law dataset comprised of over 53,000+ multiple choice questions to identify the …

📊 1 results
📏 Metrics: Accuracy

DTD

The Describable Textures Dataset (DTD) contains 5640 texture images in the wild. They are annotated with human-centric attributes inspired by …

📊 4 results
📏 Metrics: 4-shot Accuracy, 8-shot Accuracy, 12-shot Accuracy, 16-shot Accuracy, Harmonic mean

EuroSAT

Eurosat is a dataset and deep learning benchmark for land use and land cover classification. The dataset is based on …

📊 1 results
📏 Metrics: Harmonic mean

Large COVID-19 CT scan slice dataset

"We built a large lung CT scan dataset for COVID-19 by curating data from 7 public datasets listed in the …

📊 1 results
📏 Metrics: AUC-ROC, Accuracy , Macro F1, Macro Precision, Macro Recall, Micro Precision, Specificity

MR

MR Movie Reviews is a dataset for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect …

📊 1 results
📏 Metrics: Acc

MRPC

Microsoft Research Paraphrase Corpus (MRPC) is a corpus consists of 5,801 sentence pairs collected from newswire articles. Each pair is …

📊 1 results
📏 Metrics: F1-score

MedConceptsQA

MedConceptsQA - Open Source Medical Concepts QA Benchmark The benchmark can be found here: https://huggingface.co/datasets/ofir408/MedConceptsQA

📊 12 results
📏 Metrics: Accuracy

MedNLI

The MedNLI dataset consists of the sentence pairs developed by Physicians from the Past Medical History section of MIMIC-III clinical …

📊 1 results
📏 Metrics: Accuracy

PubMedQA

The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary …

📊 2 results
📏 Metrics: Accuracy

SUN397

The Scene UNderstanding (SUN) database contains 899 categories and 130,519 images. There are 397 well-sampled categories to evaluate numerous state-of-the-art …

📊 1 results
📏 Metrics: Harmonic mean

Stanford Cars

The Stanford Cars dataset consists of 196 classes of cars with a total of 16,185 images, taken from the rear. …

📊 3 results
📏 Metrics: 4-shot Accuracy, 8-shot Accuracy, 12-shot Accuracy, 16-shot Accuracy

UCF101

UCF101 dataset is an extension of UCF50 and consists of 13,320 video clips, which are classified into 101 categories. These …

📊 1 results
📏 Metrics: Harmonic mean

Instrument Recognition

NSynth

NSynth is a dataset of one shot instrumental notes, containing 305,979 musical notes with unique pitch, timbre and envelope. The …

📊 7 results
📏 Metrics: Accuracy

OpenMIC-2018

OpenMIC-2018 is an instrument recognition dataset containing 20,000 examples of Creative Commons-licensed music available on the Free Music Archive. Each …

📊 5 results
📏 Metrics: mean average precision

Language Identification

Nordic Language Identification

Automatic language identification is a challenging problem. Discriminating between closely related languages is especially difficult. This paper presents a machine-learning …

📊 1 results
📏 Metrics: Accuracy

OpenSubtitles

OpenSubtitles is collection of multilingual parallel corpora. The dataset is compiled from a large database of movie and TV subtitles …

📊 1 results
📏 Metrics: Accuracy

Universal Dependencies

The Universal Dependencies (UD) project seeks to develop cross-linguistically consistent treebank annotation of morphology and syntax for multiple languages. The …

📊 1 results
📏 Metrics: Accuracy

VOXLINGUA107

Language Identification Dataset

📊 2 results
📏 Metrics: Error rate

VoxForge

VoxForge is an open speech dataset that was set up to collect transcribed speech for use with Free and Open …

📊 1 results
📏 Metrics: Accuracy

Lung Sound Classification

ICBHI Respiratory Sound Database

The Respiratory Sound database was originally compiled to support the scientific challenge organized at Int. Conf. on Biomedical Health Informatics …

📊 1 results
📏 Metrics: Accurcay

Music Generation

Song Describer Dataset

The Song Describer Dataset (SDD) contains ~1.1k captions for 706 permissively licensed music recordings. It is designed for use in …

📊 1 results
📏 Metrics: FAD VGG

Online Beat Tracking

Ballroom

This data set includes beat and bar annotations of the ballroom dataset, introduced by Gouyon et al. [1]. [1] Gouyon …

📊 3 results
📏 Metrics: F1

GTZAN

The gtzan8 audio dataset contains 1000 tracks of 30 second length. There are 10 genres, each containing 100 tracks which …

📊 7 results
📏 Metrics: F1

Rock Corpus

This dataset contains 200 famous songs in different genres (mostly in rock) and the beats and downbeat annotations are provided …

📊 3 results
📏 Metrics: F1

Sound Event Detection

DESED

The DESED dataset is a dataset designed to recognize sound event classes in domestic environments. The dataset is designed to …

📊 10 results
📏 Metrics: event-based F1 score, PSDS1, PSDS2

L3DAS21

L3DAS21 is a dataset for 3D audio signal processing. It consists of a 65 hours 3D audio corpus, accompanied with …

📊 5 results
📏 Metrics: Error Rate, SED-score, F-Score

WildDESED

WildDESED is an extension of the original DESED dataset, created to reflect various domestic scenarios by incorporating complex and unpredictable …

📊 5 results
📏 Metrics: PSDS1 (-5dB), PSDS1 (0dB), PSDS1 (5dB), PSDS1 (10dB), PSDS1 (Clean)

Sound Event Localization and Detection

L3DAS21

L3DAS21 is a dataset for 3D audio signal processing. It consists of a 65 hours 3D audio corpus, accompanied with …

📊 1 results
📏 Metrics: SELD score

PodcastFillers

The PodcastFillers dataset consists of 199 full-length podcast episodes in English with manually annotated filler words and automatically generated transcripts. …

📊 2 results
📏 Metrics: event-based F1 score

RWCP Sound Scene Database

The RWCP Sound Scene Database includes non-speech sounds recorded in an anechoic room, reconstructed signals in various rooms, impulse responses …

📊 1 results
📏 Metrics: accuracy

STARSS22

The Sony-TAu Realistic Spatial Soundscapes 2022(STARSS22) dataset consists of recordings of real scenes captured with high channel-count spherical microphone array …

📊 2 results
📏 Metrics: Class-dependent localization error, Class-dependent localization recall, location-dependent F1-score (macro), location-dependent F1-score (micro), Localization-dependent error rate (20°)

TAU-NIGENS Spatial Sound Events 2021

The TAU-NIGENS Spatial Sound Events 2021 dataset contains multiple spatial sound-scene recordings, consisting of sound events of distinct categories integrated …

📊 1 results
📏 Metrics: ER≤20°, F1≤20°, LE-CD, LR-CD

Speech Enhancement

DNS Challenge

The DNS Challenge at INTERSPEECH 2020 intended to promote collaborative research in single-channel Speech Enhancement aimed to maximize the perceptual …

📊 4 results
📏 Metrics: PESQ-NB, PESQ-WB

EARS-WHAM

The EARS-WHAM dataset mixes speech from the EARS dataset with real noise recordings from the WHAM! dataset. Speech and noise …

📊 6 results
📏 Metrics: PESQ-WB, SI-SDR, ESTOI, SIGMOS, DNSMOS, POLQA

EasyCom

The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the cocktail party effect from an augmented-reality …

📊 6 results
📏 Metrics: PESQ, STOI, ViSQOL, HASQI, Audio Quality MOS, SDR, ESTOI, HASPI, SI-SDR, SIIB, SNR, SegSNR

RealMAN

The Audio Signal and Information Processing Lab at Westlake University, in collaboration with AISHELL, has released the Real-recorded and annotated …

📊 1 results
📏 Metrics: DNSMOS, DNSMOS BAK, DNSMOS OVRL, DNSMOS SIG, PESQ-WB

VB-DemandEx

Uses same clean speech as VoiceBank+Demand but more noise types. Features much lower SNRs ([−10, −5, 0, 5, 10, 15, …

📊 4 results
📏 Metrics: ESTOI, Number of parameters (M), PESQ (wb), SI-SDR, SSNR

VoiceBank + DEMAND

VoiceBank+DEMAND is a noisy speech database for training speech enhancement algorithms and TTS models. The database was designed to train …

📊 34 results
📏 Metrics: PESQ (wb), CBAK, COVL, CSIG, STOI, ESTOI, SSNR, SI-SDR, Para. (M)

VoiceBank+DEMAND

VoiceBank+DEMAND is a noisy speech database for training speech enhancement algorithms and TTS models. The database was designed to train …

📊 2 results
📏 Metrics: PESQ, DNSMOS, DNSMOS BAK, DNSMOS OVRL, DNSMOS SIG, ESTOI, SI-SDR, PESQ (wb)

WHAM!

The WSJ0 Hipster Ambient Mixtures (WHAM!) dataset pairs each two-speaker mixture in the wsj0-2mix dataset with a unique noise background …

📊 1 results
📏 Metrics: PESQ, SDR, SI-SNR

WHAMR!

WHAMR! is a dataset for noisy and reverberant speech separation. It extends WHAM! by introducing synthetic reverberation to the speech …

📊 2 results
📏 Metrics: PESQ, SI-SDR, ΔPESQ, SI-SNR, SDR

Speech Recognition

AISHELL-1

AISHELL-1 is a corpus for speech recognition research and building speech recognition systems for Mandarin. Source: [AISHELL-1: An Open-Source Mandarin …

📊 18 results
📏 Metrics: Word Error Rate (WER), Params(M)

AISHELL-2

AISHELL-2 contains 1000 hours of clean read-speech data from iOS is free for academic usage. Source: [AISHELL-2: Transforming Mandarin ASR …

📊 2 results
📏 Metrics: Word Error Rate (WER)

Common Voice

Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded …

📊 2 results
📏 Metrics: Test WER

EasyCom

The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the cocktail party effect from an augmented-reality …

📊 5 results
📏 Metrics: WER (%)

GigaSpeech

GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, …

📊 1 results
📏 Metrics: Word Error Rate (WER)

Google Speech Commands - Musan

This noisy speech test set is created from the Google Speech Commands v2 [1] and the Musan dataset[2]. It could …

📊 1 results
📏 Metrics: Error rate - SNR 0dB

LRS2

The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences …

📊 1 results
📏 Metrics: Word Error Rate (WER)

LRS3-TED

LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of …

📊 4 results
📏 Metrics: Word Error Rate (WER)

LibriCSS

Continuous speech separation (CSS) is an approach to handling overlapped speech in conversational audio signals. A real recorded dataset, called …

📊 2 results
📏 Metrics: Word Error Rate (WER)

MediaSpeech

MediaSpeech is a media speech dataset (you might have guessed this) built with the purpose of testing Automated Speech Recognition …

📊 8 results
📏 Metrics: WER for Arabic, WER for French, WER for Spanish, WER for Turkish

SLUE

Spoken Language Understanding Evaluation (SLUE) is a suite of benchmark tasks for spoken language understanding evaluation. It consists of limited-size …

📊 8 results
📏 Metrics: VoxPopuli (Dev), VoxPopuli (Test), VoxCeleb (Dev), VoxCeleb (Test)

SPGISpeech

SPGISpeech (pronounced “speegie-speech”) is a large-scale transcription dataset, freely available for academic research. SPGISpeech is a collection of 5,000 hours …

📊 2 results
📏 Metrics: Word Error Rate (WER)

Speech Commands

Speech Commands is an audio dataset of spoken words designed to help train and evaluate keyword spotting systems .

📊 3 results
📏 Metrics: Accuracy (%)

TED-LIUM

The TED-LIUM corpus consists of English-language TED talks. It includes transcriptions of these talks. The audio is sampled at 16kHz. …

📊 2 results
📏 Metrics: Word Error Rate (WER)

TIMIT

The TIMIT Acoustic-Phonetic Continuous Speech Corpus is a standard dataset used for evaluation of automatic speech recognition systems. It consists …

📊 20 results
📏 Metrics: Percentage error

TUDA

Overall duration per microphone: about 36 hours (31 hrs train / 2.5 hrs dev / 2.5 hrs test) Count of …

📊 3 results
📏 Metrics: Test WER

VietMed

We introduced a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled …

📊 8 results
📏 Metrics: Dev WER, Test WER

WenetSpeech

WenetSpeech is a multi-domain Mandarin corpus consisting of 10,000+ hours high-quality labeled speech, 2,400+ hours weakly labelled speech, and about …

📊 8 results
📏 Metrics: Character Error Rate (CER)

Speech Synthesis

Blizzard Challenge 2013

The English data for voice building was obtained, prepared and provided the the challenge by Lessac Technologies Inc., having originally …

📊 2 results
📏 Metrics: NLL

LJSpeech

This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from …

📊 4 results
📏 Metrics: Mean Opinion Score

LibriTTS

LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate, prepared by …

📊 15 results
📏 Metrics: PESQ, M-STFT, MCD, Periodicity, V/UV F1

Target Sound Extraction

AudioCaps

AudioCaps is a dataset of sounds with event descriptions that was introduced for the task of audio captioning, with sounds …

📊 1 results
📏 Metrics: SDRi, SI-SDRi

AudioSet

Audioset is an audio event dataset, which consists of over 2M human-annotated 10-second video clips. These clips are collected from …

📊 1 results
📏 Metrics: SDRi, SI-SDRi

FSDSoundScapes

A synthetic sound mixture specification dataset for the Target Sound Extraction (TSE) task. Dataset samples consist of a .jams file …

📊 1 results
📏 Metrics: SI-SNRi

Text to Audio Retrieval

AudioCaps

AudioCaps is a dataset of sounds with event descriptions that was introduced for the task of audio captioning, with sounds …

📊 11 results
📏 Metrics: R@1, R@5, R@10

Clotho

Clotho is an audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total …

📊 12 results
📏 Metrics: R@1, R@5, R@10, mAP@10

Localized Narratives

We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. We ask annotators to describe …

📊 1 results
📏 Metrics: Text-to-audio R@1, Text-to-audio R@10, Text-to-audio R@5

SoundDescs

We introduce a new audio dataset called SoundDescs that can be used for tasks such as text to audio retrieval, …

📊 4 results
📏 Metrics: R@1, R@10

Text-To-Speech Synthesis

20000 utterances

20000 utterances

📊 1 results
📏 Metrics: 10-keyword Speech Commands dataset

LJSpeech

This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from …

📊 15 results
📏 Metrics: Audio Quality MOS, Pleasantness MOS, Word Error Rate (WER), MOS, WER (%)

Trinity Speech-Gesture Dataset

Trinity Gesture Dataset includes 23 takes, totalling 244 minutes of motion capture and audio of a male native English speaker …

📊 1 results
📏 Metrics: MOS

Text-to-Music Generation

MusicBench

The MusicBench dataset is a music audio-text pair dataset that was designed for text-to-music generation purpose and released along with …

📊 1 results
📏 Metrics: FAD

MusicCaps

MusicCaps is a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts. For each 10-second …

📊 20 results
📏 Metrics: FAD, FD_openl3, FD, KL_passt, IS, CLAP_LAION, CLAP_MS

Voice Conversion

VCTK

This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about …

📊 1 results
📏 Metrics: Total Length Error (TLE), Word Length Error (WLE), Phone Length Error (PLE)