Machine Learning Benchmarks

Browse 417 benchmarks across 97 tasks
← ML Research Wiki / Benchmarks / Miscellaneous
Clear
Browse by Category

10-shot image generation

FQL-Driving

FQL-driving

📊 1 results
📏 Metrics: 0-shot MRR

FlyingThings3D

FlyingThings3D is a synthetic dataset for optical flow, disparity and scene flow estimation. It consists of everyday objects flying along …

📊 1 results
📏 Metrics: 0..5sec

MEAD

Multi-view Emotional Audio-visual Dataset

📊 1 results
📏 Metrics: 12k

Music21

Music21 is an untrimmed video dataset crawled by keyword query from Youtube. It contains music performances belonging to 21 categories. …

📊 1 results
📏 Metrics: 0..5sec

16k

ConceptNet

ConceptNet is a knowledge graph that connects words and phrases of natural language with labeled edges. Its knowledge is collected …

📊 1 results
📏 Metrics: 1'"

Anatomy

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Anomaly Detection

ADNI

Alzheimer's Disease Neuroimaging Initiative (ADNI) is a multisite study that aims to improve clinical trials for the prevention and treatment …

📊 1 results
📏 Metrics: AUC

AG News

AG News (AG’s News Corpus) is a subdataset of AG's corpus of news articles constructed by assembling titles and description …

📊 1 results
📏 Metrics: AUROC

BTAD

The BTAD ( beanTech Anomaly Detection) dataset is a real-world industrial anomaly dataset. The dataset contains a total of 2830 …

📊 12 results
📏 Metrics: Detection AUROC, Segmentation AUROC, Segmentation AP, Segmentation AUPRO

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 1 results
📏 Metrics: Mean AUC

COCO-OOC

COCO-OOC goes beyond standard object detection to ask the question: Which objects are out-of-context (OOC)? Given an image with a …

📊 1 results
📏 Metrics: AUC

CUHK Avenue

Avenue Dataset contains 16 training and 21 testing video clips. The videos are captured in CUHK campus avenue with 30652 …

📊 29 results
📏 Metrics: AUC, RBDC, TBDC, FPS

DIOR

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 4 results
📏 Metrics: ROC AUC

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 10 results
📏 Metrics: ROC AUC

Fishyscapes

Fishyscapes is a public benchmark for uncertainty estimation in a real-world task of semantic segmentation for urban driving. It evaluates …

📊 8 results
📏 Metrics: AP, FPR95

Forest CoverType

Predicting forest cover type from cartographic variables only (no remotely sensed data). The actual forest cover type for a given …

📊 1 results
📏 Metrics: AUC

Hyper-Kvasir Dataset

HyperKvasir dataset contains 110,079 images and 374 videos where it captures anatomical landmarks and pathological and normal findings. A total …

📊 5 results
📏 Metrics: AUC

IITB Corridor

An abnormal activity data-set for research use that contains 4,83,566 annotated frames. Source: [Multi-timescale Trajectory Prediction for Abnormal Human Activity …

📊 1 results
📏 Metrics: AUC

ITDD

The Industrial Textile Defect Detection (ITDD) dataset includes 1885 industrial textile images categorized into 4 categories: cotton fabric, dyed fabric, …

📊 1 results
📏 Metrics: Detection AUROC, Segmentation AUROC

InsPLAD

InsPLAD is a Dataset for Power Line Asset Inspection containing 10,607 high-resolution Unmanned Aerial Vehicles colour images. It contains 17 …

📊 4 results
📏 Metrics: Detection AUROC

KDD Cup 1999

This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held …

📊 1 results
📏 Metrics: F1-Score

Kaggle-Credit Card Fraud Dataset

The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred …

📊 1 results
📏 Metrics: AUC

LAG

Includes 5,824 fundus images labeled with either positive glaucoma (2,392) or negative glaucoma (3,432). Source: [Attention Based Glaucoma Detection: A …

📊 4 results
📏 Metrics: AUC

Lost and Found

Lost and Found is a novel lost-cargo image sequence dataset comprising more than two thousand frames with pixelwise annotations of …

📊 4 results
📏 Metrics: AP, FPR

MIT-BIH Arrhythmia Database

The MIT-BIH Arrhythmia Database contains 48 half-hour excerpts of two-channel ambulatory ECG recordings, obtained from 47 subjects studied by the …

📊 1 results
📏 Metrics: F1 score

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 5 results
📏 Metrics: ROC AUC

MPDD

MPDD is a dataset aimed at benchmarking visual defect detection methods in industrial metal parts manufacturing. It consists of more …

📊 14 results
📏 Metrics: Detection AUROC, Segmentation AUROC, Segmentation AUPRO

MVTEC 3D-AD

MVTec 3D Anomaly Detection Dataset (MVTec 3D-AD) is a comprehensive 3D dataset for the task of unsupervised anomaly detection and …

📊 2 results
📏 Metrics: Segmentation AUPRO, Detection AUROC, Segmentation AUROC

MVTec LOCO AD

MVTec Logical Constraints Anomaly Detection (MVTec LOCO AD) dataset is intended for the evaluation of unsupervised anomaly localization algorithms. The …

📊 35 results
📏 Metrics: Avg. Detection AUROC, Detection AUROC (only logical), Detection AUROC (only structural), Segmentation AU-sPRO (until FPR 5%)

Musk v1

The Musk dataset describes a set of molecules, and the objective is to detect musks from non-musks. This dataset describes …

📊 1 results
📏 Metrics: F1-Score

ODDS

Outliers or anomalies are instances that do not conform to the norm of a dataset. Outlier detection is an important …

📊 3 results
📏 Metrics: AUROC, F1

PAD Dataset

Multi-pose Anomaly Detection (MAD) dataset, which represents the first attempt to evaluate the performance of pose-agnostic anomaly detection. The MAD …

📊 2 results
📏 Metrics: Detection AUROC, Segmentation AUROC

Road Anomaly

This dataset contains images of unusual dangers which can be encountered by a vehicle on the road – animals, rocks, …

📊 9 results
📏 Metrics: AP, FPR95

SMD

a dataset of time-series anomaly detection

📊 1 results
📏 Metrics: Recall, precision, F1, F1-score

SVHN

Street View House Numbers (SVHN) is a digit classification benchmark dataset that contains 600,000 32×32 RGB images of printed digits …

📊 1 results
📏 Metrics: Mean AUC

ShanghaiTech

The Shanghaitech dataset is a large-scale crowd counting dataset. It consists of 1198 annotated crowd images. The dataset is divided …

📊 28 results
📏 Metrics: AUC, RBDC, TBDC

ShanghaiTech Campus

The ShanghaiTech Campus dataset has 13 scenes with complex light conditions and camera angles. It contains 130 abnormal events and …

📊 1 results
📏 Metrics: AUC-ROC

Street Scene

Street Scene is a dataset for video anomaly detection. Street Scene consists of 46 training and 35 testing high resolution …

📊 1 results
📏 Metrics: AUC, RBDC, TBDC

TII-SSRC-23

The TII-SSRC-23 dataset offers a comprehensive collection of network traffic patterns, meticulously compiled to support the development and research of …

📊 1 results
📏 Metrics: AUC

Thyroid

Thyroid is a dataset for detection of thyroid diseases, in which patients diagnosed with hypothyroid or subnormal are anomalies against …

📊 2 results
📏 Metrics: AUC, Average Precision, F1-Score

UBnormal

UBnormal is a new supervised open-set benchmark composed of multiple virtual scenes for video anomaly detection. Unlike existing data sets, …

📊 13 results
📏 Metrics: AUC, RBDC, TBDC

UCF-Crime

The UCF-Crime dataset is a large-scale dataset of 128 hours of videos. It consists of 1900 long and untrimmed real-world …

📊 1 results
📏 Metrics: AUC

UCR Anomaly Archive

The UCR Anomaly Archive is a collection of 250 uni-variate time series collected in human medicine, biology, meteorology and industry. …

📊 24 results
📏 Metrics: Average F1, AUC ROC

UCSD Ped2

The UCSD Anomaly Detection Dataset was acquired with a stationary camera mounted at an elevation, overlooking pedestrian walkways. The crowd …

📊 9 results
📏 Metrics: AUC, FPS

UEA time-series datasets

Five datasets used in NeurTraL-AD paper: \textit{RacketSports (RS).} Accelerometer and gyroscope recording of players playing four different racket sports. Each …

📊 3 results
📏 Metrics: Avg. ROC-AUC

Vehicle Claims

The code to create the dataset is available here. The dataset used in the paper is available on github - …

📊 2 results
📏 Metrics: AUC

VisA

The VisA dataset contains 12 subsets corresponding to 12 different objects as shown in the above figure. There are 10,821 …

📊 45 results
📏 Metrics: Detection AUROC, Segmentation AUPRO (until 30% FPR), F1-Score, Segmentation AUPRO, Segmentation AUROC

WFDD

WFDD is a dataset for benchmarking anomaly detection methods with a focus on textile inspection. It includes 4101 woven fabric …

📊 1 results
📏 Metrics: Detection AUROC, Segmentation AUPRO, Segmentation AUROC

voraus-AD

voraus-AD contains machine data of a collaborative robot, which moves a can by performing an industrial pick-and-place task. The samples …

📊 3 results
📏 Metrics: Avg. Detection AUROC

Astronomy

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Autonomous Driving

ApolloCar3D

ApolloCar3DT is a dataset that contains 5,277 driving images and over 60K car instances, where each car is fitted with …

📊 1 results
📏 Metrics: A3DP

Benchmarking

Wiki-40B

A new multilingual language model benchmark that is composed of 40+ languages spanning several scripts and linguistic families containing round …

📊 1 results
📏 Metrics: Perplexity

Brain Decoding

BCI Competition IV: ECoG to Finger Movements

Prediction of Finger Flexion IV Brain-Computer Interface Data Competition The goal of this dataset is to predict the flexion of …
📊 1 results
📏 Metrics: Pearson Correlation

Stanford ECoG library: ECoG to Finger Movements

Electrophysiological data from implanted electrodes in the human brain are rare, and therefore scientific access to it has remained somewhat …

📊 1 results
📏 Metrics: Pearson Correlation

Causal Inference

IHDP

The Infant Health and Development Program (IHDP) is a randomized controlled study designed to evaluate the effect of home visit …

📊 9 results
📏 Metrics: Average Treatment Effect Error

Jobs

The Jobs dataset by LaLonde [36] is a widely used benchmark in the causal inference community, where the treatment is …

📊 3 results
📏 Metrics: Average Treatment Effect on the Treated Error

Classification

Adult

Data Set Information: Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records …

📊 1 results
📏 Metrics: AUROC

BIOSCAN_1M_Insect Dataset

In an effort to catalog insect biodiversity, we propose a new large dataset of hand-labelled insect images, the BIOSCAN-1M Insect …

📊 2 results
📏 Metrics: Macro F1

BiasBios

The purpose of this dataset was to study gender bias in occupations. Online biographies, written in English, were collected to …

📊 1 results
📏 Metrics: 1:1 Accuracy

BoolQ

BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring – they are …

📊 2 results
📏 Metrics: Test Accuracy

Brain Tumor MRI Dataset

This dataset is a combination of the following three datasets : figshare, SARTAJ dataset and Br35H This dataset contains 7022 …

📊 1 results
📏 Metrics: F1 score

CIFAKE: Real and AI-Generated Synthetic Images

The quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness. CIFAKE is a dataset that …

📊 1 results
📏 Metrics: Validation Accuracy

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 1 results
📏 Metrics: Accuracy

CIFAR-10C

Common corruptions dataset for CIFAR10

📊 1 results
📏 Metrics: Accuracy on Brightness Corrupted Images

COVID-19 Image Data Collection

Contains hundreds of frontal view X-rays and is the largest public resource for COVID-19 image and prognostic data, making it …

📊 1 results
📏 Metrics: Accuracy

CWRU Bearing Dataset

Data was collected for normal bearings, single-point drive end and fan end defects. Data was collected at 12,000 samples/second and …

📊 1 results
📏 Metrics: 10 fold Cross validation

Chest X-Ray Images (Pneumonia)

The normal chest X-ray (left panel) depicts clear lungs without any areas of abnormal opacification in the image. Bacterial pneumonia …

📊 1 results
📏 Metrics: Accuracy

ForgeryNet

We construct the ForgeryNet dataset, an extremely large face forgery dataset with unified annotations in image- and video-level data across …

📊 3 results
📏 Metrics: AUC, Accuracy

Full-body Parkinson’s disease dataset

A public data set of walking full-body kinematics and kinetics in individuals with Parkinson’s disease

📊 7 results
📏 Metrics: F1-score (weighted)

HOWS

HOWS-CL-25 (Household Objects Within Simulation dataset for Continual Learning) is a synthetic dataset especially designed for object classification on mobile …

📊 1 results
📏 Metrics: Overall accuracy after last sequence

HRF

The HRF dataset is a dataset for retinal vessel segmentation which comprises 45 images and is organized as 15 subsets. …

📊 1 results
📏 Metrics: Accuracy

IRFL: Image Recognition of Figurative Language

The IRFL dataset consists of idioms, similes, and metaphors with matching figurative and literal images, as well as two novel …

📊 1 results
📏 Metrics: 1-of-100 Accuracy

ISIC 2019

The goal for ISIC 2019 is classify dermoscopic images among nine different diagnostic categories.25,331 images are available for training across …

📊 1 results
📏 Metrics: Balanced Multi-Class Accuracy

ImageNet C-OOD (class-out-of-distribution)

This dataset was presented as part of the ICLR 2023 paper 𝘈 𝘧𝘳𝘢𝘮𝘦𝘸𝘰𝘳𝘬 𝘧𝘰𝘳 𝘣𝘦𝘯𝘤𝘩𝘮𝘢𝘳𝘬𝘪𝘯𝘨 𝘊𝘭𝘢𝘴𝘴-𝘰𝘶𝘵-𝘰𝘧-𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯 𝘥𝘦𝘵𝘦𝘤𝘵𝘪𝘰𝘯 𝘢𝘯𝘥 𝘪𝘵𝘴 𝘢𝘱𝘱𝘭𝘪𝘤𝘢𝘵𝘪𝘰𝘯 …

📊 5 results
📏 Metrics: Detection AUROC (severity 0), Detection AUROC (severity 5), Detection AUROC (severity 10)

InDL

Dataset Introduction In this work, we introduce the In-Diagram Logic (InDL) dataset, an innovative resource crafted to rigorously evaluate the …

📊 9 results
📏 Metrics: Average Recall

LES-AV

This data set comprises 22 fundus images with their corresponding manual annotations for the blood vessels, separated as arteries and …

📊 1 results
📏 Metrics: Accuracy

Liver-US

The Liver-US dataset is a comprehensive collection of high-quality ultrasound images of the liver, including both normal and abnormal cases. …

📊 1 results
📏 Metrics: AUC

MHIST

The minimalist histopathology image analysis dataset (MHIST) is a binary classification dataset of 3,152 fixed-size images of colorectal polyps, each …

📊 6 results
📏 Metrics: Accuracy

MedSecId

The process by which sections in a document are demarcated and labeled is known as section identification. Such sections are …

📊 1 results
📏 Metrics: 1 shot Micro-F1

MixedWM38

MixedWM38 Dataset(WaferMap) has more than 38000 wafer maps, including 1 normal pattern, 8 single defect patterns, and 29 mixed defect …

📊 1 results
📏 Metrics: Accuracy, MCC

MuReD Dataset

Early detection of retinal diseases is one of the most important means of preventing partial or permanent blindness in patients. …

📊 1 results
📏 Metrics: ML F1, ML mAP, ML AUC

N-CARS

A large real-world event-based dataset for object classification. Source: HATS: Histograms of Averaged Time Surfaces for Robust Event-based Object Classification

📊 6 results
📏 Metrics: Accuracy (%), Architecture, Representation, Representation Time( ms / 100ms events), Inference Time, Params (M)

N-ImageNet

The N-ImageNet dataset is an event-camera counterpart for the ImageNet dataset. The dataset is obtained by moving an event camera …

📊 9 results
📏 Metrics: Accuracy (%)

RITE

The RITE (Retinal Images vessel Tree Extraction) is a database that enables comparative studies on segmentation or classification of arteries …

📊 1 results
📏 Metrics: Accuracy

RSSCN7

he RSSCN7 dataset contains satellite images acquired from Google Earth, which is originally collected for remote sensing scene classification. We …

📊 1 results
📏 Metrics: 1:1 Accuracy

RTE

The Recognizing Textual Entailment (RTE) datasets come from a series of textual entailment challenges. Data from RTE1, RTE2, RTE3 and …

📊 2 results
📏 Metrics: Test Accuracy

SGD

The Schema-Guided Dialogue (SGD) dataset consists of over 20k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. …

📊 1 results
📏 Metrics: F1 (Seqeval)

SHD - Adding

This dataset is based on the Spiking Heidelberg Digits (SHD) dataset. Sample inputs consist of two spike encoded digits sampled …

📊 3 results
📏 Metrics: Accuracy (%)

SPOT-10

The SPOTS-10 dataset is an extensive collection of grayscale images showcasing diverse patterns found in ten animal species. Specifically, SPOTS-10 …

📊 9 results
📏 Metrics: Accuracy

SST-2

The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the …

📊 2 results
📏 Metrics: Test Accuracy

Sentiment140

Sentiment140 is a dataset that allows you to discover the sentiment of a brand, product, or topic on Twitter. Source: …

📊 1 results
📏 Metrics: Accuracy

SimGas

This dataset consists of computer-generated images for gas leakage segmentation. It features diverse backgrounds, interfering foreground objects, and precise ground …

📊 1 results
📏 Metrics: Frame Level Accuracy

Sound-based drone fault classification using multitask learning

arxiv : https://arxiv.org/abs/2304.11708 Accepted at 29th International Congress on Sound and Vibration (ICSV29). The drone has been used for various …

📊 1 results
📏 Metrics: macro f1 score (A(100), B(100), C(100) Avg.)

TACM12K

Table-ACM12K (TACM12K) is a relational table dataset derived from the ACM heterogeneous graph dataset. It includes four tables: papers, authors, …

📊 1 results
📏 Metrics: Accuracy

TCGA

📊 1 results
📏 Metrics: AUPRC, AUROC

TLF2K

Table-LastFm2K (TLF2K) is a relational table dataset derived from the classical LastFM2K dataset. It contains three tables: artists, user_artists, and …

📊 1 results
📏 Metrics: Accuracy

TML1M

Table-MovieLens1M (TML1M) is a relational table dataset derived from the classical MovieLens1M dataset. It consists of three tables: users, movies, …

📊 1 results
📏 Metrics: Accuracy

WSC

The Winograd Schema Challenge was introduced both as an alternative to the Turing Test and as a test of a …

📊 2 results
📏 Metrics: Test Accuracy

WiC

WiC is a benchmark for the evaluation of context-sensitive word embeddings. WiC is framed as a binary classification task. Each …

📊 2 results
📏 Metrics: Test Accuracy

XImageNet-12

Enlarge the dataset to understand how image background effect the Computer Vision ML model. With the following topics: Blur Background …

📊 3 results
📏 Metrics: Robustness Score

Click-Through Rate Prediction

Criteo

Criteo contains 7 days of click-through data, which is widely used for CTR prediction benchmarking. There are 26 anonymous categorical …

📊 38 results
📏 Metrics: AUC, Log Loss

KDD12

A clickthrough prediction dataset, for more information please see the Kaggle page

📊 5 results
📏 Metrics: AUC, Log Loss

KKBox

The task is to predict the chances of a user listening to a song repetitively after the first observable listening …

📊 5 results
📏 Metrics: AUC

MovieLens

The MovieLens datasets, first released in 1998, describe people’s expressed preferences for movies. These preferences take the form of tuples, …

📊 3 results
📏 Metrics: AUC

iPinYou

The iPinYou Global RTB(Real-Time Bidding) Bidding Algorithm Competition is organized by iPinYou from April 1st, 2013 to December 31st, 2013.The …

📊 7 results
📏 Metrics: AUC, LogLoss

Clinical Knowledge

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Collaborative Filtering

Amazon-Book

N/A

📊 5 results
📏 Metrics: Recall@20, NDCG@20

Gowalla

Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and …

📊 11 results
📏 Metrics: Recall@20, NDCG@20

Yelp2018

The Yelp2018 dataset is adopted from the 2018 edition of the yelp challenge. Wherein local businesses like restaurants and bars …

📊 9 results
📏 Metrics: NDCG@20, Recall@20

College Medicine

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Computer Security

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Crop Yield Prediction

SICKLE

The availability of well-curated datasets has driven the success of Machine Learning (ML) models. Despite greater access to earth observation …

📊 1 results
📏 Metrics: MAPE (%)

De novo molecule generation from MS/MS spectrum

MassSpecGym

MassSpecGym provides three challenges for benchmarking the discovery and identification of new molecules from MS/MS spectra: - 💥 **De novo

📊 3 results
📏 Metrics: Top-1 Accuracy, Top-1 MCES, Top-1 Tanimoto, Top-10 Accuracy, Top-10 MCES, Top-10 Tanimoto

De novo molecule generation from MS/MS spectrum (bonus chemical formulae)

MassSpecGym

MassSpecGym provides three challenges for benchmarking the discovery and identification of new molecules from MS/MS spectra: - 💥 **De novo

📊 7 results
📏 Metrics: Top-1 Accuracy, Top-1 MCES, Top-1 Tanimoto, Top-10 Accuracy, Top-10 MCES, Top-10 Tanimoto

Deep Clustering

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 1 results
📏 Metrics: NMI

USPS

USPS is a digit dataset automatically scanned from envelopes by the U.S. Postal Service containing a total of 9,298 16×16 …

📊 1 results
📏 Metrics: NMI

DeepFake Detection

1

111

📊 1 results
📏 Metrics: 0L

CIFAKE: Real and AI-Generated Synthetic Images

The quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness. CIFAKE is a dataset that …

📊 1 results
📏 Metrics: AUC, Validation Accuracy

DFDC

The DFDC (Deepfake Detection Challenge) is a dataset for deepface detection consisting of more than 100,000 videos. The DFDC dataset …

📊 3 results
📏 Metrics: AUC, LogLoss

FaceForensics

FaceForensics is a video dataset consisting of more than 500,000 frames containing faces from 1004 videos that can be used …

📊 1 results
📏 Metrics: DF, FS, FSF, NT, Real, Total Accuracy

FaceForensics++

FaceForensics++ is a forensics dataset consisting of 1000 original video sequences that have been manipulated with four automated face manipulation …

📊 6 results
📏 Metrics: AUC, LogLoss

FakeAVCeleb

FakeAVCeleb is a novel Audio-Video Deepfake dataset that not only contains deepfake videos but respective synthesized cloned audios as well. …

📊 10 results
📏 Metrics: ROC AUC, AP, Accuracy (%)

LAV-DF

Localized Audio Visual DeepFake Dataset (LAV-DF). Paper: Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method …

📊 1 results
📏 Metrics: AUC

Econometrics

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Emotion Recognition

EMOTIC

The EMOTIC dataset, named after EMOTions In Context, is a database of images with people in real environments, annotated with …

📊 2 results
📏 Metrics: Top-3 Accuracy (%)

Emomusic

1000 songs has been selected from Free Music Archive (FMA). The excerpts which were annotated are available in the same …

📊 5 results
📏 Metrics: EmoA, EmoV

FER2013

Fer2013 contains approximately 30,000 facial RGB images of different expressions with size restricted to 48×48, and the main labels of …

📊 1 results
📏 Metrics: 5-class test accuracy

MSP-Podcast

The MSP-Podcast corpus contains speech segments from podcast recordings which are perceptually annotated using crowdsourcing. The collection of this corpus …

📊 1 results
📏 Metrics: Concordance correlation coefficient (CCC)

RAVDESS

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7,356 files (total size: 24.8 GB). The database contains …

📊 2 results
📏 Metrics: Accuracy, WAR

SEED

The SEED dataset contains subjects' EEG signals when they were watching films clips. The film clips are carefully selected so …

📊 1 results
📏 Metrics: Accuracy

Ethics

Ethics (per ethics)

Ethics (per ethics) dataset is created to test the knowledge of the basic concepts of morality. The task is to …

📊 4 results
📏 Metrics: Accuracy

Fact Checking

AVeriTeC

AVeriTeC (Automated Verification of Textual Claims) is a dataset of 4568 real-world claims covering fact-checks by 50 different organizations. Each …

📊 2 results
📏 Metrics: Question Only score, Question + Answer score, AveriTeC

Fairness

DiveFace

A new face annotation dataset with balanced distribution between genders and ethnic origins. Source: [SensitiveNets: Learning Agnostic Representations with Application …

📊 1 results
📏 Metrics: Degree of Bias (DoB)

MORPH

MORPH is a facial age estimation dataset, which contains 55,134 facial images of 13,617 subjects ranging from 16 to 77 …

📊 1 results
📏 Metrics: Degree of Bias (DoB)

UTKFace

The UTKFace dataset is a large-scale face dataset with long age span (range from 0 to 116 years old). The …

📊 1 results
📏 Metrics: Degree of Bias (DoB)

Flood Inundation Mapping

Coastal Inundation Maps with Floodwater Depth Values

This dataset provides simulated flood inundation maps of Abu Dhabi's coast under 174 different shoreline protection scenarios. The maps were …

📊 1 results
📏 Metrics: Average MAE, Zero detection rate

Food recommendation

Oktoberfest Food Dataset

A realistic, diverse, and challenging dataset for object detection on images. The data was recorded at a beer tent in …

📊 1 results
📏 Metrics: 10 fold Cross validation

Formation Energy

JARVIS-DFT

JARVIS-DFT is a repository of density functional theory based calculation data for materials.

📊 4 results
📏 Metrics: MAE

Materials Project

The Materials Project is a collection of chemical compounds labelled with different attributes. The labelling is performed by different simulations, …

📊 6 results
📏 Metrics: MAE

OQM9HK

This is a large-scale dataset of quantum-mechanically calculated properties (DFT level) of crystalline materials for graph representation learning that contains …

📊 1 results
📏 Metrics: MAE

QM9

QM9 provides quantum chemical properties (at DFT level) for a relevant, consistent, and comprehensive chemical space of small organic molecules. …

📊 11 results
📏 Metrics: MAE

Fraud Detection

Amazon-Fraud

Amazon-Fraud is a multi-relational graph dataset built upon the Amazon review dataset, which can be used in evaluating graph-based node …

📊 3 results
📏 Metrics: AUC-ROC, Averaged Precision, F1 Macro, G-mean

Elliptic Dataset

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 6 results
📏 Metrics: AUC, AUPRC

Kaggle-Credit Card Fraud Dataset

The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred …

📊 2 results
📏 Metrics: AUC, Accuracy, Average Precision

Yelp-Fraud

Yelp-Fraud is a multi-relational graph dataset built upon the Yelp spam review dataset, which can be used in evaluating graph-based …

📊 6 results
📏 Metrics: AUC-ROC, Averaged Precision, F1 Macro, G-mean

General Knowledge

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 2 results
📏 Metrics: Accuracy

High School European History

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

High School Geography

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

High School Government and Politics

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

High School Macroeconomics

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

High School Microeconomics

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

High School Psychology

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

High School US History

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

High School World History

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Human Aging

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Human Organs Senses Multiple Choice

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 2 results
📏 Metrics: Accuracy

Human Sexuality

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Image Generation

ARKitScenes

ARKitScenes is an RGB-D dataset captured with the widely available Apple LiDAR scanner. Along with the per-frame raw data (Wide …

📊 4 results
📏 Metrics: FID, FID (SwAV)

Binarized MNIST

A binarized version of MNIST. Source: Binarized MNIST

📊 10 results
📏 Metrics: nats, bits/dimension

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 72 results
📏 Metrics: FID, IS, NFE

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 6 results
📏 Metrics: FID, Inception Score, Model Size (MB)

CLEVR

CLEVR (Compositional Language and Elementary Visual Reasoning) is a synthetic Visual Question Answering dataset. It contains images of 3D-rendered objects; …

📊 6 results
📏 Metrics: FID-5k-training-steps

CelebA

CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels …

📊 1 results
📏 Metrics: bpd (8-bits)

CelebA-HQ

The CelebA-HQ dataset is a high-quality version of CelebA that consists of 30,000 images at 1024×1024 resolution. Source: [IntroVAE: Introspective …

📊 1 results
📏 Metrics: FLD

Cityscapes

Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense …

📊 6 results
📏 Metrics: FID-10k-training-steps

FFHQ

Flickr-Faces-HQ (FFHQ) consists of 70,000 high-quality PNG images at 1024×1024 resolution and contains considerable variation in terms of age, ethnicity …

📊 12 results
📏 Metrics: FID, Clean-FID (70k), FID-10k-training-steps

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 5 results
📏 Metrics: FID, Precision, Recall

KMNIST

📊 1 results
📏 Metrics: FID

LLVIP

  • Visible-infrared Paired Dataset for Low-light Vision * 30976 images (15488 pairs) * 24 dark scenes, 2 daytime scenes * …
📊 1 results
📏 Metrics: PSNR, SSIM

LSUN

The Large-scale Scene Understanding (LSUN) challenge aims to provide a different benchmark for large-scale scene classification and understanding. The LSUN …

📊 1 results
📏 Metrics: Average FID

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 11 results
📏 Metrics: bits/dimension, FID, Precision, Recall, PSNR, SSIM

MetFaces

MetFaces is an image dataset of human faces extracted from works of art. The dataset consists of 1336 high-quality PNG …

📊 3 results
📏 Metrics: MAE Signature, MAE log-signature, RMSE Signature, RMSE log-signature

Multi-dSprites

📊 1 results
📏 Metrics: FID

NASA Perseverance

Samples from NASA Perseverance and set of GAN generated synthetic images from Neural Mars.

📊 1 results
📏 Metrics: MAE Signature, MAE log-signature, RMSE Signature, RMSE log-signature

ObjectsRoom

The ObjectsRoom dataset is based on the MuJoCo environment used by the Generative Query Network [4] and is a multi-object …

📊 3 results
📏 Metrics: FID

RC-49

RC-49 is a benchmark dataset for generating images conditional on a continuous scalar variable. It is made by rendering 49 …

📊 2 results
📏 Metrics: Intra-FID

Replica

The Replica Dataset is a dataset of high quality reconstructions of a variety of indoor spaces. Each reconstruction has clean …

📊 4 results
📏 Metrics: FID, FID (SwAV)

SDSS Galaxies

This is a dataset of 306,006 galaxies whose coordinates are taken from the Sloan Digital Sky Survey Data Release 7 …

📊 1 results
📏 Metrics: FID

STL-10

The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or …

📊 25 results
📏 Metrics: FID, Inception score, Model Size (MB), Recall, NFE

ShapeStacks

A simulation-based dataset featuring 20,000 stack configurations composed of a variety of elementary geometric primitives richly annotated regarding semantics and …

📊 3 results
📏 Metrics: FID

Stacked MNIST

The Stacked MNIST dataset is derived from the standard MNIST dataset with an increased number of discrete modes. 240,000 RGB …

📊 2 results
📏 Metrics: FID, Inception score

Stanford Cars

The Stanford Cars dataset consists of 196 classes of cars with a total of 16,185 images, taken from the rear. …

📊 4 results
📏 Metrics: FID, Inception score

Stanford Dogs

The Stanford Dogs dataset contains 20,580 images of 120 classes of dogs from around the world, which are divided into …

📊 4 results
📏 Metrics: FID, Inception score

TextAtlasEval

A Dense-text Image Benchmark to evaluate large generation model's ability on text generation.

📊 4 results
📏 Metrics: TextVsionBlend OCR (F1 Score), TextVisionBlend OCR (Accuracy), TextVisionBlend OCR (Cer), TextVisionBlend FID, TextVisionBlend Clip Score, StyledTextSynth OCR (F1 Score), StyledTextSynth OCR (Accuracy), StyledTextSynth OCR (Cer), StyledTextSynth FID, StyledTextSynth Clip Score, TextScenesHQ OCR (F1 Score), TextScenesHQ OCR (Accuracy), TextScenesHQ OCR (Cer), TextScenesHQ FID, TextScenesHQ Clip Score

VLN-CE

Vision and Language Navigation in Continuous Environments (VLN-CE) is an instruction-guided navigation task with crowdsourced instructions, realistic environments, and unconstrained …

📊 4 results
📏 Metrics: FID, FID (SwAV)

VizDoom

ViZDoom is an AI research platform based on the classical First Person Shooter game Doom. The most popular game mode …

📊 4 results
📏 Metrics: FID, FID (SwAV)

WISE

WISE, the first benchmark specifically designed for World Knowledge-Informed Semantic Evaluation. WISE moves beyond simple word-pixel mapping by challenging models …

📊 13 results
📏 Metrics: Overall, Cultural, Time, Space, Biology, Physics, Chemistry

Image Reconstruction

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 15 results
📏 Metrics: FID, LPIPS, PSNR, SSIM

Spike-X4K

Overview The Spike-X4K Dataset is a high-resolution image reconstruction resource tailored for the latest advancements in spike camera technology. …

📊 1 results
📏 Metrics: Average PSNR

Ultra-High Resolution Image Reconstruction Benchmark

Ultra-high definition benchmark (UHDBench) includes 2293 images at 2k resolution sourced from the ground-truth test sets of HRSOD, LIU4k, UAVid, …

📊 6 results
📏 Metrics: rFID, PSNR, SSIM, LPIPS

Image Retrieval with Multi-Modal Query

MIT-States

The MIT-States dataset has 245 object classes, 115 attribute classes and ∼53K images. There is a wide range of objects …

📊 5 results
📏 Metrics: Recall@1, Recall@5, Recall@10

Intent Recognition

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 2 results
📏 Metrics: Accuracy

International Law

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Interpretability Techniques for Deep Learning

CausalGym

SyntaxGym, adapted for interventional interpretability.

📊 7 results
📏 Metrics: Log odds-ratio (pythia-6.9b)

CelebA

CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels …

📊 7 results
📏 Metrics: Insertion AUC score

Intrusion Detection

20NewsGroups

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

📊 1 results
📏 Metrics: Actions Top-1 (S2)

UNSW-NB15

UNSW-NB15 is a network intrusion dataset. It contains nine different attacks, includes DoS, worms, Backdoors, and Fuzzers. The dataset contains …

📊 1 results
📏 Metrics: AUC

Jurisprudence

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Knowledge Tracing

EdNet

A large-scale hierarchical dataset of diverse student activities collected by Santa, a multi-platform self-study solution equipped with artificial intelligence tutoring …

📊 8 results
📏 Metrics: AUC, Acc

Language Modelling

2000 HUB5 English

2000 HUB5 English Evaluation Transcripts was developed by the Linguistic Data Consortium (LDC) and consists of transcripts of 40 English …

📊 1 results
📏 Metrics: 10-stage average accuracy

Arxiv HEP-TH citation graph

Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a …

📊 1 results
📏 Metrics: BPB

Books3

The Books3 dataset emerged as part of a broader effort to train AI models for natural language understanding and generation. …

📊 1 results
📏 Metrics: BPB

C4

C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. …

📊 9 results
📏 Metrics: Perplexity, TPUv3 Hours, Steps

Curation Corpus

The Curation Corpus is a collection of 40,000 professionally-written summaries of news articles, with links to the articles themselves. Source: …

📊 1 results
📏 Metrics: BPB

FreeLaw

Free Law Project is a leading nonprofit organization that aims to make the legal ecosystem more equitable and competitive through …

📊 1 results
📏 Metrics: BPB

Hutter Prize

The Hutter Prize Wikipedia dataset, also known as enwiki8, is a byte-level dataset consisting of the first 100 million bytes …

📊 18 results
📏 Metrics: Bit per Character (BPC), Number of params

LAMBADA

The LAMBADA (LAnguage Modeling Broadened to Account for Discourse Aspects) benchmark is an open-ended cloze task which consists of about …

📊 34 results
📏 Metrics: Accuracy, Perplexity

OpenWebText

OpenWebText is an open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit …

📊 12 results
📏 Metrics: eval_perplexity, eval_loss, parameters

PhilPapers

PhilPapers is a remarkable resource for the philosophical community. Let me break it down for you: 1. PhilPapers: It's an …

📊 1 results
📏 Metrics: BPB

PubMed Cognitive Control Abstracts

A collection of 385,705 scientific abstracts about Cognitive Control and their GPT-3 embeddings.

📊 1 results
📏 Metrics: BPB

SALMon

The SALMon dataset and benchmark was introduced in the paper "A Suite for Acoustic Language Model Evaluation", with the goal …

📊 8 results
📏 Metrics: Sentiment Consistency, Speaker Consistency, Gender Consistency, Background (Domain) Consistency, Background (Random) Consistency, Room Consistency, Sentiment Alignment, Background Alignment

Text8

📊 22 results
📏 Metrics: Bit per Character (BPC), Number of params

The Pile

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets …

📊 39 results
📏 Metrics: Bits per byte, Test perplexity

VietMed

We introduced a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled …

📊 2 results
📏 Metrics: PPL

Wiki-40B

A new multilingual language model benchmark that is composed of 40+ languages spanning several scripts and linguistic families containing round …

📊 3 results
📏 Metrics: Perplexity

WikiText-103

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good …

📊 83 results
📏 Metrics: Test perplexity, Validation perplexity, Number of params

WikiText-2

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good …

📊 34 results
📏 Metrics: Test perplexity, Validation perplexity, Number of params

language-modeling-recommendation

This is the Big-Bench version of our language-based movie recommendation dataset https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/movie_recommendation GPT-2 has a 48.8% accuracy, chance is 25%.

📊 1 results
📏 Metrics: 1:1 Accuracy

Logical Fallacies

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Logical Reasoning

LingOly

This dataset is a benchmark for complex reasoning abilities in large language models, drawing on United Kingdom Linguistics Olympiad problems …

📊 11 results
📏 Metrics: Delta_NoContext, Exact Match Accuracy

RuWorldTree

RuWorldTree is a QA dataset with multiple-choice elementary-level science questions, which evaluate the understanding of core science facts. Motivation The …

📊 4 results
📏 Metrics: Accuracy

Winograd Automatic

The Winograd schema challenge composes tasks with syntactic ambiguity, which can be resolved with logic and reasoning. Motivation The dataset …

📊 4 results
📏 Metrics: Accuracy

MS/MS spectrum simulation

MassSpecGym

MassSpecGym provides three challenges for benchmarking the discovery and identification of new molecules from MS/MS spectra: - 💥 **De novo

📊 4 results
📏 Metrics: Cosine Similarity, Jensen-Shannon Similarity, Hit Rate @ 1, Hit Rate @ 5, Hit Rate @ 20

MS/MS spectrum simulation (bonus chemical formulae)

MassSpecGym

MassSpecGym provides three challenges for benchmarking the discovery and identification of new molecules from MS/MS spectra: - 💥 **De novo

📊 4 results
📏 Metrics: Hit Rate @ 1, Hit Rate @ 5, Hit Rate @ 20

Malware Classification

Microsoft Malware Classification Challenge

The Microsoft Malware Classification Challenge was announced in 2015 along with a publication of a huge dataset of nearly 0.5 …

📊 3 results
📏 Metrics: Accuracy (10-fold), LogLoss, Macro F1 (10-fold), Accuracy (5-fold), F1 score (5-fold), Accuracy

Management

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Marketing

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Medical Genetics

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Model Compression

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 12 results
📏 Metrics: Top-1

QNLI

The QNLI (Question-answering NLI) dataset is a Natural Language Inference dataset automatically derived from the Stanford Question Answering Dataset v1.1 …

📊 2 results
📏 Metrics: Accuracy

Molecular Property Prediction

MUV

The Maximum Unbiased Validation (MUV) dataset is a benchmark dataset selected from PubChem BioAssay. It was created by applying a …

📊 2 results
📏 Metrics: ROC-AUC

MoleculeNet

MoleculeNet is a large scale benchmark for molecular machine learning. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and …

📊 5 results
📏 Metrics: AUC

PCBA

PCBA dataset 11 is a collection of high-quality dose-response data, formulated as a multitask learning benchmark from 128 high-throughput screening …

📊 1 results
📏 Metrics: ROC-AUC

QM7

QM7 dataset is a subset of the GDB-13 database. GDB-13 contains nearly 1 billion stable and synthetically accessible organic molecules. …

📊 7 results
📏 Metrics: MAE

QM8

QM8 dataset is a collection of molecular data used for studying quantum mechanical calculations of electronic spectra and excited state …

📊 7 results
📏 Metrics: MAE

QM9

QM9 provides quantum chemical properties (at DFT level) for a relevant, consistent, and comprehensive chemical space of small organic molecules. …

📊 7 results
📏 Metrics: MAE

SIDER

SIDER contains information on marketed medicines and their recorded adverse drug reactions. The information is extracted from public documents and …

📊 16 results
📏 Metrics: ROC-AUC

Tox21

The Tox21 data set comprises 12,060 training samples and 647 test samples that represent chemical compounds. There are 801 "dense …

📊 17 results
📏 Metrics: ROC-AUC

clintox

The ClinTox dataset compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons. The …

📊 18 results
📏 Metrics: ROC-AUC, Molecules (M)

Molecule retrieval from MS/MS spectrum

MassSpecGym

MassSpecGym provides three challenges for benchmarking the discovery and identification of new molecules from MS/MS spectra: - 💥 **De novo

📊 8 results
📏 Metrics: Hit rate @ 1, Hit rate @ 5, Hit rate @ 20, MCES @ 1

Molecule retrieval from MS/MS spectrum (bonus chemical formulae)

MassSpecGym

MassSpecGym provides three challenges for benchmarking the discovery and identification of new molecules from MS/MS spectra: - 💥 **De novo

📊 8 results
📏 Metrics: Hit rate @ 1, Hit rate @ 5, Hit rate @ 20, MCES @ 1

Multi-modal Classification

AudioSet

Audioset is an audio event dataset, which consists of over 2M human-annotated 10-second video clips. These clips are collected from …

📊 2 results
📏 Metrics: Average mAP

VGG-Sound

Consists of more than 210k videos for 310 audio classes. Source: VGGSound: A Large-scale Audio-Visual Dataset

📊 2 results
📏 Metrics: Top-1 Accuracy, Top-5 Accuracy

Nutrition

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Offline RL

D4RL

D4RL is a collection of environments for offline reinforcement learning. These environments include Maze2D, AntMaze, Adroit, Gym, Flow, FrankKitchen and …

📊 3 results
📏 Metrics: Average Reward

Philosophy

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Physical Simulations

4D-DRESS

4D-DRESS is the first real-world 4D dataset of human clothing, capturing 64 human outfits in more than 520 motion sequences. …

📊 12 results
📏 Metrics: Chamfer (cm), Stretching Energy

Prehistory

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Product Recommendation

Coveo Data Challenge Dataset

The 2021 SIGIR workshop on eCommerce is hosting the Coveo Data Challenge for "In-session prediction for purchase intent and recommendations". …

📊 1 results
📏 Metrics: F1, MRR

Professional Law

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Professional Medicine

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Professional Psychology

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Public Relations

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Question Answering

AviationQA

AviationQA is introduced in the paper titled- There is No Big Brother or Small Brother: Knowledge Infusion in Language Models …

📊 1 results
📏 Metrics: Hits@1

BBH

BIG-Bench Hard (BBH) is a subset of the BIG-Bench, a diverse evaluation suite for language models. BBH focuses on a …

📊 1 results
📏 Metrics: Accuracy

BLURB

BLURB is a collection of resources for biomedical natural language processing. In general domains such as newswire and the Web, …

📊 3 results
📏 Metrics: Accuracy

Bamboogle

The Bamboogle dataset is a collection of questions that was constructed to investigate the ability of language models to perform …

📊 9 results
📏 Metrics: Accuracy

BioASQ

BioASQ is a question answering dataset. Instances in the BioASQ dataset are composed of a question (Q), human-annotated answers (A), …

📊 6 results
📏 Metrics: Accuracy

BoolQ

BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring – they are …

📊 65 results
📏 Metrics: Accuracy

CODAH

The COmmonsense Dataset Adversarially-authored by Humans (CODAH) is an evaluation set for commonsense question-answering in the sentence completion style of …

📊 2 results
📏 Metrics: Accuracy

COPA

The Choice Of Plausible Alternatives (COPA) evaluation provides researchers with a tool for assessing progress in open-domain commonsense causal reasoning. …

📊 55 results
📏 Metrics: Accuracy

CaseHOLD

CaseHOLD (Case Holdings On Legal Decisions) is a law dataset comprised of over 53,000+ multiple choice questions to identify the …

📊 3 results
📏 Metrics: Macro F1 (10-fold)

ChAII - Hindi and Tamil Question Answering

The dataset covers Hindi and Tamil, collected without the use of translation. It provides a realistic information-seeking task with questions …

📊 1 results
📏 Metrics: Jaccard

CheGeKa

CheGeKa is a Jeopardy!-like Russian QA dataset collected from the official Russian quiz database ChGK. Motivation The task can be …

📊 4 results
📏 Metrics: Accuracy

Children's Book Test

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 8 results
📏 Metrics: Accuracy-CN, Accuracy-NE

CliCR

CliCR is a new dataset for domain specific reading comprehension used to construct around 100,000 cloze queries from clinical case …

📊 2 results
📏 Metrics: F1

CoQA

CoQA is a large-scale dataset for building Conversational Question Answering systems. The goal of the CoQA challenge is to measure …

📊 9 results
📏 Metrics: In-domain, Out-of-domain, Overall

Complex-CronQuestions

A filtered version of CronQuestions and which can better demonstrate the model’s inference ability for complex temporal questions.

📊 3 results
📏 Metrics: Hits@1

ComplexWebQuestions

ComplexWebQuestions is a dataset for answering complex questions that require reasoning over multiple web snippets. It contains a large set …

📊 1 results
📏 Metrics: EM

ConditionalQA

ConditionalQA is a Question Answering (QA) dataset that contains complex questions with conditional answers, i.e. the answers are only applicable …

📊 3 results
📏 Metrics: Conditional (answers), Conditional (w/ conditions), Overall (answers), Overall (w/ conditions)

ConvFinQA

ConvFinQA is a dataset designed to study the chain of numerical reasoning in conversational question answering. The dataset contains 3892 …

📊 3 results
📏 Metrics: Execution Accuracy

CronQuestions

CRONQUESTIONS, the Temporal KGQA dataset consists of two parts: a KG with temporal annotations, and a set of natural language …

📊 10 results
📏 Metrics: Hits@1

DROP

Discrete Reasoning Over Paragraphs DROP is a crowdsourced, adversarially-created, 96k-question benchmark, in which a system must resolve references in a …

📊 6 results
📏 Metrics: Accuracy

DaNetQA

DaNetQA is a question answering dataset for yes/no questions. These questions are naturally occurring ---they are generated in unprompted and …

📊 6 results
📏 Metrics: Accuracy

DuoRC

DuoRC contains 186,089 unique question-answer pairs created from a collection of 7680 pairs of movie plots where each pair in …

📊 3 results
📏 Metrics: Accuracy

EgoTaskQA

EgoTask QA benchmark contains 40K balanced question-answer pairs selected from 368K programmatically generated questions generated over 2K egocentric videos. It …

📊 4 results
📏 Metrics: Direct

FEVER

FEVER is a publicly available dataset for fact extraction and verification against textual sources. It consists of 185,445 claims manually …

📊 7 results
📏 Metrics: EM

FQuAD

A French Native Reading Comprehension dataset of questions and answers on a set of Wikipedia articles that consists of 25,000+ …

📊 6 results
📏 Metrics: EM, F1

FairytaleQA

FairytaleQA is a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Annotated by educational experts based on an …

📊 4 results
📏 Metrics: F1, Rouge-L

FinQA

FinQA is a new large-scale dataset with Question-Answering pairs over Financial reports, written by financial experts. The dataset contains 8,281 …

📊 6 results
📏 Metrics: Execution Accuracy, Program Accuracy

GraphQuestions

GraphQuestions is a characteristic-rich dataset designed for factoid question answering. The dataset aims to provide a systematic way of constructing …

📊 1 results
📏 Metrics: Accuracy

HellaSwag

HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are …

📊 1 results
📏 Metrics: Accuracy

HotpotQA

HotpotQA is a question answering dataset collected on the English Wikipedia, containing about 113K crowd-sourced questions that are constructed to …

📊 22 results
📏 Metrics: JOINT-F1, ANS-EM, ANS-F1, SUP-EM, SUP-F1, JOINT-EM

HybridQA

A new large-scale question-answering dataset that requires reasoning on heterogeneous information. Each question is aligned with a Wikipedia table and …

📊 3 results
📏 Metrics: ANS-EM

JaQuAD

JaQuAD (Japanese Question Answering Dataset) is a question answering dataset in Japanese that consists of 39,696 extractive question-answer pairs on …

📊 1 results
📏 Metrics: Exact Match, F1

KQA Pro

A large-scale dataset for Complex KBQA. Source: [KQA Pro: A Large-Scale Dataset with Interpretable Programs and Accurate SPARQLs for Complex …

📊 1 results
📏 Metrics: Accuracy

MML

MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively …

📊 1 results
📏 Metrics: Accuracy

MRQA

The MRQA (Machine Reading for Question Answering) dataset is a dataset for evaluating the generalization capabilities of reading comprehension systems. …

📊 2 results
📏 Metrics: Average F1

MS MARCO

The MS MARCO (Microsoft MAchine Reading Comprehension) is a collection of datasets focused on deep learning in search. The first …

📊 4 results
📏 Metrics: Rouge-L, BLEU-1

MapEval-API

MapEval-Textual contains 300 question-answer pairs. The task is to answer question by fetching necessary informations using external Map APIs.

📊 2 results
📏 Metrics: Accuracy (%)

MapEval-Textual

MapEval-Textual contains 300 context-question-answer triplets. The necessary geo-spatial information is provided in the context. The task is to answer question …

📊 1 results
📏 Metrics: Accuracy (% )

Mathematics Dataset

This dataset code generates mathematical question and answer pairs, from a range of question types at roughly school-level difficulty. This …

📊 3 results
📏 Metrics: Accuracy

MedQA

Multiple choice question answering based on the United States Medical License Exams (USMLE). The dataset is collected from the professional …

📊 27 results
📏 Metrics: Accuracy

MetaQA

The MetaQA dataset consists of a movie ontology derived from the WikiMovies Dataset and three sets of question-answer pairs written …

📊 1 results
📏 Metrics: AnswerExactMatch (Question Answering)

Molweni

A machine reading comprehension (MRC) dataset with discourse structure built over multiparty dialog. Molweni's source samples from the Ubuntu Chat …

📊 4 results
📏 Metrics: EM, F1

MultiQ

MultiQ is a multi-hop QA dataset for Russian, suitable for general open-domain question answering, information retrieval, and reading comprehension tasks. …

📊 4 results
📏 Metrics: Accuracy

MultiRC

MultiRC (Multi-Sentence Reading Comprehension) is a dataset of short paragraphs and multi-sentence questions, i.e., questions that can be answered by …

📊 30 results
📏 Metrics: F1, EM

MultiTQ

MULTITQ is a large-scale dataset featuring ample relevant facts and multiple temporal granularities.

📊 9 results
📏 Metrics: Hits@1, Hits@10

NExT-QA (Open-ended VideoQA)

NExT-QA is a VideoQA benchmark targeting the explanation of video contents. It challenges QA models to reason about the causal …

📊 6 results
📏 Metrics: Accuracy, Confidence Score

NarrativeQA

The NarrativeQA dataset includes a list of documents with Wikipedia summaries, links to full stories, and questions and answers. Source: …

📊 8 results
📏 Metrics: Rouge-L, BLEU-1, BLEU-4, METEOR

Natural Questions

The Natural Questions corpus is a question answering dataset containing 307,373 training examples, 7,830 development examples, and 7,842 test examples. …

📊 46 results
📏 Metrics: EM

NewsQA

The NewsQA dataset is a crowd-sourced machine reading comprehension dataset of 120,000 question-answer pairs. * Documents are CNN news articles. …

📊 16 results
📏 Metrics: EM, F1

OTT-QA

The Open Table-and-Text Question Answering (OTT-QA) dataset contains open questions which require retrieving tables and text from the web to …

📊 3 results
📏 Metrics: ANS-EM

OpenBookQA

OpenBookQA is a new kind of question-answering dataset modeled after open book exams for assessing human understanding of a subject. …

📊 40 results
📏 Metrics: Accuracy

PIQA

PIQA is a dataset for commonsense reasoning, and was created to investigate the physical knowledge of existing models in NLP. …

📊 67 results
📏 Metrics: Accuracy

PeerQA

We present PeerQA, a real-world, scientific, document-level Question Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, which …

📊 5 results
📏 Metrics: Prometheus-2 Answer Correctness, Rouge-L, AlignScore

PopQA

PopQA is an open-domain QA dataset with 14k QA pairs with fine-grained Wikidata entity ID, Wikipedia page views, and relationship …

📊 2 results
📏 Metrics: Accuracy

PubChemQA

PubChemQA consists of molecules and their corresponding textual descriptions from PubChem. It contains a single type of question, i.e., please …

📊 2 results
📏 Metrics: BLEU-2, BLEU-4, ROUGE-1, ROUGE-2, ROUGE-L, MEATOR

PubMedQA

The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary …

📊 26 results
📏 Metrics: Accuracy

QASPER

QASPER is a dataset for question answering on scientific research papers. It consists of 5,049 questions over 1,585 Natural Language …

📊 1 results
📏 Metrics: Token F1

QuAC

Question Answering in Context is a large-scale dataset that consists of around 14K crowdsourced Question Answering dialogs with 98K question-answer …

📊 2 results
📏 Metrics: F1, HEQD, HEQQ

QuALITY

QuALITY (Question Answering with Long Input Texts, Yes!) is a multiple-choice question answering dataset for long document comprehension. The dataset …

📊 1 results
📏 Metrics: Accuracy

Quora Question Pairs

Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary …

📊 19 results
📏 Metrics: Accuracy

RACE

The ReAding Comprehension dataset from Examinations (RACE) dataset is a machine reading comprehension dataset consisting of 27,933 passages and 97,867 …

📊 6 results
📏 Metrics: RACE-m, RACE-h, RACE

ReClor

Logical reasoning is an important ability to examine, analyze, and critically evaluate arguments as they occur in ordinary language as …

📊 3 results
📏 Metrics: Accuracy, Accuracy (easy), Accuracy (hard)

RecipeQA

RecipeQA is a dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from …

📊 1 results
📏 Metrics: Accuracy

RuOpenBookQA

RuOpenBookQA is a QA dataset with multiple-choice elementary-level science questions which probe the understanding of core science facts. Motivation RuOpenBookQA …

📊 4 results
📏 Metrics: Accuracy

SCDE

SCDE is a human-created sentence cloze dataset, collected from public school English examinations in China. The task requires a model …

📊 1 results
📏 Metrics: BA, PA, DE

SIQA

Social Interaction QA (SIQA) is a question-answering benchmark for testing social commonsense intelligence. Contrary to many prior benchmarks that focus …

📊 24 results
📏 Metrics: Accuracy

SQA3D

SQA3D is a dataset for embodied scene understanding, where an agent needs to comprehend the scene it situates from an …

📊 7 results
📏 Metrics: AnswerExactMatch (Question Answering)

SQuAD

The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct …

📊 2 results
📏 Metrics: Exact Match, F1

SWAG

Given a partial description like "she opened the hood of the car," humans can reason about the situation and anticipate …

📊 1 results
📏 Metrics: Accuracy

SberQuAD

A large scale analogue of Stanford SQuAD in the Russian language - is a valuable resource that has not been …

📊 3 results
📏 Metrics: EM, F1

SchizzoSQUAD

The “Mental Health” forum was used, a forum dedicated to people suffering from schizophrenia and different mental disorders. Relevant posts …

📊 1 results
📏 Metrics: Average F1, Averaged Precision

SimpleQuestions

SimpleQuestions is a large-scale factoid question answering dataset. It consists of 108,442 natural language questions, each paired with a corresponding …

📊 1 results
📏 Metrics: F1

StepGame

A Benchmark for Robust Multi-Hop Spatial Reasoning in Texts

📊 1 results
📏 Metrics: 1-of-100 Accuracy

StoryCloze

Representation and learning of commonsense knowledge is one of the foundational problems in the quest to enable deep language understanding. …

📊 20 results
📏 Metrics: Accuracy

StrategyQA

StrategyQA is a question answering benchmark where the required reasoning steps are implicit in the question, and should be inferred …

📊 11 results
📏 Metrics: Accuracy, EM

TAT-QA

TAT-QA (Tabular And Textual dataset for Question Answering) is a large-scale QA dataset, aiming to stimulate progress of QA research …

📊 1 results
📏 Metrics: Exact Match (EM)

TIQ

Existing benchmarks for temporal QA focus on a single information source (either a KB or a text corpus), and include …

📊 9 results
📏 Metrics: P@1

TempQA-WD

TempQA-WD is a benchmark dataset for temporal reasoning designed to encourage research in extending the present approaches to target a …

📊 1 results
📏 Metrics: F1

TempQuestions

Here, we take a key step in this direction and release a new benchmark, TempQuestions, containing 1,271 questions, that are …

📊 4 results
📏 Metrics: Hits@1, F1

TimeQuestions

Question answering over knowledge graphs (KG-QA) is a vital topic in IR. Questions with temporal intent are a special class …

📊 16 results
📏 Metrics: P@1

Torque

Torque is an English reading comprehension benchmark built on 3.2k news snippets with 21k human-generated questions querying temporal relationships. Source: …

📊 2 results
📏 Metrics: F1, EM, C

TrecQA

Text Retrieval Conference Question Answering (TrecQA) is a dataset created from the TREC-8 (1999) to TREC-13 (2004) Question Answering tracks. …

📊 12 results
📏 Metrics: MAP, MRR

TriviaQA

TriviaQA is a realistic text-based question answering dataset which includes 950K question-answer pairs from 662K documents collected from Wikipedia and …

📊 51 results
📏 Metrics: EM, F1

TruthfulQA

TruthfulQA is a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises …

📊 30 results
📏 Metrics: MC1, MC2, % true, % info, % true (GPT-judge), BLEURT, ROUGE, BLEU, EM, Accuracy

TweetQA

With social media becoming increasingly popular on which lots of news and real-time events are reported, developing automated question answering …

📊 3 results
📏 Metrics: BLEU-1, ROUGE-L

UniProtQA

UniProtQA consists of proteins and textual queries about their functions and properties. The dataset is constructed from UniProt, and consists …

📊 2 results
📏 Metrics: BLEU-2, BLEU-4, ROUGE-1, ROUGE-2, ROUGE-L, MEATOR

WebQuestions

The WebQuestions dataset is a question answering dataset using Freebase as the knowledge base and contains 6,642 question-answer pairs. It …

📊 36 results
📏 Metrics: EM, F1

WebQuestionsSP

The WebQuestionsSP dataset is released as part of our ACL-2016 paper “The Value of Semantic Parse Labeling for Knowledge Base …

📊 1 results
📏 Metrics: Accuracy

WebSRC

WebSRC is a novel Web-based Structural Reading Comprehension dataset. It consists of 0.44M question-answer pairs, which are collected from 6.5K …

📊 1 results
📏 Metrics: F1

WikiHop

WikiHop is a multi-hop question-answering dataset. The query of WikiHop is constructed with entities and relations from WikiData, while supporting …

📊 9 results
📏 Metrics: Test

WikiQA

The WikiQA corpus is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain …

📊 23 results
📏 Metrics: MAP, MRR

WikiSQL

WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further …

📊 2 results
📏 Metrics: Exact Match (EM)

WikiTableQuestions

WikiTableQuestions is a question answering dataset over semi-structured tables. It is comprised of question-answer pairs on HTML tables, and was …

📊 2 results
📏 Metrics: Accuracy, Accuracy (Test)

catbAbI LM-mode

We aim to improve the bAbI benchmark as a means of developing intelligent dialogue agents. To this end, we propose …

📊 4 results
📏 Metrics: Accuracy (mean)

catbAbI QA-mode

We aim to improve the bAbI benchmark as a means of developing intelligent dialogue agents. To this end, we propose …

📊 4 results
📏 Metrics: 1:1 Accuracy

Recommendation Systems

Amazon Beauty

This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links …

📊 5 results
📏 Metrics: Hit@10, nDCG@10, NDCG

Amazon Fashion

This datasets is a subset of the Amazon reviews dataset which contain Fashion related products

📊 4 results
📏 Metrics: HitRatio@ 10 (100 Neg. Samples), nDCG@10 (100 Neg. Samples), AUC, nDCG@10 (500 Neg. Samples), Hit@10, NDCG

Amazon Men

This datasets is a subset of the Amazon reviews dataset which contain Men related products

📊 3 results
📏 Metrics: Hit@10, nDCG@10, NDCG

Amazon Product Data

This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014. This …

📊 1 results
📏 Metrics: AUC, F1

Amazon-Book

N/A

📊 15 results
📏 Metrics: nDCG@20, Recall@20, HR@10, NDCG@10, HR@50, NDCG@50

Ciao

The Ciao dataset contains rating information of users given to items, and also contain item category information. The data comes …

📊 1 results
📏 Metrics: Hits@10, Hits@20, nDCG@10, nDCG@20

Delicious

Delicious : This data set contains tagged web pages retrieved from the website delicious.com. Source: [Text segmentation on multilabel documents: …

📊 1 results
📏 Metrics: NDCG, Recall@20

Douban

We release Douban Conversation Corpus, comprising a training data set, a development set and a test set for retrieval based …

📊 5 results
📏 Metrics: RMSE, NDCG, Recall@20, AUC, HR@10, HR@100, PSP@10, nDCG@10, nDCG@100

Epinions

The Epinions dataset is built form a who-trust-whom online social network of a general consumer review site Epinions.com. Members of …

📊 4 results
📏 Metrics: MAE, RMSE, MAP@20, MRR@20, NDCG@20

Gowalla

Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and …

📊 13 results
📏 Metrics: nDCG@20, Recall@20, HR@10, HR@100, PSP@10, nDCG@10, nDCG@100

Pinterest

The Pinterest dataset contains more than 1 million images associated to Pinterest users’ who have “pinned” them. Source: https://openaccess.thecvf.com/content_iccv_2015/papers/Geng_Learning_Image_and_ICCV_2015_paper.pdf

📊 1 results
📏 Metrics: nDCG@10, Hits@10, Hits@20, nDCG@20

PixelRec

an image cover dataset in short video recommendation

📊 1 results
📏 Metrics: Hit@10

Polyvore

This dataset contains 21,889 outfits from polyvore.com, in which 17,316 are for training, 1,497 for validation and 3,076 for testing. …

📊 3 results
📏 Metrics: AUC, Accuracy

ReDial

ReDial (Recommendation Dialogues) is an annotated dataset of dialogues, where users recommend movies to each other. The dataset consists of …

📊 7 results
📏 Metrics: Recall@1, Recall@10, Recall@50

WeChat

The WeChat dataset for fake news detection contains more than 20k news labelled as fake news or not.

📊 2 results
📏 Metrics: AUC, P@10

Yelp

The Yelp Dataset is a valuable resource for academic research, teaching, and learning. It provides a rich collection of real-world …

📊 2 results
📏 Metrics: NDCG, NDCG@20, Recall@20

Yelp2018

The Yelp2018 dataset is adopted from the 2018 edition of the yelp challenge. Wherein local businesses like restaurants and bars …

📊 11 results
📏 Metrics: NDCG@20, Recall@20, HR@10, HR@100, PSP@10, nDCG@10, nDCG@100

Robotic Grasping

GraspNet-1Billion

GraspNet-1Billion provides large-scale training data and a standard evaluation platform for the task of general robotic grasping. The dataset contains …

📊 5 results
📏 Metrics: mAP, AP_seen, AP_similar, AP_novel

NBMOD

Introduction NBMOD is a dataset created for researching the task of specific object grasp detection by robots in noisy …

📊 1 results
📏 Metrics: Acc

Security Studies

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Semi Supervised Learning for Image Captioning

Flickr30k

The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators. Source: [Guiding …

📊 1 results
📏 Metrics: CIDEr

FlickrStyle10K

FlickrStyle10K is collected and built on Flickr30K image caption dataset. The original FlickrStyle10K dataset has 10,000 pairs of images and …

📊 1 results
📏 Metrics: CIDEr

Sociology

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Stochastic Optimization

AG News

AG News (AG’s News Corpus) is a subdataset of AG's corpus of news articles constructed by assembling titles and description …

📊 1 results
📏 Metrics: Accuracy (max), Accuracy (mean)

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 2 results
📏 Metrics: Accuracy (max), Accuracy (mean)

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 2 results
📏 Metrics: Accuracy (max), Accuracy (mean)

CoLA

The Corpus of Linguistic Acceptability (CoLA) consists of 10657 sentences from 23 linguistics publications, expertly annotated for acceptability (grammaticality) by …

📊 1 results
📏 Metrics: Accuracy (max), Accuracy (mean)

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 1 results
📏 Metrics: NLL

Synthetic Data Generation

UNSW-NB15

UNSW-NB15 is a network intrusion dataset. It contains nine different attacks, includes DoS, worms, Backdoors, and Fuzzers. The dataset contains …

📊 2 results
📏 Metrics: EMD

Table Detection

ICDAR 2019

Table is a compact and efficient form for summarizing and presenting correlative information in handwritten and printed archival documents, scientific …

📊 2 results
📏 Metrics: Weighted Average F1-score

STDW

STDW is a diverse large-scale dataset for table detection with more than seven thousand samples containing a wide variety of …

📊 2 results
📏 Metrics: IoU, AP

Tabular Data Generation

Adult Census Income

This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, …

📊 6 results
📏 Metrics: DT Accuracy, LR Accuracy, Parameters(M), RF Accuracy

California Housing Prices

Median house prices for California districts derived from the 1990 census. About Dataset Context This is the dataset used in …

📊 6 results
📏 Metrics: Parameters(M), RF Mean Squared Error, DT Mean Squared Error, LR Mean Squared Error

Diabetes

What do the instances in this dataset represent? The instances represent hospitalized patient records diagnosed with diabetes. **Are there recommended …

📊 6 results
📏 Metrics: DT Accuracy, Parameters(M), LR Accuracy, RF Accuracy

HELOC

HELOC The HELOC dataset from FICO. Each entry in the dataset is a line of credit, typically offered by …

📊 6 results
📏 Metrics: DT Accuracy, LR Accuracy, Parameters(M), RF Accuracy

SICK

The Sentences Involving Compositional Knowledge (SICK) dataset is a dataset for compositional distributional semantics. It includes a large number of …

📊 6 results
📏 Metrics: DT Accuracy, LR Accuracy, Parameters(M), RF Accuracy

Travel

A Tour & Travels Company Wants To Predict Whether A Customer Will Churn Or Not Based On Indicators Given Below. …

📊 6 results
📏 Metrics: DT Accuracy, LR Accuracy, RF Accuracy, Parameters(M)

Text-to-3D-Human Generation

DeepFashion

DeepFashion is a dataset containing around 800K diverse fashion images with their rich annotations (46 categories, 1,000 descriptive attributes, bounding …

📊 1 results
📏 Metrics: CLIP Score, Depth Error, Fashion Accuracy, Frechet Inception Distance, Percentage of Correct Keypoints

Transfer Learning

Office-Home

Office-Home is a benchmark dataset for domain adaptation which contains 4 domains where each domain consists of 65 categories. The …

📊 5 results
📏 Metrics: Accuracy

Retinal Fundus MultiDisease Image Dataset (RFMiD)

According to the WHO, World report on vision 2019, the number of visually impaired people worldwide is estimated to be …

📊 1 results
📏 Metrics: AUROC

Twitter Bot Detection

MGTAB

MGTAB is the first standardized graph-based benchmark for stance and bot detection. MGTAB contains 10,199 expert-annotated users and 7 types …

📊 4 results
📏 Metrics: Acc, F1

Two-sample testing

HIGGS Data Set

The data has been produced using Monte Carlo simulations. The first 21 features (columns 2-22) are kinematic properties measured by …

📊 1 results
📏 Metrics: Avg accuracy

US Foreign Policy

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Unconditional Crystal Generation

MP20

MP20 (Xie et al., 2022) contains 45,231 metastable crystal structures from the Materials Project (Jain et al., 2013), each with …

📊 3 results
📏 Metrics: DFT Stable, Unique, Novel Rate, Validity

Unconditional Molecule Generation

GEOM-DRUGS

GEOM-DRUGS is a dataset of 430,000 large organic molecules of up to 180 atoms from [Axelrod and Gómez-Bombarelli, Nature Scientific …

📊 5 results
📏 Metrics: PoseBusters Validity, Validity, PoseBusters Atoms Connected

QM9

QM9 provides quantum chemical properties (at DFT level) for a relevant, consistent, and comprehensive chemical space of small organic molecules. …

📊 4 results
📏 Metrics: Validity, PoseBusters Internal Energy

Unsupervised Anomaly Detection

AnoShift

AnoShift is a large-scale anomaly detection benchmark, which focuses on splitting the test data based on its temporal distance to …

📊 15 results
📏 Metrics: ROC-AUC FAR, ROC-AUC IID, ROC-AUC NEAR, ROC-AUC-ID (In-Distribution setup)

Caltech-101

The Caltech101 dataset contains images from 101 object categories (e.g., “helicopter”, “elephant” and “chair” etc.) and a background category that …

📊 1 results
📏 Metrics: AUC (outlier ratio = 0.5)

DAGM2007

This is a synthetic dataset for defect detection on textured surfaces. It was originally created for a competition at the …

📊 1 results
📏 Metrics: Detection AUROC

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 1 results
📏 Metrics: AUC (outlier ratio = 0.5)

KolektorSDD

The dataset is constructed from images of defective production items that were provided and annotated by Kolektor Group d.o.o.. The …

📊 1 results
📏 Metrics: Segmentation AUROC

KolektorSDD2

KolektorSDD2 is a surface-defect detection dataset with over 3000 images containing several types of defects, obtained while addressing a real-world …

📊 3 results
📏 Metrics: Segmentation AP, Segmentation AUROC, Detection AP, Segmentation AUPRO

PRONTO

The PRONTO heterogeneous benchmark dataset is based on an industrial-scale multiphase flow facility. It includes data from heterogeneous sources, including …

📊 1 results
📏 Metrics: AUC, Best Delay, Best F1, F1

Reuters-21578

The Reuters-21578 dataset is a collection of documents with news articles. The original corpus has 10,369 documents and a vocabulary …

📊 1 results
📏 Metrics: AUC (outlier ratio = 0.5)

SMAP

Soil Moisture Active Passive (SMAP) dataset is a dataset of soil samples and telemetry information using the Mars rover by …

📊 7 results
📏 Metrics: F1, Precision, Recall, AUC

SMD

a dataset of time-series anomaly detection

📊 1 results
📏 Metrics: Precision

TIMo

TIMo (Time-of-Flight Indoor Monitoring) is a dataset of infrared and depth videos intended for the use in Anomaly Detection and …

📊 1 results
📏 Metrics: AUROC

Vehicle Claims

The code to create the dataset is available here. The dataset used in the paper is available on github - …

📊 9 results
📏 Metrics: AUC

Virology

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Vulnerability Detection

VulScribeR

Datasets are listed in the repository's readme file. This one is extra and yields 20K+ items after filtering with a …

📊 1 results
📏 Metrics: F1 Score

Vulnerability Java Dataset

The dataset consists of two versions: $X_1$ with $P_3$ and $X_1$ without $P_3$, where $P_3$ represents a set of random …

📊 2 results
📏 Metrics: AUC, F1

Weather Forecasting

NOAA Atmospheric Temperature Dataset

This dataset contains meteorological observations (temperature) at the land-based weather stations located in the United States, collected from the Online …

📊 4 results
📏 Metrics: MAE (t+1), MAE (t+10)

SEVIR

SEVIR is an annotated, curated and spatio-temporally aligned dataset containing over 10,000 weather events that each consist of 384 km …

📊 5 results
📏 Metrics: MSE, mCSI

Shifts

The Shifts Dataset is a dataset for evaluation of uncertainty estimates and robustness to distributional shift. The dataset, which has …

📊 2 results
📏 Metrics: R-AUC MSE

World Religions

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

regression

California Housing Prices

Median house prices for California districts derived from the 1990 census. About Dataset Context This is the dataset used in …

📊 3 results
📏 Metrics: R2 Score, lambda

Car_Price_Prediction

In this dataset we added [Company Name, Car Model, Car Type, Fuel Type, Transmission, Engine (cc), Mileage, Kms_driven, Buyers, Horsepower …

📊 1 results
📏 Metrics: R Squared

Concrete Compressive Strength

Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age …

📊 3 results
📏 Metrics: R2 Score, lambda

Medical Cost Personal Dataset

This dataset contains demographic and personal health information for individuals, along with the corresponding medical insurance charges billed to them. …

📊 3 results
📏 Metrics: R2 Score, lambda