Machine Learning Benchmarks

Browse 349 benchmarks across 60 tasks
← ML Research Wiki / Benchmarks / Methodology
Clear
Browse by Category

16k

ConceptNet

ConceptNet is a knowledge graph that connects words and phrases of natural language with labeled edges. Its knowledge is collected …

📊 1 results
📏 Metrics: 1'"

2D Object Detection

CeyMo

CeyMo is a novel benchmark dataset for road marking detection which covers a wide variety of challenging urban, sub-urban and …

📊 5 results
📏 Metrics: mAP

Clear Weather

We introduce an object detection dataset in challenging adverse weather conditions covering 12000 samples in real-world driving scenes and 1500 …

📊 2 results
📏 Metrics: clear hard (AP)

DUO

DUO is a dataset for Underwater object detection for robot picking. The dataset contains a collection of diverse underwater images …

📊 1 results
📏 Metrics: All mAP, AP50, AP75

Dense Fog

We introduce an object detection dataset in challenging adverse weather conditions covering 12000 samples in real-world driving scenes and 1500 …

📊 2 results
📏 Metrics: dense fog hard (AP), light fog hard (AP), snow/rain hard (AP)

DroneVehicle

The DroneVehicle dataset consists of a total of 56,878 images collected by the drone, half of which are RGB images, …

📊 7 results
📏 Metrics: test/mAP50, test/mAP, Val/mAP50

ETDII Dataset

Paper: GridTracer: Automatic Mapping of Power Grids using Deep Learning and Overhead Imagery Authors: Bohao Huang, Jichen Yang, Artem Streltsov, …

📊 1 results
📏 Metrics: [email protected]

ExDark

The Exclusively Dark (ExDARK) dataset is a collection of 7,363 low-light images from very low-light environments to twilight (i.e 10 …

📊 2 results
📏 Metrics: mAP

FishEye8K

With the advance of AI, road object detection has been a prominent topic in computer vision, mostly using perspective cameras. …

📊 1 results
📏 Metrics: mAP

RADIATE

RADIATE (RAdar Dataset In Adverse weaThEr) is new automotive dataset created by Heriot-Watt University which includes Radar, Lidar, Stereo Camera …

📊 2 results
📏 Metrics: [email protected]

RF100

The evaluation of object detection models is usually performed by optimizing a single metric, e.g. mAP, on a fixed set …

📊 1 results
📏 Metrics: Average mAP

RadioGalaxyNET Dataset

Automating the creation of catalogues for radio galaxies in next-generation deep surveys necessitates the identification of components within extended sources …

📊 1 results
📏 Metrics: COCO-style AP

SARDet-100K

The SARDet-100K dataset encompasses a total of 116,598 images, and 245,653 instances distributed across six categories: Aircraft, Ship, Car, Bridge, …

📊 13 results
📏 Metrics: box mAP, mAP, mAP@50, mAP@75

TRR360D

TRR360D is based on the ICDAR2019MTD modern table detection dataset, it refers to the annotation format of the DOTA dataset. …

📊 1 results
📏 Metrics: AP50(T<90), AP90(T<90)

UAV-PDD2023

UAV-PDD2023: A benchmark dataset for pavement distress detection based on UAV images

📊 1 results
📏 Metrics: [email protected]

UAVDB

UAVDB is a high-resolution RGB video dataset meticulously designed for UAV detection tasks across diverse scales and complex backgrounds. Comprising …

📊 1 results
📏 Metrics: AP50

3D Reconstruction

300W

The 300-W is a face dataset that consists of 300 Indoor and 300 Outdoor in-the-wild images. It covers a large …

📊 1 results
📏 Metrics: 1-of-100 Accuracy

ApolloCar3D

ApolloCar3DT is a dataset that contains 5,277 driving images and over 60K car instances, where each car is fitted with …

📊 1 results
📏 Metrics: A3DP

Aria Digital Twin Dataset

A real-world dataset, with hyper-accurate digital counterpart & comprehensive ground-truth annotation. Dataset Content - 200 sequences (~400 mins) - 398 …

📊 1 results
📏 Metrics: Accuracy, Completeness, Precision

Aria Synthetic Environments

[1]: https://www.projectaria.com/datasets/ase/ "" [2]: https://facebookresearch.github.io/projectaria_tools/docs/open_datasets/aria_synthetic_environments_dataset "" [3]: https://www.projectaria.com/research/ "" Aria Synthetic Environments is a large-scale, fully simulated dataset created by …

📊 1 results
📏 Metrics: Accuracy, Completeness, Precision, Recall

DTU

DTU MVS 2014 is a multi-view stereo dataset, which is an order of magnitude larger in number of scenes and …

📊 20 results
📏 Metrics: Overall, Acc, Comp

Scan2CAD

Scan2CAD is an alignment dataset based on 1506 ScanNet scans with 97607 annotated keypoints pairs between 14225 (3049 unique) CAD …

📊 2 results
📏 Metrics: Average Accuracy

ScanNet

ScanNet is an instance-level indoor RGB-D dataset that includes both 2D and 3D data. It is a collection of labeled …

📊 1 results
📏 Metrics: 3DIoU, Chamfer Distance, L1

ShapeNet

ShapeNet is a large scale repository for 3D CAD models developed by researchers from Stanford University, Princeton University and the …

📊 8 results
📏 Metrics: IoU, Chamfer Distance, F-Score@1%

Anomaly Detection

ADNI

Alzheimer's Disease Neuroimaging Initiative (ADNI) is a multisite study that aims to improve clinical trials for the prevention and treatment …

📊 1 results
📏 Metrics: AUC

AG News

AG News (AG’s News Corpus) is a subdataset of AG's corpus of news articles constructed by assembling titles and description …

📊 1 results
📏 Metrics: AUROC

BTAD

The BTAD ( beanTech Anomaly Detection) dataset is a real-world industrial anomaly dataset. The dataset contains a total of 2830 …

📊 12 results
📏 Metrics: Detection AUROC, Segmentation AUROC, Segmentation AP, Segmentation AUPRO

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 1 results
📏 Metrics: Mean AUC

COCO-OOC

COCO-OOC goes beyond standard object detection to ask the question: Which objects are out-of-context (OOC)? Given an image with a …

📊 1 results
📏 Metrics: AUC

CUHK Avenue

Avenue Dataset contains 16 training and 21 testing video clips. The videos are captured in CUHK campus avenue with 30652 …

📊 29 results
📏 Metrics: AUC, RBDC, TBDC, FPS

DIOR

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 4 results
📏 Metrics: ROC AUC

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 10 results
📏 Metrics: ROC AUC

Fishyscapes

Fishyscapes is a public benchmark for uncertainty estimation in a real-world task of semantic segmentation for urban driving. It evaluates …

📊 8 results
📏 Metrics: AP, FPR95

Forest CoverType

Predicting forest cover type from cartographic variables only (no remotely sensed data). The actual forest cover type for a given …

📊 1 results
📏 Metrics: AUC

Hyper-Kvasir Dataset

HyperKvasir dataset contains 110,079 images and 374 videos where it captures anatomical landmarks and pathological and normal findings. A total …

📊 5 results
📏 Metrics: AUC

IITB Corridor

An abnormal activity data-set for research use that contains 4,83,566 annotated frames. Source: [Multi-timescale Trajectory Prediction for Abnormal Human Activity …

📊 1 results
📏 Metrics: AUC

ITDD

The Industrial Textile Defect Detection (ITDD) dataset includes 1885 industrial textile images categorized into 4 categories: cotton fabric, dyed fabric, …

📊 1 results
📏 Metrics: Detection AUROC, Segmentation AUROC

InsPLAD

InsPLAD is a Dataset for Power Line Asset Inspection containing 10,607 high-resolution Unmanned Aerial Vehicles colour images. It contains 17 …

📊 4 results
📏 Metrics: Detection AUROC

KDD Cup 1999

This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held …

📊 1 results
📏 Metrics: F1-Score

Kaggle-Credit Card Fraud Dataset

The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred …

📊 1 results
📏 Metrics: AUC

LAG

Includes 5,824 fundus images labeled with either positive glaucoma (2,392) or negative glaucoma (3,432). Source: [Attention Based Glaucoma Detection: A …

📊 4 results
📏 Metrics: AUC

Lost and Found

Lost and Found is a novel lost-cargo image sequence dataset comprising more than two thousand frames with pixelwise annotations of …

📊 4 results
📏 Metrics: AP, FPR

MIT-BIH Arrhythmia Database

The MIT-BIH Arrhythmia Database contains 48 half-hour excerpts of two-channel ambulatory ECG recordings, obtained from 47 subjects studied by the …

📊 1 results
📏 Metrics: F1 score

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 5 results
📏 Metrics: ROC AUC

MPDD

MPDD is a dataset aimed at benchmarking visual defect detection methods in industrial metal parts manufacturing. It consists of more …

📊 14 results
📏 Metrics: Detection AUROC, Segmentation AUROC, Segmentation AUPRO

MVTEC 3D-AD

MVTec 3D Anomaly Detection Dataset (MVTec 3D-AD) is a comprehensive 3D dataset for the task of unsupervised anomaly detection and …

📊 2 results
📏 Metrics: Segmentation AUPRO, Detection AUROC, Segmentation AUROC

MVTec LOCO AD

MVTec Logical Constraints Anomaly Detection (MVTec LOCO AD) dataset is intended for the evaluation of unsupervised anomaly localization algorithms. The …

📊 35 results
📏 Metrics: Avg. Detection AUROC, Detection AUROC (only logical), Detection AUROC (only structural), Segmentation AU-sPRO (until FPR 5%)

Musk v1

The Musk dataset describes a set of molecules, and the objective is to detect musks from non-musks. This dataset describes …

📊 1 results
📏 Metrics: F1-Score

ODDS

Outliers or anomalies are instances that do not conform to the norm of a dataset. Outlier detection is an important …

📊 3 results
📏 Metrics: AUROC, F1

PAD Dataset

Multi-pose Anomaly Detection (MAD) dataset, which represents the first attempt to evaluate the performance of pose-agnostic anomaly detection. The MAD …

📊 2 results
📏 Metrics: Detection AUROC, Segmentation AUROC

Road Anomaly

This dataset contains images of unusual dangers which can be encountered by a vehicle on the road – animals, rocks, …

📊 9 results
📏 Metrics: AP, FPR95

SMD

a dataset of time-series anomaly detection

📊 1 results
📏 Metrics: Recall, precision, F1, F1-score

SVHN

Street View House Numbers (SVHN) is a digit classification benchmark dataset that contains 600,000 32×32 RGB images of printed digits …

📊 1 results
📏 Metrics: Mean AUC

ShanghaiTech

The Shanghaitech dataset is a large-scale crowd counting dataset. It consists of 1198 annotated crowd images. The dataset is divided …

📊 28 results
📏 Metrics: AUC, RBDC, TBDC

ShanghaiTech Campus

The ShanghaiTech Campus dataset has 13 scenes with complex light conditions and camera angles. It contains 130 abnormal events and …

📊 1 results
📏 Metrics: AUC-ROC

Street Scene

Street Scene is a dataset for video anomaly detection. Street Scene consists of 46 training and 35 testing high resolution …

📊 1 results
📏 Metrics: AUC, RBDC, TBDC

TII-SSRC-23

The TII-SSRC-23 dataset offers a comprehensive collection of network traffic patterns, meticulously compiled to support the development and research of …

📊 1 results
📏 Metrics: AUC

Thyroid

Thyroid is a dataset for detection of thyroid diseases, in which patients diagnosed with hypothyroid or subnormal are anomalies against …

📊 2 results
📏 Metrics: AUC, Average Precision, F1-Score

UBnormal

UBnormal is a new supervised open-set benchmark composed of multiple virtual scenes for video anomaly detection. Unlike existing data sets, …

📊 13 results
📏 Metrics: AUC, RBDC, TBDC

UCF-Crime

The UCF-Crime dataset is a large-scale dataset of 128 hours of videos. It consists of 1900 long and untrimmed real-world …

📊 1 results
📏 Metrics: AUC

UCR Anomaly Archive

The UCR Anomaly Archive is a collection of 250 uni-variate time series collected in human medicine, biology, meteorology and industry. …

📊 24 results
📏 Metrics: Average F1, AUC ROC

UCSD Ped2

The UCSD Anomaly Detection Dataset was acquired with a stationary camera mounted at an elevation, overlooking pedestrian walkways. The crowd …

📊 9 results
📏 Metrics: AUC, FPS

UEA time-series datasets

Five datasets used in NeurTraL-AD paper: \textit{RacketSports (RS).} Accelerometer and gyroscope recording of players playing four different racket sports. Each …

📊 3 results
📏 Metrics: Avg. ROC-AUC

Vehicle Claims

The code to create the dataset is available here. The dataset used in the paper is available on github - …

📊 2 results
📏 Metrics: AUC

VisA

The VisA dataset contains 12 subsets corresponding to 12 different objects as shown in the above figure. There are 10,821 …

📊 45 results
📏 Metrics: Detection AUROC, Segmentation AUPRO (until 30% FPR), F1-Score, Segmentation AUPRO, Segmentation AUROC

WFDD

WFDD is a dataset for benchmarking anomaly detection methods with a focus on textile inspection. It includes 4101 woven fabric …

📊 1 results
📏 Metrics: Detection AUROC, Segmentation AUPRO, Segmentation AUROC

voraus-AD

voraus-AD contains machine data of a collaborative robot, which moves a can by performing an industrial pick-and-place task. The samples …

📊 3 results
📏 Metrics: Avg. Detection AUROC

AutoML

Wine

These data are the results of a chemical analysis of wines grown in the same region in Italy but derived …

📊 1 results
📏 Metrics: accuracy

BIG-bench Machine Learning

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Chatbot

AlpacaEval

The AlpacaEval set contains 805 instructions form self-instruct, open-assistant, vicuna, koala, hh-rlhf. Those were selected so that the AlpacaEval ranking …

📊 1 results
📏 Metrics: Average win rate

Classification

Adult

Data Set Information: Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records …

📊 1 results
📏 Metrics: AUROC

BIOSCAN_1M_Insect Dataset

In an effort to catalog insect biodiversity, we propose a new large dataset of hand-labelled insect images, the BIOSCAN-1M Insect …

📊 2 results
📏 Metrics: Macro F1

BiasBios

The purpose of this dataset was to study gender bias in occupations. Online biographies, written in English, were collected to …

📊 1 results
📏 Metrics: 1:1 Accuracy

BoolQ

BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring – they are …

📊 2 results
📏 Metrics: Test Accuracy

Brain Tumor MRI Dataset

This dataset is a combination of the following three datasets : figshare, SARTAJ dataset and Br35H This dataset contains 7022 …

📊 1 results
📏 Metrics: F1 score

CIFAKE: Real and AI-Generated Synthetic Images

The quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness. CIFAKE is a dataset that …

📊 1 results
📏 Metrics: Validation Accuracy

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 1 results
📏 Metrics: Accuracy

CIFAR-10C

Common corruptions dataset for CIFAR10

📊 1 results
📏 Metrics: Accuracy on Brightness Corrupted Images

COVID-19 Image Data Collection

Contains hundreds of frontal view X-rays and is the largest public resource for COVID-19 image and prognostic data, making it …

📊 1 results
📏 Metrics: Accuracy

CWRU Bearing Dataset

Data was collected for normal bearings, single-point drive end and fan end defects. Data was collected at 12,000 samples/second and …

📊 1 results
📏 Metrics: 10 fold Cross validation

Chest X-Ray Images (Pneumonia)

The normal chest X-ray (left panel) depicts clear lungs without any areas of abnormal opacification in the image. Bacterial pneumonia …

📊 1 results
📏 Metrics: Accuracy

ForgeryNet

We construct the ForgeryNet dataset, an extremely large face forgery dataset with unified annotations in image- and video-level data across …

📊 3 results
📏 Metrics: AUC, Accuracy

Full-body Parkinson’s disease dataset

A public data set of walking full-body kinematics and kinetics in individuals with Parkinson’s disease

📊 7 results
📏 Metrics: F1-score (weighted)

HOWS

HOWS-CL-25 (Household Objects Within Simulation dataset for Continual Learning) is a synthetic dataset especially designed for object classification on mobile …

📊 1 results
📏 Metrics: Overall accuracy after last sequence

HRF

The HRF dataset is a dataset for retinal vessel segmentation which comprises 45 images and is organized as 15 subsets. …

📊 1 results
📏 Metrics: Accuracy

IRFL: Image Recognition of Figurative Language

The IRFL dataset consists of idioms, similes, and metaphors with matching figurative and literal images, as well as two novel …

📊 1 results
📏 Metrics: 1-of-100 Accuracy

ISIC 2019

The goal for ISIC 2019 is classify dermoscopic images among nine different diagnostic categories.25,331 images are available for training across …

📊 1 results
📏 Metrics: Balanced Multi-Class Accuracy

ImageNet C-OOD (class-out-of-distribution)

This dataset was presented as part of the ICLR 2023 paper 𝘈 𝘧𝘳𝘢𝘮𝘦𝘸𝘰𝘳𝘬 𝘧𝘰𝘳 𝘣𝘦𝘯𝘤𝘩𝘮𝘢𝘳𝘬𝘪𝘯𝘨 𝘊𝘭𝘢𝘴𝘴-𝘰𝘶𝘵-𝘰𝘧-𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯 𝘥𝘦𝘵𝘦𝘤𝘵𝘪𝘰𝘯 𝘢𝘯𝘥 𝘪𝘵𝘴 𝘢𝘱𝘱𝘭𝘪𝘤𝘢𝘵𝘪𝘰𝘯 …

📊 5 results
📏 Metrics: Detection AUROC (severity 0), Detection AUROC (severity 5), Detection AUROC (severity 10)

InDL

Dataset Introduction In this work, we introduce the In-Diagram Logic (InDL) dataset, an innovative resource crafted to rigorously evaluate the …

📊 9 results
📏 Metrics: Average Recall

LES-AV

This data set comprises 22 fundus images with their corresponding manual annotations for the blood vessels, separated as arteries and …

📊 1 results
📏 Metrics: Accuracy

Liver-US

The Liver-US dataset is a comprehensive collection of high-quality ultrasound images of the liver, including both normal and abnormal cases. …

📊 1 results
📏 Metrics: AUC

MHIST

The minimalist histopathology image analysis dataset (MHIST) is a binary classification dataset of 3,152 fixed-size images of colorectal polyps, each …

📊 6 results
📏 Metrics: Accuracy

MedSecId

The process by which sections in a document are demarcated and labeled is known as section identification. Such sections are …

📊 1 results
📏 Metrics: 1 shot Micro-F1

MixedWM38

MixedWM38 Dataset(WaferMap) has more than 38000 wafer maps, including 1 normal pattern, 8 single defect patterns, and 29 mixed defect …

📊 1 results
📏 Metrics: Accuracy, MCC

MuReD Dataset

Early detection of retinal diseases is one of the most important means of preventing partial or permanent blindness in patients. …

📊 1 results
📏 Metrics: ML F1, ML mAP, ML AUC

N-CARS

A large real-world event-based dataset for object classification. Source: HATS: Histograms of Averaged Time Surfaces for Robust Event-based Object Classification

📊 6 results
📏 Metrics: Accuracy (%), Architecture, Representation, Representation Time( ms / 100ms events), Inference Time, Params (M)

N-ImageNet

The N-ImageNet dataset is an event-camera counterpart for the ImageNet dataset. The dataset is obtained by moving an event camera …

📊 9 results
📏 Metrics: Accuracy (%)

RITE

The RITE (Retinal Images vessel Tree Extraction) is a database that enables comparative studies on segmentation or classification of arteries …

📊 1 results
📏 Metrics: Accuracy

RSSCN7

he RSSCN7 dataset contains satellite images acquired from Google Earth, which is originally collected for remote sensing scene classification. We …

📊 1 results
📏 Metrics: 1:1 Accuracy

RTE

The Recognizing Textual Entailment (RTE) datasets come from a series of textual entailment challenges. Data from RTE1, RTE2, RTE3 and …

📊 2 results
📏 Metrics: Test Accuracy

SGD

The Schema-Guided Dialogue (SGD) dataset consists of over 20k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. …

📊 1 results
📏 Metrics: F1 (Seqeval)

SHD - Adding

This dataset is based on the Spiking Heidelberg Digits (SHD) dataset. Sample inputs consist of two spike encoded digits sampled …

📊 3 results
📏 Metrics: Accuracy (%)

SPOT-10

The SPOTS-10 dataset is an extensive collection of grayscale images showcasing diverse patterns found in ten animal species. Specifically, SPOTS-10 …

📊 9 results
📏 Metrics: Accuracy

SST-2

The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the …

📊 2 results
📏 Metrics: Test Accuracy

Sentiment140

Sentiment140 is a dataset that allows you to discover the sentiment of a brand, product, or topic on Twitter. Source: …

📊 1 results
📏 Metrics: Accuracy

SimGas

This dataset consists of computer-generated images for gas leakage segmentation. It features diverse backgrounds, interfering foreground objects, and precise ground …

📊 1 results
📏 Metrics: Frame Level Accuracy

Sound-based drone fault classification using multitask learning

arxiv : https://arxiv.org/abs/2304.11708 Accepted at 29th International Congress on Sound and Vibration (ICSV29). The drone has been used for various …

📊 1 results
📏 Metrics: macro f1 score (A(100), B(100), C(100) Avg.)

TACM12K

Table-ACM12K (TACM12K) is a relational table dataset derived from the ACM heterogeneous graph dataset. It includes four tables: papers, authors, …

📊 1 results
📏 Metrics: Accuracy

TCGA

📊 1 results
📏 Metrics: AUPRC, AUROC

TLF2K

Table-LastFm2K (TLF2K) is a relational table dataset derived from the classical LastFM2K dataset. It contains three tables: artists, user_artists, and …

📊 1 results
📏 Metrics: Accuracy

TML1M

Table-MovieLens1M (TML1M) is a relational table dataset derived from the classical MovieLens1M dataset. It consists of three tables: users, movies, …

📊 1 results
📏 Metrics: Accuracy

WSC

The Winograd Schema Challenge was introduced both as an alternative to the Turing Test and as a test of a …

📊 2 results
📏 Metrics: Test Accuracy

WiC

WiC is a benchmark for the evaluation of context-sensitive word embeddings. WiC is framed as a binary classification task. Each …

📊 2 results
📏 Metrics: Test Accuracy

XImageNet-12

Enlarge the dataset to understand how image background effect the Computer Vision ML model. With the following topics: Blur Background …

📊 3 results
📏 Metrics: Robustness Score

Clustering Algorithms Evaluation

97 synthetic datasets

97 synthetic datasets consists of 97 datasets (as illustrated in the figure) and can be used to test graph-based clustering …

📊 1 results
📏 Metrics: HIT-THE-BEST, Rank difference

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 6 results
📏 Metrics: ARI, F1-score, NMI

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 6 results
📏 Metrics: ARI, F1-score, NMI

Olivetti face

This dataset contains a set of face images taken between April 1992 and April 1994 at AT&T Laboratories Cambridge.

📊 5 results
📏 Metrics: F1-score, ARI, NMI

Continual Learning

20Newsgroup (10 tasks)

This dataset has 20 classes and each class has about 1000 documents. The data split for train/validation/test is 1600/200/200. We …

📊 6 results
📏 Metrics: F1 - macro

AIDS

AIDS is a graph dataset. It consists of 2000 graphs representing molecular compounds which are constructed from the AIDS Antiviral …

📊 1 results
📏 Metrics: 1:3 Accuracy

DSC (10 tasks)

A set of 10 DSC datasets (reviews of 10 products) to produce sequences of tasks. The products are Sports, Toys, …

📊 6 results
📏 Metrics: F1 - macro

F-CelebA (10 tasks)

F-CelebA - This dataset is adapted from federated learning. Federated learning is an emerging machine learning paradigm with an emphasis …

📊 6 results
📏 Metrics: Acc

MLT17

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 1 results
📏 Metrics: Acc

Permuted MNIST

Permuted MNIST is an MNIST variant that consists of 70,000 images of handwritten digits from 0 to 9, where 60,000 …

📊 3 results
📏 Metrics: Average Accuracy, MLP Hidden Layers-width, Pretrained/Transfer Learning, BWT

Continual Pretraining

AG News

AG News (AG’s News Corpus) is a subdataset of AG's corpus of news articles constructed by assembling titles and description …

📊 1 results
📏 Metrics: F1 - macro

SciERC

SciERC dataset is a collection of 500 scientific abstract annotated with scientific entities, their relations, and coreference clusters. The abstracts …

📊 1 results
📏 Metrics: F1 (macro)

Contrastive Learning

10,000 People - Human Pose Recognition Data

Description: 10,000 People - Human Pose Recognition Data. This dataset includes indoor and outdoor scenes.This dataset covers males and females. …

📊 1 results
📏 Metrics: 0..5sec

Core set discovery

Abalone

Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the …

📊 1 results
📏 Metrics: F1(10-fold)

Electricity

Abstract: Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 …

📊 1 results
📏 Metrics: F1(10-fold)

Letter

Letter Recognition Data Set is a handwritten digit dataset. The task is to identify each of a large number of …

📊 1 results
📏 Metrics: F1(10-fold)

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 1 results
📏 Metrics: F1(10-fold)

Data Augmentation

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 5 results
📏 Metrics: Percentage error

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 17 results
📏 Metrics: Accuracy (%)

Decision Making

NASA C-MAPSS

Engine degradation simulation was carried out using C-MAPSS. Four different were sets simulated under different combinations of operational conditions and …

📊 1 results
📏 Metrics: Average Remaining Cycles

Deep Clustering

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 1 results
📏 Metrics: NMI

USPS

USPS is a digit dataset automatically scanned from envelopes by the U.S. Postal Service containing a total of 9,298 16×16 …

📊 1 results
📏 Metrics: NMI

Density Estimation

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 15 results
📏 Metrics: NLL (bits/dim), Log-likelihood (nats)

Caltech-101

The Caltech101 dataset contains images from 101 object categories (e.g., “helicopter”, “elephant” and “chair” etc.) and a background category that …

📊 3 results
📏 Metrics: Negative ELBO, NLL, MMD-L2, COV-L2

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 6 results
📏 Metrics: NLL (bits/dim), Log-likelihood (nats), MMD-L2, COV-L2, NLL

Dimensionality Reduction

EMNIST

EMNIST (extended MNIST) has 4 times more data than MNIST. It is a set of handwritten digits with a 28 …

📊 2 results
📏 Metrics: Classification Accuracy

Domain Adaptation

Canon RAW Low Light

The goal of this project is to present two new datasets that seek to expand the capability of the Learning …

📊 1 results
📏 Metrics: PSNR, SSIM

Comic2k

Comic2k is a dataset used for cross-domain object detection which contains 2k comic images with image and instance-level annotations. Image …

📊 2 results
📏 Metrics: mAP

DomainNet

DomainNet is a dataset of common objects in six different domain. All domains include 345 categories (classes) of objects such …

📊 4 results
📏 Metrics: Accuracy

Foggy Cityscapes

Foggy Cityscapes is a synthetic foggy dataset which simulates fog on real scenes. Each foggy image is rendered with a …

📊 1 results
📏 Metrics: mAP

ImageCLEF-DA

The ImageCLEF-DA dataset is a benchmark dataset for ImageCLEF 2014 domain adaptation challenge, which contains three domains: Caltech-256 (C), ImageNet …

📊 16 results
📏 Metrics: Accuracy

LeukemiaAttri

The LeukemiaAttri dataset is a large-scale, multi-domain collection of microscopy images derived from leukemia patient samples, enriched with detailed morphological …

📊 1 results
📏 Metrics: mAP

MSDA

  • 5 domains: synthetic domain, document domain, street view domain, handwritten domain, and car license domain * over five million …
📊 1 results
📏 Metrics: Average Accuracy

Nikon RAW Low Light

Dataset release for the BMVC 2021 Paper "Few-Shot Domain Adaptation for Low Light RAW Image Enhancement" Abstract: Enhancing practical low …

📊 1 results
📏 Metrics: PSNR, SSIM

Office-31

The Office dataset contains 31 object categories in three domains: Amazon, DSLR and Webcam. The 31 categories in the dataset …

📊 38 results
📏 Metrics: Average Accuracy

Office-Caltech-10

Office-Caltech-10 a standard benchmark for domain adaptation, which consists of Office 10 and Caltech 10 datasets. It contains the 10 …

📊 1 results
📏 Metrics: Accuracy (%)

Office-Home

Office-Home is a benchmark dataset for domain adaptation which contains 4 domains where each domain consists of 65 categories. The …

📊 27 results
📏 Metrics: Accuracy

PACS

PACS is an image dataset for domain generalization. It consists of four domains, namely Photo (1,670 images), Art Painting (2,048 …

📊 1 results
📏 Metrics: Accuracy

Sim10k

SIM10k is a synthetic dataset containing 10,000 images, which is rendered from the video game Grand Theft Auto V (GTA5). …

📊 1 results
📏 Metrics: mAP

Ensemble Learning

SMS Spam Collection Data Set

This corpus has been collected from free or free for research sources at the Internet: - A collection of 425 …

📊 1 results
📏 Metrics: Accuracy

Face Recognition

BTS3.1

Large, multimodal biometric dataset: It contains still images and videos of over 1,000 people captured at various ranges (up to …

📊 1 results
📏 Metrics: TAR @ FAR=0.01

CALFW

A renovation of Labeled Faces in the Wild (LFW), the de facto standard testbed for unconstraint face verification. Source: CALFW

📊 2 results
📏 Metrics: Accuracy

CASIA-WebFace+masks

The COVID-19 pandemic raises the problem of adapting face recognition systems to the new reality, where people may wear surgical …

📊 5 results
📏 Metrics: Accuracy

CPLFW

A renovation of Labeled Faces in the Wild (LFW), the de facto standard testbed for unconstraint face verification. There are …

📊 1 results
📏 Metrics: Accuracy

CelebA+masks

The COVID-19 pandemic raises the problem of adapting face recognition systems to the new reality, where people may wear surgical …

📊 5 results
📏 Metrics: Accuracy

Color FERET

The color FERET database is a dataset for face recognition. It contains 11,338 color images of size 512×768 pixels captured …

📊 4 results
📏 Metrics: FNMR [%] @ 10-3 FMR, 5-class test accuracy

IJB-B

The IJB-B dataset is a template-based face dataset that contains 1845 subjects with 11,754 images, 55,025 frames and 7,011 videos …

📊 4 results
📏 Metrics: Rank-1, Rank-5, TAR @ FAR=0.0001, TAR @ FAR=1e-3, TAR @ FAR=1e-4, TAR @ FAR=1e-5

LFW

The LFW dataset contains 13,233 images of faces collected from the web. This dataset consists of the 5749 identities with …

📊 14 results
📏 Metrics: Accuracy, FNMR [%] @ 10-3 FMR, F1-score, Precision, Recall

MFR

During the COVID-19 coronavirus epidemic, almost everyone wears a facial mask, which poses a huge challenge to face recognition. Traditional …

📊 1 results
📏 Metrics: MFR-ALL, MFR-MASK, African, Caucasian, South Asian, East Asian

MLFW

The Masked LFW (MLFW), based on Cross-Age LFW (CALFW) database, is built using a simple but effective tool that generates …

📊 6 results
📏 Metrics: Accuracy

MORPH

MORPH is a facial age estimation dataset, which contains 55,134 facial images of 13,617 subjects ranging from 16 to 77 …

📊 3 results
📏 Metrics: FNMR [%] @ 10-3 FMR

XQLFW

An evaluation protocol for face verification focusing on a large intra-pair image quality difference. Real-world face recognition applications often deal …

📊 1 results
📏 Metrics: Accuracy

mebeblurf

Matanga Darknet — 2025 Access Guide As internet censorship intensifies, Shadow Marketplaces remain crucial tools for anonymous transactions. Matanga Darknet …

📊 3 results
📏 Metrics: FNMR [%] @ 10-3 FMR

Few-Shot Learning

CaseHOLD

CaseHOLD (Case Holdings On Legal Decisions) is a law dataset comprised of over 53,000+ multiple choice questions to identify the …

📊 1 results
📏 Metrics: Accuracy

DTD

The Describable Textures Dataset (DTD) contains 5640 texture images in the wild. They are annotated with human-centric attributes inspired by …

📊 4 results
📏 Metrics: 4-shot Accuracy, 8-shot Accuracy, 12-shot Accuracy, 16-shot Accuracy, Harmonic mean

EuroSAT

Eurosat is a dataset and deep learning benchmark for land use and land cover classification. The dataset is based on …

📊 1 results
📏 Metrics: Harmonic mean

Large COVID-19 CT scan slice dataset

"We built a large lung CT scan dataset for COVID-19 by curating data from 7 public datasets listed in the …

📊 1 results
📏 Metrics: AUC-ROC, Accuracy , Macro F1, Macro Precision, Macro Recall, Micro Precision, Specificity

MR

MR Movie Reviews is a dataset for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect …

📊 1 results
📏 Metrics: Acc

MRPC

Microsoft Research Paraphrase Corpus (MRPC) is a corpus consists of 5,801 sentence pairs collected from newswire articles. Each pair is …

📊 1 results
📏 Metrics: F1-score

MedConceptsQA

MedConceptsQA - Open Source Medical Concepts QA Benchmark The benchmark can be found here: https://huggingface.co/datasets/ofir408/MedConceptsQA

📊 12 results
📏 Metrics: Accuracy

MedNLI

The MedNLI dataset consists of the sentence pairs developed by Physicians from the Past Medical History section of MIMIC-III clinical …

📊 1 results
📏 Metrics: Accuracy

PubMedQA

The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary …

📊 2 results
📏 Metrics: Accuracy

SUN397

The Scene UNderstanding (SUN) database contains 899 categories and 130,519 images. There are 397 well-sampled categories to evaluate numerous state-of-the-art …

📊 1 results
📏 Metrics: Harmonic mean

Stanford Cars

The Stanford Cars dataset consists of 196 classes of cars with a total of 16,185 images, taken from the rear. …

📊 3 results
📏 Metrics: 4-shot Accuracy, 8-shot Accuracy, 12-shot Accuracy, 16-shot Accuracy

UCF101

UCF101 dataset is an extension of UCF50 and consists of 13,320 video clips, which are classified into 101 categories. These …

📊 1 results
📏 Metrics: Harmonic mean

General Classification

CVR

This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identified …

📊 1 results
📏 Metrics: Test error

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 1 results
📏 Metrics: Accuracy

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 1 results
📏 Metrics: Accuracy

Wine

These data are the results of a chemical analysis of wines grown in the same region in Italy but derived …

📊 1 results
📏 Metrics: Accuracy

iris

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician, …

📊 2 results
📏 Metrics: Accuracy

Generalized Few-Shot Learning

AwA2

Animals with Attributes 2 (AwA2) is a dataset for benchmarking transfer-learning algorithms, such as attribute base classification and zero-shot learning. …

📊 6 results
📏 Metrics: Per-Class Accuracy (1-shot), Per-Class Accuracy (2-shots), Per-Class Accuracy (5-shots), Per-Class Accuracy (10-shots), Per-Class Accuracy (20-shots)

SUN

When glancing at a magazine, or browsing the Internet, we are continuously being exposed to photographs. Despite of this overflow …

📊 5 results
📏 Metrics: Per-Class Accuracy (1-shot), Per-Class Accuracy (2-shots), Per-Class Accuracy (5-shots), Per-Class Accuracy (10-shots)

Incremental Learning

MLT17

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 1 results
📏 Metrics: Acc

Inductive logic programming

RuDaS

Logical rules are a popular knowledge representation language in many domains. Recently, neural networks have been proposed to support the …

📊 4 results
📏 Metrics: H-Score, R-Score

Interpretable Machine Learning

CUB-200-2011

The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of …

📊 2 results
📏 Metrics: Top 1 Accuracy

Logical Reasoning

LingOly

This dataset is a benchmark for complex reasoning abilities in large language models, drawing on United Kingdom Linguistics Olympiad problems …

📊 11 results
📏 Metrics: Delta_NoContext, Exact Match Accuracy

RuWorldTree

RuWorldTree is a QA dataset with multiple-choice elementary-level science questions, which evaluate the understanding of core science facts. Motivation The …

📊 4 results
📏 Metrics: Accuracy

Winograd Automatic

The Winograd schema challenge composes tasks with syntactic ambiguity, which can be resolved with logic and reasoning. Motivation The dataset …

📊 4 results
📏 Metrics: Accuracy

Long-tail Learning

COCO-MLT

The COCO-MLT is created from MS COCO-2017, containing 1,909 images from 80 classes. The maximum of training number per class …

📊 11 results
📏 Metrics: Average mAP

EGTEA

Extended GTEA Gaze+ EGTEA Gaze+ is a large-scale dataset for FPV actions and gaze. It subsumes GTEA Gaze+ and comes …

📊 3 results
📏 Metrics: Average Precision, Average Recall

ImageNet-LT

ImageNet Long-Tailed is a subset of /dataset/imagenet dataset consisting of 115.8K images from 1000 categories, with maximally 1280 images per …

📊 65 results
📏 Metrics: Top-1 Accuracy

Lot-insts

LoT-insts contains over 25k classes whose frequencies are naturally long-tail distributed. Its test set from four different subsets: many-, medium-, …

📊 1 results
📏 Metrics: Macro-F1

MIMIC-CXR-LT

MIMIC-CXR-LT. We construct a single-label, long-tailed version of MIMIC-CXR in a similar manner. MIMIC-CXR is a multi-label classification dataset with …

📊 15 results
📏 Metrics: Balanced Accuracy

NIH-CXR-LT

NIH-CXR-LT. NIH ChestXRay14 contains over 100,000 chest X-rays labeled with 14 pathologies, plus a “No Findings” class. We construct a …

📊 15 results
📏 Metrics: Balanced Accuracy

Places-LT

Places-LT has an imbalanced training set with 62,500 images for 365 classes from Places-2. The class frequencies follow a natural …

📊 28 results
📏 Metrics: Top-1 Accuracy, Top 1 Accuracy

VOC-MLT

We construct the long-tailed version of VOC from its 2012 train-val set. It contains 1,142 images from 20 classes, with …

📊 11 results
📏 Metrics: Average mAP

mini-ImageNet-LT

mini-ImageNet was proposed by Matching networks for one-shot learning for few-shot learning evaluation, in an attempt to have a dataset …

📊 1 results
📏 Metrics: Error Rate

Medical Report Generation

HistGen WSI-Report Dataset

This dataset is composed of 7,753 pairs of whole slide images and their corresponding diagnostic reports, extracted from the TCGA …

📊 1 results
📏 Metrics: BLEU-4

IU X-Ray

IU X-ray (Demner-Fushman et al., 2016) is a set of chest X-ray images paired with their corresponding diagnostic reports. The …

📊 1 results
📏 Metrics: BLEU-4, BLEU-1, BLEU-2, BLEU-3, CIDEr, METEOR, ROUGE

MIMIC-CXR

MIMIC-CXR from Massachusetts Institute of Technology presents 371,920 chest X-rays associated with 227,943 imaging studies from 65,079 patients. The studies …

📊 2 results
📏 Metrics: BLEU-1, BLEU-2, BLEU-3, BLEU-4, CIDEr, Example-F1-14, Example-Precision-14, Example-Recall-14, METEOR, Micro-F1-5, Micro-Precision-5, Micro-Recall-5, ROUGE-L, F1 RadGraph

Metric Learning

CARS196

CARS196 is composed of 16,185 car images of 196 classes.

📊 36 results
📏 Metrics: R@1

CUB-200-2011

The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of …

📊 2 results
📏 Metrics: R@1

DyML-Animal

DyML-Animal is based on animal images selected from ImageNet-5K [1]. It has 5 semantic scales (i.e., classes, order, family, genus, …

📊 2 results
📏 Metrics: Average-mAP

DyML-Product

DyML-Product is derived from iMaterialist-2019, a hierarchical online product dataset. The original iMaterialist-2019 offers up to 4 levels of hierarchical …

📊 2 results
📏 Metrics: Average-mAP

DyML-Vehicle

DyML-Vehicle merges two vehicle re-ID datasets PKU VehicleID [1], VERI-Wild [1]. Since these two datasets have only annotations on the …

📊 2 results
📏 Metrics: Average-mAP

In-Shop

In-shop Clothes Retrieval Benchmark evaluates the performance of in-shop Clothes Retrieval. This is a large subset of DeepFashion, containing large …

📊 15 results
📏 Metrics: R@1

Stanford Online Products

Stanford Online Products (SOP) dataset has 22,634 classes with 120,053 product images. The first 11,318 classes (59,551 images) are split …

📊 33 results
📏 Metrics: R@1

Model Compression

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 12 results
📏 Metrics: Top-1

QNLI

The QNLI (Question-answering NLI) dataset is a Natural Language Inference dataset automatically derived from the Stanford Question Answering Dataset v1.1 …

📊 2 results
📏 Metrics: Accuracy

Model extraction

UML Classes With Specs

Repository for UML-English data This repository contains the data used for "Extraction of UML Class Diagrams from Natural Language …

📊 1 results
📏 Metrics: Exact Match

Molecular Property Prediction

MUV

The Maximum Unbiased Validation (MUV) dataset is a benchmark dataset selected from PubChem BioAssay. It was created by applying a …

📊 2 results
📏 Metrics: ROC-AUC

MoleculeNet

MoleculeNet is a large scale benchmark for molecular machine learning. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and …

📊 5 results
📏 Metrics: AUC

PCBA

PCBA dataset 11 is a collection of high-quality dose-response data, formulated as a multitask learning benchmark from 128 high-throughput screening …

📊 1 results
📏 Metrics: ROC-AUC

QM7

QM7 dataset is a subset of the GDB-13 database. GDB-13 contains nearly 1 billion stable and synthetically accessible organic molecules. …

📊 7 results
📏 Metrics: MAE

QM8

QM8 dataset is a collection of molecular data used for studying quantum mechanical calculations of electronic spectra and excited state …

📊 7 results
📏 Metrics: MAE

QM9

QM9 provides quantum chemical properties (at DFT level) for a relevant, consistent, and comprehensive chemical space of small organic molecules. …

📊 7 results
📏 Metrics: MAE

SIDER

SIDER contains information on marketed medicines and their recorded adverse drug reactions. The information is extracted from public documents and …

📊 16 results
📏 Metrics: ROC-AUC

Tox21

The Tox21 data set comprises 12,060 training samples and 647 test samples that represent chemical compounds. There are 801 "dense …

📊 17 results
📏 Metrics: ROC-AUC

clintox

The ClinTox dataset compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons. The …

📊 18 results
📏 Metrics: ROC-AUC, Molecules (M)

Multi-Label Classification

CheXpert

The CheXpert dataset contains 224,316 chest radiographs of 65,240 patients with both frontal and lateral views available. The task is …

📊 11 results
📏 Metrics: AVERAGE AUC ON 14 LABEL, NUM RADS BELOW CURVE

ChestX-ray14

ChestX-ray14 is a medical imaging dataset which comprises 112,120 frontal-view X-ray images of 30,805 (collected from the year of 1992 …

📊 4 results
📏 Metrics: Average AUC on 14 label, Macro F1

MIMIC-CXR

MIMIC-CXR from Massachusetts Institute of Technology presents 371,920 chest X-rays associated with 227,943 imaging studies from 65,079 patients. The studies …

📊 1 results
📏 Metrics: Average AUC on 14 label

MLRSNet

MLRSNet is a a multi-label high spatial resolution remote sensing dataset for semantic scene understanding. It provides different perspectives of …

📊 2 results
📏 Metrics: F1-score

MRNet

The MRNet dataset consists of 1,370 knee MRI exams performed at Stanford University Medical Center. The dataset contains 1,104 (80.6%) …

📊 1 results
📏 Metrics: Average AUC, AUC on Abnormality (ABN), AUC on ACL Tear (ACL), AUC on Meniscus Tear (MEN), Average Accuracy, Accuracy on Abnormality (ABN), Accuracy on ACL Tear (ACL), Accuracy on Meniscus Tear (MEN)

NUS-WIDE

The NUS-WIDE dataset contains 269,648 images with a total of 5,018 tags collected from Flickr. These images are manually annotated …

📊 9 results
📏 Metrics: MAP

OpenImages-v6

OpenImages V6 is a large-scale dataset , consists of 9 million training images, 41,620 validation samples, and 125,456 test samples. …

📊 4 results
📏 Metrics: mAP

PASCAL VOC 2007

PASCAL VOC 2007 is a dataset for image recognition. The twenty object classes that have been selected are: Person: person …

📊 16 results
📏 Metrics: mAP

Multi-Label Text Classification

CC3M-TagMask

The dataset offers tag and mask annotations for image-text pairs from the CC3M validation set. Tag annotations denote words that …

📊 4 results
📏 Metrics: Precision, Recall, F1, Accuracy, mAP

Dataset of Propaganda Techniques of the State-Sponsored Information Operation of the People's Republic of China

This data is for the Mis2-KDD 2021 under review paper: Dataset of Propaganda Techniques of the State-Sponsored Information Operation of …

📊 1 results
📏 Metrics: 1:1 Accuracy, F1 - macro, Micro F1

MIMIC-III

The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. …

📊 3 results
📏 Metrics: AUC, Macro F1, Macro Precision, Macro Recall, Micro Precision, Micro Recall, Micro-F1, P@5, Precision, Recall

RCV1

The RCV1 dataset is a benchmark dataset on text categorization. It is a collection of newswire articles producd by Reuters …

📊 1 results
📏 Metrics: Macro-F1, Micro-F1

Reuters-21578

The Reuters-21578 dataset is a collection of documents with news articles. The original corpus has 10,369 documents and a vocabulary …

📊 5 results
📏 Metrics: Micro-F1

Multi-Task Learning

CelebA

CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels …

📊 1 results
📏 Metrics: Error

ChestX-ray14

ChestX-ray14 is a medical imaging dataset which comprises 112,120 frontal-view X-ray images of 30,805 (collected from the year of 1992 …

📊 1 results
📏 Metrics: delta_m

NYUv2

The NYU-Depth V2 data set is comprised of video sequences from a variety of indoor scenes as recorded by both …

📊 2 results
📏 Metrics: Mean IoU

QM9

QM9 provides quantum chemical properties (at DFT level) for a relevant, consistent, and comprehensive chemical space of small organic molecules. …

📊 2 results
📏 Metrics: ∆m%

UTKFace

The UTKFace dataset is a large-scale face dataset with long age span (range from 0 to 116 years old). The …

📊 1 results
📏 Metrics: delta_m

Multi-agent Reinforcement Learning

SMAC-Exp

The StarCraft Multi-Agent Challenges+ requires agents to learn completion of multi-stage tasks and usage of environmental factors without precise reward …

📊 1 results
📏 Metrics: Median Win Rate

Multiple Instance Learning

CAMELYON16

The dataset consists of 400 whole-slide images (WSIs) of lymph node sections stained with hematoxylin and eosin (H&E), collected from …

📊 14 results
📏 Metrics: AUC, ACC, Expected Calibration Error, FROC, Patch AUC

Elephant

The Elephant MIL dataset is a benchmark used in multiple instance learning (MIL), which falls under the broader categories of …

📊 2 results
📏 Metrics: AUC, ACC

Musk v1

The Musk dataset describes a set of molecules, and the objective is to detect musks from non-musks. This dataset describes …

📊 2 results
📏 Metrics: ACC, AUC

Musk v2

The Musk2 dataset is a set of 102 molecules of which 39 are judged by human experts to be musks …

📊 2 results
📏 Metrics: ACC, AUC

TCGA

📊 8 results
📏 Metrics: ACC, AUC

Network Pruning

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 4 results
📏 Metrics: Accuracy, GFLOPs, Inference Time (ms)

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 5 results
📏 Metrics: Accuracy, GFLOPs, Inference Time (ms)

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 16 results
📏 Metrics: Accuracy, GFLOPs, MParams

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 1 results
📏 Metrics: Avg #Steps

Neural Architecture Search

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 36 results
📏 Metrics: Top-1 Error Rate, Parameters, FLOPS, Search Time (GPU days), Accuracy (% )

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 12 results
📏 Metrics: Percentage Error, FLOPS, PARAMS, Search Time (GPU days), Accuracy (% )

CINIC-10

CINIC-10 is a dataset for image classification. It has a total of 270,000 images, 4.5 times that of CIFAR-10. It …

📊 4 results
📏 Metrics: Accuracy (%), FLOPS, PARAMS

DTD

The Describable Textures Dataset (DTD) contains 5640 texture images in the wild. They are annotated with human-centric attributes inspired by …

📊 4 results
📏 Metrics: Accuracy (%), FLOPS, PARAMS

Food-101

The Food-101 dataset consists of 101 food categories with 750 training and 250 test images per category, making a total …

📊 5 results
📏 Metrics: Accuracy (%), FLOPS, PARAMS, Accuracy (% )

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 121 results
📏 Metrics: Top-1 Error Rate, Accuracy, Params, MACs, FLOPs

LIDC-IDRI

The LIDC-IDRI dataset contains lesion annotations from four experienced thoracic radiologists. LIDC-IDRI contains 1,018 low-dose lung CTs from 1010 lung …

📊 1 results
📏 Metrics: F1 score, Specificity (VEB+)

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 1 results
📏 Metrics: R2

NAS-Bench-101

NAS-Bench-101 is the first public architecture dataset for NAS research. To build NASBench-101, the authors carefully constructed a compact, yet …

📊 3 results
📏 Metrics: Accuracy (%), Spearman Correlation

Oxford-IIIT Pet Dataset

The Oxford-IIIT Pet Dataset has 37 categories with roughly 200 images for each class. The images have a large variations …

📊 4 results
📏 Metrics: Accuracy (%), FLOPS, PARAMS

STL-10

The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or …

📊 4 results
📏 Metrics: Accuracy (%), FLOPS, PARAMS

Stanford Cars

The Stanford Cars dataset consists of 196 classes of cars with a total of 16,185 images, taken from the rear. …

📊 4 results
📏 Metrics: Accuracy (%), FLOPS, PARAMS

Novel Class Discovery

SVHN

Street View House Numbers (SVHN) is a digit classification benchmark dataset that contains 600,000 32×32 RGB images of printed digits …

📊 1 results
📏 Metrics: Clustering Accuracy

Optical Character Recognition (OCR)

FSNS - Test

Arabic handwriting dataset.

📊 3 results
📏 Metrics: Sequence error

I2L-140K

Introduced by Singh, Sumeet S.. “Teaching Machines to Code: Neural Markup Generation with Visual Attention.” ArXiv abs/1802.05415 (2018): n. pag. …

📊 2 results
📏 Metrics: BLEU

VideoDB's OCR Benchmark Public Collection

Dataset Introduction This dataset leverages VideoDB's Public Collection to offer a diverse range of videos featuring text-containing scenes. It …

📊 5 results
📏 Metrics: Average Accuracy, Character Error Rate (CER), Word Error Rate (WER)

im2latex-100k

A prebuilt dataset for OpenAI's task for image-2-latex system. Includes total of ~100k formulas and images splitted into train, validation …

📊 1 results
📏 Metrics: BLEU

Outlier Detection

ECG5000

The original dataset for "ECG5000" is a 20-hour long ECG downloaded from Physionet. The name is BIDMC Congestive Heart Failure …

📊 2 results
📏 Metrics: Accuracy

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 1 results
📏 Metrics: AUROC

SKAB

SKAB is designed for evaluating algorithms for anomaly detection. The benchmark currently includes 30+ datasets plus Python modules for algorithms’ …

📊 1 results
📏 Metrics: Average F1

Partial Label Learning

ISIC 2019

The goal for ISIC 2019 is classify dermoscopic images among nine different diagnostic categories.25,331 images are available for training across …

📊 1 results
📏 Metrics: Balanced Multi-Class Accuracy

M-VAD Names

The dataset contains the annotations of characters' visual appearances, in the form of tracks of face bounding boxes, and the …

📊 1 results
📏 Metrics: Accuracy

Point Processes

AgeGroup Transactions MTPP

The dataset contains historical financial transactions, including time, category and cost fields. There are 50000 clients, 205 categories and 43.7M …

📊 5 results
📏 Metrics: T-mAP, MAE, OTD, Accuracy (%)

Amazon MTPP

The dataset includes time-stamped user product reviews behavior from January, 2008 to October, 2018. Each user has a sequence of …

📊 4 results
📏 Metrics: T-mAP, OTD, Accuracy (%), MAE

MemeTracker

The Memetracker corpus contains articles from mainstream media and blogs from August 1 to October 31, 2008 with about 1 …

📊 1 results
📏 Metrics: Accuracy, RMSE

RETWEET

RETWEET is a dataset of tweets and overall predominant sentiment of their replies. SUMMARY ------ WHAT: Message-level Polarity Classification. GOAL:

📊 1 results
📏 Metrics: Accuracy, RMSE

Retweet MTPP

This dataset contains time-stamped user retweet event sequences. The events are categorized into 3 types: retweets by “small,” “medium” and …

📊 6 results
📏 Metrics: T-mAP, OTD, Accuracy (%), MAE

StackOverflow MTPP

The dataset has two years of user awards on a question-answering website: each user received a sequence of badges and …

📊 3 results
📏 Metrics: OTD, T-mAP, Accuracy (%), MAE

Quantization

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 2 results
📏 Metrics: MAP

COCO (Common Objects in Context)

The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset. It is designed to …

📊 1 results
📏 Metrics: MAP

IJB-B

The IJB-B dataset is a template-based face dataset that contains 1845 subjects with 11,754 images, 55,025 frames and 7,011 videos …

📊 1 results
📏 Metrics: TAR @ FAR=1e-4

IJB-C

The IJB-C dataset is a video-based face recognition dataset. It is an extension of the IJB-A dataset with about 138,000 …

📊 1 results
📏 Metrics: TAR @ FAR=1e-4

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 27 results
📏 Metrics: Top-1 Accuracy (%), Weight bits, Activation bits

LFW

The LFW dataset contains 13,233 images of faces collected from the web. This dataset consists of the 5749 identities with …

📊 1 results
📏 Metrics: Accuracy

Wiki-40B

A new multilingual language model benchmark that is composed of 40+ languages spanning several scripts and linguistic families containing round …

📊 1 results
📏 Metrics: Perplexity

Reinforcement Learning

iris

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician, …

📊 1 results
📏 Metrics: 10 Images, 4*4 Stitching, Exact Accuracy

Reinforcement Learning (RL)

ProcGen

Procgen Benchmark includes 16 simple-to-use procedurally-generated environments which provide a direct measure of how quickly a reinforcement learning agent learns …

📊 2 results
📏 Metrics: Mean Normalized Performance

Representation Learning

Animals-10

It contains about 28K medium quality animal images belonging to 10 categories: dog, cat, horse, spyder, butterfly, chicken, sheep, cow, …

📊 1 results
📏 Metrics: 1:1 Accuracy

SciDocs

SciDocs evaluation framework consists of a suite of evaluation tasks designed for document-level tasks. Source: Allen Institute for AI

📊 7 results
📏 Metrics: Avg.

Sports10

  • Games dataset containing 100,000 Gameplay Images of 175 Video Games across 10 Sports Genres - AMERICAN FOOTBALL, BASKETBALL, BIKE …
📊 1 results
📏 Metrics: Silhouette Score

Retrieval

HotpotQA

HotpotQA is a question answering dataset collected on the English Wikipedia, containing about 113K crowd-sourced questions that are constructed to …

📊 3 results
📏 Metrics: Queries per second

InfoSeek

In this project, we introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions that cannot be answered with …

📊 1 results
📏 Metrics: Recall@5

MVK

The dataset contains single-shot videos taken from moving cameras in underwater environments. The first shard of a new Marine Video …

📊 1 results
📏 Metrics: text-to-video Mean Rank

Natural Questions

The Natural Questions corpus is a question answering dataset containing 307,373 training examples, 7,830 development examples, and 7,842 test examples. …

📊 3 results
📏 Metrics: Queries per second

OK-VQA

Outside Knowledge Visual Question Answering (OK-VQA) includes more than 14,000 questions that require external knowledge to answer. Source: [OK-VQA: A …

📊 2 results
📏 Metrics: Recall@5

Polyvore

This dataset contains 21,889 outfits from polyvore.com, in which 17,316 are for training, 1,497 for validation and 3,076 for testing. …

📊 1 results
📏 Metrics: Recall@5

PubMedQA

The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary …

📊 1 results
📏 Metrics: Accuracy (Top-1)

PubMedQA corpus with metadata

PubMedQA-MetaGen: Metadata-Enriched PubMedQA Corpus Dataset Summary PubMedQA-MetaGen is a metadata-enriched version of the PubMedQA biomedical question-answering dataset, created using the …

📊 1 results
📏 Metrics: Accuracy (Top-1)

Quora Question Pairs

Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary …

📊 4 results
📏 Metrics: Queries per second

ToolLens

The ToolLens dataset consists of 18,770 concise yet intentionally multifaceted queries, each associated with 1 to 3 verified tools out …

📊 1 results
📏 Metrics: COMP@

Semantic Similarity

BIOSSES

The BIOSSES data set comprises total 100 sentence pairs all of which were selected from the "[TAC2 Biomedical Summarization Track …

📊 3 results
📏 Metrics: Pearson Correlation

CHIP-STS

CHIP Semantic Textual Similarity, a dataset for sentence similarity in the non-i.i.d. (non-independent and identically distributed) setting, is used for …

📊 1 results
📏 Metrics: Macro F1

SICK

The Sentences Involving Compositional Knowledge (SICK) dataset is a dataset for compositional distributional semantics. It includes a large number of …

📊 5 results
📏 Metrics: MSE, Pearson Correlation, Spearman Correlation

Sparse Learning

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 9 results
📏 Metrics: Top-1 Accuracy

Stochastic Optimization

AG News

AG News (AG’s News Corpus) is a subdataset of AG's corpus of news articles constructed by assembling titles and description …

📊 1 results
📏 Metrics: Accuracy (max), Accuracy (mean)

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 2 results
📏 Metrics: Accuracy (max), Accuracy (mean)

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 2 results
📏 Metrics: Accuracy (max), Accuracy (mean)

CoLA

The Corpus of Linguistic Acceptability (CoLA) consists of 10657 sentences from 23 linguistics publications, expertly annotated for acceptability (grammaticality) by …

📊 1 results
📏 Metrics: Accuracy (max), Accuracy (mean)

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 1 results
📏 Metrics: NLL

Transfer Learning

Office-Home

Office-Home is a benchmark dataset for domain adaptation which contains 4 domains where each domain consists of 65 categories. The …

📊 5 results
📏 Metrics: Accuracy

Retinal Fundus MultiDisease Image Dataset (RFMiD)

According to the WHO, World report on vision 2019, the number of visually impaired people worldwide is estimated to be …

📊 1 results
📏 Metrics: AUROC

Two-sample testing

HIGGS Data Set

The data has been produced using Monte Carlo simulations. The first 21 features (columns 2-22) are kinematic properties measured by …

📊 1 results
📏 Metrics: Avg accuracy

Zero-Shot Learning

AwA2

Animals with Attributes 2 (AwA2) is a dataset for benchmarking transfer-learning algorithms, such as attribute base classification and zero-shot learning. …

📊 4 results
📏 Metrics: average top-1 classification accuracy, Accuracy Seen, Accuracy Unseen, H

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 2 results
📏 Metrics: Accuracy

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 2 results
📏 Metrics: Accuracy

COCO-MLT

The COCO-MLT is created from MS COCO-2017, containing 1,909 images from 80 classes. The maximum of training number per class …

📊 2 results
📏 Metrics: Average mAP

CUB-200-2011

The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of …

📊 14 results
📏 Metrics: average top-1 classification accuracy, Accuracy Seen, Accuracy Unseen, H, Accuracy

Caltech-101

The Caltech101 dataset contains images from 101 object categories (e.g., “helicopter”, “elephant” and “chair” etc.) and a background category that …

📊 2 results
📏 Metrics: Accuracy

DTD

The Describable Textures Dataset (DTD) contains 5640 texture images in the wild. They are annotated with human-centric attributes inspired by …

📊 2 results
📏 Metrics: Accuracy

EuroSAT

Eurosat is a dataset and deep learning benchmark for land use and land cover classification. The dataset is based on …

📊 1 results
📏 Metrics: Accuracy

FGVC-Aircraft

FGVC-Aircraft contains 10,200 images of aircraft, with 100 images for each of 102 different aircraft model variants, most of which …

📊 2 results
📏 Metrics: Accuracy

Food-101

The Food-101 dataset consists of 101 food categories with 750 training and 250 test images per category, making a total …

📊 2 results
📏 Metrics: Accuracy

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 2 results
📏 Metrics: Top 1 Accuracy

ImageNet_CN

transform the ImageNet-1K classification datatset for Chinese models by translating labels and prompts into Chinese.

📊 1 results
📏 Metrics: Accuracy

LSMDC

This dataset contains 118,081 short video clips extracted from 202 movies. Each video has a caption, either extracted from the …

📊 1 results
📏 Metrics: Accuracy

MIT-States

The MIT-States dataset has 245 object classes, 115 attribute classes and ∼53K images. There is a wide range of objects …

📊 1 results
📏 Metrics: A-acc

MSRVTT-QA

The MSR-VTT-QA dataset is a benchmark for the task of Visual Question Answering (VQA) on the MSR-VTT (Microsoft Research Video …

📊 1 results
📏 Metrics: Accuracy

MSVD-QA

The MSVD-QA dataset is a Video Question Answering (VideoQA) dataset. It is based on the existing Microsoft Research Video Description …

📊 1 results
📏 Metrics: Accuracy

MedConceptsQA

MedConceptsQA - Open Source Medical Concepts QA Benchmark The benchmark can be found here: https://huggingface.co/datasets/ofir408/MedConceptsQA

📊 12 results
📏 Metrics: Accuracy

Oxford 102 Flower

Oxford 102 Flower is an image classification dataset consisting of 102 flower categories. The flowers chosen to be flower commonly …

📊 2 results
📏 Metrics: average top-1 classification accuracy

Oxford-IIIT Pets

The Oxford-IIIT Pet Dataset is a 37-category pet dataset with roughly 200 images for each class. The images have large …

📊 2 results
📏 Metrics: Accuracy

PASCAL Context

The PASCAL Context dataset is an extension of the PASCAL VOC 2010 detection challenge, and it contains pixel-wise labels for …

📊 1 results
📏 Metrics: k=10 mIOU

SNIPS

The SNIPS Natural Language Understanding benchmark is a dataset of over 16,000 crowdsourced queries distributed among 7 user intents of …

📊 1 results
📏 Metrics: Accuracy

SUN Attribute

The SUN Attribute dataset consists of 14,340 images from 717 scene categories, and each category is annotated with a taxonomy …

📊 9 results
📏 Metrics: average top-1 classification accuracy, Accuracy Seen, Accuracy Unseen, H

SUN397

The Scene UNderstanding (SUN) database contains 899 categories and 130,519 images. There are 397 well-sampled categories to evaluate numerous state-of-the-art …

📊 2 results
📏 Metrics: Accuracy

Stanford Cars

The Stanford Cars dataset consists of 196 classes of cars with a total of 16,185 images, taken from the rear. …

📊 2 results
📏 Metrics: Accuracy

TVQA

The TVQA dataset is a large-scale video dataset for video question answering. It is based on 6 popular TV shows …

📊 1 results
📏 Metrics: Accuracy

UCF101

UCF101 dataset is an extension of UCF50 and consists of 13,320 video clips, which are classified into 101 categories. These …

📊 2 results
📏 Metrics: Accuracy

VOC-MLT

We construct the long-tailed version of VOC from its 2012 train-val set. It contains 1,142 images from 20 classes, with …

📊 2 results
📏 Metrics: Average mAP

iVQA

An open-ended VideoQA benchmark that aims to: i) provide a well-defined evaluation by including five correct answer annotations per question …

📊 1 results
📏 Metrics: Accuracy

Zero-shot Generalization

CALVIN

CALVIN (Composing Actions from Language and Vision), is an open-source simulated benchmark to learn long-horizon language-conditioned robot manipulation tasks.

📊 5 results
📏 Metrics: Avg. sequence length

parameter-efficient fine-tuning

BoolQ

BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring – they are …

📊 4 results
📏 Metrics: Accuracy (% )

HellaSwag

HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are …

📊 3 results
📏 Metrics: Accuracy (% )

WinoGrande

WinoGrande is a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the …

📊 3 results
📏 Metrics: Accuracy (% )

regression

California Housing Prices

Median house prices for California districts derived from the 1990 census. About Dataset Context This is the dataset used in …

📊 3 results
📏 Metrics: R2 Score, lambda

Car_Price_Prediction

In this dataset we added [Company Name, Car Model, Car Type, Fuel Type, Transmission, Engine (cc), Mileage, Kms_driven, Buyers, Horsepower …

📊 1 results
📏 Metrics: R Squared

Concrete Compressive Strength

Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age …

📊 3 results
📏 Metrics: R2 Score, lambda

Medical Cost Personal Dataset

This dataset contains demographic and personal health information for individuals, along with the corresponding medical insurance charges billed to them. …

📊 3 results
📏 Metrics: R2 Score, lambda