Machine Learning Benchmarks

Browse 3753 benchmarks across 897 tasks
← ML Research Wiki / Benchmarks
Browse by Category

1 Image, 2*2 Stitchi

FQL-Driving

FQL-driving

📊 1 results
📏 Metrics: 0..5sec

10-shot image generation

FQL-Driving

FQL-driving

📊 1 results
📏 Metrics: 0-shot MRR

FlyingThings3D

FlyingThings3D is a synthetic dataset for optical flow, disparity and scene flow estimation. It consists of everyday objects flying along …

📊 1 results
📏 Metrics: 0..5sec

MEAD

Multi-view Emotional Audio-visual Dataset

📊 1 results
📏 Metrics: 12k

Music21

Music21 is an untrimmed video dataset crawled by keyword query from Youtube. It contains music performances belonging to 21 categories. …

📊 1 results
📏 Metrics: 0..5sec

16k

ConceptNet

ConceptNet is a knowledge graph that connects words and phrases of natural language with labeled edges. Its knowledge is collected …

📊 1 results
📏 Metrics: 1'"

2D Human Pose Estimation

COCO-WholeBody

COCO-WholeBody is an extension of COCO dataset with whole-body annotations. There are 4 types of bounding boxes (person box, face …

📊 14 results
📏 Metrics: WB, body, foot, face, hand

Human-Art

Human-Art is a versatile human-centric dataset to bridge the gap between natural and artificial scenes. It includes twenty high-quality human …

📊 10 results
📏 Metrics: AP, AP (gt bbox), Validation AP

OCHuman

This dataset focuses on heavily occluded human with comprehensive annotations including bounding-box, humans pose and instance mask. This dataset contains …

📊 10 results
📏 Metrics: Test AP, Validation AP

2D Object Detection

CeyMo

CeyMo is a novel benchmark dataset for road marking detection which covers a wide variety of challenging urban, sub-urban and …

📊 5 results
📏 Metrics: mAP

Clear Weather

We introduce an object detection dataset in challenging adverse weather conditions covering 12000 samples in real-world driving scenes and 1500 …

📊 2 results
📏 Metrics: clear hard (AP)

DUO

DUO is a dataset for Underwater object detection for robot picking. The dataset contains a collection of diverse underwater images …

📊 1 results
📏 Metrics: All mAP, AP50, AP75

Dense Fog

We introduce an object detection dataset in challenging adverse weather conditions covering 12000 samples in real-world driving scenes and 1500 …

📊 2 results
📏 Metrics: dense fog hard (AP), light fog hard (AP), snow/rain hard (AP)

DroneVehicle

The DroneVehicle dataset consists of a total of 56,878 images collected by the drone, half of which are RGB images, …

📊 7 results
📏 Metrics: test/mAP50, test/mAP, Val/mAP50

ETDII Dataset

Paper: GridTracer: Automatic Mapping of Power Grids using Deep Learning and Overhead Imagery Authors: Bohao Huang, Jichen Yang, Artem Streltsov, …

📊 1 results
📏 Metrics: [email protected]

ExDark

The Exclusively Dark (ExDARK) dataset is a collection of 7,363 low-light images from very low-light environments to twilight (i.e 10 …

📊 2 results
📏 Metrics: mAP

FishEye8K

With the advance of AI, road object detection has been a prominent topic in computer vision, mostly using perspective cameras. …

📊 1 results
📏 Metrics: mAP

RADIATE

RADIATE (RAdar Dataset In Adverse weaThEr) is new automotive dataset created by Heriot-Watt University which includes Radar, Lidar, Stereo Camera …

📊 2 results
📏 Metrics: [email protected]

RF100

The evaluation of object detection models is usually performed by optimizing a single metric, e.g. mAP, on a fixed set …

📊 1 results
📏 Metrics: Average mAP

RadioGalaxyNET Dataset

Automating the creation of catalogues for radio galaxies in next-generation deep surveys necessitates the identification of components within extended sources …

📊 1 results
📏 Metrics: COCO-style AP

SARDet-100K

The SARDet-100K dataset encompasses a total of 116,598 images, and 245,653 instances distributed across six categories: Aircraft, Ship, Car, Bridge, …

📊 13 results
📏 Metrics: box mAP, mAP, mAP@50, mAP@75

TRR360D

TRR360D is based on the ICDAR2019MTD modern table detection dataset, it refers to the annotation format of the DOTA dataset. …

📊 1 results
📏 Metrics: AP50(T<90), AP90(T<90)

UAV-PDD2023

UAV-PDD2023: A benchmark dataset for pavement distress detection based on UAV images

📊 1 results
📏 Metrics: [email protected]

UAVDB

UAVDB is a high-resolution RGB video dataset meticulously designed for UAV detection tasks across diverse scales and complex backgrounds. Comprising …

📊 1 results
📏 Metrics: AP50

2D Panoptic Segmentation

4D-OR

4D-OR includes a total of 6734 scenes, recorded by six calibrated RGB-D Kinect sensors 1 mounted to the ceiling of …

📊 1 results
📏 Metrics: VPQ

MM-OR

Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment for enhancing …

📊 1 results
📏 Metrics: VPQ

2D Pose Estimation

300W

The 300-W is a face dataset that consists of 300 Indoor and 300 Outdoor in-the-wild images. It covers a large …

📊 1 results
📏 Metrics: Mean [email protected]

Animal Kingdom

Animal Kingdom is a large and diverse dataset that provides multiple annotated tasks to enable a more thorough understanding of …

📊 1 results

Desert Locust

Desert Locus is a animal pose estimation dataset for desert locuses.

📊 1 results
📏 Metrics: Mean [email protected]

HARPER

We introduce HARPER, a novel dataset for 3D body pose estimation and forecast in dyadic interactions between users and \spot, …

📊 1 results
📏 Metrics: PCK

Human3.6M

The Human3.6M dataset is one of the largest motion capture datasets, which consists of 3.6 million human poses and corresponding …

📊 1 results
📏 Metrics: EPE

MP-100

The first large-scale pose dataset containing objects of multiple super-categories, termed Multi-category Pose (MP-100). In total, MP-100 dataset covers 100 …

📊 5 results
📏 Metrics: Mean [email protected] - 1shot, Mean [email protected] - 5shot

MacaquePose

MacaquePose is an animal pose estimation dataset containing pictures of macaque monkeys and manually labeled annotations on them.

📊 1 results
📏 Metrics: AP

Vinegar Fly

Vinegar Fly is a pose estimation dataset for fruit flies.

📊 1 results
📏 Metrics: Mean [email protected]

iRodent

Description: The "iRodent" dataset contains rodent species observations obtained using the iNaturalist API, with a focus on Suborder Myomorpha (Taxon …

📊 8 results
📏 Metrics: Average mAP

2D Semantic Segmentation

CamVid

CamVid (Cambridge-driving Labeled Video Database) is a road/driving scene understanding database which was originally captured as five video sequences with …

📊 1 results
📏 Metrics: mIoU

GF-PA66 3D XCT

Stack of 2D gray images of glass fiber-reinforced polyamide 66 (GF-PA66) 3D X-ray Computed Tomography (XCT) specimen. Usage: 2D/3D image …

📊 1 results
📏 Metrics: Jaccard (Mean)

WaterScenes

A Multi-Task 4D Radar-Camera Fusion Dataset for Autonomous Driving on Water Surfaces description of the dataset * WaterScenes, the first …

📊 1 results
📏 Metrics: mIoU

WildScenes

WildScenes is a bi-modal benchmark dataset consisting of multiple large-scale, sequential traversals in natural environments, including semantic annotations in high-resolution …

📊 5 results
📏 Metrics: mIoU, mIoU (Temporal DA) , mIoU (Env DA)

xBD

The xBD dataset contains over 45,000KM2 of polygon labeled pre and post disaster imagery. The dataset provides the post-disaster imagery …

📊 5 results
📏 Metrics: Weighted Average F1-score, Localization F1-score, Classification F1-score

2D Semantic Segmentation task 3 (25 classes)

CaDIS

CaDIS: a Cataract Dataset for Image Segmentation is a dataset for semantic segmentation created by Digital Surgery Ltd. on top …

📊 6 results
📏 Metrics: Mean IoU (test), Mean IoU (val)

3D Absolute Human Pose Estimation

Human3.6M

The Human3.6M dataset is one of the largest motion capture datasets, which consists of 3.6 million human poses and corresponding …

📊 4 results
📏 Metrics: MRPE, Average MPJPE (mm), PA-MPJPE

3D Action Recognition

Assembly101

Assembly101 is a new procedural activity dataset featuring 4321 videos of people assembling and disassembling 101 "take-apart" toy vehicles. Participants …

📊 7 results
📏 Metrics: Actions Top-1, Verbs Top-1, Object Top-1

NTU RGB+D

NTU RGB+D is a large-scale dataset for RGB-D human action recognition. It involves 56,880 samples of 60 action classes collected …

📊 3 results
📏 Metrics: Cross Subject Accuracy, Cross View Accuracy

3D Anomaly Detection

Real 3D-AD

Real 3D-AD is the first point cloud anomaly detection dataset for industrial products. Real3D-AD comprises a total of 1,254 samples …

📊 19 results
📏 Metrics: Mean Performance of P. and O. , Point AUROC, Object AUROC

3D Assembly

DeepCAD

DeepCAD is a CAD dataset consisting of 179,133 models and their CAD construction sequences. It can be used to train …

📊 1 results
📏 Metrics: 1-1

3D Canonical Hand Pose Estimation

STB

3D hand pose data set created using stereo camera - contains 18,000 RGB images and paired depth images - 3D …

📊 1 results
📏 Metrics: AUC

3D Classification

U-10: United-10 COVID19 CT Dataset

This dataset supports the research detailed in the pre-print "Virtual Imaging Trials Improved the Transparency and Reliability of AI Systems …

📊 2 results
📏 Metrics: AUC

3D Depth Estimation

Relative Human

Relative Human (RH) contains multi-person in-the-wild RGB images with rich human annotations, including: Depth layers: relative depth relationship/ordering between all …

📊 3 results
📏 Metrics: PCDR, PCDR-Baby, PCDR-Kid, PCDR-Teen, PCDR-Adult, mPCDK

3D Face Animation

BEAT2

We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures, encompassing facial, local body, hands, …

📊 5 results
📏 Metrics: MSE

Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2

BIWI 3D corpus comprises a total of 1109 sentences uttered by 14 native English speakers (6 males and 8 females). …

📊 5 results
📏 Metrics: Lip Vertex Error, FDD

VOCASET

VOCASET is a 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio. …

📊 2 results
📏 Metrics: Lip Vertex Error

3D Face Modelling

Voxceleb-3D

A dataset for voice and 3D face structure study. It contains about 1.4K identities with their 3D face models and …

📊 2 results
📏 Metrics: Mean ARE, ARE-ER, ARE-FR, ARE-MR, ARE-CR

3D Face Reconstruction

AFLW2000-3D

AFLW2000-3D is a dataset of 2000 images that have been annotated with image-level 68-point 3D facial landmarks. This dataset is …

📊 8 results
📏 Metrics: Mean NME , Mean NME

Florence

The Florence 3D faces dataset consists of: * High-resolution 3D scans of human faces from many subjects. * Several video …

📊 15 results
📏 Metrics: Mean NME , Average 3D Error, RMSE Cooperative, RMSE Indoor, RMSE Outdoor, Mean NME

NoW Benchmark

The goal of this benchmark is to introduce a standard evaluation metric to measure the accuracy and robustness of 3D …

📊 15 results
📏 Metrics: Median Reconstruction Error, Mean Reconstruction Error (mm), Stdev Reconstruction Error (mm)

REALY

The REALY benchmark aims to introduce a region-aware evaluation pipeline to measure the fine-grained normalized mean square error (NMSE) of …

📊 24 results
📏 Metrics: all, @nose, @mouth, @forehead, @cheek

3D Generation

E.T. the Exceptional Trajectories

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 4 results
📏 Metrics: FD_ClaTr, ClaTr-Score, Classifier-F1

3D Hand Pose Estimation

DexYCB

DexYCB is a dataset for capturing hand grasping of objects. It can be used three relevant tasks: 2D object and …

📊 9 results
📏 Metrics: Average MPJPE (mm), Procrustes-Aligned MPJPE, MPVPE, VAUC, PA-MPVPE, PA-VAUC

FreiHAND

FreiHAND is a 3D hand pose dataset which records different hand actions performed by 32 people. For each hand image, …

📊 30 results
📏 Metrics: PA-MPJPE, PA-MPVPE, PA-F@5mm, PA-F@15mm

H3WB

Human3.6M 3D WholeBody (H3WB) is a large scale dataset with 133 whole-body keypoint annotations on 100K images, made possible by …

📊 15 results
📏 Metrics: Average MPJPE (mm)

HInt: Hand Interactions in the wild

The HInt dataset is frequently used as a generalizability benchmark for 3D Hand Reconstruction. It features three data subsets: HInt-NewDays, …

📊 9 results
📏 Metrics: [email protected] (New Days) All, [email protected] (VISOR) All, [email protected] (Ego4D) All, [email protected] (NewDays) Visible, [email protected] (VISOR) Visible, [email protected] (Ego4D) Visible, [email protected] (NewDays) Occ, [email protected] (VISOR) Occ, [email protected] (Ego4D) Occ

HO-3D v2

A hand-object interaction dataset with 3D pose annotations of hand and object. The dataset contains 66,034 training images and 11,524 …

📊 21 results
📏 Metrics: PA-MPJPE (mm), PA-MPVPE, F@5mm, F@15mm, AUC_J, AUC_V

HO-3D v3

The HO-3D v3 is the version 3 of the HO-3D dataset with more accurate hand-object poses. HO-3D v3 provides more …

📊 8 results
📏 Metrics: PA-MPJPE, PA-MPVPE, F@5mm, F@15mm, AUC_J, AUC_V

InterHand2.6M

The InterHand2.6M dataset is a large-scale real-captured dataset with accurate GT 3D interacting hand poses, used for 3D hand pose …

📊 1 results
📏 Metrics: MPJPE

3D Human Pose Estimation

3DPW

The 3D Poses in the Wild dataset is the first dataset in the wild with accurate 3D poses for evaluation. …

📊 115 results
📏 Metrics: MPVPE, PA-MPJPE, MPJPE, Acceleration Error, FLOPs (G), Number of parameters (M)

AGORA

AGORA is a synthetic human dataset with high realism and accurate ground truth. It consists of around 14K training and …

📊 11 results
📏 Metrics: B-NMVE, B-NMJE, B-MVE, B-MPJPE

AIST++

AIST++ is a 3D dance dataset which contains 3D motion reconstructed from real dancers paired with music. The AIST++ Dance …

📊 5 results
📏 Metrics: MPJPE, Single-view, Acceleration Error

DHP19

DHP19 is the first human pose dataset with data collected from DVS event cameras. It has recordings from 4 synchronized …

📊 2 results
📏 Metrics: MPJPE3D, GFLOPs, MPJPE2D, Params (M)

EMDB

EMDB contains in-the-wild videos of human activity recorded with a hand-held iPhone. It features reference SMPL body pose and shape …

📊 13 results
📏 Metrics: Average MPJPE-PA (mm), Average MPJPE (mm), Average MVE (mm), Average MVE-PA (mm), Average MPJAE (deg), Average MPJAE-PA (deg), Jitter (10m/s^3)

H3WB

Human3.6M 3D WholeBody (H3WB) is a large scale dataset with 133 whole-body keypoint annotations on 100K images, made possible by …

📊 17 results
📏 Metrics: MPJPE

HSPACE

HSPACE (Human-SPACE) is a large-scale photo-realistic dataset of animated humans placed in complex synthetic indoor and outdoor environments. For all …

📊 1 results
📏 Metrics: MPJPE, MPVPE, PA-MPJPE, PA-MPVPE

Human3.6M

The Human3.6M dataset is one of the largest motion capture datasets, which consists of 3.6 million human poses and corresponding …

📊 80 results
📏 Metrics: Average MPJPE (mm), Using 2D ground-truth joints, Multi-View or Monocular, PA-MPJPE, Acceleration Error, Angular Error, MPVE (mm)

JTA

JTA is a dataset for people tracking in urban scenarios by exploiting a photorealistic videogame. It is up to now …

📊 1 results
📏 Metrics: F1(t=0.4m), F1(t=0.8m), F1(t=1.2m)

MPI-INF-3DHP

MPI-INF-3DHP is a 3D human body pose estimation dataset consisting of both constrained indoor and complex outdoor scenes. It records …

📊 105 results
📏 Metrics: MPJPE, AUC, PCK, 3DPCK, PA-MPJPE, Acceleration Error

Panoptic

CMU Panoptic is a large scale dataset providing 3D pose annotations (1.5 millions) for multiple people engaging social activities. It …

📊 6 results
📏 Metrics: Average MPJPE (mm)

RICH

Inferring human-scene contact (HSC) is the first step toward understanding how humans interact with their surroundings. While detecting 2D human-object …

📊 4 results
📏 Metrics: MPJPE, MPVPE, PA-MPJPE, BoSE

SLOPER4D

SLOPER4D is a novel scene-aware dataset collected in large urban environments to facilitate the research of global human pose estimation …

📊 4 results
📏 Metrics: Average MPJPE (mm)

UBody

UBody is a large-scale Upper-Body dataset with the following annotations: * 2D whole-body keypoints * 3D SMPLX annotations * Frame …

📊 4 results
📏 Metrics: PVE-All, PVE-Hands, PVE-Face, PA-PVE-All, PA-PVE-Hands, PA-PVE-Face

Waymo Open Dataset

The Waymo Open Dataset is comprised of high resolution sensor data collected by autonomous vehicles operated by the Waymo Driver …

📊 1 results
📏 Metrics: MPJPE

3D Instance Segmentation

MitoEM

Contains mitochondria instances. Source: MitoEM

📊 1 results
📏 Metrics: AP75-R-Val, AP75-H-Val, AP75-R-Test, AP75-H-Test

PartNet

PartNet is a consistent, large-scale dataset of 3D objects annotated with fine-grained, instance-level, and hierarchical 3D part information. The dataset …

📊 3 results
📏 Metrics: mAP50

S3DIS

The Stanford 3D Indoor Scene Dataset (S3DIS) dataset contains 6 large-scale indoor areas with 271 rooms. Each point in the …

📊 21 results
📏 Metrics: AP@50, mAP, mPrec, mRec, mIoU, mAcc, mCov, mWCov

STPLS3D

Our project (STPLS3D) aims to provide a large-scale aerial photogrammetry dataset with synthetic and real annotated 3D point clouds for …

📊 8 results
📏 Metrics: AP, AP50, AP25

ScanNet

ScanNet is an instance-level indoor RGB-D dataset that includes both 2D and 3D data. It is a collection of labeled …

📊 1 results
📏 Metrics: mAP

ScanNet++

ScanNet++ is a large scale dataset with 450+ 3D indoor scenes containing sub-millimeter resolution laser scans, registered 33-megapixel DSLR images, …

📊 3 results
📏 Metrics: mAP

ScanNet200

The ScanNet200 benchmark studies 200-class 3D semantic segmentation - an order of magnitude more class categories than previous 3D scene …

📊 5 results
📏 Metrics: mAP, mAP@25, mAP@50

SceneNN

SceneNN is an RGB-D scene dataset consisting of more than 100 indoor scenes. The scenes are captured at various places, …

📊 3 results
📏 Metrics: [email protected]

3D Interacting Hand Pose Estimation

InterHand2.6M

The InterHand2.6M dataset is a large-scale real-captured dataset with accurate GT 3D interacting hand poses, used for 3D hand pose …

📊 8 results
📏 Metrics: MPJPE Test, MPVPE Test, MRRPE Test

3D Multi-Object Tracking

Waymo Open Dataset

The Waymo Open Dataset is comprised of high resolution sensor data collected by autonomous vehicles operated by the Waymo Driver …

📊 3 results
📏 Metrics: MOTA/L2

nuScenes

The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in …

📊 10 results
📏 Metrics: AMOTA, MOTA, Recall

nuScenes LiDAR only

Robust detection and tracking of objects is crucial for the deployment of autonomous vehicle technology. Image based benchmark datasets have …

📊 1 results
📏 Metrics: AMOTA

3D Multi-Person Pose Estimation

AGORA

AGORA is a synthetic human dataset with high realism and accurate ground truth. It consists of around 14K training and …

📊 4 results
📏 Metrics: B-NMVE, B-NMJE, B-MVE, B-MPJPE

MuPoTS-3D

MuPoTs-3D (Multi-person Pose estimation Test Set in 3D) is a dataset for pose estimation composed of more than 8,000 frames …

📊 10 results
📏 Metrics: 3DPCK

Panoptic

CMU Panoptic is a large scale dataset providing 3D pose annotations (1.5 millions) for multiple people engaging social activities. It …

📊 17 results
📏 Metrics: Average MPJPE (mm)

3D Multi-Person Pose Estimation (absolute)

MuPoTS-3D

MuPoTs-3D (Multi-person Pose estimation Test Set in 3D) is a dataset for pose estimation composed of more than 8,000 frames …

📊 14 results
📏 Metrics: 3DPCK, MPJPE

3D Multi-Person Pose Estimation (root-relative)

MuPoTS-3D

MuPoTs-3D (Multi-person Pose estimation Test Set in 3D) is a dataset for pose estimation composed of more than 8,000 frames …

📊 19 results
📏 Metrics: 3DPCK, MPJPE, AUC

3D Object Captioning

Objaverse

Objaverse is a large dataset of objects with 800K+ (and growing) 3D models with descriptive captions, tags, and animations. Objaverse …

📊 6 results
📏 Metrics: GPT-4, Sentence-BERT, SimCSE, Precision, Correctness, Hallucination

3D Object Detection

3RScan

A novel dataset and benchmark, which features 1482 RGB-D scans of 478 environments across multiple time steps. Each scene includes …

📊 3 results

ARKitScenes

ARKitScenes is an RGB-D dataset captured with the widely available Apple LiDAR scanner. Along with the per-frame raw data (Wide …

📊 4 results

Aria Everyday Objects

A small-scale, real-world Project Aria dataset with high quality static 3D oriented bounding boxs annotations. Dataset Contents - Project Aria …

📊 4 results
📏 Metrics: mAP

Aria Synthetic Environments

[1]: https://www.projectaria.com/datasets/ase/ "" [2]: https://facebookresearch.github.io/projectaria_tools/docs/open_datasets/aria_synthetic_environments_dataset "" [3]: https://www.projectaria.com/research/ "" Aria Synthetic Environments is a large-scale, fully simulated dataset created by …

📊 4 results
📏 Metrics: MAP

Cityscapes 3D

Detecting vehicles and representing their position and orientation in the three dimensional space is a key technology for autonomous driving. …

📊 1 results
📏 Metrics: mDS

Clear Weather

We introduce an object detection dataset in challenging adverse weather conditions covering 12000 samples in real-world driving scenes and 1500 …

📊 1 results
📏 Metrics: mod. Car [email protected]

DAIR-V2X

DAIR-V2X is a large-scale, multi-modality, multi-view dataset from real scenarios for VICAD. DAIR-V2X comprises 71254 LiDAR frames and 71254 Camera …

📊 1 results
📏 Metrics: AP50

DTTD-Mobile

Are current 3D object tracking methods truely robust enough for low-fidelity depth sensors like the iPhone LiDAR? We introduce DTTD-Mobile …

📊 5 results
📏 Metrics: ADD AUC, ADD-S AUC

Dense Fog

We introduce an object detection dataset in challenging adverse weather conditions covering 12000 samples in real-world driving scenes and 1500 …

📊 1 results
📏 Metrics: mod. Car [email protected], mod. Cyclist [email protected], mod. Pedestrian [email protected], mod. mAP

Heavy Snowfall

We introduce an object detection dataset in challenging adverse weather conditions covering 12000 samples in real-world driving scenes and 1500 …

📊 1 results
📏 Metrics: mod. Car [email protected]

Light Snowfall

We introduce an object detection dataset in challenging adverse weather conditions covering 12000 samples in real-world driving scenes and 1500 …

📊 1 results
📏 Metrics: mod. Car [email protected]

MultiScan

We introduce MultiScan, a scalable RGBD dataset construction pipeline leveraging commodity mobile devices to scan indoor scenes with articulated objects …

📊 3 results

ONCE

ONCE (One millioN sCenEs) is a dataset for 3D object detection in the autonomous driving scenario. The ONCE dataset consists …

📊 2 results
📏 Metrics: mAP

OPV2V

OPV2V is a large-scale open simulated dataset for Vehicle-to-Vehicle perception. It contains over 70 interesting scenes, 11,464 frames, and 232,913 …

📊 5 results
📏 Metrics: [email protected]@Default, [email protected]@CulverCity

Rope3D

Roadside Perception 3D (Rope3D) is a dataset for autonomous driving and monocular 3D object detection task consisting of 50k images …

📊 7 results
📏 Metrics: [email protected]

S3DIS

The Stanford 3D Indoor Scene Dataset (S3DIS) dataset contains 6 large-scale indoor areas with 271 rooms. Each point in the …

📊 7 results

ScanNet++

ScanNet++ is a large scale dataset with 450+ 3D indoor scenes containing sub-millimeter resolution laser scans, registered 33-megapixel DSLR images, …

📊 3 results

SimBEV

The SimBEV dataset is a collection of 320 scenes spread across all 11 CARLA maps and contains data from a …

📊 5 results
📏 Metrics: SDS, mAP, mATE, mAOE, mASE, mAVE

TruckScenes

Autonomous trucking is a promising technology that can greatly impact modern logistics and the environment. Ensuring its safety on public …

📊 4 results
📏 Metrics: NDS, mAP

V2X-SIM

V2X-Sim, short for vehicle-to-everything simulation, is the a synthetic collaborative perception dataset in autonomous driving developed by AI4CE Lab at …

📊 5 results
📏 Metrics: mAP, mATE, mASE, mAOE

V2XSet

A large-scale V2X perception dataset using CARLA and OpenCDA

📊 6 results
📏 Metrics: AP0.5 (Perfect), AP0.7 (Perfect), AP0.5 (Noisy), AP0.7 (Noisy)

Waymo Open Dataset

The Waymo Open Dataset is comprised of high resolution sensor data collected by autonomous vehicles operated by the Waymo Driver …

📊 7 results
📏 Metrics: mAPH/L2

aiMotive Dataset

aiMotive dataset is a multimodal dataset for robust autonomous driving with long-range perception. The dataset consists of 176 scenes with …

📊 3 results
📏 Metrics: BEV [email protected] Highway, BEV [email protected] Night, BEV [email protected] Rain, BEV [email protected] Urban

nuScenes

The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in …

📊 32 results
📏 Metrics: NDS, mAP, mATE, mASE, mAOE, mAVE, mAAE

nuScenes LiDAR only

Robust detection and tracking of objects is crucial for the deployment of autonomous vehicle technology. Image based benchmark datasets have …

📊 7 results
📏 Metrics: NDS, NDS (val), mAP, mAP (val)

3D Object Reconstruction

BEHAVE

BEHAVE is a full body human-object interaction dataset with multi-view RGBD frames and corresponding 3D SMPL and object fits along …

📊 3 results
📏 Metrics: Chamfer Distance

ShapeNet

ShapeNet is a large scale repository for 3D CAD models developed by researchers from Stanford University, Princeton University and the …

📊 1 results
📏 Metrics: 3DIoU

3D Object Tracking

RTB

The Robot Tracking Benchmark (RTB) is a synthetic dataset that facilitates the quantitative evaluation of 3D tracking algorithms for multi-body …

📊 1 results
📏 Metrics: ADDS AUC, Runtime [ms]

3D Open-Vocabulary Instance Segmentation

Replica

The Replica Dataset is a dataset of high quality reconstructions of a variety of indoor spaces. Each reconstruction has clean …

📊 7 results
📏 Metrics: mAP

S3DIS

The Stanford 3D Indoor Scene Dataset (S3DIS) dataset contains 6 large-scale indoor areas with 271 rooms. Each point in the …

📊 4 results
📏 Metrics: AP50 Base B8/N4 , AP50 Novel B8/N4, AP50 Base B6/N6, AP50 Novel B6/N6

STPLS3D

Our project (STPLS3D) aims to provide a large-scale aerial photogrammetry dataset with synthetic and real annotated 3D point clouds for …

📊 3 results
📏 Metrics: AP50

ScanNet200

The ScanNet200 benchmark studies 200-class 3D semantic segmentation - an order of magnitude more class categories than previous 3D scene …

📊 6 results
📏 Metrics: mAP, AP50, AP25, AP Head, AP Common, AP Tail

3D Point Cloud Classification

IntrA

IntrA is an open-access 3D intracranial aneurysm dataset that makes the application of points-based and mesh-based classification and segmentation models …

📊 11 results
📏 Metrics: F1 score (5-fold)

ModelNet40-C

ModelNet40-C is a comprehensive dataset to benchmark the corruption robustness of 3D point cloud recognition. We create ModelNet40-C based on …

📊 11 results
📏 Metrics: Error Rate

ScanObjectNN

ScanObjectNN is a newly published real-world dataset comprising of 2902 3D objects in 15 categories. It is a challenging point …

📊 67 results
📏 Metrics: Overall Accuracy, Mean Accuracy, OBJ-BG (OA), OBJ-ONLY (OA), FLOPs, Number of params

Sydney Urban Objects

This dataset contains a variety of common urban road objects scanned with a Velodyne HDL-64E LIDAR, collected in the CBD …

📊 2 results
📏 Metrics: F1

3D Point Cloud Interpolation

DHB Dataset

Dynamic Human Bodies dataset (DHB), containing 10 point cloud sequences from the MITAMA dataset and 4 from the 8IVFB dataset. …

📊 5 results
📏 Metrics: CD, EMD

NL-Drive

A challenging multi-frame interpolation dataset for autonomous driving scenarios. Based on the principle of hard-sample selection and the diversity of …

📊 4 results
📏 Metrics: CD, EMD

3D Point Cloud Linear Classification

ScanObjectNN

ScanObjectNN is a newly published real-world dataset comprising of 2902 3D objects in 15 categories. It is a challenging point …

📊 2 results
📏 Metrics: Overall Accuracy

3D Pose Estimation

ApolloCar3D

ApolloCar3DT is a dataset that contains 5,277 driving images and over 60K car instances, where each car is fitted with …

📊 1 results
📏 Metrics: A3DP

HARPER

We introduce HARPER, a novel dataset for 3D body pose estimation and forecast in dyadic interactions between users and \spot, …

📊 1 results
📏 Metrics: Average MPJPE (mm)

Human3.6M

The Human3.6M dataset is one of the largest motion capture datasets, which consists of 3.6 million human poses and corresponding …

📊 3 results
📏 Metrics: Average MPJPE (mm)

K2HPD

Includes 100K depth images under challenging scenarios. Source: Human Pose Estimation from Depth Images via Inference Embedded Multi-task Learning

📊 1 results
📏 Metrics: FPS

3D Reconstruction

300W

The 300-W is a face dataset that consists of 300 Indoor and 300 Outdoor in-the-wild images. It covers a large …

📊 1 results
📏 Metrics: 1-of-100 Accuracy

ApolloCar3D

ApolloCar3DT is a dataset that contains 5,277 driving images and over 60K car instances, where each car is fitted with …

📊 1 results
📏 Metrics: A3DP

Aria Digital Twin Dataset

A real-world dataset, with hyper-accurate digital counterpart & comprehensive ground-truth annotation. Dataset Content - 200 sequences (~400 mins) - 398 …

📊 1 results
📏 Metrics: Accuracy, Completeness, Precision

Aria Synthetic Environments

[1]: https://www.projectaria.com/datasets/ase/ "" [2]: https://facebookresearch.github.io/projectaria_tools/docs/open_datasets/aria_synthetic_environments_dataset "" [3]: https://www.projectaria.com/research/ "" Aria Synthetic Environments is a large-scale, fully simulated dataset created by …

📊 1 results
📏 Metrics: Accuracy, Completeness, Precision, Recall

DTU

DTU MVS 2014 is a multi-view stereo dataset, which is an order of magnitude larger in number of scenes and …

📊 20 results
📏 Metrics: Overall, Acc, Comp

Scan2CAD

Scan2CAD is an alignment dataset based on 1506 ScanNet scans with 97607 annotated keypoints pairs between 14225 (3049 unique) CAD …

📊 2 results
📏 Metrics: Average Accuracy

ScanNet

ScanNet is an instance-level indoor RGB-D dataset that includes both 2D and 3D data. It is a collection of labeled …

📊 1 results
📏 Metrics: 3DIoU, Chamfer Distance, L1

ShapeNet

ShapeNet is a large scale repository for 3D CAD models developed by researchers from Stanford University, Princeton University and the …

📊 8 results
📏 Metrics: IoU, Chamfer Distance, F-Score@1%

3D Scene Graph Alignment

3DSSG

3DSSG provides 3D semantic scene graphs for 3RScan. A semantic scene graph is defined by a set of tuples between …

📊 2 results
📏 Metrics: MRR, F1, Hits@1

3D Semantic Scene Completion

KITTI-360

KITTI-360 is a large-scale dataset that contains rich sensory information and full annotations. It is the successor of the popular …

📊 7 results
📏 Metrics: mIoU

NYUv2

The NYU-Depth V2 data set is comprised of video sequences from a variety of indoor scenes as recorded by both …

📊 26 results
📏 Metrics: mIoU

PRO-teXt

PRO-teXt is an extension of PROXD with the inclusion of text prompts to synthesize objects. There are 180/20 interactions for …

📊 3 results
📏 Metrics: F1, CD, CMD

SemanticKITTI

SemanticKITTI is a large-scale outdoor-scene dataset for point cloud semantic segmentation. It is derived from the KITTI Vision Odometry Benchmark …

📊 18 results
📏 Metrics: mIoU

3D Semantic Segmentation

DALES

We present the Dayton Annotated LiDAR Earth Scan (DALES) data set, a new large-scale aerial LiDAR data set with over …

📊 8 results
📏 Metrics: mIoU, Overall Accuracy, Model size

ECLAIR

ECLAIR (Extended Classification of Lidar for AI Recognition), a new outdoor large-scale aerial LiDAR dataset designed specifically for advancing research …

📊 1 results
📏 Metrics: F1, Mean IoU

Hypersim

For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real images. …

📊 2 results
📏 Metrics: mIoU, mIoU (test)

KITTI-360

KITTI-360 is a large-scale dataset that contains rich sensory information and full annotations. It is the successor of the popular …

📊 6 results
📏 Metrics: miou, mIoU Category, miou Val, Model size

OpenTrench3D

OpenTrench3D, the first publicly available point cloud dataset of underground utilities from open trenches. It features 310 fully annotated point …

📊 3 results
📏 Metrics: mIoU, mAcc, Model Size

PartNet

PartNet is a consistent, large-scale dataset of 3D objects annotated with fine-grained, instance-level, and hierarchical 3D part information. The dataset …

📊 6 results
📏 Metrics: mIOU

S3DIS

The Stanford 3D Indoor Scene Dataset (S3DIS) dataset contains 6 large-scale indoor areas with 271 rooms. Each point in the …

📊 6 results
📏 Metrics: mIoU (Area-5), mIoU (6-Fold), mAcc

STPLS3D

Our project (STPLS3D) aims to provide a large-scale aerial photogrammetry dataset with synthetic and real annotated 3D point clouds for …

📊 4 results
📏 Metrics: mIOU

ScanNet++

ScanNet++ is a large scale dataset with 450+ 3D indoor scenes containing sub-millimeter resolution laser scans, registered 33-megapixel DSLR images, …

📊 8 results
📏 Metrics: Top-1 IoU, Top-3 IoU

ScanNet200

The ScanNet200 benchmark studies 200-class 3D semantic segmentation - an order of magnitude more class categories than previous 3D scene …

📊 16 results
📏 Metrics: val mIoU, test mIoU

ScribbleKITTI

ScribbleKITTI is a scribble-annotated dataset for LiDAR semantic segmentation.

📊 6 results
📏 Metrics: mIoU, mIoU-1%

SemanticKITTI

SemanticKITTI is a large-scale outdoor-scene dataset for point cloud semantic segmentation. It is derived from the KITTI Vision Odometry Benchmark …

📊 44 results
📏 Metrics: test mIoU, val mIoU, mIoU-1%

SensatUrban

The SensatUrbat dataset is an urban-scale photogrammetric point cloud dataset with nearly three billion richly annotated points, which is five …

📊 7 results
📏 Metrics: mIoU, oAcc

Toronto-3D

Toronto-3D is a large-scale urban outdoor point cloud dataset acquired by an MLS system in Toronto, Canada for semantic segmentation. …

📊 6 results
📏 Metrics: OA, mIoU

Waymo Open Dataset

The Waymo Open Dataset is comprised of high resolution sensor data collected by autonomous vehicles operated by the Waymo Driver …

📊 1 results
📏 Metrics: mIoU

WildScenes

WildScenes is a bi-modal benchmark dataset consisting of multiple large-scale, sequential traversals in natural environments, including semantic annotations in high-resolution …

📊 4 results
📏 Metrics: mIoU, mIoU (Temporal DA), mIoU (Env DA)

nuScenes

The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in …

📊 3 results
📏 Metrics: mIoU

3D Shape Reconstruction

ApolloCar3D

ApolloCar3DT is a dataset that contains 5,277 driving images and over 60K car instances, where each car is fitted with …

📊 1 results
📏 Metrics: A3DP

Pix3D

The Pix3D dataset is a large-scale benchmark of diverse image-shape pairs with pixel-level 2D-3D alignment. Pix3D has wide applications in …

📊 5 results
📏 Metrics: CD, EMD, IoU

3D Shape Reconstruction From A Single 2D Image

ApolloCar3D

ApolloCar3DT is a dataset that contains 5,277 driving images and over 60K car instances, where each car is fitted with …

📊 1 results
📏 Metrics: A3DP

LLFF

Local Light Field Fusion (LLFF) is a practical and robust deep learning solution for capturing and rendering novel views of …

📊 1 results
📏 Metrics: CLIP

3D human pose and shape estimation

EgoBody

EgoBody dataset is a novel large-scale dataset for egocentric 3D human pose, shape and motions under interactions in complex 3D …

📊 2 results
📏 Metrics: Average MPJPE (mm), PA-MPJPE, MPVPE, PA-MPVPE

3D scene Editing

LLFF

Local Light Field Fusion (LLFF) is a practical and robust deep learning solution for capturing and rendering novel views of …

📊 1 results
📏 Metrics: CLIP

4D Panoptic Segmentation

SemanticKITTI

SemanticKITTI is a large-scale outdoor-scene dataset for point cloud semantic segmentation. It is derived from the KITTI Vision Odometry Benchmark …

📊 5 results
📏 Metrics: LSTQ

6D Pose Estimation

3D-BSLS-6D

Dataset consist of both real captures from Photoneo PhoXi structured light scanner devices annotated by hand and synthetic samples produced …

📊 1 results
📏 Metrics: eRE, eTE

ApolloCar3D

ApolloCar3DT is a dataset that contains 5,277 driving images and over 60K car instances, where each car is fitted with …

📊 1 results
📏 Metrics: A3DP

DTTD-Mobile

Are current 3D object tracking methods truely robust enough for low-fidelity depth sensors like the iPhone LiDAR? We introduce DTTD-Mobile …

📊 8 results
📏 Metrics: ADD AUC, ADD-S AUC, AR CoU, AR CH, AR pCH

OPT

Accurately tracking the six degree-of-freedom pose of an object in real scenes is an important task in computer vision and …

📊 2 results
📏 Metrics: AUC

YCB-Video

The YCB-Video dataset is a large-scale video dataset for 6D object pose estimation. provides accurate 6D poses of 21 objects …

📊 9 results
📏 Metrics: ADDS AUC

AMR Graph Similarity

Benchmark for AMR Metrics based on Overt Objectives

Benchmark for AMR Metrics based on Overt Objectives (Bamboo), the first benchmark to support empirical assessment of graph-based MR similarity …

📊 7 results
📏 Metrics: Pearson’s ρ (amean), Spearman Correlation

AMR Parsing

Bio

This corpus includes annotations of cancer-related PubMed articles, covering 3 full papers (PMID:24651010, PMID:11777939, PMID:15630473) as well as the result …

📊 3 results
📏 Metrics: Smatch

LDC2017T10

Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University …

📊 23 results
📏 Metrics: Smatch

LDC2020T02

Abstract Meaning Representation (AMR) Annotation Release 3.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University …

📊 9 results
📏 Metrics: Smatch

New3

New3, a set of 527 instances from AMR 3.0, whose original source was the LORELEI DARPA project – not included …

📊 2 results
📏 Metrics: Smatch

The Little Prince

This corpus is an annotation of the novel The Little Prince by Antoine de Saint-Exupéry, published in 1943. We were …

📊 2 results
📏 Metrics: Smatch

Abnormal Event Detection In Video

UBI-Fights

UBI-Fights - Concerning a specific anomaly detection and still providing a wide diversity in fighting scenarios, the UBI-Fights dataset is …

📊 4 results
📏 Metrics: AUC, Decidability, EER

UCSD Ped2

The UCSD Anomaly Detection Dataset was acquired with a stationary camera mounted at an elevation, overlooking pedestrian walkways. The crowd …

📊 4 results
📏 Metrics: AUC

Abstractive Text Summarization

AESLC

To study the task of email subject line generation: automatically generating an email subject line from the email body. Source: …

📊 2 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

CNN/Daily Mail

CNN/Daily Mail is a dataset for text summarization. Human generated abstractive summary bullets were generated from news stories in CNN …

📊 3 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

WikiHow

WikiHow is a dataset of more than 230,000 article and summary pairs extracted and constructed from an online knowledge base …

📊 1 results
📏 Metrics: Content F1, ROUGE-1, ROUGE-2, ROUGE-L

Acoustic Scene Classification

CochlScene

CochlScene is a dataset for acoustic scene classification. The dataset consists of 76k samples collected from 831 participants in 13 …

📊 2 results
📏 Metrics: 1:1 Accuracy

DCASE 2019 Mobile

TAU Urban Acoustic Scenes 2019 Mobile development dataset consists of 10-seconds audio segments from 10 acoustic scenes: Airport Indoor shopping …

📊 1 results
📏 Metrics: Accuracy

TAU Urban Acoustic Scenes 2019

TAU Urban Acoustic Scenes 2019 development dataset consists of 10-seconds audio segments from 10 acoustic scenes: airport, indoor shopping mall, …

📊 1 results
📏 Metrics: 1:1 Accuracy

TUT Acoustic Scenes 2017

The TUT Acoustic Scenes 2017 dataset is a collection of recordings from various acoustic scenes all from distinct locations. For …

📊 1 results
📏 Metrics: 1:1 Accuracy

TUT Urban Acoustic Scenes 2018

The dataset for this task is the TUT Urban Acoustic Scenes 2018 dataset, consisting of recordings from various acoustic scenes. …

📊 1 results
📏 Metrics: Acc

Action Anticipation

Assembly101

Assembly101 is a new procedural activity dataset featuring 4321 videos of people assembling and disassembling 101 "take-apart" toy vehicles. Participants …

📊 2 results
📏 Metrics: Verbs Recall@5, Objects Recall@5, Actions Recall@5

EGTEA

Extended GTEA Gaze+ EGTEA Gaze+ is a large-scale dataset for FPV actions and gaze. It subsumes GTEA Gaze+ and comes …

📊 2 results
📏 Metrics: Top-1 Accuracy

EPIC-KITCHENS-100

This paper introduces the pipeline to scale the largest dataset in egocentric vision EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a …

📊 8 results
📏 Metrics: Recall@5, Top-5 Verb, Top-5 Noun

EgoExoLearn

EgoExoLearn is a fascinating dataset designed to bridge the gap between egocentric and exocentric views of procedural activities. 1. **What …

📊 2 results
📏 Metrics: Accuracy

Action Detection

Charades

The Charades dataset is composed of 9,848 videos of daily indoors activities with an average length of 30 seconds, involving …

📊 15 results
📏 Metrics: mAP

MultiSports

Spatio-temporal action detection is an important and challenging problem in video understanding. The existing action detection benchmarks are limited in …

📊 2 results
📏 Metrics: Frame-mAP 0.5, Video-mAP 0.2, Video-mAP 0.5

MultiTHUMOS

The MultiTHUMOS dataset contains dense, multilabel, frame-level action annotations for 30 hours across 400 videos in the THUMOS'14 action detection …

📊 1 results
📏 Metrics: mAP

TSU

Toyota Smarthome Untrimmed (TSU) is a dataset for activity detection in long untrimmed videos. The dataset contains 536 videos with …

📊 1 results
📏 Metrics: Frame-mAP

TTStroke-21 ME21

This task offers researchers an opportunity to test their fine-grained classification methods for detecting and recognizing strokes in table tennis …

📊 2 results
📏 Metrics: IoU, mAP

TTStroke-21 ME22

TTStroke-21 for MediaEval 2022. The task is of interest to researchers in the areas of machine learning (classification), visual content …

📊 2 results
📏 Metrics: IoU, mAP

UCF Sports

The UCF Sports dataset consists of a set of actions collected from various sports which are typically featured on broadcast …

📊 5 results
📏 Metrics: Frame-mAP 0.5, Video-mAP 0.2, Video-mAP 0.5

UCF101-24

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 15 results
📏 Metrics: Frame-mAP 0.5, Video-mAP 0.1, Video-mAP 0.2, Video-mAP 0.5

Action Parsing

JerichoWorld

JerichoWorld is a dataset that enables the creation of learning agents that can build knowledge graph-based world models of interactive …

📊 3 results
📏 Metrics: Set accuracy

Action Quality Assessment

AQA-7

Consists of 1106 action samples from seven actions with quality scores as measured by expert human judges. Source: [Action Quality …

📊 9 results
📏 Metrics: Spearman Correlation, RL2(*100)

EgoExoLearn

EgoExoLearn is a fascinating dataset designed to bridge the gap between egocentric and exocentric views of procedural activities. 1. **What …

📊 2 results
📏 Metrics: Accuracy

FineDiving

We construct a fine-grained video dataset organized by both semantic and temporal structures, where each structure contains two-level annotations. * …

📊 4 results
📏 Metrics: Spearman Correlation, RL2(*100)

JIGSAWS

The JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) is a surgical activity dataset for human motion modeling. The data …

📊 5 results
📏 Metrics: Spearman Correlation

MTL-AQA

A new multitask action quality assessment (AQA) dataset, the largest to date, comprising of more than 1600 diving samples; contains …

📊 21 results
📏 Metrics: Spearman Correlation, RL2(*100)

Rhythmic Gymnastic

The Rhythmic Gymnastics dataset contains videos of four different types of gymnastics routines: ball, clubs, hoop and ribbon. Each type …

📊 1 results
📏 Metrics: Spearman Correlation

UI-PRMD

UI-PRMD is a data set of movements related to common exercises performed by patients in physical therapy and rehabilitation programs. …

📊 1 results
📏 Metrics: Average mean absolute error

Action Recognition

ActivityNet

The ActivityNet dataset contains 200 different types of activities and a total of 849 hours of videos collected from YouTube. …

📊 16 results
📏 Metrics: mAP

Animal Kingdom

Animal Kingdom is a large and diverse dataset that provides multiple annotated tasks to enable a more thorough understanding of …

📊 2 results
📏 Metrics: mAP

BAR

Biased Action Recognition (BAR) dataset is a real-world image dataset categorized as six action classes which are biased to distinct …

📊 4 results
📏 Metrics: Accuracy

Charades

The Charades dataset is composed of 9,848 videos of daily indoors activities with an average length of 30 seconds, involving …

📊 1 results
📏 Metrics: MAP

Charades-Ego

Contains 68,536 activity instances in 68.8 hours of first and third-person video, making it one of the largest and most …

📊 6 results
📏 Metrics: mAP

DVS128 Gesture

Comprises 11 hand gesture categories from 29 subjects under 3 illumination conditions. Source: [A Low Power, Fully Event-Based Gesture Recognition …

📊 1 results
📏 Metrics: Accuracy (% )

Drone-Action

Website: https://asankagp.github.io/droneaction/

📊 2 results
📏 Metrics: Top 1 Accuracy, Top-1 Accuracy

EPIC-KITCHENS-100

This paper introduces the pipeline to scale the largest dataset in egocentric vision EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a …

📊 30 results
📏 Metrics: Action@1, Verb@1, Noun@1, GFLOPs

EPIC-KITCHENS-55

The EPIC-KITCHENS-55 dataset comprises a set of 432 egocentric videos recorded by 32 participants in their kitchens at 60fps with …

📊 1 results
📏 Metrics: Top-1 Accuracy

EgoGesture

The EgoGesture dataset contains 2,081 RGB-D videos, 24,161 gesture samples and 2,953,224 frames from 50 distinct subjects. Source: http://www.nlpr.ia.ac.cn/iva/yfzhang/datasets/egogesture.html Image …

📊 1 results
📏 Metrics: Top-1 Accuracy, Top-5 Accuracy

H2O (2 Hands and Objects)

We present a comprehensive framework for egocentric interaction recognition using markerless 3D annotations of two hands manipulating objects. To this …

📊 10 results
📏 Metrics: Actions Top-1, RGB, Hand Pose, Object Pose, Object Label

HAA500

HAA500 is a manually annotated human-centric atomic action dataset for action recognition on 500 classes with over 591k labeled frames. …

📊 4 results
📏 Metrics: Top-1 (%)

HACS

HACS is a dataset for human action recognition. It uses a taxonomy of 200 action classes, which is identical to …

📊 8 results
📏 Metrics: Top 1 Accuracy, Top 5 Accuracy

HMDB51

The HMDB51 dataset is a large collection of realistic videos from various sources, including movies and web videos. The dataset …

📊 1 results
📏 Metrics: Accuracy

IndustReal

IndustReal is an ego-centric, multi-modal dataset where 27 participants are challenged to perform assembly and maintenance procedures on a construction-toy …

📊 1 results
📏 Metrics: Top-1, Top-5

Jester (Gesture Recognition)

Jester Gesture Recognition dataset includes 148,092 labeled video clips of humans performing basic, pre-defined hand gestures in front of a …

📊 3 results
📏 Metrics: Val

MECCANO

The MECCANO dataset is the first dataset of egocentric videos to study human-object interactions in industrial-like settings. The MECCANO dataset …

📊 1 results
📏 Metrics: Top-1 Accuracy

MTL-AQA

A new multitask action quality assessment (AQA) dataset, the largest to date, comprising of more than 1600 diving samples; contains …

📊 1 results
📏 Metrics: Position Accuracy, Armstand Accuracy, Rotation Type Accuracy, No. of Somersaults Accuracy, No. of Twists Accuracy

Mimetics

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 2 results
📏 Metrics: mAP

N-UCLA

The Multiview 3D event dataset is capture by me and Xiaohan Nie in UCLA. it contains RGB, depth and human …

📊 1 results
📏 Metrics: Accuracy (Cross-Subject), Accuracy (Cross-View)

NTU RGB+D

NTU RGB+D is a large-scale dataset for RGB-D human action recognition. It involves 56,880 samples of 60 action classes collected …

📊 21 results
📏 Metrics: Accuracy (CS), Accuracy (CV)

NTU RGB+D 120

NTU RGB+D 120 is a large-scale dataset for RGB+D human action recognition, which is collected from 106 distinct subjects and …

📊 16 results
📏 Metrics: Accuracy (Cross-Setup), Accuracy (Cross-Subject)

Okutama-Action

A new video dataset for aerial view concurrent human action detection. It consists of 43 minute-long fully-annotated sequences with 12 …

📊 2 results
📏 Metrics: Accuracy

Penn Action

The Penn Action Dataset contains 2326 video sequences of 15 different actions and human joint annotations for each sequence. Source: …

📊 2 results
📏 Metrics: Accuracy

RareAct

RareAct is a video dataset of unusual actions, including actions like “blend phone”, “cut keyboard” and “microwave shoes”. It aims …

📊 3 results
📏 Metrics: mWAP

RoCoG-v2

RoCoG-v2 (Robot Control Gestures) is a dataset intended to support the study of synthetic-to-real and ground-to-air video domain adaptation. It …

📊 1 results
📏 Metrics: Top-1 Accuracy

Skeleton-Mimetics

A dataset derived from the recently introduced Mimetics dataset. Source: Quo Vadis, Skeleton Action Recognition ?

📊 1 results
📏 Metrics: Accuracy

Something-Something V1

The 20BN-SOMETHING-SOMETHING dataset is a large collection of labeled video clips that show humans performing pre-defined basic actions with everyday …

📊 66 results
📏 Metrics: Top 1 Accuracy, Top 5 Accuracy, Param., GFLOPs

Something-Something V2

The 20BN-SOMETHING-SOMETHING V2 dataset is a large collection of labeled video clips that show humans performing pre-defined basic actions with …

📊 116 results
📏 Metrics: Top-1 Accuracy, Top-5 Accuracy, Parameters, GFLOPs

Sports-1M

The Sports-1M dataset consists of over a million videos from YouTube. The videos in the dataset can be obtained through …

📊 8 results
📏 Metrics: Video hit@1 , Video hit@5, Clip Hit@1

THUMOS14

The THUMOS14 (THUMOS 2014) dataset is a large-scale video dataset that includes 1,010 videos for validation and 1,574 videos for …

📊 1 results
📏 Metrics: Accuracy

UAV-Human

UAV-Human is a large dataset for human behavior understanding with UAVs. It contains 67,428 multi-modal video sequences and 119 subjects …

📊 4 results
📏 Metrics: Top 1 Accuracy

UCF101

UCF101 dataset is an extension of UCF50 and consists of 13,320 video clips, which are classified into 101 categories. These …

📊 76 results
📏 Metrics: 3-fold Accuracy, Accuracy, Accuracy 20%Test

UTD-MHAD

The UTD-MHAD dataset consists of 27 different actions performed by 8 subjects. Each subject repeated the action for 4 times, …

📊 1 results
📏 Metrics: Accuracy

Volleyball

Volleyball is a video action recognition dataset. It has 4830 annotated frames that were handpicked from 55 videos with 9 …

📊 3 results
📏 Metrics: Accuracy

Win-Fail Action Understanding

First of its kind paired win-fail action understanding dataset with samples from the following domains: “General Stunts,” “Internet Wins-Fails,” “Trick …

📊 1 results
📏 Metrics: 2-Class Accuracy

Action Recognition In Videos

ActivityNet

The ActivityNet dataset contains 200 different types of activities and a total of 849 hours of videos collected from YouTube. …

📊 1 results
📏 Metrics: mAP

Jester (Gesture Recognition)

Jester Gesture Recognition dataset includes 148,092 labeled video clips of humans performing basic, pre-defined hand gestures in front of a …

📊 9 results
📏 Metrics: Val

Kinetics-600

The Kinetics-600 is a large-scale action recognition dataset which consists of around 480K videos from 600 action categories. The 480K …

📊 1 results
📏 Metrics: Top-1 Accuracy, Top-5 Accuracy

NTU RGB+D

NTU RGB+D is a large-scale dataset for RGB-D human action recognition. It involves 56,880 samples of 60 action classes collected …

📊 1 results
📏 Metrics: Accuracy (CS)

PKU-MMD

The PKU-MMD dataset is a large skeleton-based action detection dataset. It contains 1076 long untrimmed video sequences performed by 66 …

📊 2 results
📏 Metrics: X-Sub, X-View

Something-Something V1

The 20BN-SOMETHING-SOMETHING dataset is a large collection of labeled video clips that show humans performing pre-defined basic actions with everyday …

📊 3 results
📏 Metrics: Top 1 Accuracy

Something-Something V2

The 20BN-SOMETHING-SOMETHING V2 dataset is a large collection of labeled video clips that show humans performing pre-defined basic actions with …

📊 4 results
📏 Metrics: Top-1 Accuracy, Top-5 Accuracy

Sports-1M

The Sports-1M dataset consists of over a million videos from YouTube. The videos in the dataset can be obtained through …

📊 2 results
📏 Metrics: Video hit@1, Video hit@5

UCF101

UCF101 dataset is an extension of UCF50 and consists of 13,320 video clips, which are classified into 101 categories. These …

📊 5 results
📏 Metrics: 3-fold Accuracy

Action Segmentation

50 Salads

Activity recognition research has shifted focus from distinguishing full-body motion patterns to recognizing complex interactions of multiple entities. Manipulative gestures …

📊 21 results
📏 Metrics: F1@50%, F1@25%, F1@10%, Acc, Edit

Assembly101

Assembly101 is a new procedural activity dataset featuring 4321 videos of people assembling and disassembling 101 "take-apart" toy vehicles. Participants …

📊 5 results
📏 Metrics: F1@10%, F1@25%, F1@50%, Edit, MoF

Breakfast

The Breakfast Actions Dataset comprises of 10 actions related to breakfast preparation, performed by 52 different individuals in 18 different …

📊 28 results
📏 Metrics: Average F1, F1@50%, F1@25%, F1@10%, Edit, Acc, mIoU, F1

COIN

The COIN dataset (a large-scale dataset for COmprehensive INstructional video analysis) consists of 11,827 videos related to 180 different tasks …

📊 9 results
📏 Metrics: Frame accuracy

GTEA

The Georgia Tech Egocentric Activities (GTEA) dataset contains seven types of daily activities such as making sandwich, tea, or coffee. …

📊 19 results
📏 Metrics: F1@50%, F1@25%, F1@10%, Acc, Edit

JIGSAWS

The JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) is a surgical activity dataset for human motion modeling. The data …

📊 7 results
📏 Metrics: Edit Distance, Accuracy, F1@10, F1@25, F1@50

MPII Cooking 2 Dataset

A dataset which provides detailed annotations for activity recognition. Source: [Recognizing Fine-Grained and Composite Activities using Hand-Centric Features and Script …

📊 1 results
📏 Metrics: Accuracy, mIoU

Youtube INRIA Instructional

We address the problem of automatically learning the main steps to complete a certain task, such as changing a car …

📊 2 results
📏 Metrics: Acc, F1

Action Understanding

Win-Fail Action Understanding

First of its kind paired win-fail action understanding dataset with samples from the following domains: “General Stunts,” “Internet Wins-Fails,” “Trick …

📊 1 results
📏 Metrics: 2-Class Accuracy

Active Speaker Detection

LRS3-TED

LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of …

📊 1 results
📏 Metrics: Accuracy

Active Speaker Localization

EasyCom

The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the cocktail party effect from an augmented-reality …

📊 1 results
📏 Metrics: ASL mAP

Activity Detection

AVA-Speech

Contains densely labeled speech activity in YouTube videos, with the goal of creating a shared, available dataset for this task. …

📊 3 results
📏 Metrics: ROC-AUC

Activity Recognition

RWF-2000

A database with 2,000 videos captured by surveillance cameras in real-world scenes. Source: [RWF-2000: An Open Large Scale Video Database …

📊 4 results
📏 Metrics: Accuracy

Stanford40

The Stanford 40 Action Dataset contains images of humans performing 40 actions. In each image, we provide a bounding box …

📊 2 results
📏 Metrics: Top-3 Accuracy (%)

Ad-hoc video search

TRECVID-AVS16 (IACC.3)

Internet Archive videos (IACC.3) under Creative Commons licenses. The test video collection for TRECVID-AVS2016-TRECVID-AVS2018 contains 335,944 web video clips (600hr).

📊 3 results
📏 Metrics: infAP

TRECVID-AVS17 (IACC.3)

Internet Archive videos (IACC.3) under Creative Commons licenses. The test video collection for TRECVID-AVS2016-TRECVID-AVS2018 contains 335,944 web video clips (600hr).

📊 3 results
📏 Metrics: infAP

TRECVID-AVS18 (IACC.3)

Internet Archive videos (IACC.3) under Creative Commons licenses. The test video collection for TRECVID-AVS2016-TRECVID-AVS2018 contains 335,944 web video clips (600hr).

📊 3 results
📏 Metrics: infAP

TRECVID-AVS19 (V3C1)

The dataset has been designed to represent true web videos in the wild, with good visual quality and diverse content …

📊 2 results
📏 Metrics: infAP

TRECVID-AVS20 (V3C1)

The dataset has been designed to represent true web videos in the wild, with good visual quality and diverse content …

📊 1 results
📏 Metrics: infAP

Adversarial Attack

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 6 results
📏 Metrics: Attack: PGD20, Attack: AutoAttack, Attack: DeepFool, Robust Accuracy

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 2 results
📏 Metrics: Attack: AutoAttack

WSJ0-2mix

WSJ0-2mix is a speech recognition corpus of speech mixtures using utterances from the Wall Street Journal (WSJ0) corpus. Source: [Deep …

📊 1 results
📏 Metrics: SDR

Adversarial Defense

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 8 results
📏 Metrics: Accuracy, Attack: AutoAttack, Robust Accuracy

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 3 results
📏 Metrics: autoattack

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 1 results
📏 Metrics: Accuracy

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 2 results
📏 Metrics: Accuracy, Inference speed

Adversarial Robustness

AdvGLUE

Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to quantitatively and thoroughly explore and evaluate the vulnerabilities of modern large-scale …

📊 10 results
📏 Metrics: Accuracy

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 5 results
📏 Metrics: Accuracy, Robust Accuracy, Attack: AutoAttack

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 2 results
📏 Metrics: Clean Accuracy, AutoAttacked Accuracy

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 4 results
📏 Metrics: Accuracy

ImageNet-A

The ImageNet-A dataset consists of real-world, unmodified, and naturally occurring examples that are misclassified by ResNet models. Source: [On Robustness …

📊 4 results
📏 Metrics: Accuracy

ImageNet-C

ImageNet-C is an open source data set that consists of algorithmically generated corruptions (blur, noise) applied to the ImageNet test-set. …

📊 4 results
📏 Metrics: mean Corruption Error (mCE)

Stylized ImageNet

The Stylized-ImageNet dataset is created by removing local texture cues in ImageNet while retaining global shape information on natural images …

📊 4 results
📏 Metrics: Accuracy

Affordance Detection

3D AffordanceNet

3D AffordanceNet is a dataset of 23k shapes for visual affordance. It consists of 56,307 well-defined affordance information annotations for …

📊 1 results
📏 Metrics: AIOU, mAP

Affordance Recognition

HICO-DET

HICO-DET is a dataset for detecting human-object interactions (HOI) in images. It contains 47,776 images (38,118 in train set and …

📊 4 results
📏 Metrics: COCO-Val2017, Object365, HICO, Novel classes

Age And Gender Classification

BN-AuthProf

Although research on author profiling has quite progressed in abundant resources languages, it is still infancy for limited resources languages …

📊 1 results
📏 Metrics: F1 score

Age Estimation

AFAD

The Asian Face Age Dataset (AFAD) is a new dataset proposed for evaluating the performance of age estimation, which contains …

📊 10 results
📏 Metrics: MAE

AgeDB

AgeDB contains 16, 488 images of various famous people, such as actors/actresses, writers, scientists, politicians, etc. Every image is annotated …

📊 10 results
📏 Metrics: MAE

CACD

The Cross-Age Celebrity Dataset (CACD) contains 163,446 images from 2,000 celebrities collected from the Internet. The images are collected from …

📊 13 results
📏 Metrics: MAE

IMDB-Clean

We have cleaned the noisy IMDB-WIKI dataset using a constrained clustering method, resulting this new benchmark for in-the-wild age estimation. …

📊 4 results
📏 Metrics: Average mean absolute error

KANFace

KANFace consists of 40K still images and 44K sequences (14.5M video frames in total) captured in unconstrained, real-world conditions from …

📊 1 results
📏 Metrics: Average mean absolute error

LAGENDA

The LAGENDA dataset is a large-scale dataset with age and gender annotations for face and body bounding boxes. The dataset …

📊 2 results
📏 Metrics: MAE

MORPH

MORPH is a facial age estimation dataset, which contains 55,134 facial images of 13,617 subjects ranging from 16 to 77 …

📊 1 results
📏 Metrics: MAE

UTKFace

The UTKFace dataset is a large-scale face dataset with long age span (range from 0 to 116 years old). The …

📊 14 results
📏 Metrics: MAE

mebeblurf

Matanga Darknet — 2025 Access Guide As internet censorship intensifies, Shadow Marketplaces remain crucial tools for anonymous transactions. Matanga Darknet …

📊 2 results
📏 Metrics: Accuracy, MAE

Analogical Similarity

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 2 results
📏 Metrics: Accuracy

Anatomy

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Ancestor-descendant prediction

WN18RR

WN18RR is a link prediction dataset created from WN18, which is a subset of WordNet. WN18 consists of 18 relations …

📊 1 results
📏 Metrics: mAP-0%, mAP-50%, mAP-100%

Animal Pose Estimation

AP-10K

AP-10K is the first large-scale benchmark for general animal pose estimation, to facilitate the research in animal pose estimation. AP-10K …

📊 10 results
📏 Metrics: AP

Animal-Pose Dataset

Animal-Pose Dataset is an animal pose dataset to facilitate training and evaluation. This dataset provides animal pose annotations on five …

📊 1 results
📏 Metrics: AP

Animal3D

Accurately estimating the 3D pose and shape is an essential step towards understanding animal behavior, and can potentially benefit many …

📊 1 results
📏 Metrics: PA-MPJPE

Fish-100

Schools of inland silversides (Menidia beryllina, n=14 individuals per school) were recorded in the Lauder Lab at Harvard University while …

📊 3 results
📏 Metrics: mAP

Horse-10

Horse-10 is an animal pose estimation dataset. It comprises 30 diverse Thoroughbred horses, for which 22 body parts were labeled …

📊 8 results
📏 Metrics: [email protected] (OOD), Normalized Error (OOD)

Marmoset-8K

All animal procedures are overseen by veterinary staff of the MIT and Broad Institute Department of Comparative Medicine, in compliance …

📊 3 results
📏 Metrics: mAP

StanfordExtra

An 'in the wild' dataset of 20,580 dog images for which 2D joint and silhouette annotations were collected. Source: [Who …

📊 3 results
📏 Metrics: [email protected]

TriMouse-161

Three wild-type (C57BL/6J) male mice ran on a paper spool following odor trails (Mathis et al 2018). These experiments were …

📊 5 results
📏 Metrics: mAP

Anomaly Classification

GoodsAD

The GoodsAD dataset contains 6124 images with 6 categories of common supermarket goods. Each category contains multiple goods. All images …

📊 10 results
📏 Metrics: AUPR, AUROC

MVTec-AC

MVTec-AC is a curated refinement of the widely-used MVTec-AD dataset, specifically designed for anomaly classification—distinguishing between different types of anomalies …

📊 1 results
📏 Metrics: Accuracy (% )

MVTecAD

MVTec AD is a dataset for benchmarking anomaly detection methods with a focus on industrial inspection. It contains over 5000 …

📊 2 results
📏 Metrics: Accuracy (% )

VisA

The VisA dataset contains 12 subsets corresponding to 12 different objects as shown in the above figure. There are 10,821 …

📊 1 results
📏 Metrics: Detection AUROC

VisA-AC

VisA-AC is a refined benchmark based on the VisA dataset, tailored for the task of anomaly classification—distinguishing between different types …

📊 1 results
📏 Metrics: Accuracy(%)

Anomaly Detection

ADNI

Alzheimer's Disease Neuroimaging Initiative (ADNI) is a multisite study that aims to improve clinical trials for the prevention and treatment …

📊 1 results
📏 Metrics: AUC

AG News

AG News (AG’s News Corpus) is a subdataset of AG's corpus of news articles constructed by assembling titles and description …

📊 1 results
📏 Metrics: AUROC

BTAD

The BTAD ( beanTech Anomaly Detection) dataset is a real-world industrial anomaly dataset. The dataset contains a total of 2830 …

📊 12 results
📏 Metrics: Detection AUROC, Segmentation AUROC, Segmentation AP, Segmentation AUPRO

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 1 results
📏 Metrics: Mean AUC

COCO-OOC

COCO-OOC goes beyond standard object detection to ask the question: Which objects are out-of-context (OOC)? Given an image with a …

📊 1 results
📏 Metrics: AUC

CUHK Avenue

Avenue Dataset contains 16 training and 21 testing video clips. The videos are captured in CUHK campus avenue with 30652 …

📊 29 results
📏 Metrics: AUC, RBDC, TBDC, FPS

DIOR

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 4 results
📏 Metrics: ROC AUC

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 10 results
📏 Metrics: ROC AUC

Fishyscapes

Fishyscapes is a public benchmark for uncertainty estimation in a real-world task of semantic segmentation for urban driving. It evaluates …

📊 8 results
📏 Metrics: AP, FPR95

Forest CoverType

Predicting forest cover type from cartographic variables only (no remotely sensed data). The actual forest cover type for a given …

📊 1 results
📏 Metrics: AUC

Hyper-Kvasir Dataset

HyperKvasir dataset contains 110,079 images and 374 videos where it captures anatomical landmarks and pathological and normal findings. A total …

📊 5 results
📏 Metrics: AUC

IITB Corridor

An abnormal activity data-set for research use that contains 4,83,566 annotated frames. Source: [Multi-timescale Trajectory Prediction for Abnormal Human Activity …

📊 1 results
📏 Metrics: AUC

ITDD

The Industrial Textile Defect Detection (ITDD) dataset includes 1885 industrial textile images categorized into 4 categories: cotton fabric, dyed fabric, …

📊 1 results
📏 Metrics: Detection AUROC, Segmentation AUROC

InsPLAD

InsPLAD is a Dataset for Power Line Asset Inspection containing 10,607 high-resolution Unmanned Aerial Vehicles colour images. It contains 17 …

📊 4 results
📏 Metrics: Detection AUROC

KDD Cup 1999

This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held …

📊 1 results
📏 Metrics: F1-Score

Kaggle-Credit Card Fraud Dataset

The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred …

📊 1 results
📏 Metrics: AUC

LAG

Includes 5,824 fundus images labeled with either positive glaucoma (2,392) or negative glaucoma (3,432). Source: [Attention Based Glaucoma Detection: A …

📊 4 results
📏 Metrics: AUC

Lost and Found

Lost and Found is a novel lost-cargo image sequence dataset comprising more than two thousand frames with pixelwise annotations of …

📊 4 results
📏 Metrics: AP, FPR

MIT-BIH Arrhythmia Database

The MIT-BIH Arrhythmia Database contains 48 half-hour excerpts of two-channel ambulatory ECG recordings, obtained from 47 subjects studied by the …

📊 1 results
📏 Metrics: F1 score

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 5 results
📏 Metrics: ROC AUC

MPDD

MPDD is a dataset aimed at benchmarking visual defect detection methods in industrial metal parts manufacturing. It consists of more …

📊 14 results
📏 Metrics: Detection AUROC, Segmentation AUROC, Segmentation AUPRO

MVTEC 3D-AD

MVTec 3D Anomaly Detection Dataset (MVTec 3D-AD) is a comprehensive 3D dataset for the task of unsupervised anomaly detection and …

📊 2 results
📏 Metrics: Segmentation AUPRO, Detection AUROC, Segmentation AUROC

MVTec LOCO AD

MVTec Logical Constraints Anomaly Detection (MVTec LOCO AD) dataset is intended for the evaluation of unsupervised anomaly localization algorithms. The …

📊 35 results
📏 Metrics: Avg. Detection AUROC, Detection AUROC (only logical), Detection AUROC (only structural), Segmentation AU-sPRO (until FPR 5%)

Musk v1

The Musk dataset describes a set of molecules, and the objective is to detect musks from non-musks. This dataset describes …

📊 1 results
📏 Metrics: F1-Score

ODDS

Outliers or anomalies are instances that do not conform to the norm of a dataset. Outlier detection is an important …

📊 3 results
📏 Metrics: AUROC, F1

PAD Dataset

Multi-pose Anomaly Detection (MAD) dataset, which represents the first attempt to evaluate the performance of pose-agnostic anomaly detection. The MAD …

📊 2 results
📏 Metrics: Detection AUROC, Segmentation AUROC

Road Anomaly

This dataset contains images of unusual dangers which can be encountered by a vehicle on the road – animals, rocks, …

📊 9 results
📏 Metrics: AP, FPR95

SMD

a dataset of time-series anomaly detection

📊 1 results
📏 Metrics: Recall, precision, F1, F1-score

SVHN

Street View House Numbers (SVHN) is a digit classification benchmark dataset that contains 600,000 32×32 RGB images of printed digits …

📊 1 results
📏 Metrics: Mean AUC

ShanghaiTech

The Shanghaitech dataset is a large-scale crowd counting dataset. It consists of 1198 annotated crowd images. The dataset is divided …

📊 28 results
📏 Metrics: AUC, RBDC, TBDC

ShanghaiTech Campus

The ShanghaiTech Campus dataset has 13 scenes with complex light conditions and camera angles. It contains 130 abnormal events and …

📊 1 results
📏 Metrics: AUC-ROC

Street Scene

Street Scene is a dataset for video anomaly detection. Street Scene consists of 46 training and 35 testing high resolution …

📊 1 results
📏 Metrics: AUC, RBDC, TBDC

TII-SSRC-23

The TII-SSRC-23 dataset offers a comprehensive collection of network traffic patterns, meticulously compiled to support the development and research of …

📊 1 results
📏 Metrics: AUC

Thyroid

Thyroid is a dataset for detection of thyroid diseases, in which patients diagnosed with hypothyroid or subnormal are anomalies against …

📊 2 results
📏 Metrics: AUC, Average Precision, F1-Score

UBnormal

UBnormal is a new supervised open-set benchmark composed of multiple virtual scenes for video anomaly detection. Unlike existing data sets, …

📊 13 results
📏 Metrics: AUC, RBDC, TBDC

UCF-Crime

The UCF-Crime dataset is a large-scale dataset of 128 hours of videos. It consists of 1900 long and untrimmed real-world …

📊 1 results
📏 Metrics: AUC

UCR Anomaly Archive

The UCR Anomaly Archive is a collection of 250 uni-variate time series collected in human medicine, biology, meteorology and industry. …

📊 24 results
📏 Metrics: Average F1, AUC ROC

UCSD Ped2

The UCSD Anomaly Detection Dataset was acquired with a stationary camera mounted at an elevation, overlooking pedestrian walkways. The crowd …

📊 9 results
📏 Metrics: AUC, FPS

UEA time-series datasets

Five datasets used in NeurTraL-AD paper: \textit{RacketSports (RS).} Accelerometer and gyroscope recording of players playing four different racket sports. Each …

📊 3 results
📏 Metrics: Avg. ROC-AUC

Vehicle Claims

The code to create the dataset is available here. The dataset used in the paper is available on github - …

📊 2 results
📏 Metrics: AUC

VisA

The VisA dataset contains 12 subsets corresponding to 12 different objects as shown in the above figure. There are 10,821 …

📊 45 results
📏 Metrics: Detection AUROC, Segmentation AUPRO (until 30% FPR), F1-Score, Segmentation AUPRO, Segmentation AUROC

WFDD

WFDD is a dataset for benchmarking anomaly detection methods with a focus on textile inspection. It includes 4101 woven fabric …

📊 1 results
📏 Metrics: Detection AUROC, Segmentation AUPRO, Segmentation AUROC

voraus-AD

voraus-AD contains machine data of a collaborative robot, which moves a can by performing an industrial pick-and-place task. The samples …

📊 3 results
📏 Metrics: Avg. Detection AUROC

Anxiety Detection

Well-being Dataset

The dataset is a private dataset collected for automatic analysis of psychological distress. It contains self-reported distress labels provided by …

📊 1 results
📏 Metrics: F1-score

Arabic Text Diacritization

CATT

The CATT benchmark dataset comprises 742 sentences, which were scraped from an internet news source in 2023. It covers multiple …

📊 11 results
📏 Metrics: DER(%), WER (%)

Arithmetic Reasoning

GSM8K

GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. …

📊 144 results
📏 Metrics: Accuracy, Parameters (Billion)

Game of 24

Game of 24 is a mathematical reasoning challenge, where the goal is to use 4 numbers and basic arithmetic operations …

📊 1 results
📏 Metrics: Success

MathMC

Existing arithmetic benchmarks have a limited number of multiple-choice questions. To address this gap, MathMC is created including 1,000 Chinese …

📊 1 results
📏 Metrics: Accuracy

MathToF

Existing arithmetic benchmarks have a limited number of True-or-False questions. To address this gap, MathToF is created including 1,000 Chinese …

📊 1 results
📏 Metrics: Accuracy

Aspect Category Detection

SemEval-2014 Task-4

Sentiment analysis is increasingly viewed as a vital task both from an academic and a commercial standpoint. The majority of …

📊 1 results
📏 Metrics: Average Recall, Hit@5, MRR, NDCG

Aspect Extraction

SemEval-2014 Task-4

Sentiment analysis is increasingly viewed as a vital task both from an academic and a commercial standpoint. The majority of …

📊 5 results
📏 Metrics: Laptop (F1), Restaurant (F1), Mean F1 (Laptop + Restaurant)

Aspect-Based Sentiment Analysis (ABSA)

ACOS

Most of the aspect based sentiment analysis research aims at identifying the sentiment polarities toward some explicit aspect terms while …

📊 8 results
📏 Metrics: F1 (Laptop), F1 (Restaurant)

ASQP

Aspect-based sentiment analysis (ABSA) typically focuses on extracting aspects and predicting their sentiments on individual sentences such as customer reviews. …

📊 9 results
📏 Metrics: F1 (R15), F1 (R16)

ASTE

Target-based sentiment analysis or aspect-based sentiment analysis (ABSA) refers to addressing various sentiment analysis tasks at a fine-grained level, which …

📊 9 results
📏 Metrics: F1 (L14), F1(R14), F1 (R15), F1 (R16)

MAMS

MAMS is a challenge dataset for aspect-based sentiment analysis (ABSA), in which each sentences contain at least two aspects with …

📊 1 results
📏 Metrics: Acc, Macro-F1

SemEval-2014 Task-4

Sentiment analysis is increasingly viewed as a vital task both from an academic and a commercial standpoint. The majority of …

📊 32 results
📏 Metrics: Mean Acc (Restaurant + Laptop), Restaurant (Acc), Laptop (Acc)

TASD

Aspect-based sentiment analysis (ABSA) aims to detect the targets (which are composed by continuous words), aspects and sentiment polarities in …

📊 9 results
📏 Metrics: F1 (R15), F1 (R16)

Astronomy

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Atomic action recognition

CATER

Rendered synthetically using a library of standard 3D objects, and tests the ability to recognize compositions of object movements that …

📊 4 results
📏 Metrics: Average-mAP

Atomic number classification

CHILI-100K

The CHILI-100K dataset is a large-scale graph dataset (with overall >183M nodes, >1.2B edges) of nanomaterials generated from experimentally determined …

📊 9 results
📏 Metrics: F1-score (Weighted)

CHILI-3K

The CHILI-3K dataset is a medium-scale graph dataset (with overall >6M nodes, >49M edges) of mono-metallic oxide nanomaterials generated from …

📊 9 results
📏 Metrics: F1-score (Weighted)

Attribute Extraction

SWDE

This dataset is a real-world web page collection used for research on the automatic extraction of structured data (e.g., attribute-value …

📊 2 results
📏 Metrics: Avg F1

Attribute Mining

AE-110k

The dataset contains product information from AliExpress Sports & Entertainment category. Each attribute value in "Item Specific" is matched against …

📊 1 results
📏 Metrics: F1-score

MAVE

The dataset contains 3 million attribute-value annotations across 1257 unique categories created from 2.2 million cleaned Amazon product profiles. It …

📊 1 results
📏 Metrics: F1-score

OA-Mine - annotations

The dataset contains Amazon products from 10 product categories with full human annotations. The dataset was collected in 2021. The …

📊 1 results
📏 Metrics: F1-score

Attribute Value Extraction

AE-110k

The dataset contains product information from AliExpress Sports & Entertainment category. Each attribute value in "Item Specific" is matched against …

📊 2 results
📏 Metrics: F1-score

MAVE

The dataset contains 3 million attribute-value annotations across 1257 unique categories created from 2.2 million cleaned Amazon product profiles. It …

📊 3 results
📏 Metrics: F1-score

OA-Mine - annotations

The dataset contains Amazon products from 10 product categories with full human annotations. The dataset was collected in 2021. The …

📊 2 results
📏 Metrics: F1-score

WDC-PAVE

The datasets contains 1,420 human annotated product offers, systematically selected from the Web Data Commons Product Matching Corpus, featuring 24,582 …

📊 5 results
📏 Metrics: F1-Score

Audio Classification

AudioSet

Audioset is an audio event dataset, which consists of over 2M human-annotated 10-second video clips. These clips are collected from …

📊 43 results
📏 Metrics: Test mAP, AUC, d-prime

CREMA-D

CREMA-D is an emotional multimodal actor data set of 7,442 original clips from 91 actors. These clips were from 48 …

📊 3 results
📏 Metrics: Accuracy

DEEP-VOICE: DeepFake Voice Recognition

DEEP-VOICE: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion This dataset contains examples of real human speech, and …

📊 1 results
📏 Metrics: Accuracy (10-fold)

EPIC-KITCHENS-100

This paper introduces the pipeline to scale the largest dataset in egocentric vision EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a …

📊 4 results
📏 Metrics: Top-1 Action, Top-1 Noun, Top-1 Verb, Top-5 Action, Top-5 Noun, Top-5 Verb

EPIC-SOUNDS

EPIC-SOUNDS is a large scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of …

📊 3 results
📏 Metrics: Accuracy

ESC-50

The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification. …

📊 26 results
📏 Metrics: Top-1 Accuracy, PRE-TRAINING DATASET, Accuracy (5-fold)

FSD50K

Freesound Dataset 50k (or FSD50K for short) is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally …

📊 10 results
📏 Metrics: mAP, Mean AP

ICBHI Respiratory Sound Database

The Respiratory Sound database was originally compiled to support the scientific challenge organized at Int. Conf. on Biomedical Health Informatics …

📊 20 results
📏 Metrics: ICBHI Score, Sensitivity, Specificity

MeerKAT: Meerkat Kalahari Audio Transcripts

A large-scale reference dataset for bioacoustics. MeerKAT is a 1068h large-scale dataset containing data from audio-recording collars worn by free-ranging …

📊 1 results
📏 Metrics: AP

Multimodal PISA

Dataset for multimodal skills assessment focusing on assessing piano player’s skill level. Annotations include player's skills level, and song difficulty …

📊 1 results
📏 Metrics: Accuracy (%)

RAVDESS

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7,356 files (total size: 24.8 GB). The database contains …

📊 1 results
📏 Metrics: Top-1 Accuracy

SHD

The Spiking Heidelberg Digits (SHD) dataset is an audio-based classification dataset of 1k spoken digits ranging from zero to nine

📊 11 results
📏 Metrics: Percentage correct

SSC

The SSC dataset is a spiking version of the Speech Commands dataset release by Google (Speech Commands). SSC was generated …

📊 3 results
📏 Metrics: Accuracy

Speech Commands

Speech Commands is an audio dataset of spoken words designed to help train and evaluate keyword spotting systems .

📊 7 results
📏 Metrics: Accuracy

UCR Time Series Classification Archive

The UCR Time Series Archive - introduced in 2002, has become an important resource in the time series data mining …

📊 1 results
📏 Metrics: FruitFlies, MosquitoSound, RightWhaleCalls

VocalSound

VocalSound is a free dataset consisting of 21,024 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs from …

📊 2 results
📏 Metrics: Accuracy

Audio Generation

AudioCaps

AudioCaps is a dataset of sounds with event descriptions that was introduced for the task of audio captioning, with sounds …

📊 23 results
📏 Metrics: FD_openl3, FAD, FD, KL_passt, IS, CLAP_LAION, CLAP_MS

Audio Source Separation

AudioSet

Audioset is an audio event dataset, which consists of over 2M human-annotated 10-second video clips. These clips are collected from …

📊 2 results
📏 Metrics: SDR, SAR, SIR

Audio Super-Resolution

DSD100

The dsd100 is a dataset of 100 full lengths of music tracks of different styles along with their isolated drums, …

📊 1 results
📏 Metrics: SNR

Audio Tagging

AudioSet

Audioset is an audio event dataset, which consists of over 2M human-annotated 10-second video clips. These clips are collected from …

📊 9 results
📏 Metrics: mean average precision

Audio captioning

AudioCaps

AudioCaps is a dataset of sounds with event descriptions that was introduced for the task of audio captioning, with sounds …

📊 13 results
📏 Metrics: SPIDEr, CIDEr, SPICE, BLEU-4, METEOR, ROUGE-L, FENSE, SPIDEr-FL, #params (M), ROUGE, Sentence-BERT

Clotho

Clotho is an audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total …

📊 9 results
📏 Metrics: SPIDEr, CIDEr, SPICE, BLEU-4, METEOR, ROUGE-L, FENSE, SPIDEr-FL, Sentence-BERT

Audio-Visual Speech Recognition

LRS2

The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences …

📊 8 results
📏 Metrics: Test WER

LRS3-TED

LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of …

📊 12 results
📏 Metrics: Word Error Rate (WER)

Audio-visual Question Answering

MUSIC-AVQA

The large-scale MUSIC-AVQA dataset of musical performance contains 45,867 question-answer pairs, distributed in 9,288 videos for over 150 hours. All …

📊 5 results
📏 Metrics: Acc

AutoML

Wine

These data are the results of a chemical analysis of wines grown in the same region in Italy but derived …

📊 1 results
📏 Metrics: accuracy

Automated Essay Scoring

ASAP-AES

There are eight essay sets. Each of the sets of essays was generated from a single prompt. Selected essays range …

📊 4 results
📏 Metrics: Quadratic Weighted Kappa

Automatic Sleep Stage Classification

ISRUC-Sleep

ISRUC-Sleep is a polysomnographic (PSG) dataset. The data were obtained from human adults, including healthy subjects, and subjects with sleep …

📊 1 results
📏 Metrics: AUROC, Accuracy, Kappa

Sleep-EDF

The sleep-edf database contains 197 whole-night PolySomnoGraphic sleep recordings, containing EEG, EOG, chin EMG, and event markers. Some records also …

📊 3 results
📏 Metrics: Accuracy, Cohen’s Kappa score, Number of parameters (M)

Automatic Speech Recognition (ASR)

LRS2

The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences …

📊 9 results
📏 Metrics: Test WER

LRS3-TED

LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of …

📊 2 results
📏 Metrics: WER, Word Error Rate (WER)

RealMAN

The Audio Signal and Information Processing Lab at Westlake University, in collaboration with AISHELL, has released the Real-recorded and annotated …

📊 1 results
📏 Metrics: CER

Sagalee

Speech Recognition Dataset for Oromo Language. 📊 Key features of Sagalee: 100 hours of read speech. 283 gender balanced …

📊 2 results
📏 Metrics: Test WER

Autonomous Driving

ApolloCar3D

ApolloCar3DT is a dataset that contains 5,277 driving images and over 60K car instances, where each car is fitted with …

📊 1 results
📏 Metrics: A3DP

Autonomous Vehicles

ApolloCar3D

ApolloCar3DT is a dataset that contains 5,277 driving images and over 60K car instances, where each car is fitted with …

📊 1 results
📏 Metrics: A3DP

BEV Segmentation

SimBEV

The SimBEV dataset is a collection of 320 scenes spread across all 11 CARLA maps and contains data from a …

📊 5 results
📏 Metrics: mIoU, road, car, truck, bus, motorcycle, bicycle, rider, pedestrian

BIG-bench Machine Learning

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Beat Tracking

ASAP

ASAP is a dataset of 222 digital musical scores aligned with 1068 performances (more than 92 hours) of Western classical …

📊 1 results
📏 Metrics: F1

Ballroom

This data set includes beat and bar annotations of the ballroom dataset, introduced by Gouyon et al. [1]. [1] Gouyon …

📊 1 results
📏 Metrics: F1

Beatles

This dataset includes the beat and downbeat annotations for Beatles albums. The annotations are provided by M. E. P. Davies …

📊 1 results
📏 Metrics: F1

Candombe

35 recordings of Candombe music with beat and downbeat annotations.

📊 1 results
📏 Metrics: F1

Filosax

48 multitrack jazz recordings with many annotations.

📊 1 results
📏 Metrics: F1

GTZAN

The gtzan8 audio dataset contains 1000 tracks of 30 second length. There are 10 genres, each containing 100 tracks which …

📊 1 results
📏 Metrics: F1

Groove

The Groove MIDI Dataset (GMD) is composed of 13.6 hours of aligned MIDI and (synthesized) audio of human-performed, tempo-aligned expressive …

📊 1 results
📏 Metrics: F1

GuitarSet

GuitarSet is a dataset of high-quality guitar recordings and rich annotations. It contains 360 excerpts 30 seconds in length. The …

📊 1 results
📏 Metrics: F1

HJDB

J. Hockman, M. E. Davies, and I. Fujinaga, “One in the jungle: Downbeat detection in hardcore, jungle, and drum and …

📊 1 results
📏 Metrics: F1

Hainsworth

S. W. Hainsworth and M. D. Macleod, “Particle filtering applied to musical tempo tracking,” EURASIP Journal on Advances in Signal …

📊 1 results
📏 Metrics: F1

Harmonix

Beats, downbeats, and functional structural annotations for 912 Pop tracks. Nieto, O., McCallum, M., Davies., M., Robertson, A., Stark, A., …

📊 1 results
📏 Metrics: F1

JAAH

Eremenko, E. Demirel, B. Bozkurt, and X. Serra, “Audio-aligned jazz harmony dataset for automatic chord transcription and corpus-based research,” in …

📊 1 results
📏 Metrics: F1

SIMAC

F. Gouyon, “A computational approach to rhythm description — Audio features for the computation of rhythm periodicity functions and their …

📊 1 results
📏 Metrics: F1

SMC

A. Holzapfel, M. E. Davies, J. R. Zapata, J. L. Oliveira, and F. Gouyon, “Selective sampling for beat tracking evaluation,” …

📊 1 results
📏 Metrics: F1

TapCorrect

J. Driedger, H. Schreiber, W. B. de Haas, and M. Müller, “Towards automatically correcting tapped beat annotations for music recordings.” …

📊 1 results
📏 Metrics: F1

Benchmarking

Wiki-40B

A new multilingual language model benchmark that is composed of 40+ languages spanning several scripts and linguistic families containing round …

📊 1 results
📏 Metrics: Perplexity

Bias Detection

StereoSet

A large-scale natural dataset in English to measure stereotypical biases in four domains: gender, profession, race, and religion. Source: [StereoSet: …

📊 11 results
📏 Metrics: ICAT Score, LMS, SS

rt-inod-bias

The Innodata Red Teaming Prompts aims to rigorously assess models’ factuality and safety. This dataset, due to its manual creation …

📊 5 results
📏 Metrics: Best-of

Binary Classification

TII-SSRC-23

The TII-SSRC-23 dataset offers a comprehensive collection of network traffic patterns, meticulously compiled to support the development and research of …

📊 1 results
📏 Metrics: F1-Score

fake

[Real or Fake] : Fake Job Description Prediction This dataset contains 18K job descriptions out of which about 800 are …

📊 8 results
📏 Metrics: AUROC

kickstarter

Kickstarter is a community of more than 10 million people comprising of creative, tech enthusiasts who help in bringing creative …

📊 4 results
📏 Metrics: AUROC

Binary Condescension Detection

DPM

Don’t Patronize Me! (DPM) is an annotated dataset with Patronizing and Condescending Language towards vulnerable communities.

📊 2 results
📏 Metrics: F1-score

Binary text classification

TweepFake

The TweepFake dataset consists of 25,572 social media messages posted either by bots or humans on Twitter. Each bot imitated …

📊 2 results
📏 Metrics: F1 score, Accuracy (%)

Bird's-Eye View Semantic Segmentation

SimBEV

The SimBEV dataset is a collection of 320 scenes spread across all 11 CARLA maps and contains data from a …

📊 5 results
📏 Metrics: mIoU, road, car, truck, bus, motorcycle, bicycle, rider, pedestrian

nuScenes

The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in …

📊 15 results
📏 Metrics: IoU veh - 224x480 - Vis filter. - 100x100 at 0.5, IoU veh - 448x800 - Vis filter. - 100x100 at 0.5, IoU veh - 224x480 - No vis filter - 100x100 at 0.5, IoU veh - 448x800 - No vis filter - 100x100 at 0.5, IoU ped - 224x480 - Vis filter. - 100x100 at 0.5, IoU lane - 224x480 - 100x100 at 0.5, IoU veh - 224x480 - No vis filter - 100x50 at 0.25, IoU vehicle - Setting 3

Blind Face Restoration

CelebA-HQ

The CelebA-HQ dataset is a high-quality version of CelebA that consists of 30,000 images at 1024×1024 resolution. Source: [IntroVAE: Introspective …

📊 6 results
📏 Metrics: FID, LPIPS, PSNR

LFW

The LFW dataset contains 13,233 images of faces collected from the web. This dataset consists of the 5749 identities with …

📊 9 results
📏 Metrics: FID

WIDER

WIDER is a dataset for complex event recognition from static images. As of v0.1, it contains 61 event categories and …

📊 9 results
📏 Metrics: FID

Blood pressure estimation

MIMIC-III

The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. …

📊 1 results
📏 Metrics: MAE for SBP [mmHg], MAE for DBP [mmHg], Mean Squared Error, MAE

Boundary Detection

CoAuthor

CoAuthor is a dataset designed for revealing GPT-3's capabilities in assisting creative and argumentative writing. CoAuthor captures rich interactions between …

📊 3 results
📏 Metrics: Cohen’s Kappa score

PASCAL Context

The PASCAL Context dataset is an extension of the PASCAL VOC 2010 detection challenge, and it contains pixel-wise labels for …

📊 1 results
📏 Metrics: odsF

RoFT

RoFT is a dataset of 21,000 human annotations of generated text. The task is "Boundary detection" i.e. given a passage …

📊 4 results
📏 Metrics: Accuracy (%), MSE

RoFT-chatgpt

RoFT-chatgpt is a variation of RoFT dataset, where the same human prompts are continued with the gpt-3.5-turbo model. Each dataset …

📊 4 results
📏 Metrics: Accuracy (%), MSE

UruDendro

UruDendro is a database of wood cross section images of commercially grown Pinus taeda trees from northern Uruguay. It is …

📊 2 results
📏 Metrics: Average Precision, Average Recall, F1-score, FScore

Brain Decoding

BCI Competition IV: ECoG to Finger Movements

Prediction of Finger Flexion IV Brain-Computer Interface Data Competition The goal of this dataset is to predict the flexion of …
📊 1 results
📏 Metrics: Pearson Correlation

Stanford ECoG library: ECoG to Finger Movements

Electrophysiological data from implanted electrodes in the human brain are rare, and therefore scientific access to it has remained somewhat …

📊 1 results
📏 Metrics: Pearson Correlation

Breast Cancer Histology Image Classification

BreakHis

The Breast Cancer Histopathological Image Classification (BreakHis) is composed of 9,109 microscopic images of breast tumor tissue collected from 82 …

📊 3 results
📏 Metrics: Accuracy (%), 1:1 Accuracy, Accuracy (Inter-Patient)

Breast Tumour Classification

PCam

PatchCamelyon is an image classification dataset. It consists of 327.680 color images (96 x 96px) extracted from histopathologic scans of …

📊 16 results
📏 Metrics: AUC, Accuracy

CARLA longest6

CARLA

CARLA (CAR Learning to Act) is an open simulator for urban driving, developed as an open-source layer over Unreal Engine …

📊 19 results
📏 Metrics: Driving Score, Route Completion, Infraction Score

CCG Supertagging

CCGbank

CCGbank is a translation of the Penn Treebank into a corpus of Combinatory Categorial Grammar derivations. It pairs syntactic derivations …

📊 4 results
📏 Metrics: Accuracy

COVID-19 Diagnosis

COVIDGR

Under a close collaboration with an expert radiologist team of the Hospital Universitario San Cecilio, the COVIDGR-1.0 dataset of patients' …

📊 1 results
📏 Metrics: Accuracy

COVIDx CXR-3

COVIDx CXR-3 is an open access benchmark dataset that we generated, comprising 30,882 CXR images across 17,026 patient cases. Images …

📊 7 results
📏 Metrics: Per-Class Accuracy

Large COVID-19 CT scan slice dataset

"We built a large lung CT scan dataset for COVID-19 by curating data from 7 public datasets listed in the …

📊 1 results
📏 Metrics: AUC-ROC, Accuracy, Macro F1, Macro Precision, Macro Recall, Micro Precision, Specificity

Calving Front Delineation In Synthetic Aperture Radar Imagery

CaFFe

The temporal variability in calving front positions of marine-terminating glaciers permits inference on the frontal ablation. Frontal ablation, the sum …

📊 1 results
📏 Metrics: Mean Distance Error

Camera Pose Estimation

KITTI Odometry Benchmark

The odometry benchmark consists of 22 stereo sequences, saved in loss less png format: We provide 11 sequences (00-10) with …

📊 7 results
📏 Metrics: Average Translational Error et[%], Average Rotational Error er[%], Absolute Trajectory Error [m]

Camouflaged Object Segmentation

CAMO

Camouflaged Object (CAMO) dataset specifically designed for the task of camouflaged object segmentation. We focus on two categories, i.e., naturally …

📊 10 results
📏 Metrics: S-Measure, Weighted F-Measure, MAE

Camouflaged Animal Dataset

The nine (moving camera) videos in this benchmark exhibit camouflaged animals that are difficult to see in a single frame, …

📊 2 results
📏 Metrics: S-measure, weighted F-measure, MAE, mDice, mIoU

MoCA-Mask

The original Moving Camouflaged Animals (MoCA) Dataset includes 37K frames from 141 YouTube Video sequences with resolution and sampling rate …

📊 3 results
📏 Metrics: S-measure, weighted F-measure, MAE, mDice, mIoU

NC4K

As far as we know, there only exists one large camouflaged object testing dataset, the COD10K, while the sizes of …

📊 6 results
📏 Metrics: S-measure, weighted F-measure, MAE

Cancer Classification

Multi-omics mRNA, miRNA, and DNA Methylation Dataset

The dataset contains multi-omics data, incuding mRNA, miRNA, and DNA methylation. The dataset comprises 8,464 samples involving 2,794 omics features …

📊 2 results
📏 Metrics: 1:1 Accuracy

Causal Inference

IHDP

The Infant Health and Development Program (IHDP) is a randomized controlled study designed to evaluate the effect of home visit …

📊 9 results
📏 Metrics: Average Treatment Effect Error

Jobs

The Jobs dataset by LaLonde [36] is a widely used benchmark in the causal inference community, where the treatment is …

📊 3 results
📏 Metrics: Average Treatment Effect on the Treated Error

Cell Segmentation

DIC-C2DH-HeLa

HeLa cells on a flat glass Dr. G. van Cappellen. Erasmus Medical Center, Rotterdam, The Netherlands

📊 2 results
📏 Metrics: SEG (~Mean IoU)

Fluo-C3DL-MDA231

MDA231 human breast carcinoma cells infected with a pMSCV vector including the GFP sequence, embedded in a collagen matrix Dr. …

📊 1 results
📏 Metrics: SEG (~Mean IoU)

Fluo-N2DH-GOWT1

GFP-GOWT1 mouse stem cells Dr. E. Bártová. Institute of Biophysics, Academy of Sciences of the Czech Republic, Brno, Czech Republic

📊 2 results
📏 Metrics: SEG (~Mean IoU)

Fluo-N2DH-SIM+

Simulated nuclei of HL60 cells stained with Hoescht Dr. V. Ulman and Dr. D. Svoboda. Centre for Biomedical Image Analysis …

📊 2 results
📏 Metrics: SEG (~Mean IoU)

Fluo-N2DL-HeLa

HeLa cells stably expressing H2b-GFP Mitocheck Consortium

📊 3 results
📏 Metrics: SEG (~Mean IoU)

PhC-C2DH-U373

Glioblastoma-astrocytoma U373 cells on a polyacrylamide substrate Dr. S. Kumar. Department of Bioengineering, University of California at Berkeley, Berkeley CA …

📊 1 results
📏 Metrics: SEG (~Mean IoU)

STARE

The STARE (Structured Analysis of the Retina) dataset is a dataset for retinal vessel segmentation. It contains 20 equal-sized (700×605) …

📊 1 results
📏 Metrics: AUC

Change Detection

CDD Dataset (season-varying)

📊 13 results
📏 Metrics: F1-Score, F1, Precision, Recall, Overall Accuracy, KC, IoU

CLCD

The CLCD dataset consists of 600 pairs image of cropland change samples, with 360 pairs for training, 120 pairs for …

📊 3 results
📏 Metrics: F1

ChangeSim

ChangeSim is a dataset aimed at online scene change detection (SCD) and more. The data is collected in photo-realistic simulation …

📊 1 results
📏 Metrics: Category mIoU

DSIFN-CD

The dataset is manually collected from Google Earth. It consists of six large bi-temporal high resolution images covering six cities …

📊 7 results
📏 Metrics: F1, Precision, Recall, Overall Accuracy, KC, IoU, Params(M)

EGY-BCD

Bi-temporal images in the EGY-BCD dataset are taken from 4 different regions located in Egypt, including New Mansoura, El Galala …

📊 3 results
📏 Metrics: F1

GVLM

For change detection tasks, current open-source datasets mainly focus on building extraction (e.g., WHU building dataset and LEVIR-CD dataset) (Chen …

📊 4 results
📏 Metrics: F1

LEVIR-CD

LEVIR-CD is a new large-scale remote sensing building Change Detection dataset. The introduced dataset would be a new benchmark for …

📊 24 results
📏 Metrics: F1, IoU, Overall Accuracy, F1-score, Recall, Precision

PCD

The Arabic dataset is scraped mainly from الموسوعة الشعرية and الديوان. After merging both, the total number of verses is …

📊 1 results
📏 Metrics: F1 score

S2Looking

S2Looking is a building change detection dataset that contains large-scale side-looking satellite images captured at varying off-nadir angles. The S2Looking …

📊 11 results
📏 Metrics: F1-Score, Precision, Recall, OA, KC, IoU, F1

SECOND

SECOND is a well-annotated semantic change detection dataset. To ensure data diversity, we firstly collect 4662 pairs of aerial images …

📊 2 results
📏 Metrics: SeK, Fscd, mIoU

WHU Building Dataset

We manually edited an aerial and a satellite imagery dataset of building samples and named it a WHU building dataset. …

📊 7 results
📏 Metrics: F1-score

Change Point Detection

SKAB

SKAB is designed for evaluating algorithms for anomaly detection. The benchmark currently includes 30+ datasets plus Python modules for algorithms’ …

📊 1 results
📏 Metrics: NAB (standard), NAB (lowFP), NAB (LowFN)

TSSB

The time series segmentation benchmark (TSSB) currently contains 75 annotated time series (TS) with 1-9 segments. Each TS is constructed …

📊 3 results
📏 Metrics: Relative Change Point Distance, Covering

Chart Question Answering

ChartQA

Charts are very popular for analyzing data. When exploring charts, people often ask a variety of complex reasoning questions that …

📊 27 results
📏 Metrics: 1:1 Accuracy

PlotQA

PlotQA is a VQA dataset with 28.9 million question-answer pairs grounded over 224,377 plots on data from real-world sources and …

📊 6 results
📏 Metrics: 1:1 Accuracy

RealCQA

RealCQA Scientific Chart Question Answering as a Test-bed for First-Order Logic check on huggingface : https://huggingface.co/datasets/sal4ahm/RealCQA

📊 5 results
📏 Metrics: 1:1 Accuracy

Chatbot

AlpacaEval

The AlpacaEval set contains 805 instructions form self-instruct, open-assistant, vicuna, koala, hh-rlhf. Those were selected so that the AlpacaEval ranking …

📊 1 results
📏 Metrics: Average win rate

Chunking

Penn Treebank

The English Penn Treebank (PTB) corpus, and in particular the section of the corpus corresponding to the articles of Wall …

📊 5 results
📏 Metrics: F1 score

Circulatory Failure

HiRID

HiRID is a freely accessible critical care dataset containing data relating to almost 34 thousand patient admissions to the Department …

📊 8 results
📏 Metrics: AUPRC, Recall@50

Classification

Adult

Data Set Information: Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records …

📊 1 results
📏 Metrics: AUROC

BIOSCAN_1M_Insect Dataset

In an effort to catalog insect biodiversity, we propose a new large dataset of hand-labelled insect images, the BIOSCAN-1M Insect …

📊 2 results
📏 Metrics: Macro F1

BiasBios

The purpose of this dataset was to study gender bias in occupations. Online biographies, written in English, were collected to …

📊 1 results
📏 Metrics: 1:1 Accuracy

BoolQ

BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring – they are …

📊 2 results
📏 Metrics: Test Accuracy

Brain Tumor MRI Dataset

This dataset is a combination of the following three datasets : figshare, SARTAJ dataset and Br35H This dataset contains 7022 …

📊 1 results
📏 Metrics: F1 score

CIFAKE: Real and AI-Generated Synthetic Images

The quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness. CIFAKE is a dataset that …

📊 1 results
📏 Metrics: Validation Accuracy

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 1 results
📏 Metrics: Accuracy

CIFAR-10C

Common corruptions dataset for CIFAR10

📊 1 results
📏 Metrics: Accuracy on Brightness Corrupted Images

COVID-19 Image Data Collection

Contains hundreds of frontal view X-rays and is the largest public resource for COVID-19 image and prognostic data, making it …

📊 1 results
📏 Metrics: Accuracy

CWRU Bearing Dataset

Data was collected for normal bearings, single-point drive end and fan end defects. Data was collected at 12,000 samples/second and …

📊 1 results
📏 Metrics: 10 fold Cross validation

Chest X-Ray Images (Pneumonia)

The normal chest X-ray (left panel) depicts clear lungs without any areas of abnormal opacification in the image. Bacterial pneumonia …

📊 1 results
📏 Metrics: Accuracy

ForgeryNet

We construct the ForgeryNet dataset, an extremely large face forgery dataset with unified annotations in image- and video-level data across …

📊 3 results
📏 Metrics: AUC, Accuracy

Full-body Parkinson’s disease dataset

A public data set of walking full-body kinematics and kinetics in individuals with Parkinson’s disease

📊 7 results
📏 Metrics: F1-score (weighted)

HOWS

HOWS-CL-25 (Household Objects Within Simulation dataset for Continual Learning) is a synthetic dataset especially designed for object classification on mobile …

📊 1 results
📏 Metrics: Overall accuracy after last sequence

HRF

The HRF dataset is a dataset for retinal vessel segmentation which comprises 45 images and is organized as 15 subsets. …

📊 1 results
📏 Metrics: Accuracy

IRFL: Image Recognition of Figurative Language

The IRFL dataset consists of idioms, similes, and metaphors with matching figurative and literal images, as well as two novel …

📊 1 results
📏 Metrics: 1-of-100 Accuracy

ISIC 2019

The goal for ISIC 2019 is classify dermoscopic images among nine different diagnostic categories.25,331 images are available for training across …

📊 1 results
📏 Metrics: Balanced Multi-Class Accuracy

ImageNet C-OOD (class-out-of-distribution)

This dataset was presented as part of the ICLR 2023 paper 𝘈 𝘧𝘳𝘢𝘮𝘦𝘸𝘰𝘳𝘬 𝘧𝘰𝘳 𝘣𝘦𝘯𝘤𝘩𝘮𝘢𝘳𝘬𝘪𝘯𝘨 𝘊𝘭𝘢𝘴𝘴-𝘰𝘶𝘵-𝘰𝘧-𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯 𝘥𝘦𝘵𝘦𝘤𝘵𝘪𝘰𝘯 𝘢𝘯𝘥 𝘪𝘵𝘴 𝘢𝘱𝘱𝘭𝘪𝘤𝘢𝘵𝘪𝘰𝘯 …

📊 5 results
📏 Metrics: Detection AUROC (severity 0), Detection AUROC (severity 5), Detection AUROC (severity 10)

InDL

Dataset Introduction In this work, we introduce the In-Diagram Logic (InDL) dataset, an innovative resource crafted to rigorously evaluate the …

📊 9 results
📏 Metrics: Average Recall

LES-AV

This data set comprises 22 fundus images with their corresponding manual annotations for the blood vessels, separated as arteries and …

📊 1 results
📏 Metrics: Accuracy

Liver-US

The Liver-US dataset is a comprehensive collection of high-quality ultrasound images of the liver, including both normal and abnormal cases. …

📊 1 results
📏 Metrics: AUC

MHIST

The minimalist histopathology image analysis dataset (MHIST) is a binary classification dataset of 3,152 fixed-size images of colorectal polyps, each …

📊 6 results
📏 Metrics: Accuracy

MedSecId

The process by which sections in a document are demarcated and labeled is known as section identification. Such sections are …

📊 1 results
📏 Metrics: 1 shot Micro-F1

MixedWM38

MixedWM38 Dataset(WaferMap) has more than 38000 wafer maps, including 1 normal pattern, 8 single defect patterns, and 29 mixed defect …

📊 1 results
📏 Metrics: Accuracy, MCC

MuReD Dataset

Early detection of retinal diseases is one of the most important means of preventing partial or permanent blindness in patients. …

📊 1 results
📏 Metrics: ML F1, ML mAP, ML AUC

N-CARS

A large real-world event-based dataset for object classification. Source: HATS: Histograms of Averaged Time Surfaces for Robust Event-based Object Classification

📊 6 results
📏 Metrics: Accuracy (%), Architecture, Representation, Representation Time( ms / 100ms events), Inference Time, Params (M)

N-ImageNet

The N-ImageNet dataset is an event-camera counterpart for the ImageNet dataset. The dataset is obtained by moving an event camera …

📊 9 results
📏 Metrics: Accuracy (%)

RITE

The RITE (Retinal Images vessel Tree Extraction) is a database that enables comparative studies on segmentation or classification of arteries …

📊 1 results
📏 Metrics: Accuracy

RSSCN7

he RSSCN7 dataset contains satellite images acquired from Google Earth, which is originally collected for remote sensing scene classification. We …

📊 1 results
📏 Metrics: 1:1 Accuracy

RTE

The Recognizing Textual Entailment (RTE) datasets come from a series of textual entailment challenges. Data from RTE1, RTE2, RTE3 and …

📊 2 results
📏 Metrics: Test Accuracy

SGD

The Schema-Guided Dialogue (SGD) dataset consists of over 20k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. …

📊 1 results
📏 Metrics: F1 (Seqeval)

SHD - Adding

This dataset is based on the Spiking Heidelberg Digits (SHD) dataset. Sample inputs consist of two spike encoded digits sampled …

📊 3 results
📏 Metrics: Accuracy (%)

SPOT-10

The SPOTS-10 dataset is an extensive collection of grayscale images showcasing diverse patterns found in ten animal species. Specifically, SPOTS-10 …

📊 9 results
📏 Metrics: Accuracy

SST-2

The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the …

📊 2 results
📏 Metrics: Test Accuracy

Sentiment140

Sentiment140 is a dataset that allows you to discover the sentiment of a brand, product, or topic on Twitter. Source: …

📊 1 results
📏 Metrics: Accuracy

SimGas

This dataset consists of computer-generated images for gas leakage segmentation. It features diverse backgrounds, interfering foreground objects, and precise ground …

📊 1 results
📏 Metrics: Frame Level Accuracy

Sound-based drone fault classification using multitask learning

arxiv : https://arxiv.org/abs/2304.11708 Accepted at 29th International Congress on Sound and Vibration (ICSV29). The drone has been used for various …

📊 1 results
📏 Metrics: macro f1 score (A(100), B(100), C(100) Avg.)

TACM12K

Table-ACM12K (TACM12K) is a relational table dataset derived from the ACM heterogeneous graph dataset. It includes four tables: papers, authors, …

📊 1 results
📏 Metrics: Accuracy

TCGA

📊 1 results
📏 Metrics: AUPRC, AUROC

TLF2K

Table-LastFm2K (TLF2K) is a relational table dataset derived from the classical LastFM2K dataset. It contains three tables: artists, user_artists, and …

📊 1 results
📏 Metrics: Accuracy

TML1M

Table-MovieLens1M (TML1M) is a relational table dataset derived from the classical MovieLens1M dataset. It consists of three tables: users, movies, …

📊 1 results
📏 Metrics: Accuracy

WSC

The Winograd Schema Challenge was introduced both as an alternative to the Turing Test and as a test of a …

📊 2 results
📏 Metrics: Test Accuracy

WiC

WiC is a benchmark for the evaluation of context-sensitive word embeddings. WiC is framed as a binary classification task. Each …

📊 2 results
📏 Metrics: Test Accuracy

XImageNet-12

Enlarge the dataset to understand how image background effect the Computer Vision ML model. With the following topics: Blur Background …

📊 3 results
📏 Metrics: Robustness Score

Click-Through Rate Prediction

Criteo

Criteo contains 7 days of click-through data, which is widely used for CTR prediction benchmarking. There are 26 anonymous categorical …

📊 38 results
📏 Metrics: AUC, Log Loss

KDD12

A clickthrough prediction dataset, for more information please see the Kaggle page

📊 5 results
📏 Metrics: AUC, Log Loss

KKBox

The task is to predict the chances of a user listening to a song repetitively after the first observable listening …

📊 5 results
📏 Metrics: AUC

MovieLens

The MovieLens datasets, first released in 1998, describe people’s expressed preferences for movies. These preferences take the form of tuples, …

📊 3 results
📏 Metrics: AUC

iPinYou

The iPinYou Global RTB(Real-Time Bidding) Bidding Algorithm Competition is organized by iPinYou from April 1st, 2013 to December 31st, 2013.The …

📊 7 results
📏 Metrics: AUC, LogLoss

Clinical Assertion Status Detection

2010 i2b2/VA

2010 i2b2/VA is a biomedical dataset for relation classification and entity typing.

📊 1 results
📏 Metrics: Micro F1

Clinical Concept Extraction

2010 i2b2/VA

2010 i2b2/VA is a biomedical dataset for relation classification and entity typing.

📊 3 results
📏 Metrics: Exact Span F1

Clinical Knowledge

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Clothing Attribute Recognition

Clothing Attributes Dataset

We introduce the Clothing Attribute Dataset for promoting research in learning visual attributes for objects. The dataset contains 1856 images, …

📊 3 results
📏 Metrics: Accuracy

Clustering Algorithms Evaluation

97 synthetic datasets

97 synthetic datasets consists of 97 datasets (as illustrated in the figure) and can be used to test graph-based clustering …

📊 1 results
📏 Metrics: HIT-THE-BEST, Rank difference

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 6 results
📏 Metrics: ARI, F1-score, NMI

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 6 results
📏 Metrics: ARI, F1-score, NMI

Olivetti face

This dataset contains a set of face images taken between April 1992 and April 1994 at AT&T Laboratories Cambridge.

📊 5 results
📏 Metrics: F1-score, ARI, NMI

Code Completion

Defects4J

Defects4J is a collection of reproducible bugs and a supporting infrastructure with the goal of advancing software engineering research. Defects4J …

📊 2 results
📏 Metrics: Compilation Rate, Pass@1, BLEU

DotPrompts

DotPrompts is a set of testcases derived from PragmaticCode, such that each testcase consists of a prompt to a dereference …

📊 2 results
📏 Metrics: Compilation Rate

SAFIM

Syntax-Aware Fill-in-the-Middle (SAFIM) is a benchmark for evaluating Large Language Models (LLMs) on the code Fill-in-the-Middle (FIM) task. SAFIM has …

📊 15 results
📏 Metrics: Average, Algorithmic, Control, API

Code Documentation Generation

CodeSearchNet

The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and …

📊 7 results
📏 Metrics: Smoothed BLEU-4

Code Generation

APPS

The APPS dataset consists of problems collected from different open-access coding websites such as Codeforces, Kattis, and more. The APPS …

📊 18 results
📏 Metrics: Introductory Pass@1, Interview Pass@1, Competition Pass@1, Competition Pass@any, Interview Pass@any, Introductory Pass@any, Competition Pass@5, Interview Pass@5, Introductory Pass@5, Competition Pass@1000, Interview Pass@1000, Introductory Pass@1000, Pass@1

CONCODE

A new large dataset with over 100,000 examples consisting of Java classes from online code repositories, and develop a new …

📊 2 results
📏 Metrics: Exact Match, BLEU, CodeBLEU

CoNaLa

The CMU CoNaLa, the Code/Natural Language Challenge dataset is a joint project from the Carnegie Mellon University NeuLab and Strudel

📊 7 results
📏 Metrics: BLEU, Exact Match Accuracy

CoNaLa-Ext

The CoNaLa Extended With Question Text is an extension to the original CoNaLa Dataset (Papers With Code Link) proposed in …

📊 5 results
📏 Metrics: BLEU

CodeContests

CodeContests is a competitive programming dataset for machine-learning. This dataset was used when training AlphaCode. It consists of programming problems, …

📊 8 results
📏 Metrics: Test Set pass@1, Test Set pass@5, Val Set pass@1, Val Set pass@5

DSEval-LeetCode

In this paper, we introduce a novel benchmarking framework designed specifically for evaluations of data science agents. Our contributions are …

📊 1 results
📏 Metrics: Pass Rate, w/o Intact, w/o PE

Django

The Django dataset is a dataset for code generation comprising of 16000 training, 1000 development and 1805 test annotations. Each …

📊 5 results
📏 Metrics: Accuracy, BLEU Score

FloCo

the FloCo dataset that contains 11,884 flowchart images and their corresponding Python codes.

📊 1 results
📏 Metrics: BLEU, CodeBLEU

HumanEval

This is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained …

📊 7 results
📏 Metrics: Pass@1

HumanEval-ET

Extension test cases of HumanEval, as well as generated code.

📊 2 results
📏 Metrics: Pass@1

MBPP

The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry-level programmers, covering programming fundamentals, …

📊 95 results
📏 Metrics: Accuracy

MBPP-ET

Extension test cases of MBPP, as well as generated code.

📊 2 results
📏 Metrics: Pass@1

PECC

Recent advancements in large language models (LLMs) have showcased their exceptional abilities across various tasks, such as code generation, problem-solving …

📊 8 results
📏 Metrics: Pass@3

RES-Q

RES-Q is a natural language instruction-based benchmark for evaluating $\textbf{R}$epository $\textbf{E}$diting $\textbf{S}$ystems, which consists of 100 handcrafted repository editing tasks …

📊 9 results
📏 Metrics: pass@1

Shellcode_IA32

Shellcode_IA32 is a dataset containing 20 years of shellcodes from a variety of sources is the largest collection of shellcodes …

📊 3 results
📏 Metrics: BLEU-4, Exact Match Accuracy

TACO-BAAI

TACO (Topics in Algorithmic Code generation dataset) is a dataset focused on algorithmic code generation, designed to provide a more …

📊 3 results
📏 Metrics: easy pass@1

Turbulence

$\textbf{Turbulence}$ is a new benchmark for systematically evaluating the correctness and robustness of instruction-tuned large language models (LLMs) for code …

📊 5 results
📏 Metrics: CorrSc

Verified Smart Contract Code Comments

Verified Smart Contracts Code Comments is a dataset of real Ethereum smart contract functions, containing "code, comment" pairs of both …

📊 2 results
📏 Metrics: BLEU score

VerilogEval

VerilogEval Dataset The VerilogEval Dataset is a benchmark specifically designed to assess the ability of large language models (LLMs) …

📊 1 results
📏 Metrics: Pass Rate

WebApp1K-React

Test-driven benchmark to challenge LLMs to write JavaScript React application GitHub Script

📊 8 results
📏 Metrics: pass@1

WebApp1k-Duo-React

Test-driven benchmark to challenge LLMs to write long JavaScript React application GitHub Script

📊 6 results
📏 Metrics: pass@1

WikiSQL

WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further …

📊 10 results
📏 Metrics: Execution Accuracy, Exact Match Accuracy

Code Search

CoDesc

CoDesc is a large dataset of 4.2m Java source code and parallel data of their description from code search, and …

📊 3 results
📏 Metrics: Test MRR

CoIR

CoIR (Code Information Retrieval) benchmark, is designed to evaluate code retrieval capabilities. CoIR includes 10 curated code datasets, covering 8 …

📊 1 results
📏 Metrics: nDCG@10

CodeSearchNet

The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and …

📊 6 results
📏 Metrics: Overall, Go, Ruby, Python, Java, JS, PHP

Collaborative Filtering

Amazon-Book

N/A

📊 5 results
📏 Metrics: Recall@20, NDCG@20

Gowalla

Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and …

📊 11 results
📏 Metrics: Recall@20, NDCG@20

Yelp2018

The Yelp2018 dataset is adopted from the 2018 edition of the yelp challenge. Wherein local businesses like restaurants and bars …

📊 9 results
📏 Metrics: NDCG@20, Recall@20

College Medicine

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Collision Avoidance

A Ball Collision Dataset (ABCD)

A Ball-Collision Dataset (ABCD) serves as a comprehensive benchmark for investigating the interaction dynamics of moving objects within 3D environments. …

📊 1 results
📏 Metrics: Accuracy (L:R) - T1

Colorectal Gland Segmentation:

STARE

The STARE (Structured Analysis of the Retina) dataset is a dataset for retinal vessel segmentation. It contains 20 equal-sized (700×605) …

📊 3 results
📏 Metrics: AUC

Colorization

ImageNet ctest10k

Colorization validation set for unconditional/conditional colorization tasks. Subset of the ImageNet validation images and excludes andy grayscale single-channel images.

📊 1 results
📏 Metrics: FID

Common Sense Reasoning

CODAH

The COmmonsense Dataset Adversarially-authored by Humans (CODAH) is an evaluation set for commonsense question-answering in the sentence completion style of …

📊 1 results
📏 Metrics: Accuracy

CommonsenseQA

The CommonsenseQA is a dataset for commonsense question answering task. The dataset consists of 12,247 questions with 5 choices each. …

📊 38 results
📏 Metrics: Accuracy

PARus

Choice of Plausible Alternatives for Russian language (PARus) evaluation provides researchers with a tool for assessing progress in open-domain commonsense …

📊 6 results
📏 Metrics: Accuracy

RWSD

A Winograd schema is a pair of sentences that differ in only one or two words and that contain an …

📊 6 results
📏 Metrics: Accuracy

ReCoRD

Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) is a large-scale reading comprehension dataset which requires commonsense reasoning. ReCoRD consists of …

📊 33 results
📏 Metrics: EM, F1

RuCoS

Russian reading comprehension with Commonsense reasoning (RuCoS) is a large-scale reading comprehension dataset that requires commonsense reasoning. RuCoS consists of …

📊 6 results
📏 Metrics: Average F1, EM

SWAG

Given a partial description like "she opened the hood of the car," humans can reason about the situation and anticipate …

📊 5 results
📏 Metrics: Test, Dev

WinoGAViL

This dataset is collected via the WinoGAViL game to collect challenging vision-and-language associations. Inspired by the popular card game Codenames, …

📊 1 results
📏 Metrics: Jaccard Index

WinoGrande

WinoGrande is a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the …

📊 73 results
📏 Metrics: Accuracy

Community Detection

Citeseer

The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 …

📊 1 results
📏 Metrics: ACC, NMI

Cora

The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 …

📊 1 results
📏 Metrics: NMI, ACC

DBLP

The DBLP is a citation network dataset. The citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and …

📊 1 results
📏 Metrics: F1-Score

Pubmed

The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. …

📊 1 results
📏 Metrics: ACC, NMI

Composed Image Retrieval (CoIR)

CIRR

Composed Image Retrieval (or, Image Retreival conditioned on Language Feedback) is a relatively new retrieval task, where an input query …

📊 1 results
📏 Metrics: R@1, R@5

Fashion IQ

Fashion IQ support and advance research on interactive fashion image retrieval. Fashion IQ is the first fashion dataset to provide …

📊 1 results
📏 Metrics: (Recall@10+Recall@50)/2, R@10, R@50

Compressive Sensing

Set5

The Set5 dataset is a dataset consisting of 5 images (“baby”, “bird”, “butterfly”, “head”, “woman”) commonly used for testing performance …

📊 1 results
📏 Metrics: Average PSNR

Computer Security

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Concept-based Classification

AwA2

Animals with Attributes 2 (AwA2) is a dataset for benchmarking transfer-learning algorithms, such as attribute base classification and zero-shot learning. …

📊 1 results
📏 Metrics: Task Accuracy (%), Concept Accuracy (%)

CUB-200-2011

The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of …

📊 1 results
📏 Metrics: Task Accuracy (%), Concept Accuracy (%)

CelebA

CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels …

📊 1 results
📏 Metrics: Task Accuracy (%), Concept Accuracy (%)

Conditional Image Generation

ArtBench-10 (32x32)

We introduce ArtBench-10, the first class-balanced, high-quality, cleanly annotated, and standardized dataset for benchmarking artwork generation. It comprises 60,000 images …

📊 6 results
📏 Metrics: FID

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 25 results
📏 Metrics: FID, Inception score, Intra-FID

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 7 results
📏 Metrics: FID, Inception Score, Intra-FID

CelebAMask-HQ

CelebAMask-HQ is a large-scale face image dataset that has 30,000 high-resolution face images selected from the CelebA dataset by following …

📊 1 results
📏 Metrics: FID, LPIPS, mIoU

ImageNet-LT

ImageNet Long-Tailed is a subset of /dataset/imagenet dataset consisting of 115.8K images from 1000 categories, with maximally 1280 images per …

📊 1 results
📏 Metrics: FID

Tiny ImageNet

Tiny ImageNet contains 100000 images of 200 classes (500 for each class) downsized to 64×64 colored images. Each class has …

📊 1 results
📏 Metrics: FID, Intra-FID

Conditional Text Generation

Lipogram-e

This is a dataset of 3 English books which do not contain the letter "e" in them. This dataset includes …

📊 4 results
📏 Metrics: Ignored Constraint Error Rate

Constituency Parsing

Penn Treebank

The English Penn Treebank (PTB) corpus, and in particular the section of the corpus corresponding to the articles of Wall …

📊 22 results
📏 Metrics: F1 score

Contact Detection

BEHAVE

BEHAVE is a full body human-object interaction dataset with multi-view RGBD frames and corresponding 3D SMPL and object fits along …

📊 4 results
📏 Metrics: Precision, Recall

Continual Learning

20Newsgroup (10 tasks)

This dataset has 20 classes and each class has about 1000 documents. The data split for train/validation/test is 1600/200/200. We …

📊 6 results
📏 Metrics: F1 - macro

AIDS

AIDS is a graph dataset. It consists of 2000 graphs representing molecular compounds which are constructed from the AIDS Antiviral …

📊 1 results
📏 Metrics: 1:3 Accuracy

DSC (10 tasks)

A set of 10 DSC datasets (reviews of 10 products) to produce sequences of tasks. The products are Sports, Toys, …

📊 6 results
📏 Metrics: F1 - macro

F-CelebA (10 tasks)

F-CelebA - This dataset is adapted from federated learning. Federated learning is an emerging machine learning paradigm with an emphasis …

📊 6 results
📏 Metrics: Acc

MLT17

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 1 results
📏 Metrics: Acc

Permuted MNIST

Permuted MNIST is an MNIST variant that consists of 70,000 images of handwritten digits from 0 to 9, where 60,000 …

📊 3 results
📏 Metrics: Average Accuracy, MLP Hidden Layers-width, Pretrained/Transfer Learning, BWT

Continual Pretraining

AG News

AG News (AG’s News Corpus) is a subdataset of AG's corpus of news articles constructed by assembling titles and description …

📊 1 results
📏 Metrics: F1 - macro

SciERC

SciERC dataset is a collection of 500 scientific abstract annotated with scientific entities, their relations, and coreference clusters. The abstracts …

📊 1 results
📏 Metrics: F1 (macro)

Continual Semantic Segmentation

ADE20K

The ADE20K semantic segmentation dataset contains more than 20K scene-centric images exhaustively annotated with pixel-level objects and object parts labels. …

📊 1 results
📏 Metrics: mIoU

Continuous Affect Estimation

AMIGOS

We present a database for research on affect, personality traits and mood by means of neuro-physiological signals. Different to other …

📊 1 results
📏 Metrics: CCC (Arousal), CCC (Valence), PCC (Arousal), PCC (Valence)

Contrastive Learning

10,000 People - Human Pose Recognition Data

Description: 10,000 People - Human Pose Recognition Data. This dataset includes indoor and outdoor scenes.This dataset covers males and females. …

📊 1 results
📏 Metrics: 0..5sec

Conversational Question Answering

ConvFinQA

ConvFinQA is a dataset designed to study the chain of numerical reasoning in conversational question answering. The dataset contains 3892 …

📊 2 results
📏 Metrics: Execution Accuracy, Program Accuracy

Conversational Response Selection

Advising Corpus

Advising Corpus is a dataset based on an entirely new collection of dialogues in which university students are being advised …

📊 1 results
📏 Metrics: R@1, R@10, R@50

Douban

We release Douban Conversation Corpus, comprising a training data set, a development set and a test set for retrieval based …

📊 10 results
📏 Metrics: MAP, MRR, P@1, R10@1, R10@2, R10@5

E-commerce

We release E-commerce Dialogue Corpus, comprising a training data set, a development set and a test set for retrieval based …

📊 11 results
📏 Metrics: R10@1, R10@2, R10@5

RRS

| | Train | Validation | Test | Ranking Test | | --------- | ----- | ---------- | ------- | …

📊 4 results
📏 Metrics: MAP, MRR, P@1, R10@1, R10@2, R10@5

RRS Ranking Test

| | Train | Validation | Test | Ranking Test | | --------- | ----- | ---------- | ------- | …

📊 3 results
📏 Metrics: NDCG@3, NDCG@5

Ubuntu IRC

The Ubuntu IRC dataset is a valuable resource for research in natural language understanding and dialogue systems. Let me provide …

📊 3 results
📏 Metrics: Accuracy

Conversational Web Navigation

WebLINX

WebLINX is a large-scale benchmark of 100K interactions across 2300 expert demonstrations of conversational web navigation. It covers a broad …

📊 17 results
📏 Metrics: Overall score, Intent Match, Element (IoU), Text (F1)

Core Promoter Detection

GUE

A collection of $28$ datasets across $7$ tasks constructed for genome language model evaluation. Contains seven tasks: promoter prediction. core …

📊 1 results
📏 Metrics: MCC

Core set discovery

Abalone

Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the …

📊 1 results
📏 Metrics: F1(10-fold)

Electricity

Abstract: Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 …

📊 1 results
📏 Metrics: F1(10-fold)

Letter

Letter Recognition Data Set is a handwritten digit dataset. The task is to identify each of a large number of …

📊 1 results
📏 Metrics: F1(10-fold)

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 1 results
📏 Metrics: F1(10-fold)

Coreference Resolution

DWIE

The 'Deutsche Welle corpus for Information Extraction' (DWIE) is a multi-task dataset that combines four main Information Extraction (IE) annotation …

📊 3 results
📏 Metrics: Avg. F1

DocRED-IE

The DocRED Information Extraction (DocRED-IE) dataset extends the DocRED dataset for the Document-level Closed Information Extraction (DocIE) task. DocRED-IE is …

📊 1 results
📏 Metrics: Avg F1

GAP

GAP is a graph processing benchmark suite with the goal of helping to standardize graph processing evaluations. Fewer differences between …

📊 4 results
📏 Metrics: Overall F1, Masculine F1 (M), Feminine F1 (F), Bias (F/M), F1

LitBank

LitBank is an annotated dataset of 100 works of English-language fiction to support tasks in natural language processing and the …

📊 2 results
📏 Metrics: Avg F1, F1

OntoGUM

OntoGUM is an OntoNotes-like coreference dataset converted from GUM, an English corpus covering 12 genres using deterministic rules.

📊 2 results
📏 Metrics: Avg F1

PreCo

A large-scale English dataset for coreference resolution. The dataset is designed to embody the core challenges in coreference, such as …

📊 2 results
📏 Metrics: F1

Quizbowl

Consists of multiple sentences whose clues are arranged by difficulty (from obscure to obvious) and uniquely identify a well-known entity …

📊 1 results
📏 Metrics: F1

WikiCoref

WikiCoref is an English corpus annotated for anaphoric relations, where all documents are from the English version of Wikipedia. Source: …

📊 3 results
📏 Metrics: F1

Covid Variant Prediction

GUE

A collection of $28$ datasets across $7$ tasks constructed for genome language model evaluation. Contains seven tasks: promoter prediction. core …

📊 1 results
📏 Metrics: Avg F1

Croatian Text Diacritization

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish …

📊 1 results
📏 Metrics: Alpha-Word accuracy

Crop Yield Prediction

SICKLE

The availability of well-curated datasets has driven the success of Machine Learning (ML) models. Despite greater access to earth observation …

📊 1 results
📏 Metrics: MAPE (%)

Cross-Domain Few-Shot

EuroSAT

Eurosat is a dataset and deep learning benchmark for land use and land cover classification. The dataset is based on …

📊 11 results
📏 Metrics: 5 shot

Places

The Places dataset is proposed for scene recognition and contains more than 2.5 million images covering more than 205 scene …

📊 8 results
📏 Metrics: 5 shot

Cross-Lingual Transfer

XCOPA

The Cross-lingual Choice of Plausible Alternatives (XCOPA) dataset is a benchmark to evaluate the ability of machine learning models to …

📊 6 results
📏 Metrics: Accuracy

Cross-Modal Retrieval

CUHK-PEDES

The CUHK-PEDES dataset is a caption-annotated pedestrian dataset. It contains 40,206 images over 13,003 persons. Images are collected from five …

📊 1 results
📏 Metrics: Text-to-image Medr

ChEBI-20

Dataset contains 33,010 molecule-description pairs split into 80\%/10\%/10\% train/val/test splits. The goal of the task is to retrieve the relevant …

📊 5 results
📏 Metrics: Hits@1, Hits@10, Mean Rank, Test MRR

Flickr30k

The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators. Source: [Guiding …

📊 23 results
📏 Metrics: Image-to-text R@1, Image-to-text R@5, Image-to-text R@10, Text-to-image R@1, Text-to-image R@5, Text-to-image R@10

MSCOCO

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 1 results
📏 Metrics: Image-to-text R@1

RSICD

The Remote Sensing Image Captioning Dataset (RSICD) is a dataset for remote sensing image captioning task. It contains more than …

📊 7 results
📏 Metrics: Mean Recall, Image-to-text R@1, text-to-image R@1

RSITMD

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 7 results
📏 Metrics: Image-to-text R@1, Mean Recall, text-to-imageR@1

Recipe1M+

Recipe1M+ is a dataset which contains one million structured cooking recipes with 13M associated images. Source: [Recipe1M+: A Dataset for …

📊 2 results
📏 Metrics: Image-to-text R@1, Text-to-image R@1

SoundingEarth

SoundingEarth consists of co-located aerial imagery and audio samples all around the world.

📊 2 results
📏 Metrics: Median Rank, Image-to-sound R@100, Sound-to-image R@100

Cultural Vocal Bursts Intensity Prediction

HUME-VB

The Hume Vocal Burst Database (H-VB) includes all train, validation, and test recordings and corresponding emotion ratings for the train …

📊 1 results
📏 Metrics: Concordance correlation coefficient (CCC)

Czech Text Diacritization

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish …

📊 1 results
📏 Metrics: Alpha-Word accuracy

Data Augmentation

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 5 results
📏 Metrics: Percentage error

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 17 results
📏 Metrics: Accuracy (%)

Data-free Knowledge Distillation

QNLI

The QNLI (Question-answering NLI) dataset is a Natural Language Inference dataset automatically derived from the Stanford Question Answering Dataset v1.1 …

📊 4 results
📏 Metrics: Accuracy

SQuAD

The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct …

📊 4 results
📏 Metrics: Exact Match

Data-to-Text Generation

AMR3.0

Abstract Meaning Representation (AMR) Annotation Release 3.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University …

📊 1 results
📏 Metrics: Bleu

DART

DART is a large dataset for open-domain structured data record to text generation. DART consists of 82,191 examples across different …

📊 1 results
📏 Metrics: BLEU, METEOR, FactSpotter

E2E

End-to-End NLG Challenge (E2E) aims to assess whether recent end-to-end NLG systems can generate more complex output by learning from …

📊 2 results
📏 Metrics: METEOR

GenWiki

GenWiki is a large-scale dataset for knowledge graph-to-text (G2T) and text-to-knowledge graph (T2G) conversion. It is introduced in the paper …

📊 1 results
📏 Metrics: BLEU

MLB Dataset

A new dataset on the baseball domain. Source: Data-to-text Generation with Entity Modeling

📊 4 results
📏 Metrics: BLEU

RotoWire

This dataset consists of (human-written) NBA basketball game summaries aligned with their corresponding box- and line-scores. Summaries taken from rotowire.com …

📊 5 results
📏 Metrics: BLEU

ToTTo

ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a …

📊 6 results
📏 Metrics: BLEU, PARENT, METEOR

ViGGO

The ViGGO corpus is a set of 6,900 meaning representation to natural language utterance pairs in the video game domain. …

📊 2 results
📏 Metrics: BLEU

WebNLG

The WebNLG corpus comprises of sets of triplets describing facts (entities and relations between them) and the corresponding facts in …

📊 15 results
📏 Metrics: BLEU, METEOR, Number of parameters (M), FactSpotter, BLEU-4, ROUGE-L

WikiOFGraph

  • a high-level explanation of the dataset characteristics We introduce WikiOFGraph, a novel large-scale, domain-diverse dataset synthesized by LLMs, ensuring …
📊 1 results
📏 Metrics: BLEU

Wikipedia Person and Animal Dataset

This dataset gathers 428,748 person and 12,236 animal infobox with descriptions based on Wikipedia dump (2018/04/01) and Wikidata (2018/04/12).

📊 1 results
📏 Metrics: BLEU

XAlign

It consists of an extensive collection of a high quality cross-lingual fact-to-text dataset in 11 languages: Assamese (as), Bengali (bn), …

📊 6 results
📏 Metrics: BLEU4, METEOR

De novo molecule generation from MS/MS spectrum

MassSpecGym

MassSpecGym provides three challenges for benchmarking the discovery and identification of new molecules from MS/MS spectra: - 💥 **De novo

📊 3 results
📏 Metrics: Top-1 Accuracy, Top-1 MCES, Top-1 Tanimoto, Top-10 Accuracy, Top-10 MCES, Top-10 Tanimoto

De novo molecule generation from MS/MS spectrum (bonus chemical formulae)

MassSpecGym

MassSpecGym provides three challenges for benchmarking the discovery and identification of new molecules from MS/MS spectra: - 💥 **De novo

📊 7 results
📏 Metrics: Top-1 Accuracy, Top-1 MCES, Top-1 Tanimoto, Top-10 Accuracy, Top-10 MCES, Top-10 Tanimoto

Deblurring

Beam-Splitter Deblurring (BSD)

Using the proposed beam-splitter acquisition system, we have collected a new real-world video deblurring dataset (BSD). We collected blurry/sharp video …

📊 4 results
📏 Metrics: PSNR

GoPro

The GoPro dataset for deblurring consists of 3,214 blurred images with the size of 1,280×720 that are divided into 2,103 …

📊 52 results
📏 Metrics: PSNR, SSIM

HIDE

Consists of 8,422 blurry and sharp image pairs with 65,784 densely annotated FG human bounding boxes. Source: Human-Aware Motion Deblurring

📊 1 results
📏 Metrics: PSNR

MSU BASED

Qualitative dataset with real blurred videos, created by using beam-splitter setup in lab environment

📊 11 results
📏 Metrics: Subjective, SSIM, PSNR, VMAF, LPIPS, ERQAv2.0

REDS

The realistic and dynamic scenes (REDS) dataset was proposed in the NTIRE19 Challenge. The dataset is composed of 300 video …

📊 3 results
📏 Metrics: Average PSNR

RSBlur

The RSBlur dataset provides pairs of real and synthetic blurred images with ground truth sharp images. The dataset enables the …

📊 8 results
📏 Metrics: Average PSNR, SSIM

Decision Making

NASA C-MAPSS

Engine degradation simulation was carried out using C-MAPSS. Four different were sets simulated under different combinations of operational conditions and …

📊 1 results
📏 Metrics: Average Remaining Cycles

Deep Clustering

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 1 results
📏 Metrics: NMI

USPS

USPS is a digit dataset automatically scanned from envelopes by the U.S. Postal Service containing a total of 9,298 16×16 …

📊 1 results
📏 Metrics: NMI

DeepFake Detection

1

111

📊 1 results
📏 Metrics: 0L

CIFAKE: Real and AI-Generated Synthetic Images

The quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness. CIFAKE is a dataset that …

📊 1 results
📏 Metrics: AUC, Validation Accuracy

DFDC

The DFDC (Deepfake Detection Challenge) is a dataset for deepface detection consisting of more than 100,000 videos. The DFDC dataset …

📊 3 results
📏 Metrics: AUC, LogLoss

FaceForensics

FaceForensics is a video dataset consisting of more than 500,000 frames containing faces from 1004 videos that can be used …

📊 1 results
📏 Metrics: DF, FS, FSF, NT, Real, Total Accuracy

FaceForensics++

FaceForensics++ is a forensics dataset consisting of 1000 original video sequences that have been manipulated with four automated face manipulation …

📊 6 results
📏 Metrics: AUC, LogLoss

FakeAVCeleb

FakeAVCeleb is a novel Audio-Video Deepfake dataset that not only contains deepfake videos but respective synthesized cloned audios as well. …

📊 10 results
📏 Metrics: ROC AUC, AP, Accuracy (%)

LAV-DF

Localized Audio Visual DeepFake Dataset (LAV-DF). Paper: Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method …

📊 1 results
📏 Metrics: AUC

Defocus Blur Detection

EBD

A large-scale benchmark with 1605 high-resolution, well-annotated images, featuring more complex scenes and a wider range of DOF settings.

📊 1 results
📏 Metrics: MAE

Denoising

DIV2K

DIV2K is a popular single-image super-resolution dataset which contains 1,000 images with different scenes and is splitted to 800 for …

📊 1 results
📏 Metrics: Average PSNR (dB)

DND

Benchmarking Denoising Algorithms with Real Photographs This dataset consists of 50 pairs of noisy and (nearly) noise-free images captured with …

📊 1 results
📏 Metrics: Average PSNR, SSIM (sRGB)

Darmstadt Noise Dataset

the dataset contains data about hydrogen storage in metal hydrides

📊 10 results
📏 Metrics: PSNR

iris

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician, …

📊 1 results
📏 Metrics: Average

Dense Captioning

Visual Genome

Visual Genome contains Visual Question Answering data in a multi-choice setting. It consists of 101,174 images from MSCOCO with 1.7 …

📊 4 results
📏 Metrics: mAP

Dense Pixel Correspondence Estimation

ETH3D

ETHD is a multi-view stereo benchmark / 3D reconstruction benchmark that covers a variety of indoor and outdoor scenes. Ground …

📊 2 results
📏 Metrics: AEPE (rate=3), AEPE (rate=5)

HPatches

The HPatches is a recent dataset for local patch descriptor evaluation that consists of 116 sequences of 6 images with …

📊 8 results
📏 Metrics: Viewpoint I AEPE, Viewpoint II AEPE, Viewpoint III AEPE, Viewpoint IV AEPE, Viewpoint V AEPE, PCK-5px, PCK-1px, PCK-3px

TSS

dataset of 400 image pairs

📊 1 results
📏 Metrics: Average [email protected]

Dense Video Captioning

ActivityNet Captions

The ActivityNet Captions dataset is built on ActivityNet v1.3 which includes 20k YouTube untrimmed videos with 100k caption annotations. The …

📊 11 results
📏 Metrics: METEOR, BLEU-3, BLEU-4, CIDEr, SODA, DIV-1, DIV-2, RE-4, BLEU4, F1, Precision, Recall

ViTT

The ViTT dataset consists of human produced segment-level annotations for 8,169 videos. Of these, 5,840 videos have been annotated once, …

📊 3 results
📏 Metrics: SODA, CIDEr, METEOR

YouCook2

YouCook2 is the largest task-oriented, instructional video dataset in the vision community. It contains 2000 long untrimmed videos from 89 …

📊 6 results
📏 Metrics: CIDEr, METEOR, SODA, BLEU4, ROUGE-L, F1, Precision, Recall

Density Estimation

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 15 results
📏 Metrics: NLL (bits/dim), Log-likelihood (nats)

Caltech-101

The Caltech101 dataset contains images from 101 object categories (e.g., “helicopter”, “elephant” and “chair” etc.) and a background category that …

📊 3 results
📏 Metrics: Negative ELBO, NLL, MMD-L2, COV-L2

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 6 results
📏 Metrics: NLL (bits/dim), Log-likelihood (nats), MMD-L2, COV-L2, NLL

Dependency Parsing

Chinese Treebank

📊 1 results
📏 Metrics: LAS, UAS

CoNLL-2009

The task builds on the CoNLL-2008 task and extends it to multiple languages. The core of the task is to …

📊 2 results
📏 Metrics: LAS, UAS

DaNE

Danish Dependency Treebank (DaNE) is a named entity annotation for the Danish Universal Dependencies treebank using the CoNLL-2003 annotation scheme. …

📊 1 results
📏 Metrics: LAS, UAS

Penn Treebank

The English Penn Treebank (PTB) corpus, and in particular the section of the corpus corresponding to the articles of Wall …

📊 19 results
📏 Metrics: LAS, UAS, POS

Tweebank

Briefly describe the dataset. Provide: * a high-level explanation of the dataset characteristics * explain motivations and summary of its …

📊 2 results
📏 Metrics: Labelled Attachment Score, Unlabeled Attachment Score

Universal Dependencies

The Universal Dependencies (UD) project seeks to develop cross-linguistically consistent treebank annotation of morphology and syntax for multiple languages. The …

📊 4 results
📏 Metrics: LAS, UAS, BLEX

Depth Completion

KITTI

KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile …

📊 1 results
📏 Metrics: RMSE

Matterport3D

The Matterport3D dataset is a large RGB-D dataset for scene understanding in indoor environments. It contains 10,800 panoramic views inside …

📊 2 results
📏 Metrics: RMSE

PLAD

PLAD is a dataset where sparse depth is provided by line-based visual SLAM to verify StructMDC.

📊 1 results
📏 Metrics: MAE, RMSE

VOID

The dataset was collected using the Intel RealSense D435i camera, which was configured to produce synchronized accelerometer and gyroscope measurements …

📊 6 results
📏 Metrics: MAE, RMSE, iMAE, iRMSE

Depth Estimation

DCM

The DCM dataset is composed of 772 annotated images from 27 golden age comic books. We freely collected them from …

📊 3 results
📏 Metrics: Abs Rel, RMSE, RMSE log, Sq Rel

DIODE

Diode Dense Indoor/Outdoor DEpth (DIODE) is the first standard dataset for monocular depth estimation comprising diverse indoor and outdoor scenes …

📊 2 results
📏 Metrics: Delta < 1.25, Delta < 1.25^2, Delta < 1.25^3

Matterport3D

The Matterport3D dataset is a large RGB-D dataset for scene understanding in indoor environments. It contains 10,800 panoramic views inside …

📊 1 results
📏 Metrics: Abs Rel

ScanNet

ScanNet is an instance-level indoor RGB-D dataset that includes both 2D and 3D data. It is a collection of labeled …

📊 2 results
📏 Metrics: RMSE, absolute relative error

Taskonomy

Taskonomy provides a large and high-quality dataset of varied indoor scenes. - Complete pixel-level geometric information via aligned meshes. - …

📊 1 results
📏 Metrics: L1 error

eBDtheque

The eBDtheque database is a selection of one hundred comic pages from America, Japan (manga) and Europe. Image source: http://ebdtheque.univ-lr.fr/database/

📊 3 results
📏 Metrics: Abs Rel, RMSE, RMSE log, Sq Rel

Description-guided molecule generation

TOMG-Bench

In this paper, we propose Text-based Open Molecule Generation Benchmark (TOMG-Bench), the first benchmark to evaluate the open-domain molecule generation …

📊 25 results
📏 Metrics: wAcc

Diabetes Prediction

Diabetes

What do the instances in this dataset represent? The instances represent hospitalized patient records diagnosed with diabetes. **Are there recommended …

📊 1 results
📏 Metrics: Accuracy

Dialogue Evaluation

USR-PersonaChat

This dataset was collected with the goal of assessing dialog evaluation metrics. In the paper, USR: An Unsupervised and Reference …

📊 4 results
📏 Metrics: Spearman Correlation, Pearson Correlation

USR-TopicalChat

This dataset was collected with the goal of assessing dialog evaluation metrics. In the paper, USR: An Unsupervised and Reference …

📊 5 results
📏 Metrics: Spearman Correlation, Pearson Correlation

Dialogue Generation

FusedChat

FusedChat is an inter-mode dialogue dataset. It contains dialogue sessions fusing task-oriented dialogues (TOD) and open-domain dialogues (ODD). Based on …

📊 2 results
📏 Metrics: Slot Accuracy, Joint SA, Inform, Inform_mct, Success, Success_mct, BLEU, PPL, Sensibleness, Specificity, SSA

Harry Potter Dialogue Dataset

Harry Potter Dialogue is the first dialogue dataset that integrates with scene, attributes and relations which are dynamically changed as …

📊 2 results
📏 Metrics: mauve

PG-19

A new open-vocabulary language modelling benchmark derived from books. Source: Compressive Transformers for Long-Range Sequence Modelling

📊 1 results
📏 Metrics: Perplexity

Dialogue Rewriting

CANARD

CANARD is a dataset for question-in-context rewriting that consists of questions each given in a dialog context together with a …

📊 1 results
📏 Metrics: BLEU

Dimensionality Reduction

EMNIST

EMNIST (extended MNIST) has 4 times more data than MNIST. It is a set of handwritten digits with a 28 …

📊 2 results
📏 Metrics: Classification Accuracy

Directional Hearing

VCTK

This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about …

📊 1 results
📏 Metrics: SI-SDRi

Discourse Parsing

Instructional-DT (Instr-DT)

This discourse treebank includes annotated instructional texts originally assembled at the Information Technology Research Institute, University of Brighton. This dataset …

📊 12 results
📏 Metrics: Standard Parseval (Nuclearity), Standard Parseval (Span), Standard Parseval (Full), Standard Parseval (Relation)

Molweni

A machine reading comprehension (MRC) dataset with discourse structure built over multiparty dialog. Molweni's source samples from the Ubuntu Chat …

📊 4 results
📏 Metrics: Link & Rel F1, Link F1

RST-DT

The Rhetorical Structure Theory (RST) Discourse Treebank consists of 385 Wall Street Journal articles from the Penn Treebank annotated with …

📊 27 results
📏 Metrics: Standard Parseval (Full), Standard Parseval (Span), Standard Parseval (Nuclearity), Standard Parseval (Relation), RST-Parseval (Full), RST-Parseval (Span), RST-Parseval (Nuclearity), RST-Parseval (Relation)

Distance regression

CHILI-100K

The CHILI-100K dataset is a large-scale graph dataset (with overall >183M nodes, >1.2B edges) of nanomaterials generated from experimentally determined …

📊 8 results
📏 Metrics: MSE

CHILI-3K

The CHILI-3K dataset is a medium-scale graph dataset (with overall >6M nodes, >49M edges) of mono-metallic oxide nanomaterials generated from …

📊 8 results
📏 Metrics: MSE

Document AI

EPHOIE

EPHOIE is a fully-annotated dataset which is the first Chinese benchmark for both text spotting and visual information extraction. EPHOIE …

📊 1 results
📏 Metrics: Average F1

Document Classification

Cora

The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 …

📊 6 results
📏 Metrics: Accuracy

HOC

The Hallmarks of Cancer (*HOC) corpus consists of 1852 PubMed publication abstracts manually annotated by experts according to the Hallmarks …

📊 5 results
📏 Metrics: F1, Micro F1

Hyperpartisan News Detection

Hyperpartisan News Detection was a dataset created for PAN @ SemEval 2019 Task 4. Given a news article text, decide …

📊 1 results
📏 Metrics: Accuracy

LUN

LUN is used for unreliable news source classification, this dataset includes 17,250 articles from satire, propaganda, and hoaxe.

📊 1 results
📏 Metrics: Accuracy

Reuters-21578

The Reuters-21578 dataset is a collection of documents with news articles. The original corpus has 10,369 documents and a vocabulary …

📊 6 results
📏 Metrics: Accuracy, F1

Document Image Classification

RVL-CDIP

The RVL-CDIP dataset consists of scanned document images belonging to 16 classes such as letter, form, email, resume, memo, etc. …

📊 29 results
📏 Metrics: Accuracy, Parameters

Tobacco-3482

The Tobacco-3482 dataset consists of document images belonging to 10 classes such as letter, form, email, resume, memo, etc. The …

📊 9 results
📏 Metrics: Accuracy, Memory

Document Layout Analysis

D4LA

The D4LA dataset is a diverse benchmark for document layout analysis (DLA) derived from the RVL-CDIP dataset. It focuses on …

📊 3 results
📏 Metrics: mAP, Model Parameters

RVL-CDIP

The RVL-CDIP dataset consists of scanned document images belonging to 16 classes such as letter, form, email, resume, memo, etc. …

📊 1 results
📏 Metrics: FAR, WAR

U-DIADS-Bib

U-DIADS-Bib is a proprietary dataset developed through the collaboration of computer scientists and humanities at the University of Udine. It …

📊 1 results
📏 Metrics: Class Average IoU, Class Average IoU (Few-shot setting)

Document Ranking

DaReCzech

DareCzech DaReCzech is a dataset for text relevance ranking in Czech. The dataset consists of more than 1.6M annotated …

📊 3 results
📏 Metrics: P@10

Document Summarization

Arxiv HEP-TH citation graph

Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a …

📊 1 results
📏 Metrics: ROUGE-1

arXiv Summarization Dataset

This is a dataset for evaluating summarisation methods for research papers. Source: [A Discourse-Aware Attention Model for Abstractive Summarization of …

📊 1 results
📏 Metrics: Rouge-2

Document Text Classification

Tobacco-3482

The Tobacco-3482 dataset consists of document images belonging to 10 classes such as letter, form, email, resume, memo, etc. The …

📊 3 results
📏 Metrics: Accuracy, Training time (hours)

Document-level Relation Extraction

Bc8

Bc8BioRED is built upon BioRED 2022 with the addition of directionality annotations. The training and development sets from the original …

📊 1 results
📏 Metrics: Evaluation Macro F1

DWIE

The 'Deutsche Welle corpus for Information Extraction' (DWIE) is a multi-task dataset that combines four main Information Extraction (IE) annotation …

📊 1 results
📏 Metrics: F1

DocRED-IE

The DocRED Information Extraction (DocRED-IE) dataset extends the DocRED dataset for the Document-level Closed Information Extraction (DocIE) task. DocRED-IE is …

📊 1 results
📏 Metrics: Relation F1

Re-DocRED

The Re-DocRED Dataset resolved the following problems of DocRED: 1. Resolved the incompleteness problem by supplementing large amounts of relation …

📊 1 results
📏 Metrics: F1

Domain Adaptation

Canon RAW Low Light

The goal of this project is to present two new datasets that seek to expand the capability of the Learning …

📊 1 results
📏 Metrics: PSNR, SSIM

Comic2k

Comic2k is a dataset used for cross-domain object detection which contains 2k comic images with image and instance-level annotations. Image …

📊 2 results
📏 Metrics: mAP

DomainNet

DomainNet is a dataset of common objects in six different domain. All domains include 345 categories (classes) of objects such …

📊 4 results
📏 Metrics: Accuracy

Foggy Cityscapes

Foggy Cityscapes is a synthetic foggy dataset which simulates fog on real scenes. Each foggy image is rendered with a …

📊 1 results
📏 Metrics: mAP

ImageCLEF-DA

The ImageCLEF-DA dataset is a benchmark dataset for ImageCLEF 2014 domain adaptation challenge, which contains three domains: Caltech-256 (C), ImageNet …

📊 16 results
📏 Metrics: Accuracy

LeukemiaAttri

The LeukemiaAttri dataset is a large-scale, multi-domain collection of microscopy images derived from leukemia patient samples, enriched with detailed morphological …

📊 1 results
📏 Metrics: mAP

MSDA

  • 5 domains: synthetic domain, document domain, street view domain, handwritten domain, and car license domain * over five million …
📊 1 results
📏 Metrics: Average Accuracy

Nikon RAW Low Light

Dataset release for the BMVC 2021 Paper "Few-Shot Domain Adaptation for Low Light RAW Image Enhancement" Abstract: Enhancing practical low …

📊 1 results
📏 Metrics: PSNR, SSIM

Office-31

The Office dataset contains 31 object categories in three domains: Amazon, DSLR and Webcam. The 31 categories in the dataset …

📊 38 results
📏 Metrics: Average Accuracy

Office-Caltech-10

Office-Caltech-10 a standard benchmark for domain adaptation, which consists of Office 10 and Caltech 10 datasets. It contains the 10 …

📊 1 results
📏 Metrics: Accuracy (%)

Office-Home

Office-Home is a benchmark dataset for domain adaptation which contains 4 domains where each domain consists of 65 categories. The …

📊 27 results
📏 Metrics: Accuracy

PACS

PACS is an image dataset for domain generalization. It consists of four domains, namely Photo (1,670 images), Art Painting (2,048 …

📊 1 results
📏 Metrics: Accuracy

Sim10k

SIM10k is a synthetic dataset containing 10,000 images, which is rendered from the video game Grand Theft Auto V (GTA5). …

📊 1 results
📏 Metrics: mAP

Domain Generalization

CIFAR-10C

Common corruptions dataset for CIFAR10

📊 1 results
📏 Metrics: Accuracy

DomainNet

DomainNet is a dataset of common objects in six different domain. All domains include 345 categories (classes) of objects such …

📊 34 results
📏 Metrics: Average Accuracy

ImageNet-A

The ImageNet-A dataset consists of real-world, unmodified, and naturally occurring examples that are misclassified by ResNet models. Source: [On Robustness …

📊 39 results
📏 Metrics: Top-1 accuracy %, Number of params

ImageNet-C

ImageNet-C is an open source data set that consists of algorithmically generated corruptions (blur, noise) applied to the ImageNet test-set. …

📊 42 results
📏 Metrics: mean Corruption Error (mCE), Top 1 Accuracy, Number of params

ImageNet-R

ImageNet-R(endition) contains art, cartoons, deviantart, graffiti, embroidery, graphics, origami, paintings, patterns, plastic objects, plush objects, sculptures, sketches, tattoos, toys, and …

📊 39 results
📏 Metrics: Top-1 Error Rate

ImageNet-Sketch

ImageNet-Sketch data set consists of 50,889 images, approximately 50 images for each of the 1000 ImageNet classes. The data set …

📊 20 results
📏 Metrics: Top-1 accuracy

Office-Home

Office-Home is a benchmark dataset for domain adaptation which contains 4 domains where each domain consists of 65 categories. The …

📊 41 results
📏 Metrics: Average Accuracy

PACS

PACS is an image dataset for domain generalization. It consists of four domains, namely Photo (1,670 images), Art Painting (2,048 …

📊 118 results
📏 Metrics: Average Accuracy

VLCS

VLCS is a dataset to test for domain generalization.

📊 35 results
📏 Metrics: Average Accuracy

VizWiz-Classification

Our goal is to improve upon the status quo for designing image classification models trained in one domain that perform …

📊 86 results
📏 Metrics: Accuracy - All Images, Accuracy - Corrupted Images, Accuracy - Clean Images

WildDash

WildDash is a benchmark evaluation method is presented that uses the meta-information to calculate the robustness of a given algorithm …

📊 1 results
📏 Metrics: Mean IoU

Downbeat Tracking

ASAP

ASAP is a dataset of 222 digital musical scores aligned with 1068 performances (more than 92 hours) of Western classical …

📊 1 results
📏 Metrics: F1

Ballroom

This data set includes beat and bar annotations of the ballroom dataset, introduced by Gouyon et al. [1]. [1] Gouyon …

📊 1 results
📏 Metrics: F1

Beatles

This dataset includes the beat and downbeat annotations for Beatles albums. The annotations are provided by M. E. P. Davies …

📊 1 results
📏 Metrics: F1

Candombe

35 recordings of Candombe music with beat and downbeat annotations.

📊 1 results
📏 Metrics: F1

Filosax

48 multitrack jazz recordings with many annotations.

📊 1 results
📏 Metrics: F1

GTZAN

The gtzan8 audio dataset contains 1000 tracks of 30 second length. There are 10 genres, each containing 100 tracks which …

📊 1 results
📏 Metrics: F1

Groove

The Groove MIDI Dataset (GMD) is composed of 13.6 hours of aligned MIDI and (synthesized) audio of human-performed, tempo-aligned expressive …

📊 1 results
📏 Metrics: F1

GuitarSet

GuitarSet is a dataset of high-quality guitar recordings and rich annotations. It contains 360 excerpts 30 seconds in length. The …

📊 1 results
📏 Metrics: F1

HJDB

J. Hockman, M. E. Davies, and I. Fujinaga, “One in the jungle: Downbeat detection in hardcore, jungle, and drum and …

📊 1 results
📏 Metrics: F1

Hainsworth

S. W. Hainsworth and M. D. Macleod, “Particle filtering applied to musical tempo tracking,” EURASIP Journal on Advances in Signal …

📊 1 results
📏 Metrics: F1

Harmonix

Beats, downbeats, and functional structural annotations for 912 Pop tracks. Nieto, O., McCallum, M., Davies., M., Robertson, A., Stark, A., …

📊 1 results
📏 Metrics: F1

JAAH

Eremenko, E. Demirel, B. Bozkurt, and X. Serra, “Audio-aligned jazz harmony dataset for automatic chord transcription and corpus-based research,” in …

📊 1 results
📏 Metrics: F1

TapCorrect

J. Driedger, H. Schreiber, W. B. de Haas, and M. Müller, “Towards automatically correcting tapped beat annotations for music recordings.” …

📊 1 results
📏 Metrics: F1

Drug Discovery

DAVIS-DTA

Dataset Description: The interaction of 72 kinase inhibitors with 442 kinases covering >80% of the human catalytic protein kinome. Task …

📊 3 results
📏 Metrics: CI, MSE

KIBA

Dataset Description: Toward making use of the complementary information captured by the various bioactivity types, including IC50, K(i), and K(d), …

📊 3 results
📏 Metrics: CI, MSE

LIT-PCBA(ALDH1)

Comparative evaluation of virtual screening methods requires a rigorous benchmarking procedure on diverse, realistic, and unbiased data sets. Recent investigations …

📊 1 results
📏 Metrics: AUC

LIT-PCBA(KAT2A)

Comparative evaluation of virtual screening methods requires a rigorous benchmarking procedure on diverse, realistic, and unbiased data sets. Recent investigations …

📊 1 results
📏 Metrics: AUC

LIT-PCBA(MAPK1)

Comparative evaluation of virtual screening methods requires a rigorous benchmarking procedure on diverse, realistic, and unbiased data sets. Recent investigations …

📊 1 results
📏 Metrics: AUC

MUV

The Maximum Unbiased Validation (MUV) dataset is a benchmark dataset selected from PubChem BioAssay. It was created by applying a …

📊 4 results
📏 Metrics: AUC

PCBA

PCBA dataset 11 is a collection of high-quality dose-response data, formulated as a multitask learning benchmark from 128 high-throughput screening …

📊 2 results
📏 Metrics: AUC

QED

QED is a linguistically principled framework for explanations in question answering. Given a question and a passage, QED represents an …

📊 1 results
📏 Metrics: Diversity, Success

QM9

QM9 provides quantum chemical properties (at DFT level) for a relevant, consistent, and comprehensive chemical space of small organic molecules. …

📊 10 results
📏 Metrics: Error ratio

SIDER

SIDER contains information on marketed medicines and their recorded adverse drug reactions. The information is extracted from public documents and …

📊 4 results
📏 Metrics: AUC

Tox21

The Tox21 data set comprises 12,060 training samples and 647 test samples that represent chemical compounds. There are 801 "dense …

📊 10 results
📏 Metrics: AUC

clintox

The ClinTox dataset compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons. The …

📊 3 results
📏 Metrics: AUC

Dynamic Reconstruction

iPhone (Monocular Dynamic View Synthesis)

iPhone dataset is a challenging benchmarks for dynamic reconstruction. This dataset consists of a collection of videos with realistic scenes …

📊 7 results
📏 Metrics: LPIPS

ECG Classification

PTB-XL

Electrocardiography (ECG) is a key diagnostic tool to assess the cardiac condition of a patient. Automatic ECG interpretation algorithms as …

📊 1 results
📏 Metrics: AUROC

UCR Time Series Classification Archive

The UCR Time Series Archive - introduced in 2002, has become an important resource in the time series data mining …

📊 1 results
📏 Metrics: Accuracy (Test)

ECG Digitization

ECG-Image-Database

The George B. Moody PhysioNet Challenges are annual competitions that invite participants to develop automated approaches for addressing important physiological …

📊 1 results
📏 Metrics: SNR

Econometrics

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Edge Detection

BIPED

Details It contains 250 outdoor images of 1280$\times$720 pixels each. These images have been carefully annotated by experts on the …

📊 4 results
📏 Metrics: ODS, Number of parameters (M)

BRIND

BRIND is a short name of BSDS-RIND is the first public benchmark that dedicated to studying simultaneously the four edge …

📊 2 results
📏 Metrics: ODS, Number of parameters (M)

BSDS500

Berkeley Segmentation Data Set 500 (BSDS500) is a standard benchmark for contour detection. This dataset is designed for evaluating natural …

📊 1 results
📏 Metrics: F1

CID

The CID (Campus Image Dataset) is a dataset captured in low-light env with the help of Android programming. Its basic …

📊 1 results
📏 Metrics: ODS

MDBD

In order to study the interaction of several early visual cues (luminance, color, stereo, motion) during boundary detection in challenging …

📊 5 results
📏 Metrics: ODS, Number of parameters (M)

SBD

The Semantic Boundaries Dataset (SBD) is a dataset for predicting pixels on the boundary of the object (as opposed to …

📊 2 results
📏 Metrics: Maximum F-measure

UDED

This dataset is a collection of 1, 2, or 3 images from: BIPED, BSDS500, BSDS300, DIV2K, WIRE-FRAME, CID, CITYSCAPES, ADE20K, …

📊 3 results
📏 Metrics: ODS

EditCompletion

C# EditCompletion

We scraped the 53 most popular C# repositories from GitHub and extracted all commits since the beginning of the project’s …

📊 7 results
📏 Metrics: Accuracy

Eeg Decoding

CWL EEG/fMRI Dataset

EEG/fMRI Data from 8 subject doing a simple eyes open/eyes closed task is provided on this webpage. The EEG/fMRI data …

📊 1 results
📏 Metrics: Pearson Correlation

Emotion Classification

CAER-Dynamic

13,201 clips from 79 TV shows. Each video clip was manually annotated with six emotion categories, including “anger”, “disgust”, “fear”, …

📊 1 results
📏 Metrics: Accuracy

CMU-MOSEI

CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) is the largest dataset of sentence-level sentiment analysis and emotion recognition in …

📊 4 results
📏 Metrics: Accuracy, Weighted Accuracy

MFA

The MFA (Many Faces of Anger) dataset includes 200 in-the-wild videos from North American and Persian cultures with fine-grained labels …

📊 2 results
📏 Metrics: F-F1 score (Comb.), F-F1 score (Persian), V-F1 score (Comb.), V-F1 score (NA), F-F1 score (NA), V-F1 score (Persian)

ROCStories

ROCStories is a collection of commonsense short stories. The corpus consists of 100,000 five-sentence stories. Each story logically follows everyday …

📊 2 results
📏 Metrics: F1

Emotion Interpretation

EIBench

For Emotion Interpretation task

📊 13 results
📏 Metrics: Recall

Emotion Recognition

EMOTIC

The EMOTIC dataset, named after EMOTions In Context, is a database of images with people in real environments, annotated with …

📊 2 results
📏 Metrics: Top-3 Accuracy (%)

Emomusic

1000 songs has been selected from Free Music Archive (FMA). The excerpts which were annotated are available in the same …

📊 5 results
📏 Metrics: EmoA, EmoV

FER2013

Fer2013 contains approximately 30,000 facial RGB images of different expressions with size restricted to 48×48, and the main labels of …

📊 1 results
📏 Metrics: 5-class test accuracy

MSP-Podcast

The MSP-Podcast corpus contains speech segments from podcast recordings which are perceptually annotated using crowdsourcing. The collection of this corpus …

📊 1 results
📏 Metrics: Concordance correlation coefficient (CCC)

RAVDESS

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7,356 files (total size: 24.8 GB). The database contains …

📊 2 results
📏 Metrics: Accuracy, WAR

SEED

The SEED dataset contains subjects' EEG signals when they were watching films clips. The film clips are carefully selected so …

📊 1 results
📏 Metrics: Accuracy

Emotional Intelligence

EQ-Bench

This dataset contains benchmark scores for EQ-Bench, a novel benchmark designed to evaluate aspects of emotional intelligence in Large Language …

📊 24 results
📏 Metrics: EQ-Bench Score

Ensemble Learning

SMS Spam Collection Data Set

This corpus has been collected from free or free for research sources at the Internet: - A collection of 425 …

📊 1 results
📏 Metrics: Accuracy

Entity Alignment

DBP1M FR-EN

A large-scale cross-lingual dataset for entity alignment

📊 2 results
📏 Metrics: Hit@1

DBP2.0 zh-en

The DBP2.0 dataset can be downloaded from the figshare repository. It has three entity alignment settings, i.e., ZH-EN, JA-EN and …

📊 2 results
📏 Metrics: dangling entity detection F1, Entity Alignment (Consolidated) F1

Entity Disambiguation

AQUAINT

The AQUAINT Corpus consists of newswire text data in English, drawn from three sources: the Xinhua News Service (People's Republic …

📊 5 results
📏 Metrics: Micro-F1

DocRED-IE

The DocRED Information Extraction (DocRED-IE) dataset extends the DocRED dataset for the Document-level Closed Information Extraction (DocIE) task. DocRED-IE is …

📊 1 results
📏 Metrics: Avg F1

Mewsli-9

A large new multilingual dataset for multilingual entity linking. Source: Entity Linking in 100 Languages

📊 2 results
📏 Metrics: Micro Precision

Entity Linking

AIDA/testc

AIDA/testc is a new challenging test set for entity linking systems containing 131 Reuters news articles published between December 5th …

📊 2 results
📏 Metrics: Micro-F1 strong

EC-FUNSD

EC-FUNSD is introduced in [arXiv:2402.02379] as a benchmark of semantic entity recognition (SER) and entity linking (EL), designed for the …

📊 8 results
📏 Metrics: F1

FIGER

The FIGER dataset is an entity recognition dataset where entities are labelled using fine-grained system 112 tags, such as person/doctor, …

📊 1 results
📏 Metrics: Accuracy, Macro F1, Micro F1

FUNSD

Form Understanding in Noisy Scanned Documents (FUNSD) comprises 199 real, fully annotated, scanned forms. The documents are noisy and vary …

📊 6 results
📏 Metrics: F1

GUM

GUM is an open source multilayer English corpus of richly annotated texts from twelve text types. Annotations include: * Multiple …

📊 1 results
📏 Metrics: F1

MedMentions

MedMentions is a new manually annotated resource for the recognition of biomedical concepts. What distinguishes MedMentions from other annotated biomedical …

📊 1 results
📏 Metrics: Accuracy, Recall@64

REBEL

Wikipedia abstracts automatically annotated with WikiData entities and relations that are entailed by the text. Over 9 million triplets.

📊 1 results
📏 Metrics: Micro-F1

Rare Diseases Mentions in MIMIC-III

Data annotation The 1,073 full rare disease mention annotations (from 312 MIMIC-III discharge summaries) are in full_set_RD_ann_MIMIC_III_disch.csv. The data …

📊 2 results
📏 Metrics: F1

WiC-TSV

WiC-TSV is a new multi-domain evaluation benchmark for Word Sense Disambiguation. More specifically, it is a framework for Target Sense …

📊 6 results
📏 Metrics: Task 1 Accuracy: all, Task 1 Accuracy: general purpose, Task 1 Accuracy: domain specific, Task 2 Accuracy: all, Task 2 Accuracy: general purpose, Task 2 Accuracy: domain specific, Task 3 Accuracy: all, Task 3 Accuracy: general purpose, Task 3 Accuracy: domain specific

Entity Resolution

Abt-Buy

The Abt-Buy dataset for entity resolution derives from the online retailers Abt.com and Buy.com. The dataset contains 1081 entities from …

📊 9 results
📏 Metrics: F1 (%)

Amazon-Google

The Amazon-Google dataset for entity resolution derives from the online retailers Amazon.com and the product search service of Google accessible …

📊 12 results
📏 Metrics: F1 (%)

WDC Products

WDC Products is an entity matching benchmark which provides for the systematic evaluation of matching systems along combinations of three …

📊 1 results
📏 Metrics: F1 (%)

Entity Typing

DocRED-IE

The DocRED Information Extraction (DocRED-IE) dataset extends the DocRED dataset for the Document-level Closed Information Extraction (DocIE) task. DocRED-IE is …

📊 1 results
📏 Metrics: Avg F1

FIGER

The FIGER dataset is an entity recognition dataset where entities are labelled using fine-grained system 112 tags, such as person/doctor, …

📊 1 results
📏 Metrics: Macro F1, Micro F1

Open Entity

The Open Entity dataset is a collection of about 6,000 sentences with fine-grained entity types annotations. The entity types are …

📊 13 results
📏 Metrics: F1

Environmental Sound Classification

ESC-50

The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification. …

📊 1 results
📏 Metrics: Accuracy

FSD50K

Freesound Dataset 50k (or FSD50K for short) is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally …

📊 1 results
📏 Metrics: mAP

UrbanSound8K

Urban Sound 8K is an audio dataset that contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: …

📊 3 results
📏 Metrics: Accuracy

Epigenetic Marks Prediction

GUE

A collection of $28$ datasets across $7$ tasks constructed for genome language model evaluation. Contains seven tasks: promoter prediction. core …

📊 1 results
📏 Metrics: MCC

Epilepsy Prediction

Epilepsy seizure prediction

The original dataset from the reference consists of 5 different folders, each with 100 files, with each file representing a …

📊 1 results
📏 Metrics: 1:1 Accuracy

Error Understanding

CUB-200-2011

The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of …

📊 4 results
📏 Metrics: Average highest confidence (ResNet-101), Insertion AUC score (ResNet-101), Average highest confidence (MobileNetV2), Insertion AUC score (MobileNetV2), Average highest confidence (EfficientNetV2-M), Insertion AUC score (EfficientNetV2-M)

Ethics

Ethics (per ethics)

Ethics (per ethics) dataset is created to test the knowledge of the basic concepts of morality. The task is to …

📊 4 results
📏 Metrics: Accuracy

Event Extraction

GENIA

The GENIA corpus is the primary collection of biomedical literature compiled and annotated within the scope of the GENIA project. …

📊 1 results
📏 Metrics: F1

Event data classification

CIFAR10-DVS

CIFAR10-DVS is an event-stream dataset for object classification. 10,000 frame-based images that come from CIFAR-10 dataset are converted into 10,000 …

📊 4 results
📏 Metrics: Accuracy

DVS128 Gesture

Comprises 11 hand gesture categories from 29 subjects under 3 illumination conditions. Source: [A Low Power, Fully Event-Based Gesture Recognition …

📊 1 results
📏 Metrics: Accuracy (% )

N-Caltech 101

The Neuromorphic-Caltech101 (N-Caltech101) dataset is a spiking version of the original frame-based Caltech101 dataset. The original dataset contained both a …

📊 1 results
📏 Metrics: Accuracy (% )

Event-based Object Segmentation

DDD17-SEG

Based on the DDD17 dataset, we select some image-event pairs to evaluate the segmentation performance, namely DDD17-SEG, which only serves …

📊 1 results
📏 Metrics: mIoU

DSEC-SEG

Based on the DSEC dataset, we select some image-event pairs to evaluate the segmentation performance, namely DSEC-SEG, which only serves …

📊 1 results
📏 Metrics: mIoU

MVSEC-SEG

Based on the MVSEC dataset, we select some image-event pairs to evaluate the segmentation performance, namely MVSEC-SEG, which only serves …

📊 7 results
📏 Metrics: mIoU

RGBE-SEG

To perform universal event stream segmentation, we collected a large-scale RGB-Event dataset for event-centric segmentation, from current available pixel-level aligned …

📊 7 results
📏 Metrics: mIoU

Explainable Artificial Intelligence (XAI)

ADNI

Alzheimer's Disease Neuroimaging Initiative (ADNI) is a multisite study that aims to improve clinical trials for the prevention and treatment …

📊 1 results
📏 Metrics: AD-Related Brain Areas Identified

Explanation Generation

CLEVR-X

CLEVR-X is a dataset that extends the CLEVR dataset with natural language explanations in the context of VQA. It consists …

📊 2 results
📏 Metrics: B4, M, RL, C, Acc

VCR

Visual Commonsense Reasoning (VCR) is a large-scale dataset for cognition-level visual understanding. Given a challenging question about an image, machines …

📊 2 results
📏 Metrics: Human Explanation Rating

WHOOPS!

WHOOPS! Is a dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers …

📊 7 results
📏 Metrics: Human (%), Accuracy

e-SNLI-VE

e-SNLI-VE is a large VL (vision-language) dataset with NLEs (natural language explanations) with over 430k instances for which the explanations …

📊 2 results
📏 Metrics: Human Explanation Rating

Explanatory Visual Question Answering

GQA-REX

A GQA-based dataset with 1,040,830 multi-modal explanations of visual reasoning processes.

📊 4 results
📏 Metrics: BLEU-4, CIDEr, GQA-test, GQA-val, Grounding, METEOR, ROUGE-L, SPICE

Extractive Text Summarization

DebateSum

DebateSum consists of 187328 debate documents, arguments (also can be thought of as abstractive summaries, or queries), word-level extractive summaries, …

📊 3 results
📏 Metrics: ROUGE-L

GovReport

GovReport is a dataset for long document summarization, with significantly longer documents and summaries. It consists of reports written by …

📊 2 results
📏 Metrics: Avg. Test Rouge1, Avg. Test Rouge2, Avg. Test RougeLsum

Extreme Summarization

CiteSum

CiteSum is a large-scale scientific extreme summarization benchmark.

📊 9 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

TLDR9+

TLDR9+ is a large-scale summarization dataset containing over 9 million training instances extracted from Reddit discussion forum. This dataset is …

📊 4 results
📏 Metrics: RG-1(%), RG-2(%), RG-L(%)

XSum

The Extreme Summarization (XSum) dataset is a dataset for evaluation of abstractive single-document summarization systems. The goal is to create …

📊 1 results
📏 Metrics: METEOR

Eyeblink detection

HUST-LEBW

An eyeblink detection in the wild dataset.

📊 1 results
📏 Metrics: Avg. F1

MPEblink

The pioneering eyeblink detection dataset is characterized by three key features: (1) Sample with multi-human instances. (2) Unconstrained in-the-wild scenarios. …

📊 1 results
📏 Metrics: Blink-AP50

Face Anonymization

LFW

The LFW dataset contains 13,233 images of faces collected from the web. This dataset consists of the 5749 identities with …

📊 1 results
📏 Metrics: ID retrieval, Temporal ID consistency, negated ID retrieval

Face Detection

ADE20K

The ADE20K semantic segmentation dataset contains more than 20K scene-centric images exhaustively annotated with pixel-level objects and object parts labels. …

📊 1 results
📏 Metrics: mIoU

COCO-WholeBody

COCO-WholeBody is an extension of COCO dataset with whole-body annotations. There are 4 types of bounding boxes (person box, face …

📊 2 results
📏 Metrics: AP, AP50, AP75, APL, APM

DCM

The DCM dataset is composed of 772 annotated images from 27 golden age comic books. We freely collected them from …

📊 2 results
📏 Metrics: Average Precision

FDDB

The Face Detection Dataset and Benchmark (FDDB) dataset is a collection of labeled faces from Faces in the Wild dataset. …

📊 11 results
📏 Metrics: AP, Accuracy

Manga109

Manga109 has been compiled by the Aizawa Yamasaki Matsui Laboratory, Department of Information and Communication Engineering, the Graduate School of …

📊 1 results
📏 Metrics: Average Precision

PASCAL Face

The PASCAL FACE dataset is a dataset for face detection and face recognition. It has a total of 851 images …

📊 6 results
📏 Metrics: AP

iCartoonFace

The iCartoonFace dataset is a large-scale dataset that can be used for two different tasks: cartoon face detection and cartoon …

📊 2 results
📏 Metrics: Average Precision

Face Quality Assessement

Color FERET

The color FERET database is a dataset for face recognition. It contains 11,338 color images of size 512×768 pixels captured …

📊 1 results
📏 Metrics: Pearson Correlation

LFW

The LFW dataset contains 13,233 images of faces collected from the web. This dataset consists of the 5749 identities with …

📊 1 results
📏 Metrics: Equal Error Rate

mebeblurf

Matanga Darknet — 2025 Access Guide As internet censorship intensifies, Shadow Marketplaces remain crucial tools for anonymous transactions. Matanga Darknet …

📊 1 results
📏 Metrics: Equal Error Rate

Face Recognition

BTS3.1

Large, multimodal biometric dataset: It contains still images and videos of over 1,000 people captured at various ranges (up to …

📊 1 results
📏 Metrics: TAR @ FAR=0.01

CALFW

A renovation of Labeled Faces in the Wild (LFW), the de facto standard testbed for unconstraint face verification. Source: CALFW

📊 2 results
📏 Metrics: Accuracy

CASIA-WebFace+masks

The COVID-19 pandemic raises the problem of adapting face recognition systems to the new reality, where people may wear surgical …

📊 5 results
📏 Metrics: Accuracy

CPLFW

A renovation of Labeled Faces in the Wild (LFW), the de facto standard testbed for unconstraint face verification. There are …

📊 1 results
📏 Metrics: Accuracy

CelebA+masks

The COVID-19 pandemic raises the problem of adapting face recognition systems to the new reality, where people may wear surgical …

📊 5 results
📏 Metrics: Accuracy

Color FERET

The color FERET database is a dataset for face recognition. It contains 11,338 color images of size 512×768 pixels captured …

📊 4 results
📏 Metrics: FNMR [%] @ 10-3 FMR, 5-class test accuracy

IJB-B

The IJB-B dataset is a template-based face dataset that contains 1845 subjects with 11,754 images, 55,025 frames and 7,011 videos …

📊 4 results
📏 Metrics: Rank-1, Rank-5, TAR @ FAR=0.0001, TAR @ FAR=1e-3, TAR @ FAR=1e-4, TAR @ FAR=1e-5

LFW

The LFW dataset contains 13,233 images of faces collected from the web. This dataset consists of the 5749 identities with …

📊 14 results
📏 Metrics: Accuracy, FNMR [%] @ 10-3 FMR, F1-score, Precision, Recall

MFR

During the COVID-19 coronavirus epidemic, almost everyone wears a facial mask, which poses a huge challenge to face recognition. Traditional …

📊 1 results
📏 Metrics: MFR-ALL, MFR-MASK, African, Caucasian, South Asian, East Asian

MLFW

The Masked LFW (MLFW), based on Cross-Age LFW (CALFW) database, is built using a simple but effective tool that generates …

📊 6 results
📏 Metrics: Accuracy

MORPH

MORPH is a facial age estimation dataset, which contains 55,134 facial images of 13,617 subjects ranging from 16 to 77 …

📊 3 results
📏 Metrics: FNMR [%] @ 10-3 FMR

XQLFW

An evaluation protocol for face verification focusing on a large intra-pair image quality difference. Real-world face recognition applications often deal …

📊 1 results
📏 Metrics: Accuracy

mebeblurf

Matanga Darknet — 2025 Access Guide As internet censorship intensifies, Shadow Marketplaces remain crucial tools for anonymous transactions. Matanga Darknet …

📊 3 results
📏 Metrics: FNMR [%] @ 10-3 FMR

Face Verification

BTS3.1

Large, multimodal biometric dataset: It contains still images and videos of over 1,000 people captured at various ranges (up to …

📊 6 results
📏 Metrics: TAR @ FAR=0.01

CALFW

A renovation of Labeled Faces in the Wild (LFW), the de facto standard testbed for unconstraint face verification. Source: CALFW

📊 1 results
📏 Metrics: Accuracy

CK+

The Extended Cohn-Kanade (CK+) dataset contains 593 video sequences from a total of 123 different subjects, ranging from 18 to …

📊 1 results
📏 Metrics: Accuracy

CPLFW

A renovation of Labeled Faces in the Wild (LFW), the de facto standard testbed for unconstraint face verification. There are …

📊 1 results
📏 Metrics: Accuracy

IJB-A

The IARPA Janus Benchmark A (IJB-A) database is developed with the aim to augment more challenges to the face recognition …

📊 16 results
📏 Metrics: TAR @ FAR=0.01, TAR @ FAR=0.001, TAR @ FAR=0.1

IJB-B

The IJB-B dataset is a template-based face dataset that contains 1845 subjects with 11,754 images, 55,025 frames and 7,011 videos …

📊 12 results
📏 Metrics: TAR @ FAR=0.01, TAR @ FAR=0.001, TAR@FAR=0.0001, TAR @ FAR=1e-5, TAR @ FAR=0.0001

IJB-C

The IJB-C dataset is a video-based face recognition dataset. It is an extension of the IJB-A dataset with about 138,000 …

📊 25 results
📏 Metrics: TAR @ FAR=1e-6, TAR @ FAR=1e-5, TAR @ FAR=1e-4, TAR @ FAR=1e-3, TAR @ FAR=1e-2, training dataset, model, Rank-1, Rank-5

IJB-S

Paper Abstract We present IJB–S dataset, an open-source IARPA Janus Surveillance Video Benchmark and associated protocols. The dataset consists of …

📊 2 results
📏 Metrics: Rank-1 (Video2Booking), Rank-1 (Video2Single), Rank-1 (Video2Video)

LFW

The LFW dataset contains 13,233 images of faces collected from the web. This dataset consists of the 5749 identities with …

📊 3 results
📏 Metrics: BFAR, BFRR, FRR@FAR(%)

MegaFace

MegaFace was a publicly available dataset which is used for evaluating the performance of face recognition algorithms with up to …

📊 10 results
📏 Metrics: Accuracy

Oulu-CASIA

The Oulu-CASIA NIR&VIS facial expression database consists of six expressions (surprise, happiness, sadness, anger, fear and disgust) from 80 people …

📊 1 results
📏 Metrics: Accuracy

Facial Expression Recognition

Aff-Wild2

Aff-Wild2 is a large-scale in-the-wild database and an extension of the Aff-Wild dataset for affect recognition. It approximately doubles the …

📊 1 results
📏 Metrics: Accuracy

AffectNet

AffectNet is a large facial expression dataset with around 0.4 million images manually labeled for the presence of eight (neutral, …

📊 1 results
📏 Metrics: Accuracy (7 emotion)

CMU-MOSEI

CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) is the largest dataset of sentence-level sentiment analysis and emotion recognition in …

📊 1 results
📏 Metrics: Weighted Accuracy

FER+

The FER+ dataset is an extension of the original FER dataset, where the images have been re-labelled into one of …

📊 1 results
📏 Metrics: Accuracy

FER2013

Fer2013 contains approximately 30,000 facial RGB images of different expressions with size restricted to 48×48, and the main labels of …

📊 3 results
📏 Metrics: Accuracy

MELD

Multimodal EmotionLines Dataset (MELD) has been created by enhancing and extending EmotionLines dataset. MELD contains the same dialogue instances available …

📊 1 results
📏 Metrics: Weighted Accuracy

RAF-DB

The Real-world Affective Faces Database (RAF-DB) is a dataset for facial expression. It contains 29672 facial images tagged with basic …

📊 1 results
📏 Metrics: Overall Accuracy

Facial Expression Recognition (FER)

Acted Facial Expressions In The Wild (AFEW)

Acted Facial Expressions In The Wild (AFEW) is a dynamic temporal facial expressions data corpus consisting of close to real …

📊 8 results
📏 Metrics: Accuracy(on validation set)

Aff-Wild2

Aff-Wild2 is a large-scale in-the-wild database and an extension of the Aff-Wild dataset for affect recognition. It approximately doubles the …

📊 1 results
📏 Metrics: Accuracy, UAR

AffectNet

AffectNet is a large facial expression dataset with around 0.4 million images manually labeled for the presence of eight (neutral, …

📊 36 results
📏 Metrics: Accuracy (8 emotion), Accuracy (7 emotion)

BP4D

The BP4D-Spontaneous dataset is a 3D video database of spontaneous facial expressions in a diverse group of young adults. Well-validated …

📊 2 results
📏 Metrics: ICC

CK+

The Extended Cohn-Kanade (CK+) dataset contains 593 video sequences from a total of 123 different subjects, ranging from 18 to …

📊 6 results
📏 Metrics: Accuracy (8 emotion), Accuracy (7 emotion), Accuracy (6 emotion)

DISFA

The Denver Intensity of Spontaneous Facial Action (DISFA) dataset consists of 27 videos of 4844 frames each, with 130,788 images …

📊 2 results
📏 Metrics: ICC

ExpW

The Expression in-the-Wild (ExpW) dataset is for facial expression recognition and contains 91,793 faces manually labeled with expressions. Each of …

📊 1 results
📏 Metrics: Accuracy

FER+

The FER+ dataset is an extension of the original FER dataset, where the images have been re-labelled into one of …

📊 11 results
📏 Metrics: Accuracy

FER2013

Fer2013 contains approximately 30,000 facial RGB images of different expressions with size restricted to 48×48, and the main labels of …

📊 9 results
📏 Metrics: Accuracy

FERG

FERG is a database of cartoon characters with annotated facial expressions containing 55,769 annotated face images of six characters. The …

📊 2 results
📏 Metrics: Accuracy

JAFFE

The JAFFE dataset consists of 213 images of different facial expressions from 10 different Japanese female subjects. Each subject was …

📊 3 results
📏 Metrics: Accuracy

MMI

The MMI Facial Expression Database consists of over 2900 videos and high-resolution still images of 75 subjects. It is fully …

📊 2 results
📏 Metrics: Accuracy

Oulu-CASIA

The Oulu-CASIA NIR&VIS facial expression database consists of six expressions (surprise, happiness, sadness, anger, fear and disgust) from 80 people …

📊 2 results
📏 Metrics: Accuracy (10-fold)

RAF-DB

The Real-world Affective Faces Database (RAF-DB) is a dataset for facial expression. It contains 29672 facial images tagged with basic …

📊 22 results
📏 Metrics: Overall Accuracy, Avg. Accuracy

RaFD

The Radboud Faces Database (RaFD) is a set of pictures of 67 models (both adult and children, males and females) …

📊 1 results
📏 Metrics: Accuracy

SFEW

The Static Facial Expressions in the Wild (SFEW) dataset is a dataset for facial expression recognition. It was created by …

📊 3 results
📏 Metrics: Accuracy

Facial Landmark Detection

300W

The 300-W is a face dataset that consists of 300 Indoor and 300 Outdoor in-the-wild images. It covers a large …

📊 13 results
📏 Metrics: NME, Mean Error Rate

AFLW2000-3D

AFLW2000-3D is a dataset of 2000 images that have been annotated with image-level 68-point 3D facial landmarks. This dataset is …

📊 1 results
📏 Metrics: GTE

COCO-WholeBody

COCO-WholeBody is an extension of COCO dataset with whole-body annotations. There are 4 types of bounding boxes (person box, face …

📊 2 results
📏 Metrics: keypoint AP

COFW

The Caltech Occluded Faces in the Wild (COFW) dataset is designed to present faces in real-world conditions. Faces show large …

📊 2 results
📏 Metrics: NME (inter-pupil), NME (inter-ocular), NME

CatFLW

The Cat Facial Landmarks in the Wild (CatFLW) dataset contains 2079 images of cats' faces in various environments and conditions, …

📊 3 results
📏 Metrics: NME

WFLW

The Wider Facial Landmarks in the Wild or WFLW database contains 10000 faces (7500 for training and 2500 for testing) …

📊 3 results
📏 Metrics: NME, NME (inter-ocular), AUC@10 (inter-ocular), FR@10 (inter-ocular)

Fact Checking

AVeriTeC

AVeriTeC (Automated Verification of Textual Claims) is a dataset of 4568 real-world claims covering fact-checks by 50 different organizations. Each …

📊 2 results
📏 Metrics: Question Only score, Question + Answer score, AveriTeC

Fact Selection

ArgSciChat

ArgSciChat is an argumentative dialogue dataset. It consists of 498 messages collected from 41 dialogues on 20 scientific papers. It …

📊 4 results
📏 Metrics: Fact-F1

Fact Verification

FEVER

FEVER is a publicly available dataset for fact extraction and verification against textual sources. It consists of 185,445 claims manually …

📊 7 results
📏 Metrics: Accuracy, FEVER

Factual Inconsistency Detection in Chart Captioning

CHOCOLATE

CHOCOLATE is a benchmark for detecting and correcting factual inconsistency in generated chart captions. It consists of captions produced by …

📊 1 results
📏 Metrics: Kendall's Tau-c

Fairness

DiveFace

A new face annotation dataset with balanced distribution between genders and ethnic origins. Source: [SensitiveNets: Learning Agnostic Representations with Application …

📊 1 results
📏 Metrics: Degree of Bias (DoB)

MORPH

MORPH is a facial age estimation dataset, which contains 55,134 facial images of 13,617 subjects ranging from 16 to 77 …

📊 1 results
📏 Metrics: Degree of Bias (DoB)

UTKFace

The UTKFace dataset is a large-scale face dataset with long age span (range from 0 to 116 years old). The …

📊 1 results
📏 Metrics: Degree of Bias (DoB)

Fake News Detection

COVID-19 Fake News Dataset

Along with COVID-19 pandemic we are also fighting an `infodemic'. Fake news and rumors are rampant on social media. Believing …

📊 1 results
📏 Metrics: F1

FNC-1

FNC-1 was designed as a stance detection dataset and it contains 75,385 labeled headline and article pairs. The pairs are …

📊 9 results
📏 Metrics: Weighted Accuracy, Per-class Accuracy (Unrelated), Per-class Accuracy (Agree), Per-class Accuracy (Disagree), Per-class Accuracy (Discuss)

LIAR

LIAR is a publicly available dataset for fake news detection. A decade-long of 12.8K manually labeled short statements were collected …

📊 4 results
📏 Metrics: Test Accuracy, Validation Accuracy

RAWFC

For RAWFC, we constructed it from scratch by collecting the claims from Snopes and relevant raw reports by retrieving claim …

📊 6 results
📏 Metrics: F1

Weibo NER

The Weibo NER dataset is a Chinese Named Entity Recognition dataset drawn from the social media website Sina Weibo. Source: …

📊 1 results
📏 Metrics: Accuracy

Fault Diagnosis

Digital twin-supported deep learning for fault diagnosis

This is a dataset used to test deep learning-supported deep learning for fault diagnosis: - A digital twin model for …

📊 2 results
📏 Metrics: Accuray

Few-Shot Image Classification

Bongard-HOI

Bongard-HOI testifies to which extent your few-shot visual learner can quickly induce the true HOI concept from a handful of …

📊 9 results
📏 Metrics: Avg. Accuracy

Meta-Dataset

The Meta-Dataset benchmark is a large few-shot learning benchmark and consists of multiple datasets of different data distributions. It does …

📊 22 results
📏 Metrics: Accuracy

Oxford 102 Flower

Oxford 102 Flower is an image classification dataset consisting of 102 flower categories. The flowers chosen to be flower commonly …

📊 1 results
📏 Metrics: ACCURACY

UT Zappos50K

UT Zappos50K is a large shoe dataset consisting of 50,025 catalog images collected from Zappos.com. The images are divided into …

📊 1 results
📏 Metrics: Top 1 Accuracy

Few-Shot Learning

CaseHOLD

CaseHOLD (Case Holdings On Legal Decisions) is a law dataset comprised of over 53,000+ multiple choice questions to identify the …

📊 1 results
📏 Metrics: Accuracy

DTD

The Describable Textures Dataset (DTD) contains 5640 texture images in the wild. They are annotated with human-centric attributes inspired by …

📊 4 results
📏 Metrics: 4-shot Accuracy, 8-shot Accuracy, 12-shot Accuracy, 16-shot Accuracy, Harmonic mean

EuroSAT

Eurosat is a dataset and deep learning benchmark for land use and land cover classification. The dataset is based on …

📊 1 results
📏 Metrics: Harmonic mean

Large COVID-19 CT scan slice dataset

"We built a large lung CT scan dataset for COVID-19 by curating data from 7 public datasets listed in the …

📊 1 results
📏 Metrics: AUC-ROC, Accuracy , Macro F1, Macro Precision, Macro Recall, Micro Precision, Specificity

MR

MR Movie Reviews is a dataset for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect …

📊 1 results
📏 Metrics: Acc

MRPC

Microsoft Research Paraphrase Corpus (MRPC) is a corpus consists of 5,801 sentence pairs collected from newswire articles. Each pair is …

📊 1 results
📏 Metrics: F1-score

MedConceptsQA

MedConceptsQA - Open Source Medical Concepts QA Benchmark The benchmark can be found here: https://huggingface.co/datasets/ofir408/MedConceptsQA

📊 12 results
📏 Metrics: Accuracy

MedNLI

The MedNLI dataset consists of the sentence pairs developed by Physicians from the Past Medical History section of MIMIC-III clinical …

📊 1 results
📏 Metrics: Accuracy

PubMedQA

The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary …

📊 2 results
📏 Metrics: Accuracy

SUN397

The Scene UNderstanding (SUN) database contains 899 categories and 130,519 images. There are 397 well-sampled categories to evaluate numerous state-of-the-art …

📊 1 results
📏 Metrics: Harmonic mean

Stanford Cars

The Stanford Cars dataset consists of 196 classes of cars with a total of 16,185 images, taken from the rear. …

📊 3 results
📏 Metrics: 4-shot Accuracy, 8-shot Accuracy, 12-shot Accuracy, 16-shot Accuracy

UCF101

UCF101 dataset is an extension of UCF50 and consists of 13,320 video clips, which are classified into 101 categories. These …

📊 1 results
📏 Metrics: Harmonic mean

Few-Shot Object Detection

CAMO-FS

CAMO-FS Dataset comes with the paper entitled The Art of Camouflage: Few-shot Learning for Animal Detection and Segmentation. DOI: https://doi.org/10.1109/ACCESS.2024.3432873 …

📊 26 results
📏 Metrics: box AP

Few-Shot Semantic Segmentation

FSS-1000

FSS-1000 is a 1000 class dataset for few-shot segmentation. The dataset contains significant number of objects that have never been …

📊 3 results
📏 Metrics: Mean IoU

Few-Shot Text Classification

RAFT

The RAFT benchmark (Realworld Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that mirrors deployment. …

📊 9 results
📏 Metrics: Avg, ADE, B77, NIS, OSE, Over, SOT, SRI, TAI, ToS, TEH, TC

SST-5

The SST-5, also known as the Stanford Sentiment Treebank with 5 labels, is a dataset used for sentiment analysis. The …

📊 1 results
📏 Metrics: Accuracy

Fine-Grained Image Classification

Birdsnap

Birdsnap is a large bird dataset consisting of 49,829 images from 500 bird species with 47,386 images used for training …

📊 5 results
📏 Metrics: Accuracy

CUB-200-2011

The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of …

📊 23 results
📏 Metrics: Accuracy

Caltech-101

The Caltech101 dataset contains images from 101 object categories (e.g., “helicopter”, “elephant” and “chair” etc.) and a background category that …

📊 18 results
📏 Metrics: Top-1 Error Rate, Accuracy

CompCars

The Comprehensive Cars (CompCars) dataset contains data from two scenarios, including images from web-nature and surveillance-nature. The web-nature data contains …

📊 5 results
📏 Metrics: Accuracy

DIB-10K

Is a challenging image dataset which has more than 10 thousand different types of birds. It was created to enable …

📊 1 results
📏 Metrics: Accuracy

Food-101

The Food-101 dataset consists of 101 food categories with 750 training and 250 test images per category, making a total …

📊 12 results
📏 Metrics: Accuracy, FLOPS, PARAMS, Top 1 Accuracy

Herbarium 2021 Half–Earth

The Herbarium Half-Earth dataset is a large and diverse dataset of herbarium specimens to date for automatic taxon recognition. The …

📊 1 results
📏 Metrics: Test F1 score

Herbarium 2022

The Herbarium 2022: Flora of North America is a part of a project of the New York Botanical Garden funded …

📊 1 results
📏 Metrics: Test F1 score (private)

Kuzushiji-MNIST

Kuzushiji-MNIST is a drop-in replacement for the MNIST dataset (28x28 grayscale, 70,000 images). Since MNIST restricts us to 10 classes, …

📊 1 results
📏 Metrics: Accuracy

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 1 results
📏 Metrics: Accuracy

NABirds

NABirds V1 is a collection of 48,000 annotated photographs of the 400 species of birds that are commonly observed in …

📊 20 results
📏 Metrics: Accuracy

Oxford-IIIT Pet Dataset

The Oxford-IIIT Pet Dataset has 37 categories with roughly 200 images for each class. The images have a large variations …

📊 13 results
📏 Metrics: Accuracy, Top-1 Error Rate, FLOPS, PARAMS

Oxford-IIIT Pets

The Oxford-IIIT Pet Dataset is a 37-category pet dataset with roughly 200 images for each class. The images have large …

📊 19 results
📏 Metrics: Accuracy, Top-1 Error Rate, FLOPS, PARAMS

QMNIST

The exact pre-processing steps used to construct the MNIST dataset have long been lost. This leaves us with no reliable …

📊 1 results
📏 Metrics: Accuracy

STL-10

The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or …

📊 1 results
📏 Metrics: Accuracy

SUN397

The Scene UNderstanding (SUN) database contains 899 categories and 130,519 images. There are 397 well-sampled categories to evaluate numerous state-of-the-art …

📊 5 results
📏 Metrics: Accuracy

Stanford Cars

The Stanford Cars dataset consists of 196 classes of cars with a total of 16,185 images, taken from the rear. …

📊 69 results
📏 Metrics: Accuracy, FLOPS, PARAMS

Stanford Dogs

The Stanford Dogs dataset contains 20,580 images of 120 classes of dogs from around the world, which are divided into …

📊 18 results
📏 Metrics: Accuracy

iNaturalist

The iNaturalist 2017 dataset (iNat) contains 675,170 training and validation images from 5,089 natural fine-grained categories. Those categories belong to …

📊 1 results
📏 Metrics: Top 1 Accuracy

Flood Inundation Mapping

Coastal Inundation Maps with Floodwater Depth Values

This dataset provides simulated flood inundation maps of Abu Dhabi's coast under 174 different shoreline protection scenarios. The maps were …

📊 1 results
📏 Metrics: Average MAE, Zero detection rate

Font Recognition

AdobeVFR real

Subset of AdobeVFR. The dataset contains "real-world text images". > We collected 201,780 text images from various typography forums, where …

📊 2 results
📏 Metrics: Top 1 Accuracy, Top 5 Accuracy, Top 5 Error Rate, Top-1 Error Rate

AdobeVFR syn

Subset of AdobeVFR. The dataset contains images depicting English text and consists of 1000 synthetic images for training and 100 …

📊 5 results
📏 Metrics: Top 1 Accuracy, Top 5 Accuracy, Top-1 Error Rate, Top 5 Error Rate

Explor_all

Explor_all font image dataset https://drive.google.com/file/d/1P2DbNbVw4Q__WcV1YdzE7zsDKilmd3pO/view

📊 1 results
📏 Metrics: Top 1 Accuracy, Top 5 Accuracy

Persian Font Recognition (PFR)

Persian Font Recognition (PFR) A dataset in order to solve font recognition for the Persian language. This dataset is part …

📊 1 results
📏 Metrics: Top 5 Accuracy

Persian Text Image Segmentation (PTI SEG)

Persian Text Image Segmentation (PTI SEG) This dataset is part of a paper titled "Persis: A Persian Font Recognition Pipeline …

📊 1 results
📏 Metrics: IOU50

VFR-Wild

325 word images intended for font recognition, whose fonts are included in [VFR-447] (and [VFR-2420]). > (...) 325 real world …

📊 1 results
📏 Metrics: Top 1 Accuracy, Top 5 Error Rate, Top-1 Error Rate, Top 10 Accuracy, Top 5 Accuracy

Food recommendation

Oktoberfest Food Dataset

A realistic, diverse, and challenging dataset for object detection on images. The data was recorded at a beer tent in …

📊 1 results
📏 Metrics: 10 fold Cross validation

Formation Energy

JARVIS-DFT

JARVIS-DFT is a repository of density functional theory based calculation data for materials.

📊 4 results
📏 Metrics: MAE

Materials Project

The Materials Project is a collection of chemical compounds labelled with different attributes. The labelling is performed by different simulations, …

📊 6 results
📏 Metrics: MAE

OQM9HK

This is a large-scale dataset of quantum-mechanically calculated properties (DFT level) of crystalline materials for graph representation learning that contains …

📊 1 results
📏 Metrics: MAE

QM9

QM9 provides quantum chemical properties (at DFT level) for a relevant, consistent, and comprehensive chemical space of small organic molecules. …

📊 11 results
📏 Metrics: MAE

Fovea Detection

ADAM

ADAM is organized as a half day Challenge, a Satellite Event of the ISBI 2020 conference in Iowa City, Iowa, …

📊 1 results
📏 Metrics: Euclidean Distance (ED)

IDRiD

Indian Diabetic Retinopathy Image Dataset (IDRiD) dataset consists of typical diabetic retinopathy lesions and normal retinal structures annotated at a …

📊 1 results
📏 Metrics: Euclidean Distance (ED)

Fraud Detection

Amazon-Fraud

Amazon-Fraud is a multi-relational graph dataset built upon the Amazon review dataset, which can be used in evaluating graph-based node …

📊 3 results
📏 Metrics: AUC-ROC, Averaged Precision, F1 Macro, G-mean

Elliptic Dataset

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 6 results
📏 Metrics: AUC, AUPRC

Kaggle-Credit Card Fraud Dataset

The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred …

📊 2 results
📏 Metrics: AUC, Accuracy, Average Precision

Yelp-Fraud

Yelp-Fraud is a multi-relational graph dataset built upon the Yelp spam review dataset, which can be used in evaluating graph-based …

📊 6 results
📏 Metrics: AUC-ROC, Averaged Precision, F1 Macro, G-mean

French Text Diacritization

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish …

📊 1 results
📏 Metrics: Alpha-Word accuracy

Future Hand Prediction

Ego4D

Ego4D is a massive-scale egocentric video dataset and benchmark suite. It offers 3,025 hours of daily life activity video spanning …

📊 1 results
📏 Metrics: Disp(Total), M.Disp(Left), C.Disp(Left), M.Disp(Right), C.Disp(Right)

GSM8K

GSM8K

GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. …

📊 3 results
📏 Metrics: Accuracy, 0-shot MRR

Gait Recognition

Gait3D

Gait3D is a large-scale 3D representation-based gait recognition dataset. It contains 4,000 subjects and over 25,000 sequences extracted from 39 …

📊 2 results
📏 Metrics: Rank-1, Rank-5, mAP, mINP

OUMVLP

The OU-ISIR Gait Database, Multi-View Large Population Dataset (OU-MVLP) is meant to aid research efforts in the general area of …

📊 6 results
📏 Metrics: Averaged rank-1 acc(%)

Gaze Estimation

ETH-XGaze

Consists of over one million high-resolution images of varying gaze under extreme head poses. The dataset is collected from 110 …

📊 1 results
📏 Metrics: Angular Error

Gaze360

Understanding where people are looking is an informative social cue. In this work, we present Gaze360, a large-scale gaze-tracking dataset …

📊 4 results
📏 Metrics: Angular Error

GazeCapture

From scientific research to commercial applications, eye tracking is an important tool across many domains. Despite its range of applications, …

📊 2 results
📏 Metrics: Euclidean Mean Error (EME), FPS

MPSGaze

This is a synthetic dataset containing full images (instead of only cropped faces) that provides ground truth 3D gaze directions …

📊 1 results
📏 Metrics: Angular Error

Gaze Target Estimation

GazeFollow

GazeFollow is a large-scale dataset annotated with the location of where people in images are looking. It uses several major …

📊 1 results
📏 Metrics: AUC, Average Distance

VideoAttentionTarget

A dataset with fully annotated attention targets in video for attention target estimation.

📊 1 results
📏 Metrics: AUC, AP, Average Distance

General Classification

CVR

This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identified …

📊 1 results
📏 Metrics: Test error

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 1 results
📏 Metrics: Accuracy

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 1 results
📏 Metrics: Accuracy

Wine

These data are the results of a chemical analysis of wines grown in the same region in Italy but derived …

📊 1 results
📏 Metrics: Accuracy

iris

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician, …

📊 2 results
📏 Metrics: Accuracy

General Knowledge

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 2 results
📏 Metrics: Accuracy

Generalized Few-Shot Learning

AwA2

Animals with Attributes 2 (AwA2) is a dataset for benchmarking transfer-learning algorithms, such as attribute base classification and zero-shot learning. …

📊 6 results
📏 Metrics: Per-Class Accuracy (1-shot), Per-Class Accuracy (2-shots), Per-Class Accuracy (5-shots), Per-Class Accuracy (10-shots), Per-Class Accuracy (20-shots)

SUN

When glancing at a magazine, or browsing the Internet, we are continuously being exposed to photographs. Despite of this overflow …

📊 5 results
📏 Metrics: Per-Class Accuracy (1-shot), Per-Class Accuracy (2-shots), Per-Class Accuracy (5-shots), Per-Class Accuracy (10-shots)

Generalized Referring Expression Comprehension

gRefCOCO

gRefCOCO is the first large-scale Generalized Referring Expression Segmentation dataset that contains multi-target, no-target, and single-target expressions.

📊 5 results
📏 Metrics: Precision@(F1=1, IoU≥0.5), N-acc.

Generative 3D Object Classification

Objaverse

Objaverse is a large dataset of objects with 800K+ (and growing) 3D models with descriptive captions, tags, and animations. Objaverse …

📊 7 results
📏 Metrics: Objaverse (Average), Objaverse (I), Objaverse (C)

Generative Visual Question Answering

PMC-VQA

PMC-VQA is a large-scale medical visual question-answering dataset that contains 227k VQA pairs of 149k images that cover various modalities …

📊 3 results
📏 Metrics: BLEU-1

Geometric Matching

HPatches

The HPatches is a recent dataset for local patch descriptor evaluation that consists of 116 sequences of 6 images with …

📊 1 results
📏 Metrics: Average End-Point Error

GermEval2024 Shared Task 1 Subtask 1

GerMS-AT

This dataset contains 7984 user comments from an Austrian online newspaper. The comments have been annotated by 4 or more …

📊 1 results
📏 Metrics: Macro F1

GermEval2024 Shared Task 1 Subtask 2

GerMS-AT

This dataset contains 7984 user comments from an Austrian online newspaper. The comments have been annotated by 4 or more …

📊 1 results
📏 Metrics: Jensen-Shannon distance

Gesture Recognition

DVS128 Gesture

Comprises 11 hand gesture categories from 29 subjects under 3 illumination conditions. Source: [A Low Power, Fully Event-Based Gesture Recognition …

📊 12 results
📏 Metrics: Accuracy (%)

MSRC-12

The Microsoft Research Cambridge-12 Kinect gesture data set consists of sequences of human movements, represented as body-part locations, and the …

📊 1 results
📏 Metrics: Accuracy

Grammatical Error Correction

FCGEC

  • a fine-grained corpus to detect, identify and correct the chinese grammatical errors. * collected mainly from multi-choice questions in …
📊 1 results
📏 Metrics: exact match, F0.5

JFLEG

JFLEG is for developing and evaluating grammatical error correction (GEC). Unlike other corpora, it represents a broad range of language …

📊 6 results
📏 Metrics: GLEU

MuCGEC

MuCGEC is a multi-reference multi-source evaluation dataset for Chinese Grammatical Error Correction (CGEC), consisting of 7,063 sentences collected from three …

📊 1 results
📏 Metrics: F0.5

UA-GEC

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

📊 3 results
📏 Metrics: F0.5

WI-LOCNESS

WI-LOCNESS is part of the Building Educational Applications 2019 Shared Task for Grammatical Error Correction. It consists of two datasets: …

📊 1 results
📏 Metrics: F0.5

Graph Classification

ADNI

Alzheimer's Disease Neuroimaging Initiative (ADNI) is a multisite study that aims to improve clinical trials for the prevention and treatment …

📊 1 results
📏 Metrics: Accuracy

AIDS

AIDS is a graph dataset. It consists of 2000 graphs representing molecular compounds which are constructed from the AIDS Antiviral …

📊 2 results
📏 Metrics: Accuracy, Inference Time (ms)

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 1 results
📏 Metrics: Accuracy

COLLAB

COLLAB is a scientific collaboration dataset. A graph corresponds to a researcher’s ego network, i.e., the researcher and its collaborators …

📊 35 results
📏 Metrics: Accuracy, Accuracy (10-fold)

CSL

CSL is a synthetic dataset introduced in Murphy et al. (2019) to test the expressivity of GNNs. In particular, graphs …

📊 1 results
📏 Metrics: Acc

Citeseer

The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 …

📊 1 results
📏 Metrics: Accuracy

Cora

The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 …

📊 1 results
📏 Metrics: Accuracy

Digits

The DIGITS dataset consists of 1797 8×8 grayscale images (1439 for training and 360 for testing) of handwritten digits. Source: …

📊 1 results
📏 Metrics: Accuracy

ENZYMES

ENZYMES is a dataset of 600 protein tertiary structures obtained from the BRENDA enzyme database. The ENZYMES dataset contains 6 …

📊 49 results
📏 Metrics: Accuracy, Accuracy (10-fold)

HCP Aging

Lifespan HCP Release 2.0 includes cross-sectional visit 1 (V1) preprocessed structural and functional imaging data, unprocessed V1 imaging data for …

📊 1 results
📏 Metrics: Accuracy

IMDB-BINARY

IMDB-BINARY is a movie collaboration dataset that consists of the ego-networks of 1,000 actors/actresses who played roles in movies in …

📊 8 results
📏 Metrics: Accuracy, Accuracy (10-fold)

IMDB-MULTI

IMDB-MULTI is a relational dataset that consists of a network of 1000 actors or actresses who played roles in movies …

📊 1 results
📏 Metrics: Accuracy, Accuracy (10-fold)

IPC-grounded

📊 2 results
📏 Metrics: Accuracy

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 13 results
📏 Metrics: Accuracy

MUTAG

In particular, MUTAG is a collection of nitroaromatic compounds and the goal is to predict their mutagenicity on Salmonella typhimurium. …

📊 62 results
📏 Metrics: Accuracy, Accuracy (10-fold), Mean Accuracy, Accuracy (10 fold)

MUV

The Maximum Unbiased Validation (MUV) dataset is a benchmark dataset selected from PubChem BioAssay. It was created by applying a …

📊 2 results
📏 Metrics: ROC-AUC

Mutagenicity

Mutagenicity is a chemical compound dataset of drugs, which can be categorized into two classes: mutagen and non-mutagen. Source: [Hierarchical …

📊 5 results
📏 Metrics: Accuracy

NCI1

The NCI1 dataset comes from the cheminformatics domain, where each input graph is used as representation of a chemical compound: …

📊 59 results
📏 Metrics: Accuracy, Accuracy (10-fold)

NCI109

Tudataset: A collection of benchmark datasets for learning with graphs

📊 33 results
📏 Metrics: Accuracy

OASIS

A dataset for single-image 3D in the wild consisting of annotations of detailed 3D geometry for 140,000 images. Source: [OASIS: …

📊 1 results
📏 Metrics: Accuracy

PROTEINS

PROTEINS is a dataset of proteins that are classified as enzymes or non-enzymes. Nodes represent the amino acids and two …

📊 88 results
📏 Metrics: Accuracy, Accuracy (10 fold), Inference Time (ms)

PTC

PTC is a collection of 344 chemical compounds represented as graphs which report the carcinogenicity for rats. There are 19 …

📊 34 results
📏 Metrics: Accuracy

Pubmed

The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. …

📊 1 results
📏 Metrics: Test Accuracy

REDDIT-12K

Reddit12k contains 11929 graphs each corresponding to an online discussion thread where nodes represent users, and an edge represents the …

📊 1 results
📏 Metrics: Accuracy (10 fold)

REDDIT-BINARY

REDDIT-BINARY consists of graphs corresponding to online discussions on Reddit. In each graph, nodes represent users, and there is an …

📊 9 results
📏 Metrics: Accuracy, Accuracy (10-fold)

SIDER

SIDER contains information on marketed medicines and their recorded adverse drug reactions. The information is extracted from public documents and …

📊 2 results
📏 Metrics: ROC-AUC

Synthetic Dynamic Networks

This dataset accompanies the paper `Learning the mechanisms of network growth' by the same authors. The dataset contains 6733 networks …

📊 3 results
📏 Metrics: Accuracy

Tox21

The Tox21 data set comprises 12,060 training samples and 647 test samples that represent chemical compounds. There are 801 "dense …

📊 3 results
📏 Metrics: ROC-AUC

UK Biobank Brain MRI

UK Biobank participants have generously provided a very wide range of information about their health and well-being since recruitment began …

📊 1 results
📏 Metrics: Accuracy

UPFD-GOS

The Gossipcop variant of the UPFD dataset for benchmarking. Please refer to the UPFD dataset for more details of the …

📊 8 results
📏 Metrics: Accuracy (%)

UPFD-POL

The PolitiFact variant of the UPFD dataset for benchmarking. Please refer to the UPFD dataset for more details of the …

📊 8 results
📏 Metrics: Accuracy (%)

Wine

These data are the results of a chemical analysis of wines grown in the same region in Italy but derived …

📊 1 results
📏 Metrics: Accuracy

clintox

The ClinTox dataset compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons. The …

📊 2 results
📏 Metrics: ROC-AUC

Graph Clustering

Citeseer

The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 …

📊 8 results
📏 Metrics: ACC, NMI, ARI, F1, Precision, F score

Cora

The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 …

📊 8 results
📏 Metrics: ACC, NMI, ARI, F1, Precision, F score

Pubmed

The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. …

📊 6 results
📏 Metrics: ACC, NMI, ARI, F score

Graph Matching

PASCAL VOC

The PASCAL Visual Object Classes (VOC) 2012 dataset contains 20 object categories including vehicles, household, animals, and other: aeroplane, bicycle, …

📊 13 results
📏 Metrics: F1 score, matching accuracy

RARE

RARE consists of English AMR pairs with similarity scores that reflect the structural differences between them. Given that AMRs are …

📊 4 results
📏 Metrics: Spearman Correlation

SPair-71k

SPair-71k contains 70,958 image pairs with diverse variations in viewpoint and scale. Compared to previous datasets, it is significantly larger …

📊 6 results
📏 Metrics: matching accuracy

Graph Property Prediction

QM9

QM9 provides quantum chemical properties (at DFT level) for a relevant, consistent, and comprehensive chemical space of small organic molecules. …

📊 5 results
📏 Metrics: Standardized MAE, logMAE, alpha (ma), gap (meV)

Graph Question Answering

GQA

The GQA dataset is a large-scale visual question answering dataset with real images from the Visual Genome dataset and balanced …

📊 1 results
📏 Metrics: Accuracy

Graph Ranking

ZINC

ZINC is a free database of commercially-available compounds for virtual screening. ZINC contains over 230 million purchasable compounds in ready-to-dock, …

📊 4 results
📏 Metrics: Kendall's Tau

Graph Regression

GlassTemp

The GlassTemp dataset is collected from Polyinfo. It uses monomers as polymer graphs to predict the property of glass transition …

📊 1 results
📏 Metrics: RMSE

PCQM4Mv2-LSC

PCQM4Mv2 is a quantum chemistry dataset originally curated under the PubChemQC project. Based on the PubChemQC, we define a meaningful …

📊 20 results
📏 Metrics: Validation MAE, Test MAE

QM9

QM9 provides quantum chemical properties (at DFT level) for a relevant, consistent, and comprehensive chemical space of small organic molecules. …

📊 1 results
📏 Metrics: Inference Time (ms)

ZINC

ZINC is a free database of commercially-available compounds for virtual screening. ZINC contains over 230 million purchasable compounds in ready-to-dock, …

📊 25 results
📏 Metrics: MAE

Graph Representation Learning

COMA

CoMA contains 17,794 meshes of the human face in various expressions Source: DEMEA: Deep Mesh Autoencoders for Non-Rigidly Deforming Objects

📊 1 results
📏 Metrics: Error (mm)

Graph-to-Sequence

WebNLG

The WebNLG corpus comprises of sets of triplets describing facts (entities and relations between them) and the corresponding facts in …

📊 1 results
📏 Metrics: BLEU

HD semantic map learning

nuScenes

The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in …

📊 2 results
📏 Metrics: Chamfer AP

Hand Pose Estimation

3DPW

The 3D Poses in the Wild dataset is the first dataset in the wild with accurate 3D poses for evaluation. …

📊 1 results
📏 Metrics: MPJPE

COCO-WholeBody

COCO-WholeBody is an extension of COCO dataset with whole-body annotations. There are 4 types of bounding boxes (person box, face …

📊 2 results
📏 Metrics: keypoint AP

Custom FINNgers

A dataset with 3200 images (200 for each number quantity on each hand).

📊 1 results
📏 Metrics: 1:1 Accuracy

ICVL

📊 1 results
📏 Metrics: Error (mm)

K2HPD

Includes 100K depth images under challenging scenarios. Source: Human Pose Estimation from Depth Images via Inference Embedded Multi-task Learning

📊 1 results
📏 Metrics: PDJ@5mm

Handwriting Recognition

An extensive dataset of handwritten central Kurdish isolated characters

Data collection: Finding a suitable source of data is considered a first step toward building a database. The first step …

📊 1 results
📏 Metrics: 1:1 Accuracy

KOHTD

Kazakh offline Handwritten Text dataset (KOHTD) has 3000 handwritten exam papers and more than 140335 segmented images and there are …

📊 4 results
📏 Metrics: CER

Handwriting Verification

AND Dataset

The AND Dataset contains 13700 handwritten samples and 15 corresponding expert examined features for each sample. The dataset is released …

📊 1 results
📏 Metrics: Average F1

CEDAR Signature

CEDAR Signature is a database of off-line signatures for signature verification. Each of 55 individuals contributed 24 signatures thereby creating …

📊 2 results
📏 Metrics: FAR

Handwritten Mathmatical Expression Recognition

CROHME 2014

📊 14 results
📏 Metrics: ExpRate

HME100K

Source: HME100K

📊 11 results
📏 Metrics: ExpRate

Handwritten Text Recognition

Belfort

The Belfort dataset This dataset includes minutes of Belfort municipal council drawn up between 1790 and 1946. Documents include …

📊 4 results
📏 Metrics: CER (%), WER (%)

Bentham

Bentham manuscripts refers to a large set of documents that were written by the renowned English philosopher and reformer Jeremy …

📊 1 results
📏 Metrics: CER

Digital Peter

Digital Peter is a dataset of Peter the Great's manuscripts annotated for segmentation and text recognition. The dataset may be …

📊 1 results
📏 Metrics: CER

HKR

The database is written in Cyrillic and shares the same 33 characters. Besides these characters, the Kazakh alphabet also contains …

📊 1 results
📏 Metrics: CER

IAM

The IAM database contains 13,353 images of handwritten lines of text created by 657 writers. The texts those writers transcribed …

📊 16 results
📏 Metrics: CER, WER

IAM(line-level)

The IAM database contains 13,353 images of handwritten lines of text created by 657 writers. The texts those writers transcribed …

📊 5 results
📏 Metrics: Test CER, Test WER

LAM(line-level)

Handwritten Text Recognition (HTR) is an open problem at the intersection of Computer Vision and Natural Language Processing. The main …

📊 6 results
📏 Metrics: Test CER, Test WER

READ 2016

This dataset arises from the READ project (Horizon 2020). The dataset consists of a subset of documents from the Ratsprotokolle …

📊 2 results
📏 Metrics: CER (%), WER (%)

READ2016(line-level)

This dataset arises from the READ project (Horizon 2020). The dataset consists of a subset of documents from the Ratsprotokolle …

📊 5 results
📏 Metrics: Test CER, Test WER

SIMARA

Description We propose a new database for information extraction from historical handwritten documents. The corpus includes 5,393 finding aids …

📊 1 results
📏 Metrics: CER (%), WER (%)

Saint Gall

Saint Gall dataset contains handwritten historical manuscripts written in Latin that date back to the 9th century. It consists of …

📊 1 results
📏 Metrics: CER

Hate Speech Detection

DKhate

A corpus of Offensive Language and Hate Speech Detection for Danish This DKhate dataset contains 3600 comments from the web …

📊 1 results
📏 Metrics: F1

HatEval

Hate Speech is commonly defined as any communication that disparages a person or a group on the basis of some …

📊 2 results
📏 Metrics: Macro F1

HateMM

Hate speech has become one of the most significant issues in modern society, with implications in both the online and …

📊 2 results
📏 Metrics: TEST F1 (macro)

HateXplain

Covers multiple aspects of the issue. Each post in the dataset is annotated from three different perspectives: the basic, commonly …

📊 11 results
📏 Metrics: AUROC, Macro F1, Accuracy, Macro-F1

OLID

The OLID is a hierarchical dataset to identify the type and the target of offensive texts in social media. The …

📊 1 results
📏 Metrics: Macro F1

SHAJ

This is an abusive/offensive language detection dataset for Albanian. The data is formatted following the OffensEval convention. Data is from …

📊 1 results
📏 Metrics: F1

ToLD-Br

The Toxic Language Detection for Brazilian Portuguese (ToLD-Br) is a dataset with tweets in Brazilian Portuguese annotated according to different …

📊 2 results
📏 Metrics: F1-score

Hierarchical Text Segmentation

HierText

HierText is the first dataset featuring hierarchical annotations of text in natural scenes and documents. The dataset contains 11639 images …

📊 1 results
📏 Metrics: F-score (average), F-score (stroke), F-score (word), F-score (text-line), F-score (para., layout)

High School European History

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

High School Geography

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

High School Government and Politics

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

High School Macroeconomics

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

High School Microeconomics

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

High School Psychology

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

High School US History

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

High School World History

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Highlight Detection

QVHighlights

The Query-based Video Highlights (QVHighlights) dataset is a dataset for detecting customized moments and highlights from videos given natural language …

📊 20 results
📏 Metrics: mAP, Hit@1

TvSum

Introduced by Song et al. in TVSum: Summarizing web videos using titles. The TVSum dataset comprises 50 videos, with durations …

📊 7 results
📏 Metrics: mAP

Holdout Set

xView3-SAR

Unsustainable fishing practices worldwide pose a major threat to marine resources and ecosystems. Identifying vessels that do not show up …

📊 5 results
📏 Metrics: Aggregate xView3 Score

Hope Speech Detection

KanHope

KanHope is a code mixed hope speech dataset for equality, diversity, and inclusion in Kannada, an under-resourced Dravidian language. The …

📊 1 results
📏 Metrics: F1-score (Weighted)

Human Activity Recognition

HAR

The Human Activity Recognition Dataset has been collected from 30 subjects performing six different activities (Walking, Walking Upstairs, Walking Downstairs, …

📊 1 results
📏 Metrics: Accuracy, F1 Macro, Macro-F1

PAMAP2

The PAMAP2 Physical Activity Monitoring dataset contains data of 18 different physical activities (such as walking, cycling, playing soccer, etc.), …

📊 2 results
📏 Metrics: NMI, ARI, Accuracy, Macro F1

Human Aging

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Human Instance Segmentation

OCHuman

This dataset focuses on heavily occluded human with comprehensive annotations including bounding-box, humans pose and instance mask. This dataset contains …

📊 14 results
📏 Metrics: AP

Human Interaction Recognition

EPIC-SOUNDS

EPIC-SOUNDS is a large scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of …

📊 1 results
📏 Metrics: Top-1 accuracy %

NTU RGB+D

NTU RGB+D is a large-scale dataset for RGB-D human action recognition. It involves 56,880 samples of 60 action classes collected …

📊 4 results
📏 Metrics: Accuracy (Cross-Subject), Accuracy (Cross-View)

NTU RGB+D 120

NTU RGB+D 120 is a large-scale dataset for RGB+D human action recognition, which is collected from 106 distinct subjects and …

📊 5 results
📏 Metrics: Accuracy (Cross-Setup), Accuracy (Cross-Subject)

SBU / SBU-Refine

SBU-Kinect-Interaction dataset version 2.0 comprises of RGB-D video sequences of humans performing interaction activities that are recording using the Microsoft …

📊 2 results
📏 Metrics: Accuracy

UT-Interaction

The UT-Interaction dataset contains videos of continuous executions of 6 classes of human-human interactions: shake-hands, point, hug, push, kick and …

📊 1 results
📏 Metrics: Accuracy (Set 1), Accuracy (Set 2)

Human Mesh Recovery

BEDLAM

BEDLAM is a large-scale synthetic video dataset designed to train and test algorithms on the task of 3D human pose …

📊 3 results
📏 Metrics: PVE-All

Human Organs Senses Multiple Choice

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 2 results
📏 Metrics: Accuracy

Human Parsing

4D-DRESS

4D-DRESS is the first real-world 4D dataset of human clothing, capturing 64 human outfits in more than 520 motion sequences. …

📊 6 results
📏 Metrics: mAcc, mIoU

PASCAL Context

The PASCAL Context dataset is an extension of the PASCAL VOC 2010 detection challenge, and it contains pixel-wise labels for …

📊 1 results
📏 Metrics: mIoU

Human Part Segmentation

CIHP

The Crowd Instance-level Human Parsing (CIHP) dataset has 38,280 diverse human images. Each image in CIHP is labeled with pixel-wise …

📊 6 results
📏 Metrics: Mean IoU

Human3.6M

The Human3.6M dataset is one of the largest motion capture datasets, which consists of 3.6 million human poses and corresponding …

📊 3 results
📏 Metrics: mIoU

PASCAL-Part

PASCAL-Part is a set of additional annotations for PASCAL VOC 2010. It goes beyond the original PASCAL object detection task …

📊 7 results
📏 Metrics: mIoU

Human Sexuality

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Human action generation

Human3.6M

The Human3.6M dataset is one of the largest motion capture datasets, which consists of 3.6 million human poses and corresponding …

📊 5 results
📏 Metrics: MMDa, MMDs

HumanAct12

HumanAct12 is a new 3D human motion dataset adopted from the polar image and 3D pose dataset PHSPD, with proper …

📊 1 results
📏 Metrics: Accuracy, Diversity, FID, Multimodality

NTU RGB+D

NTU RGB+D is a large-scale dataset for RGB-D human action recognition. It involves 56,880 samples of 60 action classes collected …

📊 2 results
📏 Metrics: FID (CS), FID (CV)

NTU RGB+D 120

NTU RGB+D 120 is a large-scale dataset for RGB+D human action recognition, which is collected from 106 distinct subjects and …

📊 2 results
📏 Metrics: FID (CS), FID (CV)

NTU RGB+D 2D

NTU RGB+D 2D is a curated version of NTU RGB+D often used for skeleton-based action prediction and synthesis. It contains …

📊 5 results
📏 Metrics: MMDa (CS), MMDs (CS), MMDa (CV), MMDs (CV)

UESTC RGB-D

UESTC RGB-D Varying-view action database contains 40 categories of aerobic exercise. We utilized 2 Kinect V2 cameras in 8 fixed …

📊 1 results
📏 Metrics: Accuracy, Diversity, FID, Test

Human-Object Interaction Anticipation

VidHOI

VidHOI is a video-based human-object interaction detection benchmark. VidHOI is based on VidOR which is densely annotated with all humans …

📊 3 results
📏 Metrics: Person-wise Top5: t=1([email protected]), Person-wise Top5: t=3([email protected]), Person-wise Top5: t=5([email protected])

Human-Object Interaction Concept Discovery

HICO-DET

HICO-DET is a dataset for detecting human-object interactions (HOI) in images. It contains 47,776 images (38,118 in train set and …

📊 3 results
📏 Metrics: Unknown (AP)

Human-Object Interaction Detection

HICO

HICO is a benchmark for recognizing human-object interactions (HOI). Key features: - A diverse set of interactions with common object …

📊 8 results
📏 Metrics: mAP

HICO-DET

HICO-DET is a dataset for detecting human-object interactions (HOI) in images. It contains 47,776 images (38,118 in train set and …

📊 54 results
📏 Metrics: mAP, Time Per Frame (ms), Detection: Full ([email protected]), Detection: Non-Rare ([email protected]), Detection: Rare ([email protected])

MECCANO

The MECCANO dataset is the first dataset of egocentric videos to study human-object interactions in industrial-like settings. The MECCANO dataset …

📊 1 results
📏 Metrics: [email protected] role

V-COCO

Verbs in COCO (V-COCO) is a dataset that builds off COCO for human-object interaction detection. V-COCO provides 10,346 images (2,533 …

📊 34 results
📏 Metrics: AP(S1), AP(S2), Time Per Frame(ms), MAP

VidHOI

VidHOI is a video-based human-object interaction detection benchmark. VidHOI is based on VidOR which is densely annotated with all humans …

📊 3 results
📏 Metrics: Detection: Full ([email protected]), Detection: Non-Rare ([email protected]), Detection: Rare ([email protected]), Oracle: Full ([email protected]), Oracle: Non-Rare ([email protected]), Oracle: Rare ([email protected])

Hungarian Text Diacritization

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish …

📊 1 results
📏 Metrics: Alpha-Word accuracy

Identify Odd Metapor

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 2 results
📏 Metrics: Accuracy

Image Attribution

CUB-200-2011

The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of …

📊 8 results
📏 Metrics: Insertion AUC score (ResNet-101), Deletion AUC score (ResNet-101)

CelebA

CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels …

📊 8 results
📏 Metrics: Insertion AUC score (ArcFace ResNet-101), Deletion AUC score (ArcFace ResNet-101)

VGGFace2

VGGFace2 is a large-scale face recognition dataset. Images are downloaded from Google Image Search and have large variations in pose, …

📊 8 results
📏 Metrics: Insertion AUC score (ArcFace ResNet-101), Deletion AUC score (ArcFace ResNet-101)

Image Captioning

BanglaLekhaImageCaptions

This dataset consists of images and annotations in Bengali. The images are human annotated in Bengali by two adult native …

📊 1 results
📏 Metrics: BLEU-1, BLEU-2, BLEU-3, BLEU-4, CIDEr, METEOR, ROUGE-L, SPICE

COCO (Common Objects in Context)

The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset. It is designed to …

📊 16 results
📏 Metrics: CIDEr, BLEU-1, BLEU-2, BLEU-3, BLEU-4, METEOR, ROUGE, ROUGE-L

COCO Captions

COCO Captions contains over one and a half million captions describing over 330,000 images. For the training and validation images, …

📊 40 results
📏 Metrics: BLEU-4, CIDER, METEOR, SPICE, ROUGE-L, BLEU-1, BLEU-2, BLEU-3, CLIPScore

ChEBI-20

Dataset contains 33,010 molecule-description pairs split into 80\%/10\%/10\% train/val/test splits. The goal of the task is to retrieve the relevant …

📊 1 results
📏 Metrics: BLEU, Exact, Levenshtein, MACCS FTS, Morgan FTS, RDK FTS, Validity

Conceptual Captions

Automatic image captioning is the task of producing a natural-language utterance (usually a sentence) that correctly reflects the visual content …

📊 2 results
📏 Metrics: CIDEr, ROUGE-L, SPICE

FlickrStyle10K

FlickrStyle10K is collected and built on Flickr30K image caption dataset. The original FlickrStyle10K dataset has 10,000 pairs of images and …

📊 1 results
📏 Metrics: BLEU-1 (Romantic)

IU X-Ray

IU X-ray (Demner-Fushman et al., 2016) is a set of chest X-ray images paired with their corresponding diagnostic reports. The …

📊 1 results
📏 Metrics: CIDEr

Localized Narratives

We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. We ask annotators to describe …

📊 1 results
📏 Metrics: CIDEr

MSCOCO

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 1 results
📏 Metrics: BLEU-4

Object HalBench

Object HalBench is a benchmark used to evaluate the performance of Language Models, particularly those that are multimodal (i.e., they …

📊 3 results
📏 Metrics: chair_i, chair_s

Peir Gross

Peir Gross (Jing et al., 2018) was collected with descriptions in the Gross sub-collection from PEIR digital library, resulting in …

📊 1 results
📏 Metrics: CIDEr, METEOR, ROUGE-L

SCICAP

SCICAP is a large-scale image captioning dataset that contains real-world scientific figures and captions. SCICAP was constructed using more than …

📊 9 results
📏 Metrics: BLEU-4

WHOOPS!

WHOOPS! Is a dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers …

📊 6 results
📏 Metrics: BLEU-4, CIDEr

Image Classification

AIDER

Dataset aimed to do automated aerial scene classification of disaster events from on-board a UAV. Source: [Deep-Learning-Based Aerial Image Classification …

📊 1 results
📏 Metrics: Test F1 score

AIDERV2

The dataset contains aerial images containing three commonly occurring natural disasters earthquake/collapsed buildings, flood, wildfire/fire, and a normal class; do …

📊 1 results
📏 Metrics: Test F1 score

AmsterTime

AmsterTime dataset offers a collection of 2,500 well-curated images matching the same scene from a street view matched to historical …

📊 1 results
📏 Metrics: Accuracy

ArtDL

ArtDL is a novel painting data set for iconography classification composed of images collected from online sources. Most of the …

📊 1 results
📏 Metrics: Average Precision, F1

BreakHis

The Breast Cancer Histopathological Image Classification (BreakHis) is composed of 9,109 microscopic images of breast tumor tissue collected from 82 …

📊 2 results
📏 Metrics: Average Test Accuracy over all magnifications

CARS196

CARS196 is composed of 16,185 car images of 196 classes.

📊 1 results
📏 Metrics: Accuracy

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 241 results
📏 Metrics: Percentage correct, Top-1 Accuracy, Accuracy, Parameters, Top 1 Accuracy, F1, Cross Entropy Loss

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 197 results
📏 Metrics: Percentage correct, PARAMS, Accuracy, Top 1 Accuracy

CINIC-10

CINIC-10 is a dataset for image classification. It has a total of 270,000 images, 4.5 times that of CIFAR-10. It …

📊 9 results
📏 Metrics: Accuracy, FLOPS, PARAMS

CUB-200-2011

The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of …

📊 1 results
📏 Metrics: Accuracy

Caltech-256

Caltech-256 is an object recognition dataset containing 30,607 real-world images, of different sizes, spanning 257 classes (256 object classes and …

📊 4 results
📏 Metrics: Accuracy

Causal3DIdent

Update on 3DIdent, where we introduce six additional object classes (Hare, Dragon, Cow, Armadillo, Horse, and Head), and impose a …

📊 2 results
📏 Metrics: Accuracy

Chaoyang

Chaoyang dataset contains 1111 normal, 842 serrated, 1404 adenocarcinoma, 664 adenoma, and 705 normal, 321 serrated, 840 adenocarcinoma, 273 adenoma …

📊 1 results
📏 Metrics: Accuracy

Clothing1M

Clothing1M contains 1M clothing images in 14 classes. It is a dataset with noisy labels, since the data is collected …

📊 49 results
📏 Metrics: Accuracy

ColonINST-v1 (Seen)

ColonINST is a large-scale instruction tuning dataset designed for multimodal analysis in colonoscopy. This dataset comprises 62 categories, 303,001 colonoscopy …

📊 17 results
📏 Metrics: Accuray

ColonINST-v1 (Unseen)

ColonINST is a large-scale instruction tuning dataset designed for multimodal analysis in colonoscopy. This dataset comprises 62 categories, 303,001 colonoscopy …

📊 17 results
📏 Metrics: Accuray

Colored-MNIST(with spurious correlation)

This is a dataset with spurious correlations which can be used to evaluate machine learning methods for out-of-distribution generalization, causal …

📊 6 results
📏 Metrics: Accuracy

DF20

Danish Fungi 2020 (DF20) is a fine-grained dataset and benchmark. The dataset, constructed from observations submitted to the Danish Fungal …

📊 19 results
📏 Metrics: Top-1, Top-3, F1 - macro

DF20 - Mini

Danish Fungi 2020 (DF20) is a novel fine-grained dataset and benchmark. The dataset, constructed from observations submitted to the Danish …

📊 19 results
📏 Metrics: Top-1, Top-3, F1 - macro

DTD

The Describable Textures Dataset (DTD) contains 5640 texture images in the wild. They are annotated with human-centric attributes inspired by …

📊 11 results
📏 Metrics: Accuracy

DVS128 Gesture

Comprises 11 hand gesture categories from 29 subjects under 3 illumination conditions. Source: [A Low Power, Fully Event-Based Gesture Recognition …

📊 1 results
📏 Metrics: Accuracy

ESC-50

The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification. …

📊 1 results
📏 Metrics: Top 1 Accuracy

EuroSAT

Eurosat is a dataset and deep learning benchmark for land use and land cover classification. The dataset is based on …

📊 14 results
📏 Metrics: Accuracy (%)

EuroSAT-SAR

A SAR version of the EuroSAT dataset. The images were collected from Sentinel-1 GRD products (two bands VV and VH) …

📊 3 results
📏 Metrics: Overall Accuracy

FEMNIST

See paper: Caldas, Sebastian, et al. "Leaf: A benchmark for federated settings." arXiv preprint arXiv:1812.01097 (2018).

📊 1 results
📏 Metrics: Accuracy

FMD (materials)

Sharan, Lavanya, Ruth Rosenholtz, and Edward Adelson. "Material perception: What can you see in a brief glance?." Journal of Vision …

📊 1 results
📏 Metrics: Accuracy (%)

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 33 results
📏 Metrics: Percentage error, Accuracy, Trainable Parameters, NMI, Power consumption

FlickrLogos-32

Object detection benchmark for logo detection. Images are natural scenes. Each image contains multiple objects, and each image has a …

📊 3 results
📏 Metrics: Accuracy

Food-101

The Food-101 dataset consists of 101 food categories with 750 training and 250 test images per category, making a total …

📊 10 results
📏 Metrics: Accuracy (%)

Food-101N

The Food-101N dataset is introduced in "CleanNet: Transfer Learning for Scalable Image Training with Label Noise (CVPR'18). It is an …

📊 4 results
📏 Metrics: Accuracy

GTSRB

The German Traffic Sign Recognition Benchmark (GTSRB) contains 43 classes of traffic signs, split into 39,209 training images and 12,630 …

📊 1 results
📏 Metrics: F1

GasHisSDB

Four pathologists from Longhua Hospital Shanghai University of Traditional Chinese Medicine provide 600 images of gastric cancer pathology images at …

📊 8 results
📏 Metrics: Accuracy, Precision, F1-Score

Gaze-CIFAR-10

We construct Gaze-CIFAR-10, a gaze-augmented image dataset based on the standard CIFAR-10 benchmark, enhanced with human eye-tracking annotations collected using …

📊 2 results
📏 Metrics: 1:1 Accuracy

HErlev

📊 1 results
📏 Metrics: Accuracy

Id Pattern Dataset

After defining a taxonomy of the main stone deterioration patterns and anomalies, we selected 354 highly representative images of stone-built …

📊 3 results
📏 Metrics: Percentage correct

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 1020 results
📏 Metrics: Top 1 Accuracy, Number of params, GFLOPs, Hardware Burden, Top 5 Accuracy, Operations per network pass

ImageNet-100 (TEMI Split)

This split was introduced in TEMI (BMVC 2023) Adaloglou, Nikolas, Felix Michels, Hamza Kalisch, and Markus Kollmann. "Exploring the Limits …

📊 2 results
📏 Metrics: Percentage correct, Params

ImageNet-32

Imagenet32 is a huge dataset made up of small images called the down-sampled version of Imagenet. Imagenet32 is composed of …

📊 1 results
📏 Metrics: Top 1 Error

ImageNet-64

Imagenet64 is a massive dataset of small images called the down-sampled version of Imagenet. Imagenet64 comprises 1,281,167 training data and …

📊 1 results
📏 Metrics: Top 1 Error

ImageNet-9

ImageNet-9 consists of images with different amounts of background and foreground signal, which you can use to measure the extent …

📊 1 results
📏 Metrics: Top 1 Accuracy

ImageNet-P

ImageNet-P consists of noise, blur, weather, and digital distortions. The dataset has validation perturbations; has difficulty levels; has CIFAR-10, Tiny …

📊 1 results
📏 Metrics: Top 5 Accuracy

ImageNet-Sketch

ImageNet-Sketch data set consists of 50,889 images, approximately 50 images for each of the 1000 ImageNet classes. The data set …

📊 1 results
📏 Metrics: Accuracy

Imagenette

Imagenette is a subset of 10 easily classified classes from Imagenet (bench, English springer, cassette player, chain saw, church, French …

📊 2 results
📏 Metrics: Accuracy

Intel Image Classification

Context This is image data of Natural Scenes around the world. Content This Data contains around 25k images of size …

📊 2 results
📏 Metrics: Accuracy

JFT-300M

JFT-300M is an internal Google dataset used for training image classification models. Images are labeled using an algorithm that uses …

📊 4 results
📏 Metrics: prec@1

KMNIST

📊 1 results
📏 Metrics: Accuracy

KTH-TIPS2

The KTH-TIPS (Textures under varying Illumination, Pose and Scale) image database was created to extend the CUReT database in two …

📊 1 results
📏 Metrics: Accuracy (%)

Kuzushiji-MNIST

Kuzushiji-MNIST is a drop-in replacement for the MNIST dataset (28x28 grayscale, 70,000 images). Since MNIST restricts us to 10 classes, …

📊 14 results
📏 Metrics: Accuracy, Error, Trainable Parameters

Kvasir

The KVASIR Dataset was released as part of the medical multimedia challenge presented by MediaEval. It is based on images …

📊 3 results
📏 Metrics: Accuracy, F1

LIMUC

The LIMUC dataset is the largest publicly available labeled ulcerative colitis dataset that compromises 11276 images from 564 patients and …

📊 1 results
📏 Metrics: Quadratic Weighted Kappa

LabelMe

LabelMe database is a large collection of images with ground truth labels for object detection and recognition. The annotations come …

📊 1 results
📏 Metrics: Test Accuracy

Large Labelled Logo Dataset (L3D)

It is composed of around 770k of color 256x256 RGB images extracted from the European Union Intellectual Property Office (EUIPO) …

📊 2 results
📏 Metrics: Eval F1

LeafNet

The PlantVillage dataset, with over 54,000 images spanning 14 plant species and 26 disease types, has been widely used for …

📊 1 results
📏 Metrics: Accuracy (Top-1)

MAMe

The MAMe dataset contains images of high-resolution and variable shape of artworks from 3 different museums: - The Metropolitan Museum …

📊 4 results
📏 Metrics: Acc

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 74 results
📏 Metrics: Percentage error, Accuracy, Trainable Parameters, Cross Entropy Loss, Epochs, Top 1 Accuracy

Malaria Dataset

The dataset contains a total of 27,558 cell images with equal instances of parasitized and uninfected cells. Source: Malaria Dataset

📊 1 results
📏 Metrics: Acc. (test), PARAMS

MultiMNIST

The MultiMNIST dataset is generated from MNIST. The training and tests are generated by overlaying a digit on top of …

📊 1 results
📏 Metrics: Percentage error

N-Caltech 101

The Neuromorphic-Caltech101 (N-Caltech101) dataset is a spiking version of the original frame-based Caltech101 dataset. The original dataset contained both a …

📊 1 results
📏 Metrics: Accuracy

N-MNIST

Brief Description The Neuromorphic-MNIST (N-MNIST) dataset is a spiking version of the original frame-based MNIST dataset. It consists of the …

📊 4 results
📏 Metrics: Accuracy

NCT-CRC-HE-100K

The NCT-CRC-HE-100K dataset is a set of 100,000 non-overlapping image patches extracted from 86 H$\&$E stained human cancer tissue slides …

📊 1 results
📏 Metrics: F1

ObjectNet

ObjectNet is a test set of images collected directly using crowd-sourcing. ObjectNet is unique as the objects are captured at …

📊 94 results
📏 Metrics: Top-1 Accuracy, Top-5 Accuracy

OmniBenchmark

Omni-Realm Benchmark (OmniBenchmark) is a diverse (21 semantic realm-wise datasets) and concise (realm-wise datasets have no concepts overlapping) benchmark for …

📊 22 results
📏 Metrics: Average Top-1 Accuracy

Oracle-MNIST

We introduce the Oracle-MNIST dataset, comprising of 2828 grayscale images of 30,222 ancient characters from 10 categories, for benchmarking pattern …

📊 4 results
📏 Metrics: Accuracy, Trainable Parameters

Oxford-IIIT Pet Dataset

The Oxford-IIIT Pet Dataset has 37 categories with roughly 200 images for each class. The images have a large variations …

📊 3 results
📏 Metrics: Accuracy, PARAMS, FLOPS

Oxford-IIIT Pets

The Oxford-IIIT Pet Dataset is a 37-category pet dataset with roughly 200 images for each class. The images have large …

📊 6 results
📏 Metrics: Accuracy, Per-Class Accuracy

PASCAL VOC 2007

PASCAL VOC 2007 is a dataset for image recognition. The twenty object classes that have been selected are: Person: person …

📊 1 results
📏 Metrics: Accuracy

PRImA

The Prima head pose dataset consists of 2790 images of 15 persons recorded twice. Pitch values lie in the interval …

📊 1 results
📏 Metrics: Percentage correct

Places205

The Places205 dataset is a large-scale scene-centric dataset with 205 common scene categories. The training dataset contains around 2,500,000 images …

📊 15 results
📏 Metrics: Top 1 Accuracy

Places365

The Places365 dataset is a scene recognition dataset. It is composed of 10 million images comprising 434 scene classes. There …

📊 6 results
📏 Metrics: Top 1 Accuracy

PlantDoc

PlantDoc is a dataset for visual plant disease detection. The dataset contains 2,598 data points in total across 13 plant …

📊 1 results
📏 Metrics: PARAMS, Accuracy

PlantVillage

The PlantVillage dataset consists of 54303 healthy and unhealthy leaf images divided into 38 categories by species and disease.

📊 1 results
📏 Metrics: Accuracy, F1, Testing Ratio

QMNIST

The exact pre-processing steps used to construct the MNIST dataset have long been lost. This leaves us with no reliable …

📊 1 results
📏 Metrics: Accuracy (%)

RESISC45

RESISC45 dataset is a dataset for Remote Sensing Image Scene Classification (RESISC). It contains 31,500 RGB images of size 256×256 …

📊 16 results
📏 Metrics: Top 1 Accuracy, F1, zero-shot Acc

Red MiniImageNet 20% label noise

Part of the Controlled Noisy Web Labels Dataset.

📊 5 results
📏 Metrics: Accuracy

Red MiniImageNet 40% label noise

Part of the Controlled Noisy Web Labels Dataset.

📊 5 results
📏 Metrics: Accuracy

Red MiniImageNet 80% label noise

Part of the Controlled Noisy Web Labels Dataset.

📊 5 results
📏 Metrics: Accuracy

SIPaKMeD

  • a high-level explanation of the dataset characteristics * explain motivations and summary of its content * potential use cases …
📊 1 results
📏 Metrics: Accuracy

STL-10

The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or …

📊 97 results
📏 Metrics: Percentage correct, FLOPS, PARAMS

SUN397

The Scene UNderstanding (SUN) database contains 899 categories and 130,519 images. There are 397 well-sampled categories to evaluate numerous state-of-the-art …

📊 1 results
📏 Metrics: Accuracy

SVHN

Street View House Numbers (SVHN) is a digit classification benchmark dataset that contains 600,000 32×32 RGB images of printed digits …

📊 57 results
📏 Metrics: Percentage error, Percentage correct

So2Sat LCZ42

So2Sat LCZ42 consists of local climate zone (LCZ) labels of about half a million Sentinel-1 and Sentinel-2 image patches in …

📊 1 results
📏 Metrics: Accuracy

Sports10

  • Games dataset containing 100,000 Gameplay Images of 175 Video Games across 10 Sports Genres - AMERICAN FOOTBALL, BASKETBALL, BIKE …
📊 1 results
📏 Metrics: Validation Accuracy

Stanford Cars

The Stanford Cars dataset consists of 196 classes of cars with a total of 16,185 images, taken from the rear. …

📊 24 results
📏 Metrics: Accuracy

Stanford Online Products

Stanford Online Products (SOP) dataset has 22,634 classes with 120,053 product images. The first 11,318 classes (59,551 images) are split …

📊 1 results
📏 Metrics: Accuracy

Visual Wake Words

Visual Wake Words represents a common microcontroller vision use-case of identifying whether a person is present in the image or …

📊 4 results
📏 Metrics: Accuracy

VizWiz-Classification

Our goal is to improve upon the status quo for designing image classification models trained in one domain that perform …

📊 1 results
📏 Metrics: Accuracy

WebVision

The WebVision dataset is designed to facilitate the research on learning visual representation from noisy web data. It is a …

📊 2 results
📏 Metrics: Top 1 Accuracy, Top 5 Accuracy

iNaturalist

The iNaturalist 2017 dataset (iNat) contains 675,170 training and validation images from 5,089 natural fine-grained categories. Those categories belong to …

📊 18 results
📏 Metrics: Top 1 Accuracy, Top 5 Accuracy, Top 3 Error, Overall

iWildCam2020-WILDS

The iWildCam2020-WILDS dataset is a variant of the iWildCam 2020 dataset. iWildCam2020-WILDS is a benchmark dataset designed to test OOD …

📊 6 results
📏 Metrics: Accuracy (Top-1)

smallNORB

The smallNORB dataset is a datset for 3D object recognition from shape. It contains images of 50 toys belonging to …

📊 6 results
📏 Metrics: Classification Error

Image Clustering

Birdsnap

Birdsnap is a large bird dataset consisting of 49,829 images from 500 bird species with 47,386 images used for training …

📊 1 results
📏 Metrics: Accuracy

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 35 results
📏 Metrics: Accuracy, NMI, ARI, Train set, Backbone

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 26 results
📏 Metrics: Accuracy, NMI, ARI, Train Set, Backbone

Caltech-101

The Caltech101 dataset contains images from 101 object categories (e.g., “helicopter”, “elephant” and “chair” etc.) and a background category that …

📊 1 results
📏 Metrics: Accuracy

Country211

Country211 is a dataset released by OpenAI, designed to assess the geolocation capability of visual representations. It filters the YFCC100m …

📊 1 results
📏 Metrics: Accuracy

DTD

The Describable Textures Dataset (DTD) contains 5640 texture images in the wild. They are annotated with human-centric attributes inspired by …

📊 2 results
📏 Metrics: Accuracy, ARI, NMI

EuroSAT

Eurosat is a dataset and deep learning benchmark for land use and land cover classification. The dataset is based on …

📊 1 results
📏 Metrics: Accuracy

FER2013

Fer2013 contains approximately 30,000 facial RGB images of different expressions with size restricted to 48×48, and the main labels of …

📊 1 results
📏 Metrics: Accuracy

FRGC

The data for FRGC consists of 50,000 recordings divided into training and validation partitions. The training partition is designed for …

📊 3 results
📏 Metrics: NMI, Accuracy

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 12 results
📏 Metrics: Accuracy, NMI

Food-101

The Food-101 dataset consists of 101 food categories with 750 training and 250 test images per category, making a total …

📊 1 results
📏 Metrics: Accuracy

GTSRB

The German Traffic Sign Recognition Benchmark (GTSRB) contains 43 classes of traffic signs, split into 39,209 training images and 12,630 …

📊 1 results
📏 Metrics: Accuracy

HAR

The Human Activity Recognition Dataset has been collected from 30 subjects performing six different activities (Walking, Walking Upstairs, Walking Downstairs, …

📊 3 results
📏 Metrics: Accuracy, NMI

Hateful Memes

The Hateful Memes data set is a multimodal dataset for hateful meme detection (image + text) that contains 10,000+ new …

📊 1 results
📏 Metrics: Accuracy

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 12 results
📏 Metrics: Accuracy, NMI, ARI

ImageNet-100 (TEMI Split)

This split was introduced in TEMI (BMVC 2023) Adaloglou, Nikolas, Felix Michels, Hamza Kalisch, and Markus Kollmann. "Exploring the Limits …

📊 5 results
📏 Metrics: NMI, ACCURACY, ARI

ImageNet-50 (TEMI Split)

The ImageNet-50 dataset split as introduced in TEMI. Adaloglou, Nikolas, Felix Michels, Hamza Kalisch, and Markus Kollmann. "Exploring the Limits …

📊 5 results
📏 Metrics: NMI, ACCURACY, ARI

KITTI

KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile …

📊 1 results
📏 Metrics: Accuracy

Kinetics-700

Kinetics-700 is a video dataset of 650,000 clips that covers 700 human action classes. The videos include human-object interactions such …

📊 1 results
📏 Metrics: Accuracy

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 3 results
📏 Metrics: Accuracy, NMI

Oxford-IIIT Pets

The Oxford-IIIT Pet Dataset is a 37-category pet dataset with roughly 200 images for each class. The images have large …

📊 1 results
📏 Metrics: Accuracy

PCam

PatchCamelyon is an image classification dataset. It consists of 327.680 color images (96 x 96px) extracted from histopathologic scans of …

📊 1 results
📏 Metrics: Accuracy

RESISC45

RESISC45 dataset is a dataset for Remote Sensing Image Scene Classification (RESISC). It contains 31,500 RGB images of size 256×256 …

📊 1 results
📏 Metrics: Accuracy

Rendered SST2

The Rendered SST2 dataset is a dataset released by OpenAI, that measures the optical character recognition capability of visual representations. …

📊 1 results
📏 Metrics: Accuracy

STL-10

The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or …

📊 26 results
📏 Metrics: Accuracy, NMI, ARI, Train Split, Backbone

SUN397

The Scene UNderstanding (SUN) database contains 899 categories and 130,519 images. There are 397 well-sampled categories to evaluate numerous state-of-the-art …

📊 1 results
📏 Metrics: Accuracy

Stanford Cars

The Stanford Cars dataset consists of 196 classes of cars with a total of 16,185 images, taken from the rear. …

📊 5 results
📏 Metrics: Accuracy, NMI

Stanford Dogs

The Stanford Dogs dataset contains 20,580 images of 120 classes of dogs from around the world, which are divided into …

📊 4 results
📏 Metrics: Accuracy, NMI

UCF101

UCF101 dataset is an extension of UCF50 and consists of 13,320 video clips, which are classified into 101 categories. These …

📊 2 results
📏 Metrics: Accuracy, ARI, NMI

USPS

USPS is a digit dataset automatically scanned from envelopes by the U.S. Postal Service containing a total of 9,298 16×16 …

📊 15 results
📏 Metrics: NMI, Accuracy

Image Colorization

CelebA

CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels …

📊 3 results
📏 Metrics: Consistency, FID

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 4 results
📏 Metrics: Consistency, FID

NIR2RGB VCIP Challange Dataset

This dataset provides the VCIP 2020 Grand Challenge on the NIR Image Colorization dataset. ##You can refer to https://jchenhkg.github.io/projects/NIR2RGB_VCIP_Challenge/ for …

📊 3 results
📏 Metrics: PSNR

Image Compression

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 1 results
📏 Metrics: Bit rate

Food-101

The Food-101 dataset consists of 101 food categories with 750 training and 250 test images per category, making a total …

📊 1 results
📏 Metrics: Bit rate

Oxford-IIIT Pet Dataset

The Oxford-IIIT Pet Dataset has 37 categories with roughly 200 images for each class. The images have a large variations …

📊 1 results
📏 Metrics: Bit rate

PCam

PatchCamelyon is an image classification dataset. It consists of 327.680 color images (96 x 96px) extracted from histopathologic scans of …

📊 1 results
📏 Metrics: Bit rate

STL-10

The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or …

📊 1 results
📏 Metrics: Bit rate

Image Deblurring

CelebA

CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels …

📊 3 results
📏 Metrics: FID, PSNR, SSIM

GoPro

The GoPro dataset for deblurring consists of 3,214 blurred images with the size of 1,280×720 that are divided into 2,103 …

📊 44 results
📏 Metrics: PSNR, SSIM, Params (M), FID, LPIPS

HIDE

Consists of 8,422 blurry and sharp image pairs with 65,784 densely annotated FG human bounding boxes. Source: Human-Aware Motion Deblurring

📊 5 results
📏 Metrics: PSNR, SSIM

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 3 results
📏 Metrics: FID, PSNR, SSIM

Image Dehazing

Haze4k

Haze4k is a synthesized dataset with 4,000 hazy images, in which each hazy image has the associate ground truths of …

📊 9 results
📏 Metrics: PSNR, SSIM

KITTI

KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile …

📊 3 results
📏 Metrics: PSNR

NH-HAZE

NN-HAZE is an image dehazing dataset. Since in many real cases haze is not uniformly distributed NH-HAZE, a non-homogeneous realistic …

📊 3 results
📏 Metrics: PSNR

RESIDE

A new large-scale benchmark consisting of both synthetic and real-world hazy images, called REalistic Single Image DEhazing (RESIDE). RESIDE highlights …

📊 1 results
📏 Metrics: PSNR

RS-Haze

A large-scale non-homogeneous remote sensing image dehazing dataset

📊 7 results
📏 Metrics: PSNR, SSIM

Image Denoising

DND

Benchmarking Denoising Algorithms with Real Photographs This dataset consists of 50 pairs of noisy and (nearly) noise-free images captured with …

📊 15 results
📏 Metrics: PSNR (sRGB), SSIM (sRGB)

FFHQ

Flickr-Faces-HQ (FFHQ) consists of 70,000 high-quality PNG images at 1024×1024 resolution and contains considerable variation in terms of age, ethnicity …

📊 1 results
📏 Metrics: LPIPS

FMD

The Fluorescence Microscopy Denoising (FMD) dataset is dedicated to Poisson-Gaussian denoising. The dataset consists of 12,000 real fluorescence microscopy images …

📊 1 results
📏 Metrics: PSNR

Nam

A holistic approach to cross-channel image noise modeling and its application to image denoising

📊 1 results
📏 Metrics: PSNR, SSIM

PolyU

PolyU Dataset is a large dataset of real-world noisy images with reasonably obtained corresponding “ground truth” images. The basic idea …

📊 1 results
📏 Metrics: PSNR, SSIM

SIDD

SIDD is an image denoising dataset containing 30,000 noisy images from 10 scenes under different lighting conditions using five representative …

📊 20 results
📏 Metrics: PSNR (sRGB), SSIM (sRGB), Average PSNR

Image Editing

GEdit-Bench-EN

This dataset is a new benchmark, grounded in real-world usages is developed to support more authentic and comprehensive evaluation of …

📊 3 results
📏 Metrics: Overall, Perceptual Quality, Semantic Consistency

ImgEdit-Data

ImgEdit is a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs, which contain both novel and complex …

📊 9 results
📏 Metrics: Overall, Add, Adjust, Extract, Replace, Remove, Background, Style, Hybrid, Action

Image Enhancement

Exposure-Errors

A dataset of over 24,000 images exhibiting the broadest range of exposure values to date with a corresponding properly exposed …

📊 3 results
📏 Metrics: PSNR, SSIM

MIT-Adobe FiveK

The MIT-Adobe FiveK dataset consists of 5,000 photographs taken with SLR cameras by a set of different photographers. They are …

📊 1 results
📏 Metrics: DeltaE, LPIPS, PSNR, SSIM

SICE-Grad

A test dataset SICE_Grad image datasets to represent complex mixed over-/under-exposed scenes.

📊 1 results
📏 Metrics: Average PSNR, LPIPS, SSIM

SICE-Mix

A test dataset SICE_Mix image datasets to represent complex mixed over-/under-exposed scenes.

📊 1 results
📏 Metrics: Average PSNR, LPIPS, SSIM

TIP 2018

The first large demoire dataset. The dataset contains 135,000 image pairs, each containing an image contaminated with moire patterns and …

📊 5 results
📏 Metrics: PSNR, SSIM, FSIM

Image Generation

ARKitScenes

ARKitScenes is an RGB-D dataset captured with the widely available Apple LiDAR scanner. Along with the per-frame raw data (Wide …

📊 4 results
📏 Metrics: FID, FID (SwAV)

Binarized MNIST

A binarized version of MNIST. Source: Binarized MNIST

📊 10 results
📏 Metrics: nats, bits/dimension

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 72 results
📏 Metrics: FID, IS, NFE

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 6 results
📏 Metrics: FID, Inception Score, Model Size (MB)

CLEVR

CLEVR (Compositional Language and Elementary Visual Reasoning) is a synthetic Visual Question Answering dataset. It contains images of 3D-rendered objects; …

📊 6 results
📏 Metrics: FID-5k-training-steps

CelebA

CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels …

📊 1 results
📏 Metrics: bpd (8-bits)

CelebA-HQ

The CelebA-HQ dataset is a high-quality version of CelebA that consists of 30,000 images at 1024×1024 resolution. Source: [IntroVAE: Introspective …

📊 1 results
📏 Metrics: FLD

Cityscapes

Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense …

📊 6 results
📏 Metrics: FID-10k-training-steps

FFHQ

Flickr-Faces-HQ (FFHQ) consists of 70,000 high-quality PNG images at 1024×1024 resolution and contains considerable variation in terms of age, ethnicity …

📊 12 results
📏 Metrics: FID, Clean-FID (70k), FID-10k-training-steps

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 5 results
📏 Metrics: FID, Precision, Recall

KMNIST

📊 1 results
📏 Metrics: FID

LLVIP

  • Visible-infrared Paired Dataset for Low-light Vision * 30976 images (15488 pairs) * 24 dark scenes, 2 daytime scenes * …
📊 1 results
📏 Metrics: PSNR, SSIM

LSUN

The Large-scale Scene Understanding (LSUN) challenge aims to provide a different benchmark for large-scale scene classification and understanding. The LSUN …

📊 1 results
📏 Metrics: Average FID

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 11 results
📏 Metrics: bits/dimension, FID, Precision, Recall, PSNR, SSIM

MetFaces

MetFaces is an image dataset of human faces extracted from works of art. The dataset consists of 1336 high-quality PNG …

📊 3 results
📏 Metrics: MAE Signature, MAE log-signature, RMSE Signature, RMSE log-signature

Multi-dSprites

📊 1 results
📏 Metrics: FID

NASA Perseverance

Samples from NASA Perseverance and set of GAN generated synthetic images from Neural Mars.

📊 1 results
📏 Metrics: MAE Signature, MAE log-signature, RMSE Signature, RMSE log-signature

ObjectsRoom

The ObjectsRoom dataset is based on the MuJoCo environment used by the Generative Query Network [4] and is a multi-object …

📊 3 results
📏 Metrics: FID

RC-49

RC-49 is a benchmark dataset for generating images conditional on a continuous scalar variable. It is made by rendering 49 …

📊 2 results
📏 Metrics: Intra-FID

Replica

The Replica Dataset is a dataset of high quality reconstructions of a variety of indoor spaces. Each reconstruction has clean …

📊 4 results
📏 Metrics: FID, FID (SwAV)

SDSS Galaxies

This is a dataset of 306,006 galaxies whose coordinates are taken from the Sloan Digital Sky Survey Data Release 7 …

📊 1 results
📏 Metrics: FID

STL-10

The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or …

📊 25 results
📏 Metrics: FID, Inception score, Model Size (MB), Recall, NFE

ShapeStacks

A simulation-based dataset featuring 20,000 stack configurations composed of a variety of elementary geometric primitives richly annotated regarding semantics and …

📊 3 results
📏 Metrics: FID

Stacked MNIST

The Stacked MNIST dataset is derived from the standard MNIST dataset with an increased number of discrete modes. 240,000 RGB …

📊 2 results
📏 Metrics: FID, Inception score

Stanford Cars

The Stanford Cars dataset consists of 196 classes of cars with a total of 16,185 images, taken from the rear. …

📊 4 results
📏 Metrics: FID, Inception score

Stanford Dogs

The Stanford Dogs dataset contains 20,580 images of 120 classes of dogs from around the world, which are divided into …

📊 4 results
📏 Metrics: FID, Inception score

TextAtlasEval

A Dense-text Image Benchmark to evaluate large generation model's ability on text generation.

📊 4 results
📏 Metrics: TextVsionBlend OCR (F1 Score), TextVisionBlend OCR (Accuracy), TextVisionBlend OCR (Cer), TextVisionBlend FID, TextVisionBlend Clip Score, StyledTextSynth OCR (F1 Score), StyledTextSynth OCR (Accuracy), StyledTextSynth OCR (Cer), StyledTextSynth FID, StyledTextSynth Clip Score, TextScenesHQ OCR (F1 Score), TextScenesHQ OCR (Accuracy), TextScenesHQ OCR (Cer), TextScenesHQ FID, TextScenesHQ Clip Score

VLN-CE

Vision and Language Navigation in Continuous Environments (VLN-CE) is an instruction-guided navigation task with crowdsourced instructions, realistic environments, and unconstrained …

📊 4 results
📏 Metrics: FID, FID (SwAV)

VizDoom

ViZDoom is an AI research platform based on the classical First Person Shooter game Doom. The most popular game mode …

📊 4 results
📏 Metrics: FID, FID (SwAV)

WISE

WISE, the first benchmark specifically designed for World Knowledge-Informed Semantic Evaluation. WISE moves beyond simple word-pixel mapping by challenging models …

📊 13 results
📏 Metrics: Overall, Cultural, Time, Space, Biology, Physics, Chemistry

Image Inpainting

ApolloScape

ApolloScape is a large dataset consisting of over 140,000 video frames (73 street scene videos) from various locations in China …

📊 1 results
📏 Metrics: MAE, PSNR, RMSE, SSIM

Apolloscape Inpainting

The Inpainting dataset consists of synchronized Labeled image and LiDAR scanned point clouds. It's captured by HESAI Pandora All-in-One Sensing …

📊 1 results
📏 Metrics: RMSE

CelebA

CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels …

📊 5 results
📏 Metrics: FID, PSNR, SSIM, LPIPS

CelebA-HQ

The CelebA-HQ dataset is a high-quality version of CelebA that consists of 30,000 images at 1024×1024 resolution. Source: [IntroVAE: Introspective …

📊 6 results
📏 Metrics: FID, P-IDS, U-IDS

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 5 results
📏 Metrics: FID, PSNR, SSIM

Image Manipulation

LRS2

The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences …

📊 2 results
📏 Metrics: LPIPS (S1), LPIPS (S2), LPIPS (S3), LPIPS (S4), LPIPS (S5), SIFID (S1), SIFID (S2), SIFID (S3), SIFID (S4), SIFID (S5)

Image Manipulation Detection

CASIA (OSN-transmitted - Facebook)

This dataset is an OSN-transmitted (OSN = Online Social Network) version of the CASIA dataset. The dataset is available here: …

📊 2 results
📏 Metrics: AUC, F-score, Intersection over Union

CASIA (OSN-transmitted - Wechat)

This dataset is an OSN-transmitted (OSN = Online Social Network) version of the CASIA dataset. The dataset is available here: …

📊 2 results
📏 Metrics: AUC, Intersection over Union, f-Score

CASIA (OSN-transmitted - Weibo)

This dataset is an OSN-transmitted (OSN = Online Social Network) version of the CASIA dataset. The dataset is available here: …

📊 2 results
📏 Metrics: AUC, Intersection over Union, f-Score

CASIA (OSN-transmitted - Whatsapp)

This dataset is an OSN-transmitted (OSN = Online Social Network) version of the CASIA dataset. The dataset is available here: …

📊 2 results
📏 Metrics: AUC, Intersection over Union, f-Score

COVERAGE

COVERAGE contains copymove forged (CMFD) images and their originals with similar but genuine objects (SGOs). COVERAGE is designed to highlight …

📊 6 results
📏 Metrics: Balanced Accuracy, AUC

Casia V1+

Casia V1 is a dataset for forgery classification. Casia V1+ is a modification of the Casia V1 dataset proposed by …

📊 7 results
📏 Metrics: Balanced Accuracy, AUC

Columbia (OSN-transmitted - Facebook)

This dataset is an OSN-transmitted (Online Social Network) version of the Columbia dataset. Unfortunately, OSNs automatically apply operations like compression …

📊 2 results
📏 Metrics: AUC, f-Score, Intersection over Union

Columbia (OSN-transmitted - Wechat)

This dataset is an OSN-transmitted (Online Social Network) version of the Columbia dataset. Unfortunately, OSNs automatically apply operations like compression …

📊 2 results
📏 Metrics: AUC, f-Score, Intersection over Union

Columbia (OSN-transmitted - Weibo)

This dataset is an OSN-transmitted (Online Social Network) version of the Columbia dataset. Unfortunately, OSNs automatically apply operations like compression …

📊 2 results
📏 Metrics: AUC, f-Score, Intersection over Union

Columbia (OSN-transmitted - Whatsapp)

This dataset is an OSN-transmitted (Online Social Network) version of the Columbia dataset. Unfortunately, OSNs automatically apply operations like compression …

📊 2 results
📏 Metrics: AUC, f-Score, Intersection over Union

DSO (OSN-transmitted - Facebook)

This dataset is an OSN-transmitted (Online Social Network) version of the DSO dataset. Unfortunately, OSNs automatically apply operations like compression …

📊 2 results
📏 Metrics: AUC, f-Score, Intersection over Union

DSO (OSN-transmitted - Wechat)

This dataset is an OSN-transmitted (Online Social Network) version of the DSO dataset. Unfortunately, OSNs automatically apply operations like compression …

📊 2 results
📏 Metrics: AUC, f-Score, Intersection over Union

DSO (OSN-transmitted - Weibo)

This dataset is an OSN-transmitted (Online Social Network) version of the DSO dataset. Unfortunately, OSNs automatically apply operations like compression …

📊 2 results
📏 Metrics: AUC, f-Score, Intersection over Union

DSO (OSN-transmitted - Whatsapp)

This dataset is an OSN-transmitted (Online Social Network) version of the DSO dataset. Unfortunately, OSNs automatically apply operations like compression …

📊 2 results
📏 Metrics: AUC, f-Score, Intersection over Union

NIST (OSN-transmitted - Facebook)

This dataset is an OSN-transmitted (Online Social Network) version of the NIST dataset (https://www.nist.gov/itl/iad/mig/nimble-challenge-2017-evaluation). Unfortunately, OSNs automatically apply operations like …

📊 2 results
📏 Metrics: AUC, f-Score, Intersection over Union

NIST (OSN-transmitted - Wechat)

This dataset is an OSN-transmitted (Online Social Network) version of the NIST dataset (https://www.nist.gov/itl/iad/mig/nimble-challenge-2017-evaluation). Unfortunately, OSNs automatically apply operations like …

📊 2 results
📏 Metrics: AUC, f-Score, Intersection over Union

NIST (OSN-transmitted - Weibo)

This dataset is an OSN-transmitted (Online Social Network) version of the NIST dataset (https://www.nist.gov/itl/iad/mig/nimble-challenge-2017-evaluation). Unfortunately, OSNs automatically apply operations like …

📊 2 results
📏 Metrics: AUC, f-Score, Intersection over Union

NIST (OSN-transmitted - Whatsapp)

This dataset is an OSN-transmitted (Online Social Network) version of the NIST dataset (https://www.nist.gov/itl/iad/mig/nimble-challenge-2017-evaluation). Unfortunately, OSNs automatically apply operations like …

📊 2 results
📏 Metrics: AUC, f-Score, Intersection over Union

Image Manipulation Localization

COVERAGE

COVERAGE contains copymove forged (CMFD) images and their originals with similar but genuine objects (SGOs). COVERAGE is designed to highlight …

📊 9 results
📏 Metrics: Average Pixel F1(Fixed threshold)

Casia V1+

Casia V1 is a dataset for forgery classification. Casia V1+ is a modification of the Casia V1 dataset proposed by …

📊 9 results
📏 Metrics: Average Pixel F1(Fixed threshold)

Image Matching

IMC PhotoTourism

Dataset provided by the Image Matching Workshop https://www.cs.ubc.ca/research/image-matching-challenge/current/

📊 6 results
📏 Metrics: mean average accuracy @ 10

ZEB

A evaluation benchmark ZEB for image matching by merging 8 real-world datasets and 4 simulated datasets with diverse image resolutions, …

📊 9 results
📏 Metrics: Mean AUC@5°

Image Matting

AIM-500

AIM-500 is the first natural image matting test set, contains 500 high-resolution real-world natural images from three types of images …

📊 5 results
📏 Metrics: SAD, MSE, MAD, Conn., Grad.

AM-2K

AM-2k (Animal Matting 2,000 Dataset) consists of 2,000 high-resolution images collected and carefully selected from websites with open licenses. AM-2k …

📊 7 results
📏 Metrics: SAD, MSE, MAD

Composition-1K

Composition-1K is a large-scale image matting dataset including 49300 training images and 1000 testing images. Image source: https://arxiv.org/pdf/1703.03872v3.pdf

📊 13 results
📏 Metrics: MSE, SAD, Conn, Grad

Distinctions-646

Dinstinctions-646 are composed of 646 foreground images with manually annotated alpha mattes

📊 4 results
📏 Metrics: SAD, MSE, Grad, Conn, Trimap

P3M-10k

P3M-10k contains 10421 high-resolution real-world face-blurred portrait images, along with their manually labeled alpha mattes. The Dataset is aimed to …

📊 5 results
📏 Metrics: SAD, MSE, MAD

PPM-100

PPM is a portrait matting benchmark with the following characteristics: - Fine Annotation - All images are labeled and checked …

📊 1 results
📏 Metrics: MAD, MSE

Image Outpainting

MSCOCO

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 1 results
📏 Metrics: CLIP Similarity, FID, Inception score

Image Paragraph Captioning

Image Paragraph Captioning

The Image Paragraph Captioning dataset allows researchers to benchmark their progress in generating paragraphs that tell a story about an …

📊 4 results
📏 Metrics: BLEU-4, METEOR, CIDEr

Image Quality Assessment

KonIQ-10k

KonIQ-10k is a large-scale IQA dataset consisting of 10,073 quality scored images. This is the first in-the-wild database aiming for …

📊 4 results
📏 Metrics: SRCC, PLCC

MSU FR VQA Database

The dataset was created for video quality assessment problem. It was formed with 36 clips from Vimeo, which were selected …

📊 3 results
📏 Metrics: SRCC

MSU NR VQA Database

The dataset was created for video quality assessment problem. It was formed with 36 clips from Vimeo, which were selected …

📊 6 results
📏 Metrics: SRCC, PLCC, KLCC

Image Reconstruction

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 15 results
📏 Metrics: FID, LPIPS, PSNR, SSIM

Spike-X4K

Overview The Spike-X4K Dataset is a high-resolution image reconstruction resource tailored for the latest advancements in spike camera technology. …

📊 1 results
📏 Metrics: Average PSNR

Ultra-High Resolution Image Reconstruction Benchmark

Ultra-high definition benchmark (UHDBench) includes 2293 images at 2k resolution sourced from the ground-truth test sets of HRSOD, LIU4k, UAVid, …

📊 6 results
📏 Metrics: rFID, PSNR, SSIM, LPIPS

Image Registration

DIR-LAB COPDgene

Inspiratory and exipratory breath-hold CT image pairs acquired from the National Heart Lung Blood Institute COPDgene study archive.

📊 1 results
📏 Metrics: landmarks

FIRE

Fundus Image Registration Dataset (FIRE) is a dataset consisting of 129 retinal images forming 134 image pairs. These image pairs …

📊 5 results
📏 Metrics: mAUC

Image Restoration

CDD-11

An image restoration dataset

📊 11 results
📏 Metrics: Average PSNR (dB), SSIM

UHDM

The first ultra-high-definition image demoireing dataset, consisting of 4,500 4K resolution training pairs and 500 standard 4K resolution validation pairs.

📊 2 results
📏 Metrics: PSNR

Image Retrieval

AmsterTime

AmsterTime dataset offers a collection of 2,500 well-curated images matching the same scene from a street view matched to historical …

📊 5 results
📏 Metrics: mAP

CARS196

CARS196 is composed of 16,185 car images of 196 classes.

📊 7 results
📏 Metrics: R@1, R@8

CBVS

A Large-Scale Chinese Image-Text Benchmark for Real-World Short Video Search Scenario

📊 2 results
📏 Metrics: PNR, Recall@1

CIRR

Composed Image Retrieval (or, Image Retreival conditioned on Language Feedback) is a relatively new retrieval task, where an input query …

📊 14 results
📏 Metrics: (Recall@5+Recall_subset@1)/2, Recall@10

COCO (Common Objects in Context)

The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset. It is designed to …

📊 6 results
📏 Metrics: recall@1, recall@5, Recall@10, QPS

COCO-CN

COCO-CN is a bilingual image description dataset enriching MS-COCO with manually written Chinese sentences and tags. The new dataset can …

📊 9 results
📏 Metrics: R@1, R@10, R@5

COFAR

The COFAR (COmmonsense and FActual Reasoning) dataset is a collection of images and text queries specifically designed to challenge and …

📊 1 results
📏 Metrics: Recall@1, Recall@5

CREPE (Compositional REPresentation Evaluation)

A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains …

📊 22 results
📏 Metrics: Recall@1 (HN-Atom + HN-Comp, SC), Recall@1 (HN-Atom + HN-Comp, UC), Recall@1 (HN-Atom, UC), Recall@1 (HN-Comp, UC)

CUB-200-2011

The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of …

📊 7 results
📏 Metrics: R@1, R@2, R@4, R@8

DeepFashion

DeepFashion is a dataset containing around 800K diverse fashion images with their rich annotations (46 categories, 1,000 descriptive attributes, bounding …

📊 1 results
📏 Metrics: Recall@20

Exact Street2Shop

A dataset containing 404,683 shop photos collected from 25 different online retailers and 20,357 street photos, providing a total of …

📊 3 results
📏 Metrics: Rank-1, Rank-10, Rank-20, mAP, Rank-50

FETA Car-Manuals

FETA benchmark focuses on text-to-image and image-to-text retrieval in public car manuals and sales catalogue brochures. The FETA Car-Manuals dataset …

📊 1 results
📏 Metrics: R@1, R@10, R@5

Fashion IQ

Fashion IQ support and advance research on interactive fashion image retrieval. Fashion IQ is the first fashion dataset to provide …

📊 17 results
📏 Metrics: (Recall@10+Recall@50)/2, Recall@10

Flickr30k

The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators. Source: [Guiding …

📊 9 results
📏 Metrics: Recall@10, Recall@5, Recall@1, Recall@Sum, Image-to-text R@1, Image-to-text R@10, Image-to-text R@5, QPS

ICFG-PEDES

One large-scale database for Text-to-Image Person Re-identification, i.e., Text-based Person Retrieval. Compared with existing databases, ICFG-PEDES has three key advantages. …

📊 1 results
📏 Metrics: rank-1

INSTRE

INSTRE is a benchmark for INSTance-level visual object REtrieval and REcognition (INSTRE). INSTRE has the following major properties: (1) balanced …

📊 1 results
📏 Metrics: MAP

ImageCoDe

Given 10 minimally contrastive (highly similar) images and a complex description for one of them, the task is to retrieve …

📊 1 results
📏 Metrics: Accuracy

In-Shop

In-shop Clothes Retrieval Benchmark evaluates the performance of in-shop Clothes Retrieval. This is a large subset of DeepFashion, containing large …

📊 7 results
📏 Metrics: R@1

LaSCo

Large Scale Composed Image Retrieval (LaSCo) is a new dataset for Composed Image Retrieval (CoIR), x10 times larger than current …

📊 2 results
📏 Metrics: Recall@1 (%)

Localized Narratives

We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. We ask annotators to describe …

📊 1 results
📏 Metrics: Text-to-image R@1, Text-to-image R@10, Text-to-image R@5

MSCOCO

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 3 results
📏 Metrics: Recall@1, Recall@10, Recall@5

NUS-WIDE

The NUS-WIDE dataset contains 269,648 images with a total of 5,018 tags collected from Flickr. These images are manually annotated …

📊 1 results
📏 Metrics: MAP

Oxford5k

Oxford5K is the Oxford Buildings Dataset, which contains 5062 images collected from Flickr. It offers a set of 55 queries …

📊 2 results
📏 Metrics: mAP

PKU SketchRe-ID Dataset

The PKU Sketch Re-ID dataset is constructed by National Engineering Laboratory for Video Technology (NELVT), Peking University. The dataset contains …

📊 1 results
📏 Metrics: R1

PKU-Reid

This dataset contains 114 individuals including 1824 images captured from two disjoint camera views. For each person, eight images are …

📊 1 results
📏 Metrics: R1

Paris6k

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 2 results
📏 Metrics: mAP

PhotoChat

PhotoChat, the first dataset that casts light on the photo sharing behavior in online messaging. PhotoChat contains 12k dialogues, each …

📊 5 results
📏 Metrics: R1, R@10, R@5, Sum(R@1,5,10)

WIT

Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 …

📊 2 results
📏 Metrics: R@1, R@5

iNaturalist

The iNaturalist 2017 dataset (iNat) contains 675,170 training and validation images from 5,089 natural fine-grained categories. Those categories belong to …

📊 10 results
📏 Metrics: R@1, R@16, R@32, R@5

Image Retrieval with Multi-Modal Query

MIT-States

The MIT-States dataset has 245 object classes, 115 attribute classes and ∼53K images. There is a wide range of objects …

📊 5 results
📏 Metrics: Recall@1, Recall@5, Recall@10

Image Segmentation

EVD4UAV

VD4UAV is an altitude-sensitive benchmark dataset designed to evade vehicle detection in Unmanned Aerial Vehicle (UAV) imagery. This dataset is …

📊 1 results
📏 Metrics: Detection: Full ([email protected])

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 1 results
📏 Metrics: GFLOPs

MARIDA

MARIDA (Marine Debris Archive) is the first dataset based on the multispectral Sentinel-2 (S2) satellite data, which distinguishes Marine Debris …

📊 1 results
📏 Metrics: IoU, F1, F1@M

MAS3K

MAS3K contains a total of 3,103 images, where 1,588 are for camouflaged cases, 1,322 are for common cases, and 193 …

📊 3 results
📏 Metrics: mIoU, S-measure, E-measure, MAE

MSD (Mirror Segmentation Dataset)

We construct the first large-scale mirror dataset, named MSD. It includes 4, 018 pairs of images containing mirrors and their …

📊 3 results
📏 Metrics: IoU, F-measure, MAE

PASCAL VOC

The PASCAL Visual Object Classes (VOC) 2012 dataset contains 20 object categories including vehicles, household, animals, and other: aeroplane, bicycle, …

📊 3 results
📏 Metrics: mIoU, mAP0.5

PMD

We propose a large-scale benchmark here, which contains a total of 6,461 mirror images with ground truth annotations.

📊 3 results
📏 Metrics: IoU, F-measure, MAE

Pascal Panoptic Parts

The Pascal Panoptic Parts dataset consists of annotations for the part-aware panoptic segmentation task on the PASCAL VOC 2010 dataset. …

📊 4 results
📏 Metrics: mIoUPartS

RMAS

We construct a new large-scale real-world MAS data set for conducting extensive experiments. It consists of over 3000 images with …

📊 3 results
📏 Metrics: mIoU, S-measure, E-measure, MAE

Image Shadow Removal

INS Dataset

A significant challenge in removing shadows from indoor scenes is obtaining shadow-free images. To overcome this challenge, we propose a …

📊 1 results
📏 Metrics: Average PSNR (dB)

Image Stitching

HPatches

The HPatches is a recent dataset for local patch descriptor evaluation that consists of 116 sequences of 6 images with …

📊 1 results
📏 Metrics: 0..5sec

Image Super-Resolution

CelebA

CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels …

📊 5 results
📏 Metrics: FID, PSNR, SSIM

Chikusei Dataset

The airborne hyperspectral dataset was taken by Headwall Hyperspec-VNIR-C imaging sensor over agricultural and urban areas in Chikusei, Ibaraki, Japan, …

📊 1 results
📏 Metrics: PSNR

IXI

IXI Dataset is a collection of 600 MR brain images from normal, healthy subjects. The MR image acquisition protocol for …

📊 8 results
📏 Metrics: PSNR 2x T2w, PSNR 4x T2w, SSIM 4x T2w, SSIM for 2x T2w

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 6 results
📏 Metrics: FID, PSNR, SSIM

Set14

The Set14 dataset is a dataset consisting of 14 images commonly used for testing performance of Image Super-Resolution models. Image …

📊 1 results
📏 Metrics: PSNR

ShipSpotting

To construct such a dataset, a straightforward approach was scraping images from the web. The main source for our dataset …

📊 1 results
📏 Metrics: Frechet Inception Distance

TextZoom

TextZoom is a super-resolution dataset that consists of paired Low Resolution – High Resolution scene text images. The images are …

📊 1 results
📏 Metrics: Average Accuracy, ASTER Overall Accuracy, MORAN Overall Accuracy, CRNN Overall Accuracy

Image-to-Image Translation

AFHQ

Animal FacesHQ (AFHQ) is a dataset of animal faces consisting of 15,000 high-quality images at 512 × 512 resolution. The …

📊 2 results
📏 Metrics: LPIPS, FID

BCI

The evaluation of human epidermal growth factor receptor 2 (HER2) expression is essential to formulate a precise treatment for breast …

📊 4 results
📏 Metrics: Average PSNR, SSIM

CelebA-HQ

The CelebA-HQ dataset is a high-quality version of CelebA that consists of 30,000 images at 1024×1024 resolution. Source: [IntroVAE: Introspective …

📊 6 results
📏 Metrics: FID, LPIPS

IXI

IXI Dataset is a collection of 600 MR brain images from normal, healthy subjects. The MR image acquisition protocol for …

📊 7 results
📏 Metrics: PSNR

LLVIP

  • Visible-infrared Paired Dataset for Low-light Vision * 30976 images (15488 pairs) * 24 dark scenes, 2 daytime scenes * …
📊 4 results
📏 Metrics: PSNR, SSIM

RaFD

The Radboud Faces Database (RaFD) is a set of pictures of 67 models (both adult and children, males and females) …

📊 4 results
📏 Metrics: Classification Error

selfie2anime

The selfie dataset contains 46,836 selfie images annotated with 36 different attributes. We only use photos of females as training …

📊 4 results
📏 Metrics: DFID, FID, LPIPS

Image-to-Text Retrieval

COCO (Common Objects in Context)

The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset. It is designed to …

📊 9 results
📏 Metrics: Recall@1, Recall@5, Recall@10

FETA Car-Manuals

FETA benchmark focuses on text-to-image and image-to-text retrieval in public car manuals and sales catalogue brochures. The FETA Car-Manuals dataset …

📊 1 results
📏 Metrics: R@1, R@10, R@5

Flickr30k

The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators. Source: [Guiding …

📊 11 results
📏 Metrics: Recall@1, Recall@5, Recall@10, Recall@Sum

RSICD

The Remote Sensing Image Captioning Dataset (RSICD) is a dataset for remote sensing image captioning task. It contains more than …

📊 1 results
📏 Metrics: Image to Text Recall@1

WHOOPS!

WHOOPS! Is a dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers …

📊 7 results
📏 Metrics: Specificity

Imputation

Adult

Data Set Information: Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records …

📊 1 results
📏 Metrics: Test error

PhysioNet Challenge 2012

The PhysioNet Challenge 2012 dataset is publicly available and contains the de-identified records of 8000 patients in Intensive Care Units …

📊 1 results
📏 Metrics: AUROC

Sprites

The Sprites dataset contains 60 pixel color images of animated characters (sprites). There are 672 sprites, 500 for training, 100 …

📊 1 results
📏 Metrics: MSE

Incremental Learning

MLT17

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 1 results
📏 Metrics: Acc

Inductive logic programming

RuDaS

Logical rules are a popular knowledge representation language in many domains. Recently, neural networks have been proposed to support the …

📊 4 results
📏 Metrics: H-Score, R-Score

Information Extraction

SemTabNet

Dataset Card for SemTabNet This dataset accompanies the following paper: ``` Title: Statements: Universal Information Extraction from Tables with …

📊 1 results
📏 Metrics: average Tree Similarity Score

Information Retrieval

BSARD

The Belgian Statutory Article Retrieval Dataset (BSARD) is a French native corpus for studying statutory article retrieval. BSARD consists of …

📊 3 results
📏 Metrics: Recall@100, Recall@200, Recall@500

CQADupStack

CQADupStack is a benchmark dataset for community question-answering research. It contains threads from twelve StackExchange subforums, annotated with duplicate question …

📊 2 results
📏 Metrics: mAP@100

MS MARCO

The MS MARCO (Microsoft MAchine Reading Comprehension) is a collection of datasets focused on deep learning in search. The first …

📊 3 results
📏 Metrics: Time (ms), MRR@10

MSLR-WEB30K

The MSLR-WEB30K dataset consists of 30,000 search queries over the documents from search results. The data also contains the values …

📊 1 results
📏 Metrics: nDCG@10

MTEB

MTEB is a benchmark that spans 8 embedding tasks covering a total of 56 datasets and 112 languages. The 8 …

📊 1 results
📏 Metrics: nDCG@10

Initial Structure to Relaxed Energy (IS2RE)

OC20

Open Catalyst 2020 is a dataset for catalysis in chemical engineering. Focusing on molecules that are important in renewable energy …

📊 4 results
📏 Metrics: Energy MAE

Instance Segmentation

ARMBench

ARMBench is a large-scale, object-centric benchmark dataset for robotic manipulation in the context of a warehouse. ARMBench contains images, videos, …

📊 7 results
📏 Metrics: AP50, AP75

Box-IS

RGB-D instance segmentation box dataset. The Box-IS dataset was created to support research on human-robot collaboration with a focus on …

📊 1 results
📏 Metrics: mask AP

COCO-N Medium

COCO-N Medium introduces a stochastic benchmark that simulates common real-world scenarios with noticeable label inaccuracies in the COCO dataset. This …

📊 1 results
📏 Metrics: mIOU

Cityscapes

Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense …

📊 1 results
📏 Metrics: AP

KINS

Augments the KITTI with more instance pixel-level annotation for 8 categories. Source: Amodal Instance Segmentation with KINS Dataset

📊 1 results
📏 Metrics: mAP

LDD

The Instance Segmentation task, an extension of the well-known Object Detection task, is of great help in many areas, such …

📊 1 results
📏 Metrics: mask AP

NYUDv2-IS

A RGB-D dataset converted from NYUDv2 into COCO-style instance segmentation format. To construct NYUDv2-IS, specifically tailored for instance segmentation, we …

📊 2 results
📏 Metrics: mask AP

Occluded COCO

Occluded COCO is automatically generated subset of COCO val dataset, collecting partially occluded objects for a large variety of categories …

📊 6 results
📏 Metrics: Mean Recall

OoDIS

OoDIS is a benchmark dataset for anomaly instance segmentation, crucial for autonomous vehicle safety. It extends existing anomaly segmentation benchmarks …

📊 3 results
📏 Metrics: AP, AP50

PartNet

PartNet is a consistent, large-scale dataset of 3D objects annotated with fine-grained, instance-level, and hierarchical 3D part information. The dataset …

📊 1 results
📏 Metrics: mAP50

SUN-RGBD-IS

A RGB-D dataset converted from SUN-RGBD into COCO-style instance segmentation format. To transform SUN-RGBD into an instance segmentation benchmark (i.e., …

📊 2 results
📏 Metrics: mask AP

Separated COCO

Separated COCO is automatically generated subsets of COCO val dataset, collecting separated objects for a large variety of categories in …

📊 6 results
📏 Metrics: Mean Recall

UFBA-425

We introduce a set of 425 panoramic X-rays with Human annotated Bounding Boxes and Polygons, the 425 images are a …

📊 1 results
📏 Metrics: Dice Coef

iSAID

iSAID contains 655,451 object instances for 15 categories across 2,806 high-resolution images. The images of iSAID is the same as …

📊 5 results
📏 Metrics: Average Precision

iShape

iShape is an irregular shape dataset for instance segmentation. iShape contains six sub-datasets with one real and five synthetics, each …

📊 1 results
📏 Metrics: mask AP

nuScenes

The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in …

📊 1 results
📏 Metrics: MOTA

Instance Shadow Detection

SOBA

A new dataset called SOBA, named after Shadow-OBject Association, with 3,623 pairs of shadow and object instances in 1,000 photos, …

📊 2 results
📏 Metrics: mask SOAP, Bounding Box SOAP, Asso. AP_segm, Asso. AP_bbox, Instance AP_segm, Instance AP_bbox

Instruction Following

IFEval

This dataset evaluates instruction following ability of large language models. There are 500+ prompts with instructions such as "write an …

📊 4 results
📏 Metrics: Inst-level loose-accuracy, Inst-level strict-accuracy, Prompt-level loose-accuracy, Prompt-level strict-accuracy

Instrument Recognition

NSynth

NSynth is a dataset of one shot instrumental notes, containing 305,979 musical notes with unique pitch, timbre and envelope. The …

📊 7 results
📏 Metrics: Accuracy

OpenMIC-2018

OpenMIC-2018 is an instrument recognition dataset containing 20,000 examples of Creative Commons-licensed music available on the Free Music Archive. Each …

📊 5 results
📏 Metrics: mean average precision

Intent Classification

KUAKE-QIC

KUAKE Query Intent Classification, a dataset for intent classification, is used for the KUAKE-QIC task. Given the queries of search …

📊 1 results
📏 Metrics: Accuracy

MASSIVE

MASSIVE is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks …

📊 3 results
📏 Metrics: Intent Accuracy

ORCAS-I

A labelled version of the ORCAS click-based dataset of Web queries, which provides 18 million connections to 10 million distinct …

📊 1 results
📏 Metrics: F1-score, Precision, Recall

SLURP

A new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets. …

📊 5 results
📏 Metrics: Accuracy (%)

Intent Detection

ATIS

The ATIS (Airline Travel Information Systems) is a dataset consisting of audio recordings and corresponding manual transcripts about humans asking …

📊 13 results
📏 Metrics: Accuracy

BANKING77

Dataset composed of online banking queries annotated with their corresponding intents. BANKING77 dataset provides a very fine-grained set of intents …

📊 2 results
📏 Metrics: Accuracy (%)

CAIS

We collect utterances from the Chinese Artificial Intelligence Speakers (CAIS), and annotate them with slot tags and intent labels. The …

📊 1 results
📏 Metrics: Acc

CLINC150

This dataset is for evaluating the performance of intent classification systems in the presence of "out-of-scope" queries, i.e., queries that …

📊 1 results
📏 Metrics: Accuracy (%)

Dialogue State Tracking Challenge

The Dialog State Tracking Challenges 2 & 3 (DSTC2&3) were research challenge focused on improving the state of the art …

📊 1 results
📏 Metrics: Accuracy

HWU64

This project contains natural language data for human-robot interaction in home domain which we collected and annotated for evaluating NLU …

📊 1 results
📏 Metrics: Accuracy (%)

MixATIS

Dataset is constructed from single intent dataset ATIS. This is a publically available multi intent dataset, which can be downloaded …

📊 10 results
📏 Metrics: Accuracy

MixSNIPS

Dataset is constructed from single intent dataset SNIPS. This is a publicly available multi intent dataset, which can be downloaded …

📊 11 results
📏 Metrics: Accuracy, f1 macro

ProSLU

In the paper, to bridge the research gap, we propose a new and important task, Profile-based Spoken Language Understanding (ProSLU), …

📊 1 results
📏 Metrics: Accuracy

SNIPS

The SNIPS Natural Language Understanding benchmark is a dataset of over 16,000 crowdsourced queries distributed among 7 user intents of …

📊 8 results
📏 Metrics: Accuracy

Intent Discovery

ATIS

The ATIS (Airline Travel Information Systems) is a dataset consisting of audio recordings and corresponding manual transcripts about humans asking …

📊 1 results
📏 Metrics: ARI

Persian-ATIS

The PATIS is a Persian language dataset for intent detection and slot filling.

📊 1 results
📏 Metrics: ARI

SNIPS

The SNIPS Natural Language Understanding benchmark is a dataset of over 16,000 crowdsourced queries distributed among 7 user intents of …

📊 1 results
📏 Metrics: ARI

Intent Recognition

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 2 results
📏 Metrics: Accuracy

Interactive Segmentation

DAVIS

The Densely Annotation Video Segmentation dataset (DAVIS) is a high quality and high resolution densely annotated video segmentation dataset under …

📊 13 results
📏 Metrics: NoC@90, NoC@85, NoC@95

DAVIS-585

A dataset for interactive segmentation with simulated initial masks.

📊 2 results
📏 Metrics: NoC@90, NoC@85

PASCAL VOC

The PASCAL Visual Object Classes (VOC) 2012 dataset contains 20 object categories including vehicles, household, animals, and other: aeroplane, bicycle, …

📊 2 results
📏 Metrics: NoC@95, NoC@90, NoC@85

SBD

The Semantic Boundaries Dataset (SBD) is a dataset for predicting pixels on the boundary of the object (as opposed to …

📊 10 results
📏 Metrics: NoC@90, NoC@85, NoC@95

International Law

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Interpretability Techniques for Deep Learning

CausalGym

SyntaxGym, adapted for interventional interpretability.

📊 7 results
📏 Metrics: Log odds-ratio (pythia-6.9b)

CelebA

CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels …

📊 7 results
📏 Metrics: Insertion AUC score

Interpretable Machine Learning

CUB-200-2011

The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of …

📊 2 results
📏 Metrics: Top 1 Accuracy

Intrusion Detection

20NewsGroups

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

📊 1 results
📏 Metrics: Actions Top-1 (S2)

UNSW-NB15

UNSW-NB15 is a network intrusion dataset. It contains nine different attacks, includes DoS, worms, Backdoors, and Fuzzers. The dataset contains …

📊 1 results
📏 Metrics: AUC

Inverse Rendering

Stanford-ORB

We introduce Stanford-ORB, a new real-world 3D Object inverse Rendering Benchmark. Recent advances in inverse rendering have enabled a wide …

📊 7 results
📏 Metrics: HDR-PSNR

Inverse-Tone-Mapping

MSU HDR Video Reconstruction Benchmark

This is a dataset for a video inverse-tone-mapping task. The dataset contains various contents for the task of restoring HDR …

📊 7 results
📏 Metrics: HDR-PSNR, HDR-SSIM, HDR-VQM

Irish Text Diacritization

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish …

📊 1 results
📏 Metrics: Alpha-Word accuracy

JPEG Decompression

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 6 results
📏 Metrics: FID-5K, IS, CA, PD

Jurisprudence

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

KG-to-Text Generation

AGENDA

Abstract GENeration DAtaset (AGENDA) is a dataset of knowledge graphs paired with scientific abstracts. The dataset consists of 40k paper …

📊 6 results
📏 Metrics: BLEU

ENT-DESC

ENT-DESC involves retrieving abundant knowledge of various types of main entities from a large knowledge graph (KG), which makes the …

📊 1 results
📏 Metrics: BLEU

EventNarrative

EventNarrative is a knowledge graph-to-text dataset from publicly available open-world knowledge graphs. EventNarrative consists of approximately 230,000 graphs and their …

📊 8 results
📏 Metrics: BLEU, METEOR, ROUGE, BertScore, CIDEr, ChrF++

PathQuestion

Adopts two subsets of Freebase (Bollacker et al., 2008) as Knowledge Bases to construct the PathQuestion (PQ) and the PathQuestion-Large …

📊 5 results
📏 Metrics: BLEU, METEOR, ROUGE

WebQuestions

The WebQuestions dataset is a question answering dataset using Freebase as the knowledge base and contains 6,642 question-answer pairs. It …

📊 5 results
📏 Metrics: BLEU, METEOR, ROUGE

WikiGraphs

WikiGraphs is a dataset of Wikipedia articles each paired with a knowledge graph, to facilitate the research in conditional text …

📊 4 results
📏 Metrics: Test perplexity, rBLEU (Test), rBLEU (Valid), rBLEU(w/title)(Test), rBLEU(w/title)(Valid)

Key Information Extraction

CORD

OCR is inevitably linked to NLP since its final output is in text. Advances in document intelligence are driving the …

📊 9 results
📏 Metrics: F1

EPHOIE

EPHOIE is a fully-annotated dataset which is the first Chinese benchmark for both text spotting and visual information extraction. EPHOIE …

📊 1 results
📏 Metrics: Average F1

ETD500

The paper used 500 scanned Electronic Theses and Dissertation cover pages (i.e., front pages). The dataset contains several intermediate datasets, …

📊 1 results
📏 Metrics: F1 (%)

Kleister NDA

Kleister NDA is a dataset for Key Information Extraction (KIE). The dataset contains a mix of scanned and born-digital long …

📊 3 results
📏 Metrics: F1

SIMARA

Description We propose a new database for information extraction from historical handwritten documents. The corpus includes 5,393 finding aids …

📊 1 results
📏 Metrics: F1 (%)

SROIE

Consists of a dataset with 1000 whole scanned receipt images and annotations for the competition on scanned receipts OCR and …

📊 5 results
📏 Metrics: F1, Accuracy

Keyphrase Extraction

Inspec

Paper: Improved automatic keyword extraction given more linguistic knowledge Doi: 10.3115/1119355.1119383

📊 3 results
📏 Metrics: F1@10

KP20k

KP20k is a large-scale scholarly articles dataset with 528K articles for training, 20K articles for validation and 20K articles for …

📊 7 results
📏 Metrics: Recall, F1@10

KPTimes

KPTimes is a large-scale dataset of news texts paired with editor-curated keyphrases. Source: [KPTimes: A Large-Scale Dataset for Keyphrase Generation …

📊 4 results
📏 Metrics: Recall, F1@10

Krapivin

A dataset for benchmarking keyphrase extraction and generation techniques from long document English scientific papers. The dataset has high quality …

📊 2 results
📏 Metrics: F1@10

NUS

The dataset was constructed by first finding suitable publications and then collecting keyphrases from manual annotators. Google SOAP API was …

📊 1 results
📏 Metrics: F1@10

Keyword Extraction

Inspec

Paper: Improved automatic keyword extraction given more linguistic knowledge Doi: 10.3115/1119355.1119383

📊 5 results
📏 Metrics: F1 score, Precision@10, Recall @ 10

SemEval-2017 Task-10

We describe the SemEval task of extracting keyphrases and relations between them from scientific documents, which is crucial for understanding …

📊 5 results
📏 Metrics: F1 score, Precision@10, Recall@10

Keyword Spotting

FKD

The football keyword dataset (FKD), as a new keyword spotting dataset in Persian, is collected with crowdsourcing. This dataset contains …

📊 2 results
📏 Metrics: Accuracy

TAU Urban Acoustic Scenes 2019

TAU Urban Acoustic Scenes 2019 development dataset consists of 10-seconds audio segments from 10 acoustic scenes: airport, indoor shopping mall, …

📊 2 results
📏 Metrics: Accuracy

VoxForge

VoxForge is an open speech dataset that was set up to collect transcribed speech for use with Free and Open …

📊 2 results
📏 Metrics: Accuracy (%)

Kidney Function

HiRID

HiRID is a freely accessible critical care dataset containing data relating to almost 34 thousand patient admissions to the Department …

📊 6 results
📏 Metrics: MAE

Kinematic Based Workflow Recognition

PETRAW

PETRAW data set was composed of 150 sequences of peg transfer training sessions. The objective of the peg transfer session …

📊 6 results
📏 Metrics: Average AD-Accuracy

Kinship Verification

KinFaceW-I

KinFaceW-I dataset contains 533 pairs of facial images of persons with a kin relation. Four different kin relations are considered …

📊 3 results
📏 Metrics: Mean Accuracy

KinFaceW-II

KinFaceW-II Dataset consists of 1000 pairs of facial images of individuals with a kin relation. This database also considers four …

📊 3 results
📏 Metrics: Mean Accuracy

Knowledge Base Population

LM-KBC 2023

A diverse set of 21 relations, each covering a different set of subject-entities and a complete list of ground truth …

📊 1 results
📏 Metrics: F1

Knowledge Distillation

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 27 results
📏 Metrics: Top-1 Accuracy (%)

COCO (Common Objects in Context)

The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset. It is designed to …

📊 4 results
📏 Metrics: box AP, mask AP, mAP

Cityscapes

Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense …

📊 1 results
📏 Metrics: AP

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 50 results
📏 Metrics: Top-1 accuracy %, model size, CRD training setting,

KITTI

KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile …

📊 1 results
📏 Metrics: RMSE, model size

PASCAL VOC

The PASCAL Visual Object Classes (VOC) 2012 dataset contains 20 object categories including vehicles, household, animals, and other: aeroplane, bicycle, …

📊 2 results
📏 Metrics: mAP

Knowledge Graph Completion

DBP-5L (English)

DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used …

📊 3 results
📏 Metrics: MRR

DBP-5L (Greek)

DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used …

📊 3 results
📏 Metrics: MRR

DPB-5L (French)

DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used …

📊 3 results
📏 Metrics: MRR

FB15k-237

FB15k-237 is a link prediction dataset created from FB15k. While FB15k consists of 1,345 relations, 14,951 entities, and 592,213 triples, …

📊 3 results
📏 Metrics: Hits@10, Hits@1, Hits@3, MR, MRR

WN18RR

WN18RR is a link prediction dataset created from WN18, which is a subset of WordNet. WN18 consists of 18 relations …

📊 2 results
📏 Metrics: Hits@3, Hits@1, Hits@10

Knowledge Graph Embedding

FB15k

The FB15k dataset contains knowledge base relation triples and textual mentions of Freebase entity pairs. It has a total of …

📊 1 results
📏 Metrics: MRR

Knowledge Graphs

JerichoWorld

JerichoWorld is a dataset that enables the creation of learning agents that can build knowledge graph-based world models of interactive …

📊 5 results
📏 Metrics: Set accuracy

MARS (Multimodal Analogical Reasoning dataSet)

Analogical reasoning is fundamental to human cognition and holds an important place in various fields. However, previous studies mainly focus …

📊 8 results
📏 Metrics: MRR

Knowledge Tracing

EdNet

A large-scale hierarchical dataset of diverse student activities collected by Santa, a multi-platform self-study solution equipped with artificial intelligence tutoring …

📊 8 results
📏 Metrics: AUC, Acc

LIDAR Semantic Segmentation

Paris-Lille-3D

The Paris-Lille-3D is a Benchmark on Point Cloud Classification. The Point Cloud has been labeled entirely by hand with 50 …

📊 7 results
📏 Metrics: mIOU

S.MID

SeMantic InDustry (S.MID) is a dataset designed to advance the field of LiDAR semantic segmentation, specifically for robotic applications and …

📊 4 results
📏 Metrics: val mIoU

SemanticKITTI

SemanticKITTI is a large-scale outdoor-scene dataset for point cloud semantic segmentation. It is derived from the KITTI Vision Odometry Benchmark …

📊 3 results
📏 Metrics: mIOU, val mIoU

SemanticSTF

SemanticSTF is an adverse-weather point cloud dataset that provides dense point-level annotations and allows to study 3DSS under various adverse …

📊 2 results
📏 Metrics: Mean IoU

ULS labeled data

UAV Laser Scanning data collected over neotropical forest (Paracou French Guiana). Four flights conducted over one ha plot in 2021 …

📊 1 results
📏 Metrics: Binary Accuracy, G-mean, Specificity

nuScenes

The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in …

📊 26 results
📏 Metrics: test mIoU, val mIoU

Lane Detection

CULane

CULane is a large scale challenging dataset for academic research on traffic lane detection. It is collected by cameras mounted …

📊 54 results
📏 Metrics: F1 score, mF1

CurveLanes

CurveLanes is a new benchmark lane detection dataset with 150K lanes images for difficult scenarios such as curves and multi-lanes …

📊 15 results
📏 Metrics: F1 score, GFLOPs, Precision, Recall, FPS

DET

DET is a lane detection dataset that consists of the raw event data, accumulated images over 30ms and corresponding lane …

📊 1 results
📏 Metrics: Average IOU, event-based F1 score

K-Lane

KAIST-Lane (K-Lane) is the world’s first and the largest public urban road and highway lane dataset for Lidar. K-Lane has …

📊 1 results
📏 Metrics: F1

LLAMAS

The unsupervised Labeled Lane MArkerS dataset (LLAMAS) is a dataset for lane detection and segmentation. It contains over 100,000 annotated …

📊 10 results
📏 Metrics: F1, mF1

OpenLane-V2 val

OpenLane-V2 is the world's first perception and reasoning benchmark for scene structure in autonomous driving. The primary task of the …

📊 1 results
📏 Metrics: mAP

TuSimple

The TuSimple dataset consists of 6,408 road images on US highways. The resolution of image is 1280×720. The dataset is …

📊 38 results
📏 Metrics: Accuracy, F1 score

nuScenes

The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in …

📊 3 results
📏 Metrics: IoU, F1 score

Language Identification

Nordic Language Identification

Automatic language identification is a challenging problem. Discriminating between closely related languages is especially difficult. This paper presents a machine-learning …

📊 1 results
📏 Metrics: Accuracy

OpenSubtitles

OpenSubtitles is collection of multilingual parallel corpora. The dataset is compiled from a large database of movie and TV subtitles …

📊 1 results
📏 Metrics: Accuracy

Universal Dependencies

The Universal Dependencies (UD) project seeks to develop cross-linguistically consistent treebank annotation of morphology and syntax for multiple languages. The …

📊 1 results
📏 Metrics: Accuracy

VOXLINGUA107

Language Identification Dataset

📊 2 results
📏 Metrics: Error rate

VoxForge

VoxForge is an open speech dataset that was set up to collect transcribed speech for use with Free and Open …

📊 1 results
📏 Metrics: Accuracy

Language Modelling

2000 HUB5 English

2000 HUB5 English Evaluation Transcripts was developed by the Linguistic Data Consortium (LDC) and consists of transcripts of 40 English …

📊 1 results
📏 Metrics: 10-stage average accuracy

Arxiv HEP-TH citation graph

Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a …

📊 1 results
📏 Metrics: BPB

Books3

The Books3 dataset emerged as part of a broader effort to train AI models for natural language understanding and generation. …

📊 1 results
📏 Metrics: BPB

C4

C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. …

📊 9 results
📏 Metrics: Perplexity, TPUv3 Hours, Steps

Curation Corpus

The Curation Corpus is a collection of 40,000 professionally-written summaries of news articles, with links to the articles themselves. Source: …

📊 1 results
📏 Metrics: BPB

FreeLaw

Free Law Project is a leading nonprofit organization that aims to make the legal ecosystem more equitable and competitive through …

📊 1 results
📏 Metrics: BPB

Hutter Prize

The Hutter Prize Wikipedia dataset, also known as enwiki8, is a byte-level dataset consisting of the first 100 million bytes …

📊 18 results
📏 Metrics: Bit per Character (BPC), Number of params

LAMBADA

The LAMBADA (LAnguage Modeling Broadened to Account for Discourse Aspects) benchmark is an open-ended cloze task which consists of about …

📊 34 results
📏 Metrics: Accuracy, Perplexity

OpenWebText

OpenWebText is an open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit …

📊 12 results
📏 Metrics: eval_perplexity, eval_loss, parameters

PhilPapers

PhilPapers is a remarkable resource for the philosophical community. Let me break it down for you: 1. PhilPapers: It's an …

📊 1 results
📏 Metrics: BPB

PubMed Cognitive Control Abstracts

A collection of 385,705 scientific abstracts about Cognitive Control and their GPT-3 embeddings.

📊 1 results
📏 Metrics: BPB

SALMon

The SALMon dataset and benchmark was introduced in the paper "A Suite for Acoustic Language Model Evaluation", with the goal …

📊 8 results
📏 Metrics: Sentiment Consistency, Speaker Consistency, Gender Consistency, Background (Domain) Consistency, Background (Random) Consistency, Room Consistency, Sentiment Alignment, Background Alignment

Text8

📊 22 results
📏 Metrics: Bit per Character (BPC), Number of params

The Pile

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets …

📊 39 results
📏 Metrics: Bits per byte, Test perplexity

VietMed

We introduced a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled …

📊 2 results
📏 Metrics: PPL

Wiki-40B

A new multilingual language model benchmark that is composed of 40+ languages spanning several scripts and linguistic families containing round …

📊 3 results
📏 Metrics: Perplexity

WikiText-103

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good …

📊 83 results
📏 Metrics: Test perplexity, Validation perplexity, Number of params

WikiText-2

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good …

📊 34 results
📏 Metrics: Test perplexity, Validation perplexity, Number of params

language-modeling-recommendation

This is the Big-Bench version of our language-based movie recommendation dataset https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/movie_recommendation GPT-2 has a 48.8% accuracy, chance is 25%.

📊 1 results
📏 Metrics: 1:1 Accuracy

Language-Based Temporal Localization

VidChapters-7M

VidChapters-7M is a dataset of 817K user-chaptered videos including 7M chapters in total. VidChapters-7M is automatically created from videos online …

📊 1 results
📏 Metrics: [email protected], R@10s

Latvian Text Diacritization

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish …

📊 1 results
📏 Metrics: Alpha-Word accuracy

Length-of-Stay prediction

Clinical Admission Notes from MIMIC-III

This dataset is created from MIMIC-III (Medical Information Mart for Intensive Care III) and contains simulated patient admission notes. The …

📊 3 results
📏 Metrics: AUROC

MIMIC-III

The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. …

📊 5 results
📏 Metrics: Accuracy (LOS>3 Days), Accuracy (LOS>7 Days)

Li-ion State of Health Estimation

NASA Li-ion Dataset

Experiments on Li-Ion batteries. Charging and discharging at different temperatures. Records the impedance as the damage criterion. The data set …

📊 1 results
📏 Metrics: mean absolute error

Lidar Scene Completion

SemanticKITTI

SemanticKITTI is a large-scale outdoor-scene dataset for point cloud semantic segmentation. It is derived from the KITTI Vision Odometry Benchmark …

📊 4 results
📏 Metrics: Chamfer Distance, JSD 3D, JSD BEV, Voxel IoU 0.1m, Voxel IoU 0.2m, Voxel IoU 0.5m

Lifelike 3D Human Generation

THuman2.0 Dataset

THuman2.0 Dataset contains 500 high-quality human scans captured by a dense DLSR rig. For each scan, we provide the 3D …

📊 6 results
📏 Metrics: CLIP Similarity, SSIM, LPIPS, PSNR

Line Detection

NKL

NKL (short for NanKai Lines) is a dataset for semantic line detection. Semantic lines are meaningful line structures that outline …

📊 2 results
📏 Metrics: F_measure (EA)

SEL

The semantic line (SEL) dataset contains 1,750 outdoor images in total, which are split into 1,575 training and 175 testing …

📊 2 results
📏 Metrics: AUC_F, HIoU

Linguistic Acceptability

CoLA

The Corpus of Linguistic Acceptability (CoLA) consists of 10657 sentences from 23 linguistics publications, expertly annotated for acceptability (grammaticality) by …

📊 42 results
📏 Metrics: Accuracy, MCC

DaLAJ

DaLAJ 1.0, a dataset for Linguistic Acceptability Judgments for Swedish, comprising 9,596 sentences in its first version; and the initial …

📊 1 results
📏 Metrics: Accuracy, MCC

ItaCoLA

ItaCoLA is a corpus for monolingual and cross-lingual acceptability judgments which contains almost 10,000 sentences with acceptability judgments.

📊 4 results
📏 Metrics: MCC, Accuracy

RuCoLA

The Russian Corpus of Linguistic Acceptability (RuCoLA) is built from the ground up under the well-established binary LA approach. RuCoLA …

📊 9 results
📏 Metrics: MCC, Accuracy

Link Prediction

ACM

The ACM dataset contains papers published in KDD, SIGMOD, SIGCOMM, MobiCOMM, and VLDB and are divided into three classes (Database, …

📊 1 results
📏 Metrics: AP, AUC

AbstRCT - Neoplasm

The AbstRCT dataset consists of randomized controlled trials retrieved from the MEDLINE database via PubMed search. The trials are annotated …

📊 1 results
📏 Metrics: F1

Aristo-v4

The Aristo Tuple KB contains a collection of high-precision, domain-targeted (subject,relation,object) tuples extracted from text using a high-precision extraction pipeline, …

📊 1 results
📏 Metrics: Hits@1, Hits@10, Hits@3, MRR

CDCP

The Cornell eRulemaking Corpus – CDCP is an argument mining corpus annotated with argumentative structure information capturing the evaluability of …

📊 1 results
📏 Metrics: F1

COLLAB

COLLAB is a scientific collaboration dataset. A graph corresponds to a researcher’s ego network, i.e., the researcher and its collaborators …

📊 1 results
📏 Metrics: Hits

Citeseer

The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 …

📊 12 results
📏 Metrics: AUC, AP, Accuracy, ACC

CoDEx Large

CoDEx comprises a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph …

📊 6 results
📏 Metrics: MRR, Hits@1, Hits@3, Hits@10

CoDEx Medium

CoDEx comprises a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph …

📊 7 results
📏 Metrics: MRR, Hits@1, Hits@3, Hits@10

CoDEx Small

CoDEx comprises a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph …

📊 6 results
📏 Metrics: MRR, Hits@1, Hits@3, Hits@10

Cora

The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 …

📊 11 results
📏 Metrics: AUC, AP, Accuracy, ACC

DBLP

The DBLP is a citation network dataset. The citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and …

📊 3 results
📏 Metrics: AUC, AP

DRI Corpus

The Dr. Inventor Multi-Layer Scientific Corpus (DRI Corpus) includes 40 Computer Graphics papers, selected by domain experts. Each paper of …

📊 1 results
📏 Metrics: F1

Decagon

Bio-decagon is a dataset for polypharmacy side effect identification problem framed as a multirelational link prediction problem in a two-layer …

📊 2 results
📏 Metrics: AUROC, AUPRC, mAP@50

Douban

We release Douban Conversation Corpus, comprising a training data set, a development set and a test set for retrieval based …

📊 2 results
📏 Metrics: AUC

FB122

📊 4 results
📏 Metrics: HITS@3, Hits@5, Hits@10, MRR

FB15k

The FB15k dataset contains knowledge base relation triples and textual mentions of Freebase entity pairs. It has a total of …

📊 10 results
📏 Metrics: MRR, Hits@1, Hits@3, Hits@10, MR, MRR raw, Hits@5

FB15k-237

FB15k-237 is a link prediction dataset created from FB15k. While FB15k consists of 1,345 relations, 14,951 entities, and 592,213 triples, …

📊 70 results
📏 Metrics: Hits@1, Hits@3, Hits@10, MRR, MR, training time (s), Hit@1, Hit@10

GDELT

The GDELT Project is a remarkable initiative that monitors our world by analyzing global news from various sources. Here are …

📊 10 results
📏 Metrics: MRR

GO21

GO21 is a biomedical knowledge graph that models genes, proteins, drugs, and the hierarchy of the biological processes they participate …

📊 1 results
📏 Metrics: Hit@1, Hits@10, Hits@3, MRR

KG20C

KG20C is a Knowledge Graph about high quality papers from 20 top computer science Conferences. It can serve as a …

📊 1 results
📏 Metrics: MRR, Hits@1, Hits@3, Hits@10

NELL-995

NELL-995 KG Completion Dataset

📊 3 results
📏 Metrics: Hits@1, Hits@10, MRR, Mean AP, HITS@3

PPI

protein roles—in terms of their cellular functions from gene ontology—in various protein-protein interaction (PPI) graphs, with each graph corresponding to …

📊 1 results
📏 Metrics: AP, AUC, Accuracy

Pubmed

The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. …

📊 13 results
📏 Metrics: AUC, AP, Accuracy, ACC

SINS

SINS is a database of continuous real-life audio recordings in a home environment. The home is a vacation home and …

📊 1 results
📏 Metrics: Scaled time-delay embeddings

TSP/HCP Benchmark set

This is a benchmark set for Traveling salesman problem (TSP) with characteristics that are different from the existing benchmark sets. …

📊 4 results
📏 Metrics: F1

UMLS

The Unified Medical Language System (UMLS) is a comprehensive resource that integrates and disseminates essential terminology, classification standards, and coding …

📊 9 results
📏 Metrics: Hits@10, MR

WN18

The WN18 dataset has 18 relations scraped from WordNet for roughly 41,000 synsets, resulting in 141,442 triplets. It was found …

📊 33 results
📏 Metrics: Hits@10, Hits@3, Hits@1, MRR, MR, training time (s)

WN18RR

WN18RR is a link prediction dataset created from WN18, which is a subset of WordNet. WN18 consists of 18 relations …

📊 69 results
📏 Metrics: Hits@10, Hits@3, Hits@1, MRR, MR

Wiki

Context There's a story behind every dataset and here's your opportunity to share yours. ### Content What's inside is …

📊 1 results
📏 Metrics: AUC

Wikidata5M

Wikidata5m is a million-scale knowledge graph dataset with aligned corpus. This dataset integrates the Wikidata knowledge graph and Wikipedia pages. …

📊 12 results
📏 Metrics: MRR, Hits@10, Hits@1, Hits@3

YAGO3-10

YAGO3-10 is benchmark dataset for knowledge base completion. It is a subset of YAGO3 (which itself is an extension of …

📊 17 results
📏 Metrics: Hits@1, Hits@3, Hits@10, MRR, MR

Yelp

The Yelp Dataset is a valuable resource for academic research, teaching, and learning. It provides a rich collection of real-world …

📊 8 results
📏 Metrics: HR@10, AUC, nDCG@10

Link Sign Prediction

Epinions

The Epinions dataset is built form a who-trust-whom online social network of a general consumer review site Epinions.com. Members of …

📊 1 results
📏 Metrics: AUC, Accuracy, Macro-F1

Slashdot

The Slashdot dataset is a relational dataset obtained from Slashdot. Slashdot is a technology-related news website know for its specific …

📊 1 results
📏 Metrics: AUC, Accuracy, Macro-F1

Lip Reading

LRW

The Lip Reading in the Wild (LRW) dataset a large-scale audio-visual database that contains 500 different words from over 1,000 …

📊 1 results
📏 Metrics: WER

Lip to Speech Synthesis

LRW

The Lip Reading in the Wild (LRW) dataset a large-scale audio-visual database that contains 500 different words from over 1,000 …

📊 1 results
📏 Metrics: ESTOI, PESQ, STOI

Lipreading

CAS-VSR-W1k (LRW-1000)

LRW-1000 has been renamed as CAS-VSR-W1k.* It is a naturally-distributed large-scale benchmark for word-level lipreading in the wild, including 1000 …

📊 9 results
📏 Metrics: Top-1 Accuracy

LRS2

The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences …

📊 18 results
📏 Metrics: Word Error Rate (WER)

LRS3-TED

LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of …

📊 20 results
📏 Metrics: Word Error Rate (WER)

Local Distortion

DocUNet

Various documents dataset. Each of the 65 documents includes scanned ground truth images, both hard and easy distorted photos, and …

📊 3 results
📏 Metrics: LD

Logical Fallacies

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Logical Reasoning

LingOly

This dataset is a benchmark for complex reasoning abilities in large language models, drawing on United Kingdom Linguistics Olympiad problems …

📊 11 results
📏 Metrics: Delta_NoContext, Exact Match Accuracy

RuWorldTree

RuWorldTree is a QA dataset with multiple-choice elementary-level science questions, which evaluate the understanding of core science facts. Motivation The …

📊 4 results
📏 Metrics: Accuracy

Winograd Automatic

The Winograd schema challenge composes tasks with syntactic ambiguity, which can be resolved with logic and reasoning. Motivation The dataset …

📊 4 results
📏 Metrics: Accuracy

Long Video Retrieval (Background Removed)

YouCook2

YouCook2 is the largest task-oriented, instructional video dataset in the vision community. It contains 2000 long untrimmed videos from 89 …

📊 6 results
📏 Metrics: Cap. Avg. R@1, Cap. Avg. R@5, Cap. Avg. R@10, DTW R@1, DTW R@5, DTW R@10, OTAM R@1, OTAM R@5, OTAM R@10

Long-Context Understanding

L-Eval

Although large language models (LLMs) demonstrate impressive performance for many language tasks, most of them can only handle texts a …

📊 4 results
📏 Metrics: Average Score

LongBench

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 3 results
📏 Metrics: Average Score

MMNeedle

We introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark, specifically designed to assess the long-context capabilities of MLLMs. Besides multi-image input, we …

📊 11 results
📏 Metrics: 1 Image, 4*4 Stitching, Exact Accuracy, 1 Image, 8*8 Stitching, Exact Accuracy, 1 Image, 2*2 Stitching, Exact Accuracy, 10 Images, 1*1 Stitching, Exact Accuracy, 10 Images, 2*2 Stitching, Exact Accuracy, 10 Images, 4*4 Stitching, Exact Accuracy, 10 Images, 8*8 Stitching, Exact Accuracy

Long-tail Learning

COCO-MLT

The COCO-MLT is created from MS COCO-2017, containing 1,909 images from 80 classes. The maximum of training number per class …

📊 11 results
📏 Metrics: Average mAP

EGTEA

Extended GTEA Gaze+ EGTEA Gaze+ is a large-scale dataset for FPV actions and gaze. It subsumes GTEA Gaze+ and comes …

📊 3 results
📏 Metrics: Average Precision, Average Recall

ImageNet-LT

ImageNet Long-Tailed is a subset of /dataset/imagenet dataset consisting of 115.8K images from 1000 categories, with maximally 1280 images per …

📊 65 results
📏 Metrics: Top-1 Accuracy

Lot-insts

LoT-insts contains over 25k classes whose frequencies are naturally long-tail distributed. Its test set from four different subsets: many-, medium-, …

📊 1 results
📏 Metrics: Macro-F1

MIMIC-CXR-LT

MIMIC-CXR-LT. We construct a single-label, long-tailed version of MIMIC-CXR in a similar manner. MIMIC-CXR is a multi-label classification dataset with …

📊 15 results
📏 Metrics: Balanced Accuracy

NIH-CXR-LT

NIH-CXR-LT. NIH ChestXRay14 contains over 100,000 chest X-rays labeled with 14 pathologies, plus a “No Findings” class. We construct a …

📊 15 results
📏 Metrics: Balanced Accuracy

Places-LT

Places-LT has an imbalanced training set with 62,500 images for 365 classes from Places-2. The class frequencies follow a natural …

📊 28 results
📏 Metrics: Top-1 Accuracy, Top 1 Accuracy

VOC-MLT

We construct the long-tailed version of VOC from its 2012 train-val set. It contains 1,142 images from 20 classes, with …

📊 11 results
📏 Metrics: Average mAP

mini-ImageNet-LT

mini-ImageNet was proposed by Matching networks for one-shot learning for few-shot learning evaluation, in an attempt to have a dataset …

📊 1 results
📏 Metrics: Error Rate

Lung Nodule Classification

LIDC-IDRI

The LIDC-IDRI dataset contains lesion annotations from four experienced thoracic radiologists. LIDC-IDRI contains 1,018 low-dose lung CTs from 1010 lung …

📊 7 results
📏 Metrics: Accuracy, Acc, AUC, Accuracy(10-fold), Recall/ Sensitivity, Precision, F1 Score

Lung Sound Classification

ICBHI Respiratory Sound Database

The Respiratory Sound database was originally compiled to support the scientific challenge organized at Int. Conf. on Biomedical Health Informatics …

📊 1 results
📏 Metrics: Accurcay

MMLU

MMLU-Pro

The MMLU-Pro dataset is an enhanced version of the Massive Multitask Language Understanding (MMLU) benchmark. It's designed to be more …

📊 1 results
📏 Metrics: 0-shot MRR

MMR total

MRR-Benchmark

Multi-Modal Reading (MMR) Benchmark includes 550 annotated question-answer pairs across 11 distinct tasks involving texts, fonts, visual elements, bounding boxes, …

📊 13 results
📏 Metrics: Total Column Score

MS/MS spectrum simulation

MassSpecGym

MassSpecGym provides three challenges for benchmarking the discovery and identification of new molecules from MS/MS spectra: - 💥 **De novo

📊 4 results
📏 Metrics: Cosine Similarity, Jensen-Shannon Similarity, Hit Rate @ 1, Hit Rate @ 5, Hit Rate @ 20

MS/MS spectrum simulation (bonus chemical formulae)

MassSpecGym

MassSpecGym provides three challenges for benchmarking the discovery and identification of new molecules from MS/MS spectra: - 💥 **De novo

📊 4 results
📏 Metrics: Hit Rate @ 1, Hit Rate @ 5, Hit Rate @ 20

Machine Translation

ACES

ACES a dataset consisting of 68 phenomena ranging from simple perturbations at the word/character level to more complex errors based …

📊 21 results
📏 Metrics: Score

Alexa Point of View

The Alexa Point of View dataset is point of view conversion dataset, a parallel corpus of messages spoken to a …

📊 1 results
📏 Metrics: BLEU

FLoRes-200

FLoRes-200 doubles the existing language coverage of FLoRes-101. Given the nature of the new languages, which have less standardization and …

📊 5 results
📏 Metrics: BLEU

IWSLT 2017

The IWSLT 2017 translation dataset.

📊 1 results
📏 Metrics: BLEU score

Itihasa

Itihasa is a large-scale corpus for Sanskrit to English translation containing 93,000 pairs of Sanskrit shlokas and their English translations. …

📊 2 results
📏 Metrics: SacreBLEU

Multi Lingual Bug Reports

Dataset Description The dataset used in this study comprises bug reports extracted from the Visual Studio Code GitHub repository, …

📊 1 results
📏 Metrics: BERTScore

OpenSubtitles

OpenSubtitles is collection of multilingual parallel corpora. The dataset is compiled from a large database of movie and TV subtitles …

📊 1 results
📏 Metrics: BLEU score, METEOR

Malware Classification

Microsoft Malware Classification Challenge

The Microsoft Malware Classification Challenge was announced in 2015 along with a publication of a huge dataset of nearly 0.5 …

📊 3 results
📏 Metrics: Accuracy (10-fold), LogLoss, Macro F1 (10-fold), Accuracy (5-fold), F1 score (5-fold), Accuracy

Management

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Marketing

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Math Information Retrieval

ARQMath

The goal of ARQMath is to advance techniques for mathematical information retrieval, in particular, retrieving answers to mathematical questions (Task …

📊 2 results
📏 Metrics: P@10, MAP, NDCG, bpref

Math Word Problem Solving

ALG514

514 algebra word problems and associated equation systems gathered from Algebra.com.

📊 1 results
📏 Metrics: Accuracy (%)

GSM-Plus

By perturbing the widely used GSM8K dataset, an adversarial dataset for grade-school math called GSM-Plus is created. Motivated by the …

📊 1 results
📏 Metrics: 1:1 Accuracy

MATH

MATH is a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution …

📊 132 results
📏 Metrics: Accuracy, Parameters (Billions)

MAWPS

MAWPS is an online repository of Math Word Problems, to provide a unified testbed to evaluate different algorithms. MAWPS allows …

📊 15 results
📏 Metrics: Accuracy (%)

Math23K

Math23K is a dataset created for math word problem solving, contains 23, 162 Chinese problems crawled from the Internet. Refer …

📊 12 results
📏 Metrics: Accuracy (5-fold), Accuracy (training-test), weakly-supervised

MathQA

MathQA significantly enhances the AQuA dataset with fully-specified operational programs. Source: [MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based …

📊 5 results
📏 Metrics: Answer Accuracy

ParaMAWPS

This repository contains the code, data, and models of the paper titled **"Math Word Problem Solving by Generating Linguistic Variants …

📊 6 results
📏 Metrics: Accuracy (%)

SVAMP

A challenge set for elementary-level Math Word Problems (MWP). An MWP consists of a short Natural Language narrative that describes …

📊 23 results
📏 Metrics: Execution Accuracy, Accuracy

Mathematical Question Answering

GeoS

GeoS is a dataset for automatic math problem solving. It is a dataset of SAT plane geometry questions where every …

📊 1 results
📏 Metrics: Accuracy (%)

Geometry3K

A new large-scale geometry problem-solving dataset - 3,002 multi-choice geometry problems - dense annotations in formal language for the diagrams …

📊 8 results
📏 Metrics: Accuracy (%)

Mathematical Reasoning

GeoQA

GeoQA is a dataset for automatic geometric problem solving containing 5,010 geometric problems with corresponding annotated programs, which illustrate the …

📊 2 results
📏 Metrics: Accuracy (%)

PGPS9K

A new large scale plane geometry problem solving dataset called PGPS9K, labeled both fine-grained diagram annotation and interpretable solution program.

📊 6 results
📏 Metrics: Completion accuracy

Medical Code Prediction

MIMIC-III

The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. …

📊 15 results
📏 Metrics: Micro-F1, Macro-F1, Micro-AUC, Macro-AUC, Precision@5, Precision@8, Precision@15, mAP

MIMIC-IV ICD-10

MIMIC-IV ICD-10 contains 122,279 discharge summaries—free-text medical documents—annotated with ICD-10 diagnosis and procedure codes. It contains data for patients admitted …

📊 6 results
📏 Metrics: Precision@8, F1 Macro, F1 Micro, Precision@15, R-Prec, mAP, Exact Match Ratio, AUC Macro, AUC Micro

MIMIC-IV ICD-9

MIMIC-IV ICD-9 contains 209,326 discharge summaries—free-text medical documents—annotated with ICD-9 diagnosis and procedure codes. It contains data for patients admitted …

📊 6 results
📏 Metrics: AUC Macro, AUC Micro, Exact Match Ratio, F1 Macro, F1 Micro, Precision@15, Precision@8, R-Prec, mAP

MIMIC-IV-ICD-10-full

The MIMIC-IV-ICD10-full dataset, including occurring labels.

📊 5 results
📏 Metrics: Macro-AUC, Micro-AUC, Macro-F1, Micro-F1, Precision@8

MIMIC-IV-ICD10-top50

The MIMIC-IV-ICD10 dataset, featuring the top 50 most frequently occurring labels.

📊 5 results
📏 Metrics: F1 (micro), F1 (macro), AUC (Micro), AUC (Macro), Precision@5

MIMIC-IV-ICD9-full

The MIMIC-IV-ICD9 dataset, including all occurring labels.

📊 5 results
📏 Metrics: Macro AUC, Micro AUC, F1 Macro, F1 Micro, Precision@8

MIMIC-IV-ICD9-top50

The MIMIC-IV-ICD9 dataset, featuring the top 50 most frequently occurring labels.

📊 5 results
📏 Metrics: AUC Macro, AUC Micro, F1 Macro, F1 Micro, Precision @5

Medical Diagnosis

BreastDICOM4

Several datasets are fostering innovation in higher-level functions for everyone, everywhere. By providing this repository, we hope to encourage the …

📊 1 results
📏 Metrics: Average Precision, Average Recall

Clinical Admission Notes from MIMIC-III

This dataset is created from MIMIC-III (Medical Information Mart for Intensive Care III) and contains simulated patient admission notes. The …

📊 2 results
📏 Metrics: AUROC

Medical Genetics

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Medical Image Classification

COVIDGR

Under a close collaboration with an expert radiologist team of the Hospital Universitario San Cecilio, the COVIDGR-1.0 dataset of patients' …

📊 1 results
📏 Metrics: Accuracy

CheXphoto

CheXphoto is a competition for x-ray interpretation based on a new dataset of naturally and synthetically perturbed chest x-rays hosted …

📊 1 results
📏 Metrics: Mean AUC

IDRiD

Indian Diabetic Retinopathy Image Dataset (IDRiD) dataset consists of typical diabetic retinopathy lesions and normal retinal structures annotated at a …

📊 1 results
📏 Metrics: Accuracy, Accuracy (% )

ISIC 2020 Challenge Dataset

The dataset contains 33,126 dermoscopic training images of unique benign and malignant skin lesions from over 2,000 patients. Each image …

📊 1 results
📏 Metrics: AUC

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 2 results
📏 Metrics: GFLOPs, Top 1 Accuracy

NCT-CRC-HE-100K

The NCT-CRC-HE-100K dataset is a set of 100,000 non-overlapping image patches extracted from 86 H$\&$E stained human cancer tissue slides …

📊 7 results
📏 Metrics: Accuracy (%), F1-Score, Precision, Specificity

Medical Image Enhancement

Brain Tumor MRI Dataset

This dataset is a combination of the following three datasets : figshare, SARTAJ dataset and Br35H This dataset contains 7022 …

📊 1 results
📏 Metrics: Average PSNR

LoDoPaB-CT

LoDoPaB-CT is a dataset of computed tomography images and simulated low-dose measurements. It contains over 40,000 scan slices from around …

📊 1 results
📏 Metrics: SSIM

Medical Image Generation

ACDC

The goal of the Automated Cardiac Diagnosis Challenge (ACDC) challenge is to: - compare the performance of automatic methods on …

📊 3 results
📏 Metrics: FID

Chest X-Ray Images (Pneumonia)

The normal chest X-ray (left panel) depicts clear lungs without any areas of abnormal opacification in the image. Bacterial pneumonia …

📊 1 results
📏 Metrics: Frechet Inception Distance

ChestX-ray14

ChestX-ray14 is a medical imaging dataset which comprises 112,120 frontal-view X-ray images of 30,805 (collected from the year of 1992 …

📊 1 results
📏 Metrics: FID

Medical Image Registration

IXI

IXI Dataset is a collection of 600 MR brain images from normal, healthy subjects. The MR image acquisition protocol for …

📊 7 results
📏 Metrics: DSC

OASIS

A dataset for single-image 3D in the wild consisting of annotations of detailed 3D geometry for 140,000 images. Source: [OASIS: …

📊 8 results
📏 Metrics: DSC, val dsc

SR-Reg

SR-Reg is a brain MR-CT registration dataset, deriving from SynthRAD 2023 (https://synthrad2023.grand-challenge.org/). This dataset contains 180 subjects preprocessed images, and …

📊 1 results
📏 Metrics: Dice (Average)

Medical Image Segmentation

2018 Data Science Bowl

This dataset contains a large number of segmented nuclei images. The images were acquired under a variety of conditions and …

📊 10 results
📏 Metrics: Dice, mIoU, Recall, Precision, AHD95, ASD

ACDC

The goal of the Automated Cardiac Diagnosis Challenge (ACDC) challenge is to: - compare the performance of automatic methods on …

📊 6 results
📏 Metrics: Dice Score

AMOS

Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the …

📊 1 results
📏 Metrics: Average Dice

BKAI-IGH NeoPolyp-Small

This dataset contains 1200 images (1000 WLI images and 200 FICE images) with fine-grained segmentation annotations. The training set consists …

📊 9 results
📏 Metrics: Average Dice, mIoU, Average Dice (5-folds), MAE (5-folds), mIoU (5-folds)

Brain US

This brain anatomy segmentation dataset has 1300 2D US scans for training and 329 for testing. A total of 1629 …

📊 3 results
📏 Metrics: F1, IoU

CHASE_DB1

CHASE_DB1 is a dataset for retinal vessel segmentation which contains 28 color retina images with the size of 999×960 pixels …

📊 3 results
📏 Metrics: DSC

CVC-ClinicDB

CVC-ClinicDB is an open-access dataset of 612 images with a resolution of 384×288 from 31 colonoscopy sequences.It is used for …

📊 40 results
📏 Metrics: mean Dice, Average MAE, S-Measure, mIoU, max E-Measure, F-measure

Cell

The CELL benchmark is made of fluorescence microscopy images of cell. Source: Multi-Domain Adversarial Learning Image Source: https://arxiv.org/pdf/1903.09239v1.pdf

📊 1 results
📏 Metrics: IoU

DRIVE

The Digital Retinal Images for Vessel Extraction (DRIVE) dataset is a dataset for retinal vessel segmentation. It consists of a …

📊 4 results
📏 Metrics: mIoU, F1 score, Recall, Specificity, Precision

Electron Microscopy Dataset

The dataset available for download on this webpage represents a 5x5x5µm section taken from the CA1 hippocampus region of the …

📊 1 results
📏 Metrics: AHD95, ASD, Dice, IoU

Endotect Polyp Segmentation Challenge Dataset

A challenge that consists of three tasks, each targeting a different requirement for in-clinic use. The first task involves classifying …

📊 2 results
📏 Metrics: DSC, mIoU, FPS

Extended Task10_Colon Medical Decathlon

A dataset of abdominal CT studies in NifTi format from the open-source medical data repository Medical Decathlon was utilized. To …

📊 1 results
📏 Metrics: Average Dice

GlaS

The dataset used in this challenge consists of 165 images derived from 16 H&E stained histological sections of stage T3 …

📊 9 results
📏 Metrics: F1, IoU, Dice

Kvasir-Instrument

Consists of annotated frames containing GI procedure tools such as snares, balloons and biopsy forceps, etc. Beside of the images, …

📊 2 results
📏 Metrics: DSC, Dice Score, Intersection over Union

Kvasir-SEG

Kvasir-SEG is an open-access dataset of gastrointestinal polyp images and corresponding segmentation masks, manually annotated by a medical doctor and …

📊 51 results
📏 Metrics: mean Dice, Average MAE, S-Measure, max E-Measure, mIoU, FPS, F-measure, Precision, Recall

KvasirCapsule-SEG

The dataset contains a Video capsule endoscopy dataset for polyp segmentation. The dataset can be downloaded from here: https://www.kaggle.com/debeshjha1/kvasircapsuleseg https://www.dropbox.com/home/KvasirCapsule-SEG …

📊 2 results
📏 Metrics: DSC, mIoU

MICCAI 2015 Head and Neck Challenge

This database is provided and maintained by Dr. Gregory C Sharp (Harvard Medical School – MGH, Boston) and his group. …

📊 1 results
📏 Metrics: Dice

MICCAI 2015 Multi-Atlas Abdomen Labeling Challenge

Under Institutional Review Board (IRB) supervision, 50 abdomen CT scans of were randomly selected from a combination of an ongoing …

📊 6 results
📏 Metrics: Avg DSC, Avg HD

Medical Segmentation Decathlon

The Medical Segmentation Decathlon is a collection of medical image segmentation datasets. It contains a total of 2,633 three-dimensional images …

📊 5 results
📏 Metrics: Dice (Average), NSD

Medico automatic polyp segmentation challenge (dataset)

The “Medico automatic polyp segmentation challenge” aims to develop computer-aided diagnosis systems for automatic polyp segmentation to detect all types …

📊 2 results
📏 Metrics: DSC, mIoU, Recall, Precision, FPS

MoNuSAC

Different types of cells play a vital role in the initiation, development, invasion, metastasis and therapeutic response of tumors of …

📊 1 results
📏 Metrics: Dice, IoU

MoNuSeg

The dataset for this challenge was obtained by carefully annotating tissue images of several patients with tumors of different organs …

📊 14 results
📏 Metrics: F1, IoU, AHD95, ASD, mIoU

MosMedData

MosMedData contains anonymised human lung computed tomography (CT) scans with COVID-19 related findings, as well as without such findings. A …

📊 1 results
📏 Metrics: Average Dice

RITE

The RITE (Retinal Images vessel Tree Extraction) is a database that enables comparative studies on segmentation or classification of arteries …

📊 3 results
📏 Metrics: Dice, Jaccard Index

ROBUST-MIS

The ROBUST-MIS dataset was made available to support the Robust Medical Instrument Segmentation (ROBUST-MIS) Challenge 2019, part of the Endoscopic …

📊 3 results
📏 Metrics: DSC, mIoU, FPS

TNBC

Inolves an annotated a large number of cells, including normal epithelial and myoepithelial breast cells (localized in ducts and lobules), …

📊 1 results
📏 Metrics: AHD95, Dice, IoU

Medical Procedure

Clinical Admission Notes from MIMIC-III

This dataset is created from MIMIC-III (Medical Information Mart for Intensive Care III) and contains simulated patient admission notes. The …

📊 3 results
📏 Metrics: AUROC

Medical Relation Extraction

CMeIE

Chinese Medical Information Extraction, a dataset that is also released in CHIP2020, is used for CMeIE task. The task is …

📊 1 results
📏 Metrics: Micro F1

Medical Report Generation

HistGen WSI-Report Dataset

This dataset is composed of 7,753 pairs of whole slide images and their corresponding diagnostic reports, extracted from the TCGA …

📊 1 results
📏 Metrics: BLEU-4

IU X-Ray

IU X-ray (Demner-Fushman et al., 2016) is a set of chest X-ray images paired with their corresponding diagnostic reports. The …

📊 1 results
📏 Metrics: BLEU-4, BLEU-1, BLEU-2, BLEU-3, CIDEr, METEOR, ROUGE

MIMIC-CXR

MIMIC-CXR from Massachusetts Institute of Technology presents 371,920 chest X-rays associated with 227,943 imaging studies from 65,079 patients. The studies …

📊 2 results
📏 Metrics: BLEU-1, BLEU-2, BLEU-3, BLEU-4, CIDEr, Example-F1-14, Example-Precision-14, Example-Recall-14, METEOR, Micro-F1-5, Micro-Precision-5, Micro-Recall-5, ROUGE-L, F1 RadGraph

Meeting Summarization

AMI Meeting Corpus

The AMI Meeting Corpus is a multi-modal data set comprising 100 hours of meeting recordings. It has been meticulously curated …

📊 1 results
📏 Metrics: ROUGE-1 F1

ICSI Meeting Corpus

ICSI Meeting Corpus in JSON format.

📊 1 results
📏 Metrics: ROUGE-1 F1

Meme Classification

Hateful Memes

The Hateful Memes data set is a multimodal dataset for hateful meme detection (image + text) that contains 10,000+ new …

📊 17 results
📏 Metrics: ROC-AUC, Accuracy

MultiOFF

Introudced from Multimodal Meme Dataset (MultiOFF) for Identifying Offensive Content in Image and Text

📊 4 results
📏 Metrics: Accuracy, F1

Tamil Memes

Social media are interactive platforms that facilitate the creation or sharing of information, ideas or other forms of expression among …

📊 2 results
📏 Metrics: Micro-F1

Memex Question Answering

MemexQA

A large, realistic multimodal dataset consisting of real personal photos and crowd-sourced questions/answers. Source: MemexQA: Visual Memex Question Answering

📊 1 results
📏 Metrics: Accuracy

Meter Reading

Copel-AMR

This dataset contains 12,500 meter images acquired in the field by the employees of the Energy Company of Paraná (Copel), …

📊 2 results
📏 Metrics: Rank-1 Recognition Rate

UFPR-ADMR-v1

This dataset contains 2,000 dial meter images obtained on-site by employees of the Energy Company of Paraná (Copel), which serves …

📊 11 results
📏 Metrics: Rank-1 Recognition Rate

UFPR-AMR

This dataset contains 2,000 images taken from inside a warehouse of the Energy Company of Paraná (Copel), which directly serves …

📊 3 results
📏 Metrics: Rank-1 Recognition Rate

Metric Learning

CARS196

CARS196 is composed of 16,185 car images of 196 classes.

📊 36 results
📏 Metrics: R@1

CUB-200-2011

The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of …

📊 2 results
📏 Metrics: R@1

DyML-Animal

DyML-Animal is based on animal images selected from ImageNet-5K [1]. It has 5 semantic scales (i.e., classes, order, family, genus, …

📊 2 results
📏 Metrics: Average-mAP

DyML-Product

DyML-Product is derived from iMaterialist-2019, a hierarchical online product dataset. The original iMaterialist-2019 offers up to 4 levels of hierarchical …

📊 2 results
📏 Metrics: Average-mAP

DyML-Vehicle

DyML-Vehicle merges two vehicle re-ID datasets PKU VehicleID [1], VERI-Wild [1]. Since these two datasets have only annotations on the …

📊 2 results
📏 Metrics: Average-mAP

In-Shop

In-shop Clothes Retrieval Benchmark evaluates the performance of in-shop Clothes Retrieval. This is a large subset of DeepFashion, containing large …

📊 15 results
📏 Metrics: R@1

Stanford Online Products

Stanford Online Products (SOP) dataset has 22,634 classes with 120,053 product images. The first 11,318 classes (59,551 images) are split …

📊 33 results
📏 Metrics: R@1

Micro-gesture Recognition

iMiGUE

iMiGUE is a dataset for emotional artificial intelligence research: identity-free video dataset for Micro-Gesture Understanding and Emotion analysis (iMiGUE). Different …

📊 1 results
📏 Metrics: Top 1 Accuracy, Top 5 Accuracy

Model Compression

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 12 results
📏 Metrics: Top-1

QNLI

The QNLI (Question-answering NLI) dataset is a Natural Language Inference dataset automatically derived from the Stanford Question Answering Dataset v1.1 …

📊 2 results
📏 Metrics: Accuracy

Model extraction

UML Classes With Specs

Repository for UML-English data This repository contains the data used for "Extraction of UML Class Diagrams from Natural Language …

📊 1 results
📏 Metrics: Exact Match

Molecular Property Prediction

MUV

The Maximum Unbiased Validation (MUV) dataset is a benchmark dataset selected from PubChem BioAssay. It was created by applying a …

📊 2 results
📏 Metrics: ROC-AUC

MoleculeNet

MoleculeNet is a large scale benchmark for molecular machine learning. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and …

📊 5 results
📏 Metrics: AUC

PCBA

PCBA dataset 11 is a collection of high-quality dose-response data, formulated as a multitask learning benchmark from 128 high-throughput screening …

📊 1 results
📏 Metrics: ROC-AUC

QM7

QM7 dataset is a subset of the GDB-13 database. GDB-13 contains nearly 1 billion stable and synthetically accessible organic molecules. …

📊 7 results
📏 Metrics: MAE

QM8

QM8 dataset is a collection of molecular data used for studying quantum mechanical calculations of electronic spectra and excited state …

📊 7 results
📏 Metrics: MAE

QM9

QM9 provides quantum chemical properties (at DFT level) for a relevant, consistent, and comprehensive chemical space of small organic molecules. …

📊 7 results
📏 Metrics: MAE

SIDER

SIDER contains information on marketed medicines and their recorded adverse drug reactions. The information is extracted from public documents and …

📊 16 results
📏 Metrics: ROC-AUC

Tox21

The Tox21 data set comprises 12,060 training samples and 647 test samples that represent chemical compounds. There are 801 "dense …

📊 17 results
📏 Metrics: ROC-AUC

clintox

The ClinTox dataset compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons. The …

📊 18 results
📏 Metrics: ROC-AUC, Molecules (M)

Molecule Captioning

ChEBI-20

Dataset contains 33,010 molecule-description pairs split into 80\%/10\%/10\% train/val/test splits. The goal of the task is to retrieve the relevant …

📊 28 results
📏 Metrics: BLEU-2, BLEU-4, METEOR, ROUGE-1, ROUGE-2, ROUGE-L, Text2Mol

L+M-24

Language-molecule models have emerged as an exciting direction for molecular discovery and understanding. However, training these models is challenging due …

📊 3 results
📏 Metrics: BLEU-2, BLEU-4, ROUGE-1, ROUGE-2, ROUGE-L, METEOR

Molecule retrieval from MS/MS spectrum

MassSpecGym

MassSpecGym provides three challenges for benchmarking the discovery and identification of new molecules from MS/MS spectra: - 💥 **De novo

📊 8 results
📏 Metrics: Hit rate @ 1, Hit rate @ 5, Hit rate @ 20, MCES @ 1

Molecule retrieval from MS/MS spectrum (bonus chemical formulae)

MassSpecGym

MassSpecGym provides three challenges for benchmarking the discovery and identification of new molecules from MS/MS spectra: - 💥 **De novo

📊 8 results
📏 Metrics: Hit rate @ 1, Hit rate @ 5, Hit rate @ 20, MCES @ 1

Moment Retrieval

Charades-STA

Charades-STA is a new dataset built on top of Charades by adding sentence temporal annotations. Source: [TALL: Temporal Activity Localization …

📊 25 results
📏 Metrics: R@1 IoU=0.5, R@1 IoU=0.7, R@5 IoU=0.5, R@5 IoU=0.7, R@1 IoU=0.3, mIoU

QVHighlights

The Query-based Video Highlights (QVHighlights) dataset is a dataset for detecting customized moments and highlights from videos given natural language …

📊 31 results
📏 Metrics: mAP, R@1 IoU=0.5, R@1 IoU=0.7, [email protected], [email protected]

Morpheme Segmentaiton

UniMorph 4.0

The Universal Morphology (UniMorph) project is a collaborative effort to improve how NLP handles complex morphology in the world’s languages. …

📊 3 results
📏 Metrics: macro avg (subtask 1), f1 macro avg (subtask 2), lev dist (subtask 2)

Mortality Prediction

Clinical Admission Notes from MIMIC-III

This dataset is created from MIMIC-III (Medical Information Mart for Intensive Care III) and contains simulated patient admission notes. The …

📊 3 results
📏 Metrics: AUROC

MIMIC-III

The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. …

📊 13 results
📏 Metrics: F1 score, Precision, Recall, Accuracy

Motion Captioning

HumanML3D

HumanML3D is a 3D human motion-language dataset that originates from a combination of HumanAct12 and Amass dataset. It covers a …

📊 4 results
📏 Metrics: BLEU-4, BERTScore

KIT Motion-Language

The KIT Motion-Language is a dataset linking human motion and natural language. Source: The KIT Motion-Language Dataset

📊 3 results
📏 Metrics: BLEU-4, BERTScore

Motion Detection

nuScenes

The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in …

📊 2 results
📏 Metrics: F1 (%)

Motion Planning

nuScenes

The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in …

📊 1 results
📏 Metrics: Collision, L2

Motion Segmentation

ApolloScape

ApolloScape is a large dataset consisting of over 140,000 video frames (73 street scene videos) from various locations in China …

📊 5 results
📏 Metrics: Accuracy

Hopkins155

The Hopkins 155 dataset consists of 156 video sequences of two or three motions. Each video sequence motion corresponds to …

📊 4 results
📏 Metrics: Classification Error

KT3DMoSeg

Please find more details of this dataset at https://alex-xun-xu.github.io/ProjectPage/CVPR_18/index.html 3D motion segmentation has been the key problem in computer vision …

📊 1 results
📏 Metrics: Error

Motion Synthesis

AIOZ-GDANCE

AIOZ-GDANCE comprises 16.7 hours of whole-body motion and music audio of group dancing. The duration of each video in our …

📊 4 results
📏 Metrics: FID, MMC, GenDiv, PFC, GMR, GMC, TIF

AIST++

AIST++ is a 3D dance dataset which contains 3D motion reconstructed from real dancers paired with music. The AIST++ Dance …

📊 12 results
📏 Metrics: FID, Beat alignment score

BRACE

BRACE is a dataset for audio-conditioned dance motion synthesis challenging common assumptions for this task: - strong music-dance correlation - …

📊 3 results
📏 Metrics: Frechet Inception Distance, Beat alignment score, Beat DTW cost, Footwork average, Powermove average, Toprock average

FineDance

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 7 results
📏 Metrics: fid_k, BAS

HumanAct12

HumanAct12 is a new 3D human motion dataset adopted from the polar image and 3D pose dataset PHSPD, with proper …

📊 2 results
📏 Metrics: Accuracy, FID, Multimodality

HumanML3D

HumanML3D is a 3D human motion-language dataset that originates from a combination of HumanAct12 and Amass dataset. It covers a …

📊 35 results
📏 Metrics: FID, R Precision Top3, Diversity, Multimodality

Inter-X

Inter-X is a large-scale dataset containing ~11K interaction sequences, more than 8.1M frames and 34K fine-grained human textual descriptions.

📊 5 results
📏 Metrics: FID, R-Precision Top3, MMDist, MModality

InterHuman

InterHuman is a multimodal dataset, named InterHuman. It consists of about 107M frames for diverse two-person interactions, with accurate skeletal …

📊 8 results
📏 Metrics: FID, R-Precision Top3, MMDist, MModality

KIT Motion-Language

The KIT Motion-Language is a dataset linking human motion and natural language. Source: The KIT Motion-Language Dataset

📊 29 results
📏 Metrics: FID, R Precision Top3, Diversity, Multimodality

LaFAN1

Ubisoft La Forge Animation Dataset ("LAFAN1") Ubisoft La Forge Animation dataset and accompanying code for the SIGGRAPH 2020 paper …

📊 4 results
📏 Metrics: L2Q@5, L2Q@15, L2Q@30, L2P@5, L2P@15, L2P@30, NPSS@5, NPSS@15, NPSS@30

Motion-X

Motion-X is a large-scale 3D expressive whole-body motion dataset, which comprises 15.6M precise 3D whole-body pose annotations (i.e., SMPL-X) covering …

📊 4 results
📏 Metrics: FID, TMR-R-Precision Top3, TMR-Matching Score, MModality, Diversity

TMD

The Text-Music-Dance (TMD) dataset establishes a pioneering benchmark comprising 2,153 text-music-motion pairs. Dance motions and corresponding text annotations are sourced …

📊 1 results
📏 Metrics: FID, BAS, MModality, MMDist

Trinity Speech-Gesture Dataset

Trinity Gesture Dataset includes 23 takes, totalling 244 minutes of motion capture and audio of a male native English speaker …

📊 1 results
📏 Metrics: Mean Opinion Score

Multi-Label Classification

CheXpert

The CheXpert dataset contains 224,316 chest radiographs of 65,240 patients with both frontal and lateral views available. The task is …

📊 11 results
📏 Metrics: AVERAGE AUC ON 14 LABEL, NUM RADS BELOW CURVE

ChestX-ray14

ChestX-ray14 is a medical imaging dataset which comprises 112,120 frontal-view X-ray images of 30,805 (collected from the year of 1992 …

📊 4 results
📏 Metrics: Average AUC on 14 label, Macro F1

MIMIC-CXR

MIMIC-CXR from Massachusetts Institute of Technology presents 371,920 chest X-rays associated with 227,943 imaging studies from 65,079 patients. The studies …

📊 1 results
📏 Metrics: Average AUC on 14 label

MLRSNet

MLRSNet is a a multi-label high spatial resolution remote sensing dataset for semantic scene understanding. It provides different perspectives of …

📊 2 results
📏 Metrics: F1-score

MRNet

The MRNet dataset consists of 1,370 knee MRI exams performed at Stanford University Medical Center. The dataset contains 1,104 (80.6%) …

📊 1 results
📏 Metrics: Average AUC, AUC on Abnormality (ABN), AUC on ACL Tear (ACL), AUC on Meniscus Tear (MEN), Average Accuracy, Accuracy on Abnormality (ABN), Accuracy on ACL Tear (ACL), Accuracy on Meniscus Tear (MEN)

NUS-WIDE

The NUS-WIDE dataset contains 269,648 images with a total of 5,018 tags collected from Flickr. These images are manually annotated …

📊 9 results
📏 Metrics: MAP

OpenImages-v6

OpenImages V6 is a large-scale dataset , consists of 9 million training images, 41,620 validation samples, and 125,456 test samples. …

📊 4 results
📏 Metrics: mAP

PASCAL VOC 2007

PASCAL VOC 2007 is a dataset for image recognition. The twenty object classes that have been selected are: Person: person …

📊 16 results
📏 Metrics: mAP

Multi-Label Classification Of Biomedical Texts

MIMIC-III

The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. …

📊 1 results
📏 Metrics: Micro F1

Multi-Label Image Classification

BigEarthNet

BigEarthNet consists of 590,326 Sentinel-2 image patches, each of which is a section of i) 120x120 pixels for 10m bands; …

📊 10 results
📏 Metrics: mAP (micro), mAP (macro), FScore, official split

VizWiz-Classification

Our goal is to improve upon the status quo for designing image classification models trained in one domain that perform …

📊 1 results
📏 Metrics: Accuracy

Multi-Label Text Classification

CC3M-TagMask

The dataset offers tag and mask annotations for image-text pairs from the CC3M validation set. Tag annotations denote words that …

📊 4 results
📏 Metrics: Precision, Recall, F1, Accuracy, mAP

Dataset of Propaganda Techniques of the State-Sponsored Information Operation of the People's Republic of China

This data is for the Mis2-KDD 2021 under review paper: Dataset of Propaganda Techniques of the State-Sponsored Information Operation of …

📊 1 results
📏 Metrics: 1:1 Accuracy, F1 - macro, Micro F1

MIMIC-III

The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. …

📊 3 results
📏 Metrics: AUC, Macro F1, Macro Precision, Macro Recall, Micro Precision, Micro Recall, Micro-F1, P@5, Precision, Recall

RCV1

The RCV1 dataset is a benchmark dataset on text categorization. It is a collection of newswire articles producd by Reuters …

📊 1 results
📏 Metrics: Macro-F1, Micro-F1

Reuters-21578

The Reuters-21578 dataset is a collection of documents with news articles. The original corpus has 10,369 documents and a vocabulary …

📊 5 results
📏 Metrics: Micro-F1

Multi-Object Tracking

2024 AI City Challenge

The AI City Challenge, hosted at CVPR 2024, focuses on harnessing AI to enhance operational efficiency in physical settings such …

📊 1 results
📏 Metrics: HOTA, DetA, AssA, LocA

BDD100K

Datasets drive vision progress, yet existing driving datasets are impoverished in terms of visual content and supported tasks to study …

📊 1 results
📏 Metrics: TETA, AssocA, ClsA, LocA

DanceTrack

A large-scale multi-object tracking dataset for human tracking in occlusion, frequent crossover, uniform appearance and diverse body gestures. It is …

📊 35 results
📏 Metrics: HOTA, MOTA, IDF1, AssA, DetA

HiEve

A new large-scale dataset for understanding human motions, poses, and actions in a variety of realistic events, especially crowd & …

📊 4 results
📏 Metrics: MOTA, IDF1

JRDB

A novel egocentric dataset collected from social mobile manipulator JackRabbot. The dataset includes 64 minutes of annotated multimodal sensor data …

📊 1 results
📏 Metrics: HOTA

MOT15

MOT2015 is a dataset for multiple object tracking. It contains 11 different indoor and outdoor scenes of public places with …

📊 2 results
📏 Metrics: MOTA, MOTP

MOT16

The MOT16 dataset is a dataset for multiple object tracking. It a collection of existing and new data (part of …

📊 23 results
📏 Metrics: MOTA, IDF1, IDs

MOT17

The Multiple Object Tracking 17 (MOT17) dataset is a dataset for multiple object tracking. Similar to its previous version MOT16, …

📊 42 results
📏 Metrics: HOTA, MOTA, IDF1, AssA, DetA, e2e-MOT, Speed (FPS)

MOT20

MOT20 is a dataset for multiple object tracking. The dataset contains 8 challenging video sequences (4 train, 4 test) in …

📊 22 results
📏 Metrics: HOTA, MOTA, IDF1, AssA, Speed (FPS)

MultiviewX

MultiviewX is a synthetic Multiview pedestrian detection dataset. It is build using pedestrian models from PersonX, in Unity. The MultiviewX …

📊 2 results
📏 Metrics: IDF1, MOTA

PersonPath22

PersonPath22 is a large-scale multi-person tracking dataset containing 236 videos captured mostly from static-mounted cameras, collected from sources where we …

📊 4 results
📏 Metrics: IDF1, MOTA

SeaDronesSee

SeaDronesSee is a large-scale data set aimed at helping develop systems for Search and Rescue (SAR) using Unmanned Aerial Vehicles …

📊 3 results
📏 Metrics: MOTA

SportsMOT

Motivation Multi-object tracking (MOT) is a fundamental task in computer vision, aiming to estimate objects (e.g., pedestrians and vehicles) …

📊 22 results
📏 Metrics: HOTA, IDF1, AssA, MOTA, DetA

Synthehicle

Synthehicle is a massive CARLA-based synthehic multi-vehicle multi-camera tracking dataset and includes ground truth for 2D detection and tracking, 3D …

📊 1 results
📏 Metrics: MOTA

TAO

TAO is a federated dataset for Tracking Any Object, containing 2,907 high resolution videos, captured in diverse environments, which are …

📊 9 results
📏 Metrics: TETA, LocA, AssocA, ClsA, Track mAP

UAVDT

UAVDT is a large scale challenging UAV Detection and Tracking benchmark (i.e., about 80, 000 representative frames from 10 hours …

📊 2 results
📏 Metrics: IDF1, MOTA

Wildtrack

Wildtrack is a large-scale and high-resolution dataset. It has been captured with seven static cameras in a public open area, …

📊 9 results
📏 Metrics: IDF1, MOTA

Multi-Object Tracking and Segmentation

KITTI MOTS

The Multi-Object and Segmentation (MOTS) benchmark [2] consists of 21 training sequences and 29 test sequences. It is based on …

📊 1 results
📏 Metrics: AssA, DetA, HOTA

Multi-Person Pose Estimation

COCO (Common Objects in Context)

The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset. It is designed to …

📊 15 results
📏 Metrics: AP, Test AP, Validation AP

COCO-WholeBody

COCO-WholeBody is an extension of COCO dataset with whole-body annotations. There are 4 types of bounding boxes (person box, face …

📊 2 results
📏 Metrics: keypoint AP

CrowdPose

The CrowdPose dataset contains about 20,000 images and a total of 80,000 human poses with 14 labeled keypoints. The test …

📊 28 results
📏 Metrics: mAP @0.5:0.95, AP Easy, AP Medium, AP Hard, FPS

OCHuman

This dataset focuses on heavily occluded human with comprehensive annotations including bounding-box, humans pose and instance mask. This dataset contains …

📊 8 results
📏 Metrics: AP50, AP75, Validation AP

Multi-Task Learning

CelebA

CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels …

📊 1 results
📏 Metrics: Error

ChestX-ray14

ChestX-ray14 is a medical imaging dataset which comprises 112,120 frontal-view X-ray images of 30,805 (collected from the year of 1992 …

📊 1 results
📏 Metrics: delta_m

NYUv2

The NYU-Depth V2 data set is comprised of video sequences from a variety of indoor scenes as recorded by both …

📊 2 results
📏 Metrics: Mean IoU

QM9

QM9 provides quantum chemical properties (at DFT level) for a relevant, consistent, and comprehensive chemical space of small organic molecules. …

📊 2 results
📏 Metrics: ∆m%

UTKFace

The UTKFace dataset is a large-scale face dataset with long age span (range from 0 to 116 years old). The …

📊 1 results
📏 Metrics: delta_m

Multi-View 3D Reconstruction

ETH3D

ETHD is a multi-view stereo benchmark / 3D reconstruction benchmark that covers a variety of indoor and outdoor scenes. Ground …

📊 4 results
📏 Metrics: F1 score

Multi-agent Integration

BBAI Dataset

This dataset is for evaluating the task of Black-box Multi-agent Integration which focuses on combining the capabilities of multiple black-box …

📊 1 results
📏 Metrics: P@1

Multi-agent Reinforcement Learning

SMAC-Exp

The StarCraft Multi-Agent Challenges+ requires agents to learn completion of multi-stage tasks and usage of environmental factors without precise reward …

📊 1 results
📏 Metrics: Median Win Rate

Multi-class Classification

TII-SSRC-23

The TII-SSRC-23 dataset offers a comprehensive collection of network traffic patterns, meticulously compiled to support the development and research of …

📊 1 results
📏 Metrics: F1-Score

Multi-hop Question Answering

MuSiQue-Ans

MuSiQue-Ans is a new multihop QA dataset with ~25K 2-4 hop questions using seed questions from 5 existing single-hop datasets.

📊 1 results
📏 Metrics: An, Sp

Multi-label Condescension Detection

DPM

Don’t Patronize Me! (DPM) is an annotated dataset with Patronizing and Condescending Language towards vulnerable communities.

📊 2 results
📏 Metrics: Macro-F1

Multi-modal Classification

AudioSet

Audioset is an audio event dataset, which consists of over 2M human-annotated 10-second video clips. These clips are collected from …

📊 2 results
📏 Metrics: Average mAP

VGG-Sound

Consists of more than 210k videos for 310 audio classes. Source: VGGSound: A Large-scale Audio-Visual Dataset

📊 2 results
📏 Metrics: Top-1 Accuracy, Top-5 Accuracy

Multi-modal Entity Alignment

MMKG

MMKG is a collection of three knowledge graphs for link prediction and entity matching research. Contrary to other knowledge graph …

📊 1 results
📏 Metrics: H@1

Multi-step retrosynthesis

USPTO-190

A chemical synthesis route dataset constructed from the USPTO reaction dataset (1976-Sep2016) and a list of commercially available building blocks …

📊 5 results
📏 Metrics: Success Rate (100 model calls), Success Rate (500 model calls)

Multi-target Domain Adaptation

DomainNet

DomainNet is a dataset of common objects in six different domain. All domains include 345 categories (classes) of objects such …

📊 4 results
📏 Metrics: Accuracy

OBJ-MDA

The dataset contains images of 16 artworks included in the cultural site “Galleria Regionale di Palazzo Bellomo2”. The collection covers …

📊 1 results
📏 Metrics: [email protected]

Office-31

The Office dataset contains 31 object categories in three domains: Amazon, DSLR and Webcam. The 31 categories in the dataset …

📊 5 results
📏 Metrics: Accuracy

Office-Home

Office-Home is a benchmark dataset for domain adaptation which contains 4 domains where each domain consists of 65 categories. The …

📊 4 results
📏 Metrics: Accuracy

Multi-tissue Nucleus Segmentation

CoNSeP

The colorectal nuclear segmentation and phenotypes (CoNSeP) dataset consists of 41 H&E stained image tiles, each of size 1,000×1,000 pixels …

📊 2 results
📏 Metrics: Dice, Jaccard Index, PQ

Kumar

The Kumar dataset contains 30 1,000×1,000 image tiles from seven organs (6 breast, 6 liver, 6 kidney, 6 prostate, 2 …

📊 18 results
📏 Metrics: Dice, Hausdorff Distance (mm), Jaccard Index, PQ

Multimodal Emotion Recognition

IEMOCAP

Multimodal Emotion Recognition IEMOCAP The IEMOCAP dataset consists of 151 videos of recorded dialogues, with 2 speakers per session for …

📊 2 results
📏 Metrics: Weighted F1, Accuracy

MELD

Multimodal EmotionLines Dataset (MELD) has been created by enhancing and extending EmotionLines dataset. MELD contains the same dialogue instances available …

📊 3 results
📏 Metrics: Weighted F1, Accuracy

Multimodal Machine Translation

Multi30K

Multi30K is a large-scale multilingual multimodal dataset for interdisciplinary machine learning research. It extends the Flickr30K dataset with German translations …

📊 14 results
📏 Metrics: BLEU (EN-DE), BLUE (DE-EN), Meteor (EN-DE), Meteor (EN-FR)

Multimodal Reasoning

AlgoPuzzleVQA

We introduce the novel task of multimodal puzzle solving, framed within the context of visual question-answering. We present a new …

📊 1 results
📏 Metrics: Acc

MATH-V

Math-Vision (Math-V) dataset is a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math …

📊 4 results
📏 Metrics: Accuracy

REBUS

Recent advances in large language models have led to the development of multimodal LLMs (MLLMs), which take both image data …

📊 8 results
📏 Metrics: Accuracy

Multimodal Text Prediction

MultiSubs

MultiSubs is a dataset of multilingual subtitles gathered from the OPUS OpenSubtitles dataset, which in turn was sourced from opensubtitles.org. …

📊 1 results
📏 Metrics: Accuracy, Word similarity

Multimodal Text and Image Classification

CD18

📊 1 results
📏 Metrics: Accuracy, F-measure (%)

Multiple Instance Learning

CAMELYON16

The dataset consists of 400 whole-slide images (WSIs) of lymph node sections stained with hematoxylin and eosin (H&E), collected from …

📊 14 results
📏 Metrics: AUC, ACC, Expected Calibration Error, FROC, Patch AUC

Elephant

The Elephant MIL dataset is a benchmark used in multiple instance learning (MIL), which falls under the broader categories of …

📊 2 results
📏 Metrics: AUC, ACC

Musk v1

The Musk dataset describes a set of molecules, and the objective is to detect musks from non-musks. This dataset describes …

📊 2 results
📏 Metrics: ACC, AUC

Musk v2

The Musk2 dataset is a set of 102 molecules of which 39 are judged by human experts to be musks …

📊 2 results
📏 Metrics: ACC, AUC

TCGA

📊 8 results
📏 Metrics: ACC, AUC

Multiple Object Tracking

GMOT-40

GMOT-40 is the first public dense dataset for Generic Multiple Object Tracking (GMOT). It contains 40 carefully annotated sequences evenly …

📊 1 results
📏 Metrics: HOTA, IDF1, MOTA

RADIATE

RADIATE (RAdar Dataset In Adverse weaThEr) is new automotive dataset created by Heriot-Watt University which includes Radar, Lidar, Stereo Camera …

📊 2 results
📏 Metrics: MOTA

SportsMOT

Motivation Multi-object tracking (MOT) is a fundamental task in computer vision, aiming to estimate objects (e.g., pedestrians and vehicles) …

📊 19 results
📏 Metrics: HOTA, IDF1, AssA, MOTA, DetA

UA-DETRAC

Consists of 100 challenging video sequences captured from real-world traffic scenes (over 140,000 frames with rich annotations, including occlusion, weather, …

📊 1 results
📏 Metrics: MOTA

Waymo Open Dataset

The Waymo Open Dataset is comprised of high resolution sensor data collected by autonomous vehicles operated by the Waymo Driver …

📊 2 results
📏 Metrics: Category, MOTA, mAP

Multispectral Object Detection

KAIST Multispectral Pedestrian Detection Benchmark

KAIST Multispectral Pedestrian Dataset The KAIST Multispectral Pedestrian Dataset is imaging hardware consisting of a color camera, a thermal camera …

📊 13 results
📏 Metrics: All Miss Rate, Reasonable Miss Rate

LLVIP

  • Visible-infrared Paired Dataset for Low-light Vision * 30976 images (15488 pairs) * 24 dark scenes, 2 daytime scenes * …
📊 1 results
📏 Metrics: mAP50

Multivariate Time Series Forecasting

Electricity

Abstract: Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 …

📊 1 results
📏 Metrics: MSE

ExtMarker

Three-dimensional position of external markers placed on the chest and abdomen of healthy individuals breathing during intervals from 73s to …

📊 1 results
📏 Metrics: Jitter, MAE, Maximum error, RMSE, normalized RMSE

MIMIC-III

The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. …

📊 4 results
📏 Metrics: MSE, NegLL

MuJoCo

MuJoCo (multi-joint dynamics with contact) is a physics engine used to implement environments to benchmark Reinforcement Learning methods.

📊 5 results
📏 Metrics: MSE (10^-2, 50% missing)

PhysioNet Challenge 2012

The PhysioNet Challenge 2012 dataset is publicly available and contains the de-identified records of 8000 patients in Intensive Care Units …

📊 4 results
📏 Metrics: mse (10^-3), MSE stdev

Traffic

Abstract: The task for this dataset is to forecast the spatio-temporal traffic volume based on the historical traffic volume and …

📊 1 results
📏 Metrics: MSE

Weather

Weather is recorded every 10 minutes for the 2020 whole year, which contains 21 meteorological indicators, such as air temperature, …

📊 1 results
📏 Metrics: MSE

Music Auto-Tagging

MagnaTagATune

MagnaTagATune dataset contains 25,863 music clips. Each clip is a 29-seconds-long excerpt belonging to one of the 5223 songs, 445 …

📊 3 results
📏 Metrics: PR-AUC, ROC AUC

TimeTravel

TimeTravel contains 29,849 counterfactual rewritings, each with the original story, a counterfactual event, and human-generated revision of the original story …

📊 1 results
📏 Metrics: 0..5sec

Music Generation

Song Describer Dataset

The Song Describer Dataset (SDD) contains ~1.1k captions for 706 permissively licensed music recordings. It is designed for use in …

📊 1 results
📏 Metrics: FAD VGG

Music Modeling

JSB Chorales

The JSB chorales are a set of short, four-voice pieces of music well-noted for their stylistic homogeneity. The chorales were …

📊 9 results
📏 Metrics: NLL, Parameters

Nottingham

The Nottingham Dataset is a collection of 1200 American and British folk songs. Source: [Rethinking Recurrent Latent Variable Model for …

📊 8 results
📏 Metrics: NLL, Parameters

Music Question Answering

MusicQA

We propose the MusicQA dataset to train Music-enabled question-answering models and is used for training and evaluating our MU-LLaMA model. …

📊 3 results
📏 Metrics: BLEU, METEOR, ROUGE, BERT Score

Music Source Separation

MUSDB18

The MUSDB18 is a dataset of 150 full lengths music tracks (~10h duration) of different genres along with their isolated …

📊 20 results
📏 Metrics: SDR (avg), SDR (vocals), SDR (drums), SDR (bass), SDR (other)

MUSDB18-HQ

MUSDB18-HQ is a high-quality version of the MUSDB18 music tracks dataset. The high-quality dataset consists of the same 150 songs, …

📊 12 results
📏 Metrics: SDR (avg), SDR (bass), SDR (drums), SDR (others), SDR (vocals)

Slakh2100

The Synthesized Lakh (Slakh) Dataset is a dataset for audio source separation that is synthesized from the Lakh MIDI Dataset …

📊 1 results
📏 Metrics: SDR (bass), SDR (drums), SI-SDRi (Bass), Si-SDRi (Drums), Si-SDRi (Guitar), Si-SDRi (Piano)

Music Transcription

MAESTRO

The MAESTRO dataset contains over 200 hours of paired audio and MIDI recordings from ten years of International Piano-e-Competition. The …

📊 6 results
📏 Metrics: Onset F1

MAPS

MAPS – standing for MIDI Aligned Piano Sounds – is a database of MIDI-annotated piano recordings. MAPS has been designed …

📊 6 results
📏 Metrics: Onset F1

MusicNet

MusicNet is a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise …

📊 6 results
📏 Metrics: APS, Number of params

Slakh2100

The Synthesized Lakh (Slakh) Dataset is a dataset for audio source separation that is synthesized from the Lakh MIDI Dataset …

📊 6 results
📏 Metrics: note-level F-measure-no-offset (Fno), Onset F1

URMP

URMP (University of Rochester Multi-Modal Musical Performance) is a dataset for facilitating audio-visual analysis of musical performances. The dataset comprises …

📊 3 results
📏 Metrics: Onset F1

Named Entity Recognition (NER)

ACE 2004

ACE 2004 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2004 Automatic …

📊 9 results
📏 Metrics: F1, Multi-Task Supervision

ACE 2005

ACE 2005 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic …

📊 19 results
📏 Metrics: F1

Adverse Drug Events (ADE) Corpus

Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. A significant …

📊 1 results
📏 Metrics: NER Macro F1

BC2GM

Created by Smith et al. at 2008, the BioCreative II Gene Mention Recognition (BC2GM) Dataset contains data where participants are …

📊 11 results
📏 Metrics: F1

BC4CHEMD

Introduced by Krallinger et al. in The CHEMDNER corpus of chemicals and drugs and its annotation principles BC4CHEMD is a …

📊 3 results
📏 Metrics: F1

BC5CDR

BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions. Source: https://www.ncbi.nlm.nih.gov/research/bionlp/Data/ Image …

📊 14 results
📏 Metrics: F1

BioRED

BioRED is a first-of-its-kind biomedical relation extraction dataset with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. …

📊 3 results
📏 Metrics: F1

CMeEE

Chinese Medical Named Entity Recognition, a dataset first released in CHIP20204, is used for CMeEE task. Given a pre-defined schema, …

📊 2 results
📏 Metrics: F1, Micro F1

CORD-r

We introduce FUNSD-r and CORD-r in Token Path Prediction, the revised VrD-NER datasets to reflect the real-world scenarios of NER …

📊 4 results
📏 Metrics: F1

CoNLL++

CoNLL++ is a corrected version of the CoNLL03 NER dataset where 5.38% of the test sentences have been fixed. Source: …

📊 11 results
📏 Metrics: F1

CoNLL-2020

A test dataset that annotated articles in 2020 following the CoNLL-2003 NER task.

📊 2 results
📏 Metrics: F1

DWIE

The 'Deutsche Welle corpus for Information Extraction' (DWIE) is a multi-task dataset that combines four main Information Extraction (IE) annotation …

📊 3 results
📏 Metrics: F1-Hard

DaNE

Danish Dependency Treebank (DaNE) is a named entity annotation for the Danish Universal Dependencies treebank using the CoNLL-2003 annotation scheme. …

📊 1 results
📏 Metrics: Micro-average F1

FUNSD-r

We introduce FUNSD-r and CORD-r in Token Path Prediction, the revised VrD-NER datasets to reflect the real-world scenarios of NER …

📊 4 results
📏 Metrics: F1

FindVehicle

The first NER dataset in the field of traffic, which is to extract the characteristics and attributes of the vehicle …

📊 3 results
📏 Metrics: F1 Score, F1

GENIA

The GENIA corpus is the primary collection of biomedical literature compiled and annotated within the scope of the GENIA project. …

📊 12 results
📏 Metrics: F1

HiNER-collapsed

This dataset releases a significantly sized standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 3 collapsed …

📊 2 results
📏 Metrics: F1-score (Weighted)

HiNER-original

This dataset releases a significantly sized standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 11 tags.

📊 2 results
📏 Metrics: F1-score (Weighted)

JNLPBA

JNLPBA is a biomedical dataset that comes from the GENIA version 3.02 corpus (Kim et al., 2003). It was created …

📊 15 results
📏 Metrics: F1

LINNAEUS

LINNAEUS is a general-purpose dictionary matching software, capable of processing multiple types of document formats in the biomedical domain (MEDLINE, …

📊 4 results
📏 Metrics: F1

NCBI Disease

The NCBI Disease corpus consists of 793 PubMed abstracts, which are separated into training (593), development (100) and test (100) …

📊 1 results
📏 Metrics: F1

NEMO-Corpus

Named Entity (NER) annotations of the Hebrew Treebank (Haaretz newspaper) corpus, including: morpheme and token level NER labels, nested mentions, …

📊 1 results
📏 Metrics: F1

OntoNotes 5.0

OntoNotes 5.0 is a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk …

📊 2 results
📏 Metrics: Average F1, Micro F1

SLUE

Spoken Language Understanding Evaluation (SLUE) is a suite of benchmark tasks for spoken language understanding evaluation. It consists of limited-size …

📊 13 results
📏 Metrics: F1 (%), label-F1 (%), Text model

SciERC

SciERC dataset is a collection of 500 scientific abstract annotated with scientific entities, their relations, and coreference clusters. The abstracts …

📊 6 results
📏 Metrics: F1

Species-800

Species-800 is a corpus for species entities, which is based on manually annotated abstracts. It comprises 800 PubMed abstracts that …

📊 3 results
📏 Metrics: F1

WNUT 2017

This shared task focuses on identifying unusual, previously-unseen entities in the context of emerging discussions. Named entities form the basis …

📊 20 results
📏 Metrics: F1, F1 (surface form), Precision, Recall

WNUT 2020

The training and development dataset for our task was taken from previous work on wet lab corpus (Kulkarni et al., …

📊 2 results
📏 Metrics: F1, Precision, Recall

i2b2 De-identification Dataset

This dataset contains 1304 de-identified longitudinal medical records describing 296 patients.

📊 1 results
📏 Metrics: F1, Precision

Natural Language Inference

BioNLI

BioNLI is a dataset in biomedical natural language inference. This dataset contains abstracts from biomedical literature and mechanistic premises generated …

📊 1 results
📏 Metrics: Macro F1

CommitmentBank

The CommitmentBank is a corpus of 1,200 naturally occurring discourses whose final sentence contains a clause-embedding predicate under an entailment …

📊 20 results
📏 Metrics: Accuracy, F1

FarsTail

Natural Language Inference (NLI), also called Textual Entailment, is an important task in NLP with the goal of determining the …

📊 10 results
📏 Metrics: % Test Accuracy

HANS

The HANS (Heuristic Analysis for NLI Systems) dataset which contains many examples where the heuristics fail. Source: [Right for the …

📊 1 results
📏 Metrics: 1:1 Accuracy

JamPatoisNLI

JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois. Many of the most-spoken low-resource …

📊 2 results
📏 Metrics: Accuracy

KUAKE-QQR

KUAKE Query-Query Relevance, a dataset used to evaluate the relevance of the content expressed in two queries, is used for …

📊 1 results
📏 Metrics: Accuracy

KUAKE-QTR

KUAKE Query Title Relevance, a dataset used to estimate the relevance of the title of a query document, is used …

📊 1 results
📏 Metrics: Accuracy

LiDiRus

LiDiRus is a diagnostic dataset that covers a large volume of linguistic phenomena, while allowing you to evaluate information systems …

📊 6 results
📏 Metrics: MCC

MED

MED is a new evaluation dataset that covers a wide range of monotonicity reasoning that was created by crowdsourcing and …

📊 1 results
📏 Metrics: 1:1 Accuracy

MRPC

Microsoft Research Paraphrase Corpus (MRPC) is a corpus consists of 5,801 sentence pairs collected from newswire articles. Each pair is …

📊 1 results
📏 Metrics: Acc

MedNLI

The MedNLI dataset consists of the sentence pairs developed by Physicians from the Past Medical History section of MIMIC-III clinical …

📊 5 results
📏 Metrics: Accuracy, Params (M)

MultiNLI

The Multi-Genre Natural Language Inference (MultiNLI) dataset has 433K sentence pairs. Its size and mode of collection are modeled closely …

📊 63 results
📏 Metrics: Matched, Mismatched, Accuracy, Dev Matched, Dev Mismatched

Probability words NLI

This dataset tests the capabilities of language models to correctly capture the meaning of words denoting probabilities (WEP), e.g. words …

📊 1 results
📏 Metrics: 1:1 Accuracy

QNLI

The QNLI (Question-answering NLI) dataset is a Natural Language Inference dataset automatically derived from the Stanford Question Answering Dataset v1.1 …

📊 42 results
📏 Metrics: Accuracy

Quora Question Pairs

Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary …

📊 1 results
📏 Metrics: Accuracy

RCB

The Russian Commitment Bank is a corpus of naturally occurring discourses whose final sentence contains a clause-embedding predicate under an …

📊 6 results
📏 Metrics: Average F1, Accuracy

RTE

The Recognizing Textual Entailment (RTE) datasets come from a series of textual entailment challenges. Data from RTE1, RTE2, RTE3 and …

📊 89 results
📏 Metrics: Accuracy

SICK

The Sentences Involving Compositional Knowledge (SICK) dataset is a dataset for compositional distributional semantics. It includes a large number of …

📊 1 results
📏 Metrics: 1:1 Accuracy

SNLI

The SNLI dataset (Stanford Natural Language Inference) consists of 570k sentence-pairs manually labeled as entailment, contradiction, and neutral. Premises are …

📊 88 results
📏 Metrics: % Test Accuracy, % Train Accuracy, Parameters, Dev Accuracy, % Dev Accuracy, Accuracy

SciTail

The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct …

📊 11 results
📏 Metrics: Accuracy, Dev Accuracy, % Dev Accuracy, % Test Accuracy

TERRa

Textual Entailment Recognition has been proposed recently as a generic task that captures major semantic inference needs across many NLP …

📊 6 results
📏 Metrics: Accuracy

TabFact

TabFact is a large-scale dataset which consists of 117,854 manually annotated statements with regard to 16,573 Wikipedia tables, their relations …

📊 1 results
📏 Metrics: Accuracy

WNLI

The WNLI dataset is a part of the GLUE benchmark used for Natural Language Inference (NLI). It contains pairs of …

📊 22 results
📏 Metrics: Accuracy

XWINO

XWINO is a multilingual collection of Winograd Schemas in six languages that can be used for evaluation of cross-lingual commonsense …

📊 1 results
📏 Metrics: Accuracy

e-SNLI

e-SNLI is used for various goals, such as obtaining full sentence justifications of a model's decisions, improving universal sentence representations …

📊 3 results
📏 Metrics: BLEU, Accuracy

Natural Language Queries

Ego4D

Ego4D is a massive-scale egocentric video dataset and benchmark suite. It offers 3,025 hours of daily life activity video spanning …

📊 10 results
📏 Metrics: R@1 Mean(0.3 and 0.5), R@1 IoU=0.3, R@1 IoU=0.5, R@5 IoU=0.3, R@5 IoU=0.5

Natural Language Understanding

GLUE

General Language Understanding Evaluation (GLUE) benchmark is a collection of nine natural language understanding tasks, including single-sentence tasks CoLA and …

📊 2 results
📏 Metrics: Average

LexGLUE

Legal General Language Understanding Evaluation (LexGLUE) benchmark is a collection of datasets for evaluating model performance across a diverse set …

📊 8 results
📏 Metrics: ECtHR Task A, ECtHR Task B, SCOTUS, EUR-LEX, LEDGAR, UNFAIR-ToS, CaseHOLD

STREUSLE

STREUSLE stands for Supersense-Tagged Repository of English with a Unified Semantics for Lexical Expressions. The text is from the web …

📊 10 results
📏 Metrics: Tags (Full) Acc, Role F1 (Preps), Function F1 (Preps), Full F1 (Preps)

Natural Language Visual Grounding

ScreenSpot

ScreenSpot Evaluation Benchmark ScreenSpot is an evaluation benchmark for GUI grounding, comprising over 1,200 instructions from various environments, including …

📊 18 results
📏 Metrics: Accuracy (%)

Nature-Inspired Optimization Algorithm

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 2 results
📏 Metrics: training time (s)

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 2 results
📏 Metrics: training time (s)

Nested Mention Recognition

ACE 2004

ACE 2004 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2004 Automatic …

📊 7 results
📏 Metrics: F1

ACE 2005

ACE 2005 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic …

📊 9 results
📏 Metrics: F1

Network Pruning

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 4 results
📏 Metrics: Accuracy, GFLOPs, Inference Time (ms)

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 5 results
📏 Metrics: Accuracy, GFLOPs, Inference Time (ms)

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 16 results
📏 Metrics: Accuracy, GFLOPs, MParams

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 1 results
📏 Metrics: Avg #Steps

Neural Architecture Search

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 36 results
📏 Metrics: Top-1 Error Rate, Parameters, FLOPS, Search Time (GPU days), Accuracy (% )

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 12 results
📏 Metrics: Percentage Error, FLOPS, PARAMS, Search Time (GPU days), Accuracy (% )

CINIC-10

CINIC-10 is a dataset for image classification. It has a total of 270,000 images, 4.5 times that of CIFAR-10. It …

📊 4 results
📏 Metrics: Accuracy (%), FLOPS, PARAMS

DTD

The Describable Textures Dataset (DTD) contains 5640 texture images in the wild. They are annotated with human-centric attributes inspired by …

📊 4 results
📏 Metrics: Accuracy (%), FLOPS, PARAMS

Food-101

The Food-101 dataset consists of 101 food categories with 750 training and 250 test images per category, making a total …

📊 5 results
📏 Metrics: Accuracy (%), FLOPS, PARAMS, Accuracy (% )

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 121 results
📏 Metrics: Top-1 Error Rate, Accuracy, Params, MACs, FLOPs

LIDC-IDRI

The LIDC-IDRI dataset contains lesion annotations from four experienced thoracic radiologists. LIDC-IDRI contains 1,018 low-dose lung CTs from 1010 lung …

📊 1 results
📏 Metrics: F1 score, Specificity (VEB+)

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 1 results
📏 Metrics: R2

NAS-Bench-101

NAS-Bench-101 is the first public architecture dataset for NAS research. To build NASBench-101, the authors carefully constructed a compact, yet …

📊 3 results
📏 Metrics: Accuracy (%), Spearman Correlation

Oxford-IIIT Pet Dataset

The Oxford-IIIT Pet Dataset has 37 categories with roughly 200 images for each class. The images have a large variations …

📊 4 results
📏 Metrics: Accuracy (%), FLOPS, PARAMS

STL-10

The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or …

📊 4 results
📏 Metrics: Accuracy (%), FLOPS, PARAMS

Stanford Cars

The Stanford Cars dataset consists of 196 classes of cars with a total of 16,185 images, taken from the rear. …

📊 4 results
📏 Metrics: Accuracy (%), FLOPS, PARAMS

No-Reference Image Quality Assessment

CSIQ

The CSIQ database consists of 30 original images, each is distorted using six different types of distortions at four to …

📊 6 results
📏 Metrics: SRCC, PLCC

KADID-10k

Konstanz artificially distorted image quality database (KADID-10k) contains 81 pristine images, each degraded by 25 distortions in 5 levels.

📊 6 results
📏 Metrics: SRCC, PLCC

LIVE

Briefly describe the dataset. Provide: * a high-level explanation of the dataset characteristics * explain motivations and summary of its …

📊 1 results
📏 Metrics: SRCC, PLCC

TID2013

TID2013 is a dataset for image quality assessment that contains 25 reference images and 3000 distorted images (25 reference images …

📊 6 results
📏 Metrics: SRCC, PLCC

UHD-IQA

We introduce a novel Image Quality Assessment (IQA) dataset comprising 6073 UHD-1 (4K) images, annotated at a fixed width of …

📊 6 results
📏 Metrics: SRCC, PLCC

Node Classification

AMZ Computers

AMZ Computers is a co-purchase graph extracted from Amazon, where nodes represent products, edges represent the co-purchased relations of products, …

📊 5 results
📏 Metrics: Accuracy

AVA

AVA is a project that provides audiovisual annotations of video for improving our understanding of human activity. Each of the …

📊 4 results
📏 Metrics: mAP

Amazon Photo

Amazon Photo

📊 10 results
📏 Metrics: Accuracy

Amazon-Fraud

Amazon-Fraud is a multi-relational graph dataset built upon the Amazon review dataset, which can be used in evaluating graph-based node …

📊 3 results
📏 Metrics: AUC-ROC

Brazil Air-Traffic

Brazil Air-Traffic

📊 7 results
📏 Metrics: Accuracy

CLUSTER

CLUSTER is a node classification tasks generated with Stochastic Block Models, which is widely used to model communities in social …

📊 12 results
📏 Metrics: Accuracy

CellTypeGraph Benchmark

Classifying all cells in an organ is a relevant and difficult problem from plant developmental biology. We here abstract the …

📊 1 results
📏 Metrics: Top-1 accuracy, class-average Accuracy

Chameleon (48%/32%/20% fixed splits)

Node classification on Chameleon with the fixed 48%/32%/20% splits provided by Geom-GCN.

📊 4 results
📏 Metrics: Accuracy

Chameleon(60%/20%/20% random splits)

Node classification on Chameleon with 60%/20%/20% random splits for training/validation/test.

📊 2 results
📏 Metrics: Accuracy

Citeseer

The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 …

📊 65 results
📏 Metrics: Accuracy, Training Split, Validation, 1:1 Accuracy, Accuracy (%), Inference Time (ms)

Citeseer (48%/32%/20% fixed splits)

Node classification on Citeseer with the fixed 48%/32%/20% splits provided by Geom-GCN.

📊 26 results
📏 Metrics: 1:1 Accuracy, Accuracy

Cora

The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 …

📊 69 results
📏 Metrics: Accuracy, Training Split, Validation, 1:1 Accuracy, Inference Time (ms)

Cora (48%/32%/20% fixed splits)

Node classification on Cora with the fixed 48%/32%/20% splits provided by Geom-GCN.

📊 26 results
📏 Metrics: 1:1 Accuracy, Accuracy

Cornell

📊 58 results
📏 Metrics: Accuracy, Accuracy (%)

Cornell (48%/32%/20% fixed splits)

Node classification on Cornell with the fixed 48%/32%/20% splits provided by Geom-GCN.

📊 3 results
📏 Metrics: Accuracy

Cornell (60%/20%/20% random splits)

Node classification on Cornell with 60%/20%/20% random splits for training/validation/test.

📊 36 results
📏 Metrics: 1:1 Accuracy

DBLP

The DBLP is a citation network dataset. The citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and …

📊 6 results
📏 Metrics: Accuracy, Micro F1, Inference Time (ms), Macro F1

Film (60%/20%/20% random splits)

Node classification on Film with 60%/20%/20% random splits for training/validation/test.

📊 35 results
📏 Metrics: 1:1 Accuracy

Film(48%/32%/20% fixed splits)

Node classification on Film with the fixed 48%/32%/20% splits provided by Geom-GCN.

📊 1 results
📏 Metrics: Accuracy

MUTAG

In particular, MUTAG is a collection of nitroaromatic compounds and the goal is to predict their mutagenicity on Salmonella typhimurium. …

📊 4 results
📏 Metrics: Accuracy

MuMiN-large

This is the large version of the MuMiN dataset.

📊 4 results
📏 Metrics: Claim Classification Macro-F1, Tweet Classification Macro-F1

MuMiN-medium

This is the medium version of the MuMiN dataset.

📊 4 results
📏 Metrics: Claim Classification Macro-F1, Tweet Classification Macro-F1

MuMiN-small

This is the small version of the MuMiN dataset.

📊 4 results
📏 Metrics: Claim Classification Macro-F1, Tweet Classification Macro-F1

NELL

NELL is a dataset built from the Web via an intelligent agent called Never-Ending Language Learner. This agent attempts to …

📊 4 results
📏 Metrics: Accuracy

PATTERN

PATTERN is a node classification tasks generated with Stochastic Block Models, which is widely used to model communities in social …

📊 11 results
📏 Metrics: Accuracy

PPI

protein roles—in terms of their cellular functions from gene ontology—in various protein-protein interaction (PPI) graphs, with each graph corresponding to …

📊 23 results
📏 Metrics: F1, Micro-F1, Micro F1, Macro-F1

Penn94

Node classification on Penn94

📊 31 results
📏 Metrics: Accuracy

Placenta

Placenta is a benchmark dataset for node classification in an underexplored domain: predicting microanatomical tissue structures from cell graphs in …

📊 5 results
📏 Metrics: Accuracy (%)

PubMed (48%/32%/20% fixed splits)

Node classification on PubMed with the fixed 48%/32%/20% splits provided by Geom-GCN.

📊 26 results
📏 Metrics: 1:1 Accuracy, Accuracy

PubMed (60%/20%/20% random splits)

Node classification on PubMed with 60%/20%/20% random splits for training/validation/test.

📊 35 results
📏 Metrics: 1:1 Accuracy

Pubmed

The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. …

📊 63 results
📏 Metrics: Accuracy, Training Split, F1, Validation, Accuracy (%), F1-Score, Inference Time (ms)

Reddit

The Reddit dataset is a graph dataset from Reddit posts made in the month of September, 2014. The node label …

📊 15 results
📏 Metrics: Accuracy, Micro-F1

Squirrel (48%/32%/20% fixed splits)

Node classification on Squirrel with the fixed 48%/32%/20% splits provided by Geom-GCN.

📊 4 results
📏 Metrics: Accuracy

Squirrel (60%/20%/20% random splits)

Node classification on Squirrel with 60%/20%/20% random splits for training/validation/test.

📊 36 results
📏 Metrics: 1:1 Accuracy

Texas (48%/32%/20% fixed splits)

Node classification on Texas with the fixed 48%/32%/20% splits provided by Geom-GCN.

📊 2 results
📏 Metrics: Accuracy

USA Air-Traffic

Leonardo Filipe Rodrigues Ribeiro, Pedro H. P. Saverese, and Daniel R. Figueiredo. struc2vec: Learning node representations from structural identity.

📊 7 results
📏 Metrics: Accuracy

Wiki

Context There's a story behind every dataset and here's your opportunity to share yours. ### Content What's inside is …

📊 1 results
📏 Metrics: AUC, Macro F1, Micro F1

Wiki-CS

Wiki-CS is a Wikipedia-based dataset for benchmarking Graph Neural Networks. The dataset is constructed from Wikipedia categories, specifically 10 classes …

📊 6 results
📏 Metrics: Accuracy

Wisconsin (48%/32%/20% fixed splits)

Node classification on Wisconsin with the fixed 48%/32%/20% splits provided by Geom-GCN.

📊 2 results
📏 Metrics: Accuracy

Yelp-Fraud

Yelp-Fraud is a multi-relational graph dataset built upon the Yelp spam review dataset, which can be used in evaluating graph-based …

📊 5 results
📏 Metrics: AUC-ROC

amazon-ratings

amazon-ratings is a product co-purchasing network based on data from SNAP datasets

📊 4 results
📏 Metrics: Accuracy (%)

genius

node classification on genius

📊 25 results
📏 Metrics: Accuracy, 1:1 Accuracy

minesweeper

minesweeper is a synthetic graph emulating the eponymous game.

📊 4 results
📏 Metrics: AUCROC

questions

Questions is an interaction graph of users of a question-answering website based on data provided by Yandex Q.

📊 3 results
📏 Metrics: AUCROC

roman-empire

Roman-empire is a word dependency graph based on the Roman Empire article from the English Wikipedia.

📊 7 results
📏 Metrics: Accuracy (% )

tolokers

Tolokers is a crowdsourcing platform workers network based on data provided by Toloka.

📊 4 results
📏 Metrics: AUCROC

twitch-gamers

node classification on twitch-gamers

📊 1 results
📏 Metrics: Accuracy

Noise Estimation

SIDD

SIDD is an image denoising dataset containing 30,000 noisy images from 10 scenes under different lighting conditions using five representative …

📊 5 results
📏 Metrics: PSNR Gap, Average KL Divergence

Novel Class Discovery

SVHN

Street View House Numbers (SVHN) is a digit classification benchmark dataset that contains 600,000 32×32 RGB images of printed digits …

📊 1 results
📏 Metrics: Clustering Accuracy

Novel View Synthesis

ACID

ACID consists of thousands of aerial drone videos of different coastline and nature scenes on YouTube. Structure-from-motion is used to …

📊 3 results
📏 Metrics: FID, NLL, PSIM, PSNR, SSIM

BLEFF

Synthetic (Blender) Dataset for forward facing scenes Toe vaualte NVS quality and camera parameter accuracy.

📊 3 results
📏 Metrics: PSNR/SSIM

DONeRF: Evaluation Dataset

This is the dataset for the CGF 2021 paper "DONeRF: Towards Real-Time Rendering of Compact Neural Radiance Fields using Depth …

📊 6 results
📏 Metrics: PSNR

Deep Blending

The Deep Blending Dataset comprises 19 diverse scenes, offering comprehensive resources for free-viewpoint image-based rendering (IBR). Each scene includes input …

📊 1 results
📏 Metrics: LPIPS, PSNR, SSIM, Size (MB)

HDR-GS

This is dataset for high dynamic range novel view synthesis. It is collected by HDR-NeRF and recalibrated by HDR-GS for …

📊 2 results
📏 Metrics: Average PSNR, SSIM, LPIPS

KITTI

KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile …

📊 1 results
📏 Metrics: Average PSNR

LLFF

Local Light Field Fusion (LLFF) is a practical and robust deep learning solution for capturing and rendering novel views of …

📊 15 results
📏 Metrics: PSNR, LPIPS, SSIM

Mip-NeRF 360

Mip-NeRF 360 is an extension to the Mip-NeRF that uses a non-linear parameterization, online distillation, and a novel distortion-based regularize …

📊 12 results
📏 Metrics: LPIPS, PSNR, SSIM, Size (MB)

NeRF

Neural Radiance Fields (NeRF) is a method for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric …

📊 12 results
📏 Metrics: PSNR, SSIM, LPIPS, Average PSNR, Size (MB)

PhotoShape

The PhotoShape dataset consists of photorealistic, relightable, 3D shapes produced by the work proposed in the work of [Park et …

📊 1 results
📏 Metrics: LPIPS, PSNR

RTMV

RTMV is a large-scale synthetic dataset for novel view synthesis consisting of ∼300k images rendered from nearly 2000 complex scenes …

📊 4 results
📏 Metrics: PSNR, SSIM

RealEstate10K

RealEstate10K is a large dataset of camera poses corresponding to 10 million frames derived from about 80,000 video clips, gathered …

📊 2 results
📏 Metrics: FID, NLL, PSIM, PSNR, SSIM

RefRef

RefRef is a synthetic dataset and benchmark designed for the task of reconstructing scenes with complex refractive and reflective objects. …

📊 5 results
📏 Metrics: Average PSNR (dB)

SWORD

The new dataset contains around 1,500 train videos and 290 test videos, with 50 frames per video on average. The …

📊 3 results
📏 Metrics: LPIPS, PSNR, SSIM

ScanNet++

ScanNet++ is a large scale dataset with 450+ 3D indoor scenes containing sub-millimeter resolution laser scans, registered 33-megapixel DSLR images, …

📊 4 results
📏 Metrics: PSNR, SSIM, LPIPS

Tanks and Temples

We present a benchmark for image-based 3D reconstruction. The benchmark sequences were acquired outside the lab, in realistic conditions. Ground-truth …

📊 10 results
📏 Metrics: PSNR, SSIM, LPIPS, Size (MB)

X3D

X3D is a dataset containing 15 scenes and covering 4 applications for X-ray 3D reconstruction. More specifically, the X3D dataset …

📊 5 results
📏 Metrics: PSNR, SSIM

iFF

Real-world dataset on forward facing scenes with different camera intrinisc parameters.

📊 2 results
📏 Metrics: Average PSNR, SSIM, Focal Error

Nutrition

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Object Categorization

GRIT

The General Robust Image Task (GRIT) Benchmark is an evaluation-only benchmark for evaluating the performance and robustness of vision systems …

📊 4 results
📏 Metrics: Categorization (ablation), Categorization (test)

Object Counting

CARPK

The Car Parking Lot Dataset (CARPK) contains nearly 90,000 cars from 4 different parking lots collected by means of drone …

📊 14 results
📏 Metrics: MAE, RMSE

FSC147

We introduce a dataset of 147 object categories containing over 6000 images that are suitable for the few-shot counting task. …

📊 18 results
📏 Metrics: MAE(test), MAE(val), RMSE(test), RMSE(val)

HowMany-QA

HowMany-Qa is a object counting dataset. It is taken from the counting-specific union of VQA 2.0 (Goyal et al., 2017) …

📊 3 results
📏 Metrics: Accuracy, RMSE

Omnicount-191

To effectively evaluate OmniCount across open-vocabulary, supervised, and few-shot counting tasks, a dataset catering to a broad spectrum of visual …

📊 1 results
📏 Metrics: mRMSE

PASCAL VOC

The PASCAL Visual Object Classes (VOC) 2012 dataset contains 20 object categories including vehicles, household, animals, and other: aeroplane, bicycle, …

📊 3 results
📏 Metrics: mRMSE

TRANCOS

📊 1 results
📏 Metrics: MAE, MSE

Object Detection

10,000 People - Human Pose Recognition Data

Description: 10,000 People - Human Pose Recognition Data. This dataset includes indoor and outdoor scenes.This dataset covers males and females. …

📊 1 results
📏 Metrics: 0-shot MRR

A Dataset of Multispectral Potato Plants Images

The dataset contains aerial agricultural images of a potato field with manual labels of healthy and stressed plant regions. The …

📊 1 results
📏 Metrics: Average IOU, Dice Score

A2D

A2D (Actor-Action Dataset) is a dataset for simultaneously inferring actors and actions in videos. A2D has seven actor classes (adult, …

📊 1 results
📏 Metrics: Mean IoU

AI-TOD

AI-TOD comes with 700,621 object instances for eight categories across 28,036 aerial images. Compared to existing object detection datasets in …

📊 6 results
📏 Metrics: AP, AP50, AP75, APvt, APt, APs, APm, mAP50, mAP@50-95

AODRaw

We introduce the AODRaw dataset, which offers 7,785 high-resolution real RAW images with 135,601 annotated instances spanning 62 categories, capturing …

📊 1 results
📏 Metrics: box AP

BDD100K

Datasets drive vision progress, yet existing driving datasets are impoverished in terms of visual content and supported tasks to study …

📊 1 results
📏 Metrics: MAP

C2A: Human Detection in Disaster Scenarios

C2A: Combination to Application Dataset ## Overview This repository contains the code and information for the paper "UAV-Enhanced Combination …

📊 1 results
📏 Metrics: Average mAP

COCO (Common Objects in Context)

The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset. It is designed to …

📊 3 results
📏 Metrics: box AP, GFlops

COCO-O

COCO-O(ut-of-distribution) contains 6 domains (sketch, cartoon, painting, weather, handmake, tattoo) of COCO objects which are hard to be detected by …

📊 45 results
📏 Metrics: Average mAP, Effective Robustness

CPPE-5

CPPE - 5 (Medical Personal Protective Equipment) is a new challenging dataset with the goal to allow the study of …

📊 16 results
📏 Metrics: box AP, AP50, AP75, APS, APM, APL

CityPersons

The CityPersons dataset is a subset of Cityscapes which only consists of person annotations. There are 2975 images for training, …

📊 1 results
📏 Metrics: mMR

Clipart1k

In Clipart1k, the target domain classes to be detected are the same as those in the source domain. All the …

📊 1 results
📏 Metrics: MAP

Comic2k

Comic2k is a dataset used for cross-domain object detection which contains 2k comic images with image and instance-level annotations. Image …

📊 1 results
📏 Metrics: mAP

CrowdHuman

CrowdHuman is a large and rich-annotated human detection dataset, which contains 15,000, 4,370 and 5,000 images collected from the Internet …

📊 1 results
📏 Metrics: AP, MR^-2

DSEC

DSEC is a stereo camera dataset in driving scenarios that contains data from two monochrome event cameras and two global …

📊 10 results
📏 Metrics: mAP

DeepTrash

📊 1 results
📏 Metrics: mAP

Drinking Waste Classification

About the Dataset: 4 classes of drinking waste: Aluminium Cans, Glass bottles, PET (plastic) bottles and HDPE (plastic) Milk …

📊 1 results
📏 Metrics: AP50

Drone vs Bird

For the Drone-vs-Bird Detection Challenge 2021, 77 different video sequences have been made available as training data. These video sequences …

📊 2 results
📏 Metrics: AP50, AP50l, AP50m, AP50s

ELEVATER

The ELEVATER benchmark is a collection of resources for training, evaluating, and analyzing language-image models on image classification and object …

📊 1 results
📏 Metrics: AP

EVD4UAV

VD4UAV is an altitude-sensitive benchmark dataset designed to evade vehicle detection in Unmanned Aerial Vehicle (UAV) imagery. This dataset is …

📊 1 results
📏 Metrics: Detection: Full ([email protected])

FlickrLogos-32

Object detection benchmark for logo detection. Images are natural scenes. Each image contains multiple objects, and each image has a …

📊 3 results
📏 Metrics: MAP

GEN1 Detection

Prophesee’s GEN1 Automotive Detection Dataset is the largest Event-Based Dataset to date. The dataset was recorded using a PROPHESEE GEN1 …

📊 11 results
📏 Metrics: mAP, Params

GMOT-40

GMOT-40 is the first public dense dataset for Generic Multiple Object Tracking (GMOT). It contains 40 carefully annotated sequences evenly …

📊 1 results
📏 Metrics: [email protected]

GRAZPEDWRI-DX

GRAZPEDWRI-DX is a public dataset of 20,327 pediatric wrist trauma X-ray images released by the University of Medicine of Graz. …

📊 6 results
📏 Metrics: mAP

IndustReal

IndustReal is an ego-centric, multi-modal dataset where 27 participants are challenged to perform assembly and maintenance procedures on a construction-toy …

📊 2 results
📏 Metrics: mAP

LDD

The Instance Segmentation task, an extension of the well-known Object Detection task, is of great help in many areas, such …

📊 1 results
📏 Metrics: box mAP

LLVIP

  • Visible-infrared Paired Dataset for Low-light Vision * 30976 images (15488 pairs) * 24 dark scenes, 2 daytime scenes * …
📊 1 results
📏 Metrics: AP

LeukemiaAttri

The LeukemiaAttri dataset is a large-scale, multi-domain collection of microscopy images derived from leukemia patient samples, enriched with detailed morphological …

📊 1 results
📏 Metrics: mAP 50-95

MJU-Waste

MJU-Waste is an RGBD waste object segmentation dataset that is made public to facilitate future research in this area. Source: …

📊 1 results
📏 Metrics: AP50

MSCOCO

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 3 results
📏 Metrics: Average mAP

MUSES: MUlti-SEnsor Semantic perception dataset

MUSES offers 2500 multi-modal scenes, evenly distributed across various combinations of weather conditions (clear, fog, rain, and snow) and types …

📊 1 results
📏 Metrics: AP

Manga109

Manga109 has been compiled by the Aizawa Yamasaki Matsui Laboratory, Department of Information and Communication Engineering, the Graduate School of …

📊 2 results
📏 Metrics: Average Precision

NAO

Natural Adversarial Objects (NAO) is a new dataset to evaluate the robustness of object detection models. NAO contains 7,934 images …

📊 7 results
📏 Metrics: mAP, mAP w/o OOD, mAR

Objects365

Objects365 is a large-scale object detection dataset, Objects365, which has 365 object categories over 600K training images. More than 10 …

📊 1 results
📏 Metrics: AP

OoDIS

OoDIS is a benchmark dataset for anomaly instance segmentation, crucial for autonomous vehicle safety. It extends existing anomaly segmentation benchmarks …

📊 3 results
📏 Metrics: AP, AP50

OpenImages-v6

OpenImages V6 is a large-scale dataset , consists of 9 million training images, 41,620 validation samples, and 125,456 test samples. …

📊 2 results
📏 Metrics: box AP

PASCAL VOC

The PASCAL Visual Object Classes (VOC) 2012 dataset contains 20 object categories including vehicles, household, animals, and other: aeroplane, bicycle, …

📊 1 results
📏 Metrics: Parameters(K)

PASCAL VOC 2007

PASCAL VOC 2007 is a dataset for image recognition. The twenty object classes that have been selected are: Person: person …

📊 28 results
📏 Metrics: MAP, AP50, mAP@50, mAP@50-95, box AP

PASCAL VOC 2012 test

SCC Data Set

📊 1 results
📏 Metrics: Bounding Box AP

PeopleArt

People-Art is an object detection dataset which consists of people in 43 different styles. People contained in this dataset are …

📊 6 results

SA-Det-100k

SA-Det-100k is a large-scale class-agnostic object detection dataset for Research Purposes only. The dataset is based on a subset of …

📊 2 results
📏 Metrics: AP, AP50, AP75, APS, APM, APL

SAR-AIRcraft-1.0

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 1 results
📏 Metrics: Average mAP

SFCHD

This work contributes a large, complex, and realistic high-quality safety clothing and helmet detection (SFCHD) dataset. The dataset comprises 12,373 …

📊 12 results

SIXray

The SIXray dataset is constructed by the Pattern Recognition and Intelligent System Development Laboratory, University of Chinese Academy of Sciences. …

📊 2 results
📏 Metrics: 1 in 10 R@5

STN PLAD

STN PLAD is a high-resolution and real-world image dataset of multiple high-voltage power line components. It has 2,409 annotated objects …

📊 1 results
📏 Metrics: mAP

SeaDronesSee

SeaDronesSee is a large-scale data set aimed at helping develop systems for Search and Rescue (SAR) using Unmanned Aerial Vehicles …

📊 10 results

Songdo Vision

The Songdo Vision dataset provides high-resolution (4K, 3840×2160 pixels) RGB images annotated with categorized axis-aligned bounding boxes (BBs) for vehicle …

📊 1 results
📏 Metrics: Precision, Recall, mAP@50, mAP@50-95

SpaceNet 1

SpaceNet 1: Building Detection v1 is a dataset for building footprint detection. The data is comprised of 382,534 building footprints, …

📊 1 results
📏 Metrics: F1 Score

SpaceNet 2

SpaceNet 2: Building Detection v2 - is a dataset for building footprint detection in geographically diverse settings from very high …

📊 2 results
📏 Metrics: F1 Score (Avg. over Cities)

UA-DETRAC

Consists of 100 challenging video sequences captured from real-world traffic scenes (over 140,000 frames with rich annotations, including occlusion, weather, …

📊 8 results
📏 Metrics: mAP

UAVDT

UAVDT is a large scale challenging UAV Detection and Tracking benchmark (i.e., about 80, 000 representative frames from 10 hours …

📊 8 results
📏 Metrics: mAP

UAVVaste

The UAVVaste dataset consists to date of 772 images and 3716 annotations. The main motivation for creation of the dataset …

📊 1 results
📏 Metrics: AP50

VEDAI

VEDAI is a dataset for Vehicle Detection in Aerial Imagery, provided as a tool to benchmark automatic target recognition algorithms …

📊 3 results
📏 Metrics: mAP50

Visual Genome

Visual Genome contains Visual Question Answering data in a multi-choice setting. It consists of 101,174 images from MSCOCO with 1.7 …

📊 2 results
📏 Metrics: MAP

WaterScenes

A Multi-Task 4D Radar-Camera Fusion Dataset for Autonomous Driving on Water Surfaces description of the dataset * WaterScenes, the first …

📊 4 results
📏 Metrics: mAP@50-95

Watercolor2k

Watercolor2k is a dataset used for cross-domain object detection which contains 2k watercolor images with image and instance-level annotations.

📊 1 results
📏 Metrics: MAP

Waymo Open Dataset

The Waymo Open Dataset is comprised of high resolution sensor data collected by autonomous vehicles operated by the Waymo Driver …

📊 5 results
📏 Metrics: AP/L2, Latency, ms

WiderPerson

WiderPerson contains a total of 13,382 images with 399,786 annotations, i.e., 29.87 annotations per image, which means this dataset contains …

📊 4 results
📏 Metrics: AP, mMR

iSAID

iSAID contains 655,451 object instances for 15 categories across 2,806 high-resolution images. The images of iSAID is the same as …

📊 5 results
📏 Metrics: Average Precision

nuScenes

The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in …

📊 2 results
📏 Metrics: AP(l), AP(m), AP(s), AP50, AP75, AP85, AR, AR(l), AR(m), AR(s), MAP

Object Localization

GRIT

The General Robust Image Task (GRIT) Benchmark is an evaluation-only benchmark for evaluating the performance and robustness of vision systems …

📊 3 results
📏 Metrics: Localization (ablation), Localization (test)

IllusionVQA

IllusionVQA is a Visual Question Answering (VQA) dataset with two sub-tasks. The first task tests comprehension on 435 instances in …

📊 9 results
📏 Metrics: Accuracy

Mall

The Mall is a dataset for crowd counting and profiling research. Its images are collected from publicly accessible webcam. It …

📊 1 results
📏 Metrics: Precision

PASCAL VOC 2007

PASCAL VOC 2007 is a dataset for image recognition. The twenty object classes that have been selected are: Person: person …

📊 1 results
📏 Metrics: CorLoc

Object Rearrangement

Open6DOR V2

We introduce a challenging and comprehensive benchmark for open-instruction 6-DoF object rearrangement tasks, termed Open6DOR.

📊 4 results
📏 Metrics: 6-DoF, pos-level1, pos-level0, rot-level0, rot-level1, rot-level2

Object Recognition

CIFAR10-DVS

CIFAR10-DVS is an event-stream dataset for object classification. 10,000 frame-based images that come from CIFAR-10 dataset are converted into 10,000 …

📊 2 results
📏 Metrics: Accuracy (% )

DVS128 Gesture

Comprises 11 hand gesture categories from 29 subjects under 3 illumination conditions. Source: [A Low Power, Fully Event-Based Gesture Recognition …

📊 1 results
📏 Metrics: Accuracy (% )

MECCANO

The MECCANO dataset is the first dataset of egocentric videos to study human-object interactions in industrial-like settings. The MECCANO dataset …

📊 1 results
📏 Metrics: mAP

N-CARS

A large real-world event-based dataset for object classification. Source: HATS: Histograms of Averaged Time Surfaces for Robust Event-based Object Classification

📊 1 results
📏 Metrics: Accuracy (% )

N-Caltech 101

The Neuromorphic-Caltech101 (N-Caltech101) dataset is a spiking version of the original frame-based Caltech101 dataset. The original dataset contained both a …

📊 2 results
📏 Metrics: Accuracy (% )

shape bias

The 'shape bias' dataset was introduced in Geirhos et al. (ICLR 2019) and consists of 224x224 images with conflicting texture …

📊 18 results
📏 Metrics: shape bias

Object Segmentation

GRIT

The General Robust Image Task (GRIT) Benchmark is an evaluation-only benchmark for evaluating the performance and robustness of vision systems …

📊 2 results
📏 Metrics: Segmentation (ablation), Segmentation (test)

Object Tracking

1

111

📊 1 results
📏 Metrics: 0S

COESOT

In this work, we propose a general dataset for Color-Event camera based Single Object Tracking, termed COESOT. It contains 1354 …

📊 11 results
📏 Metrics: Success Rate, Precision Rate

FE108

Large-scale single-object tracking dataset, containing 108 sequences with a total length of 1.5 hours. FE108 provides ground truth annotations on …

📊 7 results
📏 Metrics: Success Rate, Averaged Precision

KITTI

KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile …

📊 2 results
📏 Metrics: mean precision, mean success

MMPTRACK

Multi-camera Multiple People Tracking (MMPTRACK) dataset has about 9.6 hours of videos, with over half a million frame-wise annotations. The …

📊 2 results
📏 Metrics: 3DMOTA

Perception Test

Perception Test is a benchmark designed to evaluate the perception and reasoning skills of multimodal models. It introduces real-world videos …

📊 1 results
📏 Metrics: Average IOU

QuadTrack

Most existing MOT datasets are captured using pinhole cameras, which are characterized by a narrow-FoV and linear sensor motion. However, …

📊 8 results
📏 Metrics: HOTA

SeaDronesSee

SeaDronesSee is a large-scale data set aimed at helping develop systems for Search and Rescue (SAR) using Unmanned Aerial Vehicles …

📊 5 results
📏 Metrics: Success Rate, Precision Score

VisEvent

VisEvent (Visible-Event benchmark) is a dataset constructed for the evaluation of tracking by combing visible and event cameras. VisEvent is …

📊 1 results
📏 Metrics: Precision Plot

Occluded 3D Object Symmetry Detection

YCB-Video

The YCB-Video dataset is a large-scale video dataset for 6D object pose estimation. provides accurate 6D poses of 21 objects …

📊 1 results
📏 Metrics: PR AUC

Odd One Out

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 2 results
📏 Metrics: Accuracy

Offline RL

D4RL

D4RL is a collection of environments for offline reinforcement learning. These environments include Maze2D, AntMaze, Adroit, Gym, Flow, FrankKitchen and …

📊 3 results
📏 Metrics: Average Reward

One-Shot Segmentation

Cluttered Omniglot

Dataset for one-shot segmentation. Source: One-Shot Segmentation in Clutter

📊 2 results
📏 Metrics: IoU [32 distractors], IoU [4 distractors], IoU [256 distractors]

Online Beat Tracking

Ballroom

This data set includes beat and bar annotations of the ballroom dataset, introduced by Gouyon et al. [1]. [1] Gouyon …

📊 3 results
📏 Metrics: F1

GTZAN

The gtzan8 audio dataset contains 1000 tracks of 30 second length. There are 10 genres, each containing 100 tracks which …

📊 7 results
📏 Metrics: F1

Rock Corpus

This dataset contains 200 famous songs in different genres (mostly in rock) and the beats and downbeat annotations are provided …

📊 3 results
📏 Metrics: F1

Open Information Extraction

BenchIE

BenchIE: a benchmark and evaluation framework for comprehensive evaluation of OIE systems for English, Chinese and German. In contrast to …

📊 11 results
📏 Metrics: Precision, F1, Recall

CaRB

CaRB [Bhardwaj et al., 2019] is developed by re-annotating the dev and test splits of OIE2016 via crowd-sourcing. Besides improving …

📊 25 results
📏 Metrics: F1

LSOIE

LSOIE is a large-scale OpenIE data converted from QA-SRL 2.0 in two domains, i.e., Wikipedia and Science. It is 20 …

📊 9 results
📏 Metrics: F1

OIE2016

OIE2016 is the first large-scale OpenIE benchmark. It is created by automatic conversion from QA-SRL [He et al., 2015], a …

📊 12 results
📏 Metrics: F1, AUC

Penn Treebank

The English Penn Treebank (PTB) corpus, and in particular the section of the corpus corresponding to the articles of Wall …

📊 4 results
📏 Metrics: F1, AUC

WiRe57

We manually performed the task of Open Information Extraction on 5 short documents, elaborating tentative guidelines for the task, and …

📊 18 results
📏 Metrics: F1

Open Intent Discovery

ATIS

The ATIS (Airline Travel Information Systems) is a dataset consisting of audio recordings and corresponding manual transcripts about humans asking …

📊 1 results
📏 Metrics: ACC, ARI, NMI

BANKING77

Dataset composed of online banking queries annotated with their corresponding intents. BANKING77 dataset provides a very fine-grained set of intents …

📊 1 results
📏 Metrics: ACC, ARI, NMI

CLINC150

This dataset is for evaluating the performance of intent classification systems in the presence of "out-of-scope" queries, i.e., queries that …

📊 1 results
📏 Metrics: ACC, ARI, NMI

SNIPS

The SNIPS Natural Language Understanding benchmark is a dataset of over 16,000 crowdsourced queries distributed among 7 user intents of …

📊 1 results
📏 Metrics: ACC, ARI, NMI

Open Vocabulary Action Detection

JHMDB

JHMDB is an action recognition dataset that consists of 960 video sequences belonging to 21 actions. It is a subset …

📊 1 results
📏 Metrics: val mAP

MultiSports

Spatio-temporal action detection is an important and challenging problem in video understanding. The existing action detection benchmarks are limited in …

📊 1 results
📏 Metrics: val mAP

UCF101-24

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 1 results
📏 Metrics: val mAP

Open Vocabulary Object Detection

MSCOCO

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 32 results
📏 Metrics: AP 0.5

Objects365

Objects365 is a large-scale object detection dataset, Objects365, which has 365 object categories over 600K training images. More than 10 …

📊 2 results
📏 Metrics: mask AP50

Open Vocabulary Panoptic Segmentation

ADE20K

The ADE20K semantic segmentation dataset contains more than 20K scene-centric images exhaustively annotated with pixel-level objects and object parts labels. …

📊 10 results
📏 Metrics: PQ

Open Vocabulary Semantic Segmentation

Cityscapes

Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense …

📊 5 results
📏 Metrics: mIoU

ISPRS Potsdam

The data set contains 38 patches (of the same size), each consisting of a true orthophoto (TOP) extracted from a …

📊 1 results
📏 Metrics: mIoU

iSAID

iSAID contains 655,451 object instances for 15 categories across 2,806 high-resolution images. The images of iSAID is the same as …

📊 2 results
📏 Metrics: mIoU-

Open-Domain Question Answering

DuReader

DuReader is a large-scale open-domain Chinese machine reading comprehension dataset. The dataset consists of 200K questions, 420K answers and 1M …

📊 2 results
📏 Metrics: EM

ELI5

ELI5 is a dataset for long-form question answering. It contains 270K complex, diverse questions that require explanatory multi-sentence answers. Web …

📊 6 results
📏 Metrics: Rouge-L, Rouge-1, Rouge-2

Natural Questions

The Natural Questions corpus is a question answering dataset containing 307,373 training examples, 7,830 development examples, and 7,842 test examples. …

📊 5 results
📏 Metrics: Exact Match

SearchQA

SearchQA was built using an in-production, commercial search engine. It closely reflects the full pipeline of a (hypothetical) general question-answering …

📊 12 results
📏 Metrics: EM, N-gram F1, Unigram Acc, F1

TQA

The TextbookQuestionAnswering (TQA) dataset is drawn from middle school science curricula. It consists of 1,076 lessons from Life Science, Earth …

📊 2 results
📏 Metrics: Exact Match

TriviaQA

TriviaQA is a realistic text-based question answering dataset which includes 950K question-answer pairs from 662K documents collected from Wikipedia and …

📊 1 results
📏 Metrics: Exact Match

WebQuestions

The WebQuestions dataset is a question answering dataset using Freebase as the knowledge base and contains 6,642 question-answer pairs. It …

📊 4 results
📏 Metrics: Exact Match

OpenAPI code completion

OpenAPI completion refined

A human-refined dataset of OpenAPI definitions based on the APIs.guru OpenAPI directory. The dataset was collected from the APIs.guru OpenAPI …

📊 4 results
📏 Metrics: Correctness, avg., %, Correctness, max., %, Validness, avg., %, Validness, max., %

Optic Cup Segmentation

REFUGE Challenge

REFUGE Challenge provides a data set of 1200 fundus images with ground truth segmentations and clinical glaucoma labels, currently the …

📊 1 results
📏 Metrics: Dice

Optic Disc Detection

IDRiD

Indian Diabetic Retinopathy Image Dataset (IDRiD) dataset consists of typical diabetic retinopathy lesions and normal retinal structures annotated at a …

📊 1 results
📏 Metrics: Euclidean Distance (ED)

Optical Character Recognition (OCR)

FSNS - Test

Arabic handwriting dataset.

📊 3 results
📏 Metrics: Sequence error

I2L-140K

Introduced by Singh, Sumeet S.. “Teaching Machines to Code: Neural Markup Generation with Visual Attention.” ArXiv abs/1802.05415 (2018): n. pag. …

📊 2 results
📏 Metrics: BLEU

VideoDB's OCR Benchmark Public Collection

Dataset Introduction This dataset leverages VideoDB's Public Collection to offer a diverse range of videos featuring text-containing scenes. It …

📊 5 results
📏 Metrics: Average Accuracy, Character Error Rate (CER), Word Error Rate (WER)

im2latex-100k

A prebuilt dataset for OpenAI's task for image-2-latex system. Includes total of ~100k formulas and images splitted into train, validation …

📊 1 results
📏 Metrics: BLEU

Optical Flow Estimation

Spring

Spring is a large, high-resolution and high-detail, computer-generated benchmark for scene flow, optical flow, and stereo. Based on rendered scenes …

📊 10 results
📏 Metrics: 1px total

Out-of-Distribution Detection

20 Newsgroups

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

📊 2 results
📏 Metrics: AUROC, FPR95

ADE-OoD

ADE-OoD is a public benchmark for dense out-of-distribution detection in general natural images. It measures the ability to detect and …

📊 4 results
📏 Metrics: AP, FPR@95

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 9 results
📏 Metrics: AUROC, FPR95

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 4 results
📏 Metrics: FPR95, AUROC

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 2 results
📏 Metrics: AUROC

ImageNet-1k vs NINCO

The NINCO (No ImageNet Class Objects) dataset is introduced in the ICML 2023 paper In or Out? Fixing ImageNet Out-of-Distribution …

📊 5 results
📏 Metrics: AUROC, FPR@95, Latency, ms

ImageNet-1k vs OpenImage-O

OpenImage-O is built for the ID dataset ImageNet-1k. It is manually annotated, comes with a naturally diverse distribution, and has …

📊 6 results
📏 Metrics: AUROC, FPR95, Latency, ms

ImageNet-1k vs Places

A benchmark dataset for out-of-distribution detection. ImageNet-1k is in-distribution, while Places is out-of-distribution.

📊 22 results
📏 Metrics: FPR95, AUROC

ImageNet-1k vs SUN

A benchmark dataset for out-of-distribution detection. ImageNet-1k is in-distribution, while SUN is out-of-distribution.

📊 19 results
📏 Metrics: FPR95, AUROC

ImageNet-1k vs Textures

A benchmark dataset for out-of-distribution detection. ImageNet-1k is in-distribution, while Textures is out-of-distribution.

📊 30 results
📏 Metrics: AUROC, FPR95, Latency, ms

ImageNet-1k vs iNaturalist

A benchmark dataset for out-of-distribution detection. ImageNet-1k is in-distribution, while iNaturalist is out-of-distribution.

📊 24 results
📏 Metrics: AUROC, FPR95, Latency, ms

SST

The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the …

📊 1 results
📏 Metrics: AUROC, FPR95

STL-10

The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or …

📊 6 results
📏 Metrics: Percentage correct

Outlier Detection

ECG5000

The original dataset for "ECG5000" is a 20-hour long ECG downloaded from Physionet. The name is BIDMC Congestive Heart Failure …

📊 2 results
📏 Metrics: Accuracy

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 1 results
📏 Metrics: AUROC

SKAB

SKAB is designed for evaluating algorithms for anomaly detection. The benchmark currently includes 30+ datasets plus Python modules for algorithms’ …

📊 1 results
📏 Metrics: Average F1

Panoptic Segmentation

DALES

We present the Dayton Annotated LiDAR Earth Scan (DALES) data set, a new large-scale aerial LiDAR data set with over …

📊 1 results
📏 Metrics: PQ, RQ, SQ, Params (M)

Hypersim

For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real images. …

📊 1 results
📏 Metrics: PQ, PQ (test), mIoU, mIoU (test)

KITTI-360

KITTI-360 is a large-scale dataset that contains rich sensory information and full annotations. It is the successor of the popular …

📊 1 results
📏 Metrics: PQ, RQ, SQ, Params (M)

LaRS

LaRS is the largest and most diverse panoptic maritime obstacle detection dataset. Highlights: * Diverse scenes from manual capture, public …

📊 8 results
📏 Metrics: PQ

MUSES: MUlti-SEnsor Semantic perception dataset

MUSES offers 2500 multi-modal scenes, evenly distributed across various combinations of weather conditions (clear, fog, rain, and snow) and types …

📊 2 results
📏 Metrics: PQ

PASTIS

PASTIS is a benchmark dataset for panoptic and semantic segmentation of agricultural parcels from satellite image time series. It is …

📊 3 results
📏 Metrics: PQ, RQ, SQ

PASTIS-R

Extension of the PASTIS benchmark with radar and optical image time series.

📊 1 results
📏 Metrics: PQ, RQ, SQ

PanNuke

PanNuke is a semi automatically generated nuclei instance segmentation and classification dataset with exhaustive nuclei labels across 19 different tissue …

📊 3 results
📏 Metrics: PQ

S3DIS

The Stanford 3D Indoor Scene Dataset (S3DIS) dataset contains 6 large-scale indoor areas with 271 rooms. Each point in the …

📊 1 results
📏 Metrics: PQ, RQ, SQ, PQ (with stuff), RQ (with stuff), SQ (with stuff), Params (M)

ScanNet

ScanNet is an instance-level indoor RGB-D dataset that includes both 2D and 3D data. It is a collection of labeled …

📊 4 results
📏 Metrics: PQ, PQ_th, PQ_st

SemanticKITTI

SemanticKITTI is a large-scale outdoor-scene dataset for point cloud semantic segmentation. It is derived from the KITTI Vision Odometry Benchmark …

📊 2 results
📏 Metrics: PQ, PQ_dagger, PQst, PQth, RQ, RQst, RQth, SQ, SQst, SQth, mIoU

Paraphrase Generation

MSCOCO

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 1 results
📏 Metrics: BLEU, iBLEU

Paralex

Paralex learns from a collection of 18 million question-paraphrase pairs scraped from WikiAnswers.

📊 2 results
📏 Metrics: iBLEU, BLEU

Quora Question Pairs

Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary …

📊 2 results
📏 Metrics: iBLEU, BLEU

Paraphrase Identification

AP

This is a paraphrasing dataset created using the adversarial paradigm. A task was designed called the Adversarial Paraphrasing Task (APT) …

📊 1 results
📏 Metrics: MCC

PIT

Paraphrase and Semantic Similarity in Twitter (PIT) presents a constructed Twitter Paraphrase Corpus that contains 18,762 sentence pairs. Source: [SemEval-2015 …

📊 1 results
📏 Metrics: AP

Quora Question Pairs

Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary …

📊 29 results
📏 Metrics: F1, Accuracy, Direct Intrinsic Dimension, Structure Aware Intrinsic Dimension, Dev Accuracy, Accuarcy, Dev F1

TURL

Twitter News URL Corpus is a human-labeled paraphrase corpus to date of 51,524 sentence pairs and the first cross-domain benchmarking …

📊 1 results
📏 Metrics: AP

Translated SNLI Dataset in Marathi

Translated SNLI Dataset in Marathi A translated version of the SNLI dataset in Marathi, designed for **Semantic Textual Similarity …

📊 1 results
📏 Metrics: 1:1 Accuracy

WikiHop

WikiHop is a multi-hop question-answering dataset. The query of WikiHop is constructed with entities and relations from WikiData, while supporting …

📊 1 results
📏 Metrics: Accuracy

Yelp

The Yelp Dataset is a valuable resource for academic research, teaching, and learning. It provides a rich collection of real-world …

📊 1 results
📏 Metrics: Accuracy

Parking Space Occupancy

Action-Camera Parking

The Action-Camera Parking Dataset contains 293 images captured at a roughly 10-meter height using a GoPro Hero 6 camera. It …

📊 7 results
📏 Metrics: F1-score, F1

PKLot

The PKLot dataset contains 12,417 images of parking lots and 695,899 images of parking spaces segmented from them, which were …

📊 2 results
📏 Metrics: Average-mAP, F1-score

SPKL

The SPKL dataset contains 1203 images of parking lots divided into 11 categories regarding vision conditions (including the 'winter' category …

📊 7 results
📏 Metrics: F1-score

Part-Of-Speech Tagging

DaNE

Danish Dependency Treebank (DaNE) is a named entity annotation for the Danish Universal Dependencies treebank using the CoNLL-2003 annotation scheme. …

📊 1 results
📏 Metrics: Accuracy (%)

Morphosyntactic-analysis-dataset

This dataset is for evaluation of morphosyntactic analyzers.

📊 1 results
📏 Metrics: BLEX

Penn Treebank

The English Penn Treebank (PTB) corpus, and in particular the section of the corpus corresponding to the articles of Wall …

📊 16 results
📏 Metrics: Accuracy

Tweebank

Briefly describe the dataset. Provide: * a high-level explanation of the dataset characteristics * explain motivations and summary of its …

📊 2 results
📏 Metrics: Acc

XGLUE

XGLUE is an evaluation benchmark XGLUE,which is composed of 11 tasks that span 19 languages. For each task, the training …

📊 1 results
📏 Metrics: Avg. F1

Part-aware Panoptic Segmentation

Cityscapes Panoptic Parts

The Cityscapes Panoptic Parts dataset introduces part-aware panoptic segmentation annotations for the Cityscapes dataset. It extends the original panoptic annotations …

📊 4 results
📏 Metrics: PartPQ

Pascal Panoptic Parts

The Pascal Panoptic Parts dataset consists of annotations for the part-aware panoptic segmentation task on the PASCAL VOC 2010 dataset. …

📊 4 results
📏 Metrics: PartPQ

Partial Label Learning

ISIC 2019

The goal for ISIC 2019 is classify dermoscopic images among nine different diagnostic categories.25,331 images are available for training across …

📊 1 results
📏 Metrics: Balanced Multi-Class Accuracy

M-VAD Names

The dataset contains the annotations of characters' visual appearances, in the form of tracks of face bounding boxes, and the …

📊 1 results
📏 Metrics: Accuracy

Partial Point Cloud Matching

4DMatch

A benchmark for matching and registration of partial point clouds with time-varying geometry. It is constructed using randomly selected 1761 …

📊 9 results
📏 Metrics: NFMR, IR

Participant Intervention Comparison Outcome Extraction

EBM-NLP

EBM-NLP annotates PICO (Participants, Interventions, Comparisons and Outcomes) spans in clinical trial abstracts. The corresponding PICO Extraction task aims to …

📊 5 results
📏 Metrics: F1

Passage Ranking

MS MARCO

The MS MARCO (Microsoft MAchine Reading Comprehension) is a collection of datasets focused on deep learning in search. The first …

📊 4 results
📏 Metrics: MRR@10

Passage Re-Ranking

MS MARCO

The MS MARCO (Microsoft MAchine Reading Comprehension) is a collection of datasets focused on deep learning in search. The first …

📊 4 results
📏 Metrics: MRR

Patient Phenotyping

HiRID

HiRID is a freely accessible critical care dataset containing data relating to almost 34 thousand patient admissions to the Department …

📊 7 results
📏 Metrics: Balanced Accuracy

Pedestrian Attribute Recognition

PA-100K

PA-100K is a recent-proposed large pedestrian attribute dataset, with 100,000 images in total collected from outdoor surveillance cameras. It is …

📊 10 results
📏 Metrics: Accuracy, Accuracy , F1 score

PETA

The PEdesTrian Attribute dataset (PETA) is a dataset fore recognizing pedestrian attributes, such as gender and clothing style, at a …

📊 5 results
📏 Metrics: Accuracy

RAP

The Richly Annotated Pedestrian (RAP) dataset is a dataset for pedestrian attribute recognition. It contains 41,585 images collected from indoor …

📊 2 results
📏 Metrics: Accuracy

UAV-Human

UAV-Human is a large dataset for human behavior understanding with UAVs. It contains 67,428 multi-modal video sequences and 119 subjects …

📊 2 results
📏 Metrics: Backpack, Gender, Hat, LCC, LCS, UCC, UCS

Pedestrian Detection

CityPersons

The CityPersons dataset is a subset of Cityscapes which only consists of person annotations. There are 2975 images for training, …

📊 18 results
📏 Metrics: Reasonable MR^-2, Heavy MR^-2, Partial MR^-2, Bare MR^-2, Small MR^-2, Medium MR^-2, Large MR^-2, Test Time

LLVIP

  • Visible-infrared Paired Dataset for Low-light Vision * 30976 images (15488 pairs) * 24 dark scenes, 2 daytime scenes * …
📊 9 results
📏 Metrics: AP, log average miss rate

MMPD-Dataset

MMPD Dataset is proposed in ECCV'2024 "When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model and Benchmark Dataset".

📊 1 results
📏 Metrics: box mAP

Period Estimation

OmniArt

Presents half a million samples and structured meta-data to encourage further research and societal engagement. Source: [OmniArt: Multi-task Deep Learning …

📊 2 results
📏 Metrics: Mean absolute error

Perpetual View Generation

LHQ

A dataset of 90,000 high-resolution nature landscape images, crawled from Unsplash and Flickr and preprocessed with Mask R-CNN and Inception …

📊 2 results
📏 Metrics: FID (first 20 steps), IS (first 20 steps), KID (first 20 steps), FID (full 100 steps), IS (full 100 steps), KID (full 100 steps)

Person Identification

EEG Motor Movement/Imagery Dataset

This data set consists of over 1500 one- and two-minute EEG recordings, obtained from 109 volunteers.

📊 3 results
📏 Metrics: Accuracy

WiGesture

WiGesture dataset contains data related to gesture recognition and people id identification in a meeting room scenario. The dataset provides …

📊 3 results
📏 Metrics: Accuracy (% )

Person Re-Identification

AG-ReID

Person re-ID matches persons across multiple non-overlapping cameras. Despite the increasing deployment of airborne platforms in surveillance, current existing person …

📊 2 results
📏 Metrics: Averaged rank-1 acc(%)

AG-ReID.v2

Aerial-ground person re-identification (Re-ID) presents unique challenges in computer vision, stemming from the distinct differences in viewpoints, poses, and resolutions …

📊 1 results
📏 Metrics: Average mAP

CCVID

Clothes-Changing Video person re-ID (CCVID) is a dataset constructed from the raw data of a gait recognition dataset, i.e. FVG. …

📊 3 results
📏 Metrics: Rank-1, mAP, Rank-1

CUHK-SYSU

The CUKL-SYSY dataset is a large scale benchmark for person search, containing 18,184 images and 8,432 identities. Different from previous …

📊 3 results
📏 Metrics: MAP, Rank-1

CUHK03

The CUHK03 consists of 14,097 images of 1,467 different identities, where 6 campus cameras were deployed for image collection and …

📊 19 results
📏 Metrics: MAP, Rank-1, Rank-5, Rank-10

CUHK03-C

CUHK03-C is an evaluation set that consists of algorithmically generated corruptions applied to the CUHK03 test-set. These corruptions consist of …

📊 8 results
📏 Metrics: Rank-1, mAP, mINP, Rank-1, mAP, mINP

ClonedPerson

The ClonedPerson dataset is a large-scale synthetic person re-identification dataset introduced in the paper "Cloning Outfits from Real-World Images to …

📊 1 results
📏 Metrics: mAP, Rank-1

DukeMTMC-VideoReID

The DukeMTMC-VideoReID (Duke Multi-Tracking Multi-Camera Video-based ReIDentification) dataset is a subset of the DukeMTMC for video-based person re-ID. The dataset …

📊 1 results
📏 Metrics: mAP

DukeMTMC-reID

The DukeMTMC-reID (Duke Multi-Tracking Multi-Camera ReIDentification) dataset is a subset of the DukeMTMC for image-based person re-ID. The dataset is …

📊 89 results
📏 Metrics: mAP, Rank-1, Rank-5, Rank-10, Rank-1, Rank-5

ENTIRe-ID

The growing importance of person re-identification in computer vision has highlighted the need for more extensive and diverse datasets. In …

📊 1 results
📏 Metrics: mAP

IUST_PersonReID

The IUST_PersonReID dataset was developed to address limitations in existing person re-identification datasets by including cultural and environmental contexts unique …

📊 2 results
📏 Metrics: Rank-1, Rank-5, Rank-10, mAP

LTCC

LTCC contains 17,119 person images of 152 identities, and each identity is captured by at least two cameras. The dataset …

📊 8 results
📏 Metrics: Rank-1, mAP, mAP, Rank-1

MARS

MARS (Motion Analysis and Re-identification Set) is a large scale video based person reidentification dataset, an extension of the Market-1501 …

📊 20 results
📏 Metrics: mAP, Rank-1, Rank-5, Rank-10, Rank-20

MSMT17

MSMT17 is a multi-scene multi-time person re-identification dataset. The dataset consists of 180 hours of videos, captured by 12 outdoor …

📊 43 results
📏 Metrics: mAP, Rank-1, Rank-10, Rank-5

MSMT17-C

MSMT17-C is an evaluation set that consists of algorithmically generated corruptions applied to the MSMT17 test-set. These corruptions consist of …

📊 5 results
📏 Metrics: Rank-1, mAP, mINP, Rank-1, mAP, mINP

Market-1501

Market-1501 is a large-scale public benchmark dataset for person re-identification. It contains 1501 identities which are captured by six different …

📊 125 results
📏 Metrics: Rank-1, mAP, Rank-5, mINP

Market-1501-C

Market-1501-C is an evaluation set that consists of algorithmically generated corruptions applied to the Market-1501 test-set. These corruptions consist of …

📊 22 results
📏 Metrics: Rank-1, mAP, mINP, Rank-1, mAP, mINP

Occluded REID

Occluded REID is an occluded person dataset captured by mobile cameras, consisting of 2,000 images of 200 occluded persons (see …

📊 5 results
📏 Metrics: mAP, Rank-1

Occluded-DukeMTMC

Occluded-DukeMTMC contains 15,618 training images, 17,661 gallery images, and 2,210 occluded query images. The experiment results on Occluded-DukeMTMC will demonstrate …

📊 25 results
📏 Metrics: Rank-1, mAP

Occluded-PoseTrack-ReID

We introduce Occluded PoseTrack-ReID (or simply Occ-PTrack), a new ReID dataset we built out of the annotation available with PoseTrack21, …

📊 1 results
📏 Metrics: MAP, Rank-1

P-DukeMTMC-reID

P-DukeMTMC-reID is a modified version based on DukeMTMC-reID dataset. There are 12,927 images (665 identifies) in training set, 2,163 images …

📊 2 results
📏 Metrics: mAP, Rank-1, Rank-5, Rank-10

PRCC

This dataset consists of 33698 images from 221 identities. Each person in Cameras A and B is wearing the same …

📊 10 results
📏 Metrics: mAP, Rank-1

PRID2011

PRID 2011 is a person reidentification dataset that provides multiple person trajectories recorded from two different static surveillance cameras, monitoring …

📊 10 results
📏 Metrics: Rank-1, Rank-20, Rank-5, Rank-10

Partial-REID

Partial REID is a specially designed partial person reidentification dataset that includes 600 images from 60 people, with 5 full-body …

📊 2 results
📏 Metrics: Rank-1

RegDB

RegDB is used for Visible-Infrared Re-ID which handles the cross-modality matching between the daytime visible and night-time infrared images. The …

📊 1 results
📏 Metrics: Rank-1

SYSU-30k

SYSU-30k contains 30k categories of persons, which is about 20 times larger than CUHK03 (1.3k categories) and Market1501 (1.5k categories), …

📊 10 results
📏 Metrics: Rank-1, Rank-1

SYSU-MM01

The SYSU-MM01 is a dataset collected for the Visible-Infrared Re-identification problem. The images in the dataset were obtained from 491 …

📊 1 results
📏 Metrics: rank1

SYSU-MM01-C

SYSU-MM01-C is an evaluation set that consists of algorithmically generated corruptions applied to the SYSU-MM01 test-set. These corruptions consist of …

📊 2 results
📏 Metrics: Rank-1 (All Search), mAP (All Search), mINP (All Search), Rank-1 (Indoor Search), mAP (Indoor Search), mINP (Indoor Search), Rank-1 (All Search), Rank-1 (Indoor Search), mAP (All Search), mAP (Indoor Search), mINP (All Search), mINP (Indoor Search)

SenseReID

SenseReID is a person re-identification dataset for evaluating ReID models. It is captured from real surveillance cameras and the person …

📊 1 results
📏 Metrics: Top-1

SoccerNet-v2

A novel large-scale corpus of manual annotations for the SoccerNet video dataset, along with open challenges to encourage more research …

📊 1 results
📏 Metrics: Rank-1, mAP

UAV-Human

UAV-Human is a large dataset for human behavior understanding with UAVs. It contains 67,428 multi-modal video sequences and 119 subjects …

📊 4 results
📏 Metrics: Rank-1, Rank-5, mAP

VC-Clothes

Person re-identification (Reid) is now an active research topic for AI-based video surveillance applications such as specific person search, but …

📊 4 results
📏 Metrics: Rank-1, mAP

eSports Sensors Dataset

The eSports Sensors dataset contains sensor data collected from 10 players in 22 matches in League of Legends. The sensor …

📊 5 results
📏 Metrics: Accuracy, LogLoss, ROC AUC

iLIDS-VID

The iLIDS-VID dataset is a person re-identification dataset which involves 300 different pedestrians observed across two disjoint camera views in …

📊 8 results
📏 Metrics: Rank-1, Rank-5, Rank-10, Rank-20

Person Search

CUHK-SYSU

The CUKL-SYSY dataset is a large scale benchmark for person search, containing 18,184 images and 8,432 identities. Different from previous …

📊 14 results
📏 Metrics: MAP, Top-1

PRW

PRW is a large-scale dataset for end-to-end pedestrian detection and person recognition in raw video frames. PRW is introduced to …

📊 13 results
📏 Metrics: mAP, Top-1

Personality Recognition in Conversation

CPED

We construct a dataset named CPED from 40 Chinese TV shows. CPED consists of multisource knowledge related to empathy and …

📊 4 results
📏 Metrics: Accuracy (%), Macro-F1, Accuracy of Neurotism, Accuracy of Extraversion, Accuracy of Openness, Accuracy of Agreeableness, Accuracy of Conscientiousness

Personality Trait Recognition

Essays

J. W. Pennebaker and L. A. King, “Linguistic styles: Language use as an individual difference,” J. Pers. Soc. Psychol., vol. …

📊 2 results
📏 Metrics: Accuracy, F-Measure, Precision, Recall

SynthPAI

SynthPAI was created to provide a dataset that can be used to investigate the personal attribute inference (PAI) capabilities of …

📊 18 results
📏 Metrics: Average accuracy in %

Personalized Image Generation

DreamBooth

The DreamBooth dataset is a collection of images used for fine-tuning text-to-image diffusion models for subject-driven generation¹. Here are some …

📊 7 results
📏 Metrics: Overall (CP * PF), Concept Preservation (CP), Prompt Following (PF)

Personalized Segmentation

PerSeg

PerSeg is a dataset for personalized segmentation. The raw images are collect from the training data of subject driven diffusion …

📊 5 results
📏 Metrics: mIoU

Philosophy

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Phrase Grounding

Flickr30k

The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators. Source: [Guiding …

📊 3 results
📏 Metrics: Pointing Game Accuracy

Visual Genome

Visual Genome contains Visual Question Answering data in a multi-choice setting. It consists of 101,174 images from MSCOCO with 1.7 …

📊 3 results
📏 Metrics: Pointing Game Accuracy

Phrase Ranking

KP20k

KP20k is a large-scale scholarly articles dataset with 528K articles for training, 20K articles for validation and 20K articles for …

📊 3 results
📏 Metrics: P@5K, P@50K

KPTimes

KPTimes is a large-scale dataset of news texts paired with editor-curated keyphrases. Source: [KPTimes: A Large-Scale Dataset for Keyphrase Generation …

📊 4 results
📏 Metrics: P@5K, P@50K

Phrase Tagging

KP20k

KP20k is a large-scale scholarly articles dataset with 528K articles for training, 20K articles for validation and 20K articles for …

📊 4 results
📏 Metrics: Precision, Recall, F1

KPTimes

KPTimes is a large-scale dataset of news texts paired with editor-curated keyphrases. Source: [KPTimes: A Large-Scale Dataset for Keyphrase Generation …

📊 4 results
📏 Metrics: Precision, Recall, F1

Physical Attribute Prediction

Sound of Water 50

We collect a dataset of 805 clean videos that show the action of pouring water in a container. Our dataset …

📊 1 results
📏 Metrics: Mean Squared Error

Physical Simulations

4D-DRESS

4D-DRESS is the first real-world 4D dataset of human clothing, capturing 64 human outfits in more than 520 motion sequences. …

📊 12 results
📏 Metrics: Chamfer (cm), Stretching Energy

Playing the Game of 2048

The Game of 2048

The 2048 game task involves training an agent to achieve high scores in the game 2048 (Wikipedia)

📊 2 results
📏 Metrics: Average Score

Pneumonia Detection

Chest X-ray images

Chest X-ray images for pneumonia detection.

📊 3 results
📏 Metrics: Accuracy

ChestX-ray14

ChestX-ray14 is a medical imaging dataset which comprises 112,120 frontal-view X-ray images of 30,805 (collected from the year of 1992 …

📊 4 results
📏 Metrics: AUROC, Params, FLOPS

Poem meters classification

PCD

The Arabic dataset is scraped mainly from الموسوعة الشعرية and الديوان. After merging both, the total number of verses is …

📊 1 results
📏 Metrics: Accuracy

Point Cloud Classification

PointCloud-C

PointCloud-C is the very first test-suite for point cloud robustness analysis under corruptions. - Two sets: ModelNet-C for point cloud …

📊 23 results
📏 Metrics: mean Corruption Error (mCE)

Point Cloud Completion

Completion3D

The Completion3D benchmark is a dataset for evaluating state-of-the-art 3D Object Point Cloud Completion methods. Ggiven a partial 3D object …

📊 6 results
📏 Metrics: Chamfer Distance

ShapeNet

ShapeNet is a large scale repository for 3D CAD models developed by researchers from Stanford University, Princeton University and the …

📊 9 results
📏 Metrics: Chamfer Distance, F-Score@1%, Earth Mover's Distance, Frechet Point cloud Distance, Chamfer Distance L2

ShapeNet-ViPC

A large-scale dataset for the point cloud completion task on the ShapeNet dataset.

📊 3 results
📏 Metrics: Chamfer Distance

Point Cloud Generation

ShapeNet

ShapeNet is a large scale repository for 3D CAD models developed by researchers from Stanford University, Princeton University and the …

📊 1 results
📏 Metrics: CD, EMD, 1-NNA-CD, 1-NNA-EMD

Point Cloud Quality Assessment

M-PCCD

The emerging MPEG point cloud codecs (V-PCC and G-PCC variants) are assessed, and best practices for rate allocation are investigated …

📊 1 results
📏 Metrics: Pearson Correlation Coefficient

WPC

The WPC (Waterloo Point Cloud) database is a dataset for subjective and objective quality assessment of point clouds.

📊 5 results
📏 Metrics: PLCC, KROCC, RMSE, SROCC

Point Cloud Registration

3RScan

A novel dataset and benchmark, which features 1482 RGB-D scans of 478 environments across multiple time steps. Each scene includes …

📊 2 results
📏 Metrics: CD, RRE, RTE

FPv1

FPv1 (prior name FAUST-partial) is a 3D registration benchmark dataset created to address the lack of data variability in the …

📊 7 results
📏 Metrics: Recall (3cm, 10 degrees), RRE (degrees), RTE (cm)

KITTI

KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile …

📊 5 results
📏 Metrics: Success Rate

Point Cloud Segmentation

PointCloud-C

PointCloud-C is the very first test-suite for point cloud robustness analysis under corruptions. - Two sets: ModelNet-C for point cloud …

📊 11 results
📏 Metrics: mean Corruption Error (mCE)

Point Clouds

DTU

DTU MVS 2014 is a multi-view stereo dataset, which is an order of magnitude larger in number of scenes and …

📊 1 results
📏 Metrics: Overall

Tanks and Temples

We present a benchmark for image-based 3D reconstruction. The benchmark sequences were acquired outside the lab, in realistic conditions. Ground-truth …

📊 17 results
📏 Metrics: Mean F1 (Advanced), Mean F1 (Intermediate)

Point Processes

AgeGroup Transactions MTPP

The dataset contains historical financial transactions, including time, category and cost fields. There are 50000 clients, 205 categories and 43.7M …

📊 5 results
📏 Metrics: T-mAP, MAE, OTD, Accuracy (%)

Amazon MTPP

The dataset includes time-stamped user product reviews behavior from January, 2008 to October, 2018. Each user has a sequence of …

📊 4 results
📏 Metrics: T-mAP, OTD, Accuracy (%), MAE

MemeTracker

The Memetracker corpus contains articles from mainstream media and blogs from August 1 to October 31, 2008 with about 1 …

📊 1 results
📏 Metrics: Accuracy, RMSE

RETWEET

RETWEET is a dataset of tweets and overall predominant sentiment of their replies. SUMMARY ------ WHAT: Message-level Polarity Classification. GOAL:

📊 1 results
📏 Metrics: Accuracy, RMSE

Retweet MTPP

This dataset contains time-stamped user retweet event sequences. The events are categorized into 3 types: retweets by “small,” “medium” and …

📊 6 results
📏 Metrics: T-mAP, OTD, Accuracy (%), MAE

StackOverflow MTPP

The dataset has two years of user awards on a question-answering website: each user received a sequence of badges and …

📊 3 results
📏 Metrics: OTD, T-mAP, Accuracy (%), MAE

Point Tracking

Perception Test

Perception Test is a benchmark designed to evaluate the perception and reasoning skills of multimodal models. It introduces real-world videos …

📊 1 results
📏 Metrics: Average Jaccard

PointOdyssey

PointOdyssey is a large-scale synthetic dataset, and data generation framework, for the training and evaluation of long-term fine-grained tracking algorithms. …

📊 2 results
📏 Metrics: Survival, δ, MTE

TAP-Vid

TAP-Vid is a benchmark which contains both real-world videos with accurate human annotations of point tracks, and synthetic videos with …

📊 1 results
📏 Metrics: MTE, Survival, δ

Polish Text Diacritization

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish …

📊 1 results
📏 Metrics: Alpha-Word accuracy

Polyphone disambiguation

CPP

A benchmark dataset that consists of 99,000+ sentences for Chinese polyphone disambiguation. Source: [g2pM: A Neural Grapheme-to-Phoneme Conversion Package for …

📊 3 results
📏 Metrics: Accuracy

Pose Estimation

3DPW

The 3D Poses in the Wild dataset is the first dataset in the wild with accurate 3D poses for evaluation. …

📊 1 results
📏 Metrics: Acceleration Error, [email protected]

AIC

A large-scale dataset named AIC (AI Challenger) with three sub-datasets, human keypoint detection (HKD), large-scale attribute dataset (LAD) and image …

📊 10 results
📏 Metrics: AP, AP75, AR, AR50, AP50

ApolloCar3D

ApolloCar3DT is a dataset that contains 5,277 driving images and over 60K car instances, where each car is fitted with …

📊 1 results
📏 Metrics: A3DP

BRACE

BRACE is a dataset for audio-conditioned dance motion synthesis challenging common assumptions for this task: - strong music-dance correlation - …

📊 2 results
📏 Metrics: Average Precision, Average Recall

COCO (Common Objects in Context)

The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset. It is designed to …

📊 10 results
📏 Metrics: AP, AR, AP50, AP75, APL, APM

CrowdPose

The CrowdPose dataset contains about 20,000 images and a total of 80,000 human poses with 14 labeled keypoints. The test …

📊 12 results
📏 Metrics: AP, AP50, AP75, APM, Test, AP Hard, AP Easy, AP Medium

InLoc

InLoc is a dataset with reference 6DoF poses for large-scale indoor localization. Query photographs are captured by mobile phones at …

MERL-RAV

The MERL-RAV (MERL Reannotation of AFLW with Visibility) Dataset contains over 19,000 face images in a full range of head …

📊 1 results
📏 Metrics: MAE mean (º), MAE yaw (º), MAE pitch (º), MAE roll (º)

MPII

The MPII Human Pose Dataset for single person pose estimation is composed of about 25K images of which 15K are …

📊 1 results
📏 Metrics: [email protected]

MPII Human Pose

MPII Human Pose Dataset is a dataset for human pose estimation. It consists of around 25k images extracted from online …

📊 45 results
📏 Metrics: PCKh-0.5

OCHuman

This dataset focuses on heavily occluded human with comprehensive annotations including bounding-box, humans pose and instance mask. This dataset contains …

📊 18 results
📏 Metrics: Test AP, Validation AP

Pix3D

The Pix3D dataset is a large-scale benchmark of diverse image-shape pairs with pixel-level 2D-3D alignment. Pix3D has wide applications in …

📊 1 results
📏 Metrics: Percentage correct

SALSA

A novel dataset facilitating multimodal and Synergetic sociAL Scene Analysis. Source: SALSA: A Novel Dataset for Multimodal Group Behavior Analysis

📊 4 results
📏 Metrics: Accuracy

UAV-Human

UAV-Human is a large dataset for human behavior understanding with UAVs. It contains 67,428 multi-modal video sequences and 119 subjects …

📊 2 results
📏 Metrics: mAP

Pose Retrieval

Human3.6M

The Human3.6M dataset is one of the largest motion capture datasets, which consists of 3.6 million human poses and corresponding …

📊 1 results
📏 Metrics: Hit@1, Hit@10

MPI-INF-3DHP

MPI-INF-3DHP is a 3D human body pose estimation dataset consisting of both constrained indoor and complex outdoor scenes. It records …

📊 1 results
📏 Metrics: Hit@1, Hit@10

Precipitation Forecasting

SEVIR

SEVIR is an annotated, curated and spatio-temporally aligned dataset containing over 10,000 weather events that each consist of 384 km …

📊 1 results
📏 Metrics: CSI-pool16, CSI-pool4

Prediction

QM9

QM9 provides quantum chemical properties (at DFT level) for a relevant, consistent, and comprehensive chemical space of small organic molecules. …

📊 4 results
📏 Metrics: Edit Distance

Prediction Of Occupancy Grid Maps

nuScenes

The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in …

📊 1 results
📏 Metrics: mIoU

Prehistory

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Procedure Step Recognition

IndustReal

IndustReal is an ego-centric, multi-modal dataset where 27 participants are challenged to perform assembly and maintenance procedures on a construction-toy …

📊 2 results
📏 Metrics: Delay (seconds), F1, POS

Product Recommendation

Coveo Data Challenge Dataset

The 2021 SIGIR workshop on eCommerce is hosting the Coveo Data Challenge for "In-session prediction for purchase intent and recommendations". …

📊 1 results
📏 Metrics: F1, MRR

Professional Law

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Professional Medicine

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Professional Psychology

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Program Repair

DeepFix

DeepFix consists of a program repair dataset (fix compiler errors in C programs). It enables research around automatically fixing programming …

📊 4 results
📏 Metrics: Average Success Rate

GitHub-Python

Repair AST parse (syntax) errors in Python code

📊 2 results
📏 Metrics: Accuracy (%)

HumanEvalPack

HumanEvalPack is an extension of OpenAI's HumanEval to cover 6 total languages across 3 tasks. The evaluation suite is fully …

📊 1 results
📏 Metrics: Pass@1

Promoter Detection

GUE

A collection of $28$ datasets across $7$ tasks constructed for genome language model evaluation. Contains seven tasks: promoter prediction. core …

📊 1 results
📏 Metrics: MCC

Prompt Engineering

Caltech-101

The Caltech101 dataset contains images from 101 object categories (e.g., “helicopter”, “elephant” and “chair” etc.) and a background category that …

📊 14 results
📏 Metrics: Harmonic mean

DTD

The Describable Textures Dataset (DTD) contains 5640 texture images in the wild. They are annotated with human-centric attributes inspired by …

📊 14 results
📏 Metrics: Harmonic mean

EuroSAT

Eurosat is a dataset and deep learning benchmark for land use and land cover classification. The dataset is based on …

📊 14 results
📏 Metrics: Harmonic mean

FGVC-Aircraft

FGVC-Aircraft contains 10,200 images of aircraft, with 100 images for each of 102 different aircraft model variants, most of which …

📊 14 results
📏 Metrics: Harmonic mean

Food-101

The Food-101 dataset consists of 101 food categories with 750 training and 250 test images per category, making a total …

📊 13 results
📏 Metrics: Harmonic mean

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 15 results
📏 Metrics: Harmonic mean

ImageNet-A

The ImageNet-A dataset consists of real-world, unmodified, and naturally occurring examples that are misclassified by ResNet models. Source: [On Robustness …

📊 9 results
📏 Metrics: Top-1 accuracy %

ImageNet-R

ImageNet-R(endition) contains art, cartoons, deviantart, graffiti, embroidery, graphics, origami, paintings, patterns, plastic objects, plush objects, sculptures, sketches, tattoos, toys, and …

📊 9 results
📏 Metrics: Top-1 accuracy %

ImageNet-S

Powered by the ImageNet dataset, unsupervised learning on large-scale data has made significant advances for classification tasks. There are two …

📊 9 results
📏 Metrics: Top-1 accuracy %

Oxford 102 Flower

Oxford 102 Flower is an image classification dataset consisting of 102 flower categories. The flowers chosen to be flower commonly …

📊 14 results
📏 Metrics: Harmonic mean

Oxford-IIIT Pet Dataset

The Oxford-IIIT Pet Dataset has 37 categories with roughly 200 images for each class. The images have a large variations …

📊 14 results
📏 Metrics: Harmonic mean

SUN397

The Scene UNderstanding (SUN) database contains 899 categories and 130,519 images. There are 397 well-sampled categories to evaluate numerous state-of-the-art …

📊 14 results
📏 Metrics: Harmonic mean

Stanford Cars

The Stanford Cars dataset consists of 196 classes of cars with a total of 16,185 images, taken from the rear. …

📊 14 results
📏 Metrics: Harmonic mean

UCF101

UCF101 dataset is an extension of UCF50 and consists of 13,320 video clips, which are classified into 101 categories. These …

📊 14 results
📏 Metrics: Harmonic mean

Protein Design

CATH 4.2

The CATH (Class, Architecture, Topology, Homology) [65] database is a comprehensive resource for protein structure classification that hierarchical group proteins …

📊 8 results
📏 Metrics: Perplexity, Sequence Recovery %(All)

CATH 4.3

The CATH (Class, Architecture, Topology, Homology) [65] database is a comprehensive resource for protein structure classification that hierarchical group proteins …

📊 2 results
📏 Metrics: Perplexity, Sequence Recovery %(All)

Public Relations

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Quantization

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 2 results
📏 Metrics: MAP

COCO (Common Objects in Context)

The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset. It is designed to …

📊 1 results
📏 Metrics: MAP

IJB-B

The IJB-B dataset is a template-based face dataset that contains 1845 subjects with 11,754 images, 55,025 frames and 7,011 videos …

📊 1 results
📏 Metrics: TAR @ FAR=1e-4

IJB-C

The IJB-C dataset is a video-based face recognition dataset. It is an extension of the IJB-A dataset with about 138,000 …

📊 1 results
📏 Metrics: TAR @ FAR=1e-4

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 27 results
📏 Metrics: Top-1 Accuracy (%), Weight bits, Activation bits

LFW

The LFW dataset contains 13,233 images of faces collected from the web. This dataset consists of the 5749 identities with …

📊 1 results
📏 Metrics: Accuracy

Wiki-40B

A new multilingual language model benchmark that is composed of 40+ languages spanning several scripts and linguistic families containing round …

📊 1 results
📏 Metrics: Perplexity

Quantum Machine Learning

iris

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician, …

📊 1 results
📏 Metrics: Average F1

Question Answering

AviationQA

AviationQA is introduced in the paper titled- There is No Big Brother or Small Brother: Knowledge Infusion in Language Models …

📊 1 results
📏 Metrics: Hits@1

BBH

BIG-Bench Hard (BBH) is a subset of the BIG-Bench, a diverse evaluation suite for language models. BBH focuses on a …

📊 1 results
📏 Metrics: Accuracy

BLURB

BLURB is a collection of resources for biomedical natural language processing. In general domains such as newswire and the Web, …

📊 3 results
📏 Metrics: Accuracy

Bamboogle

The Bamboogle dataset is a collection of questions that was constructed to investigate the ability of language models to perform …

📊 9 results
📏 Metrics: Accuracy

BioASQ

BioASQ is a question answering dataset. Instances in the BioASQ dataset are composed of a question (Q), human-annotated answers (A), …

📊 6 results
📏 Metrics: Accuracy

BoolQ

BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring – they are …

📊 65 results
📏 Metrics: Accuracy

CODAH

The COmmonsense Dataset Adversarially-authored by Humans (CODAH) is an evaluation set for commonsense question-answering in the sentence completion style of …

📊 2 results
📏 Metrics: Accuracy

COPA

The Choice Of Plausible Alternatives (COPA) evaluation provides researchers with a tool for assessing progress in open-domain commonsense causal reasoning. …

📊 55 results
📏 Metrics: Accuracy

CaseHOLD

CaseHOLD (Case Holdings On Legal Decisions) is a law dataset comprised of over 53,000+ multiple choice questions to identify the …

📊 3 results
📏 Metrics: Macro F1 (10-fold)

ChAII - Hindi and Tamil Question Answering

The dataset covers Hindi and Tamil, collected without the use of translation. It provides a realistic information-seeking task with questions …

📊 1 results
📏 Metrics: Jaccard

CheGeKa

CheGeKa is a Jeopardy!-like Russian QA dataset collected from the official Russian quiz database ChGK. Motivation The task can be …

📊 4 results
📏 Metrics: Accuracy

Children's Book Test

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 8 results
📏 Metrics: Accuracy-CN, Accuracy-NE

CliCR

CliCR is a new dataset for domain specific reading comprehension used to construct around 100,000 cloze queries from clinical case …

📊 2 results
📏 Metrics: F1

CoQA

CoQA is a large-scale dataset for building Conversational Question Answering systems. The goal of the CoQA challenge is to measure …

📊 9 results
📏 Metrics: In-domain, Out-of-domain, Overall

Complex-CronQuestions

A filtered version of CronQuestions and which can better demonstrate the model’s inference ability for complex temporal questions.

📊 3 results
📏 Metrics: Hits@1

ComplexWebQuestions

ComplexWebQuestions is a dataset for answering complex questions that require reasoning over multiple web snippets. It contains a large set …

📊 1 results
📏 Metrics: EM

ConditionalQA

ConditionalQA is a Question Answering (QA) dataset that contains complex questions with conditional answers, i.e. the answers are only applicable …

📊 3 results
📏 Metrics: Conditional (answers), Conditional (w/ conditions), Overall (answers), Overall (w/ conditions)

ConvFinQA

ConvFinQA is a dataset designed to study the chain of numerical reasoning in conversational question answering. The dataset contains 3892 …

📊 3 results
📏 Metrics: Execution Accuracy

CronQuestions

CRONQUESTIONS, the Temporal KGQA dataset consists of two parts: a KG with temporal annotations, and a set of natural language …

📊 10 results
📏 Metrics: Hits@1

DROP

Discrete Reasoning Over Paragraphs DROP is a crowdsourced, adversarially-created, 96k-question benchmark, in which a system must resolve references in a …

📊 6 results
📏 Metrics: Accuracy

DaNetQA

DaNetQA is a question answering dataset for yes/no questions. These questions are naturally occurring ---they are generated in unprompted and …

📊 6 results
📏 Metrics: Accuracy

DuoRC

DuoRC contains 186,089 unique question-answer pairs created from a collection of 7680 pairs of movie plots where each pair in …

📊 3 results
📏 Metrics: Accuracy

EgoTaskQA

EgoTask QA benchmark contains 40K balanced question-answer pairs selected from 368K programmatically generated questions generated over 2K egocentric videos. It …

📊 4 results
📏 Metrics: Direct

FEVER

FEVER is a publicly available dataset for fact extraction and verification against textual sources. It consists of 185,445 claims manually …

📊 7 results
📏 Metrics: EM

FQuAD

A French Native Reading Comprehension dataset of questions and answers on a set of Wikipedia articles that consists of 25,000+ …

📊 6 results
📏 Metrics: EM, F1

FairytaleQA

FairytaleQA is a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Annotated by educational experts based on an …

📊 4 results
📏 Metrics: F1, Rouge-L

FinQA

FinQA is a new large-scale dataset with Question-Answering pairs over Financial reports, written by financial experts. The dataset contains 8,281 …

📊 6 results
📏 Metrics: Execution Accuracy, Program Accuracy

GraphQuestions

GraphQuestions is a characteristic-rich dataset designed for factoid question answering. The dataset aims to provide a systematic way of constructing …

📊 1 results
📏 Metrics: Accuracy

HellaSwag

HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are …

📊 1 results
📏 Metrics: Accuracy

HotpotQA

HotpotQA is a question answering dataset collected on the English Wikipedia, containing about 113K crowd-sourced questions that are constructed to …

📊 22 results
📏 Metrics: JOINT-F1, ANS-EM, ANS-F1, SUP-EM, SUP-F1, JOINT-EM

HybridQA

A new large-scale question-answering dataset that requires reasoning on heterogeneous information. Each question is aligned with a Wikipedia table and …

📊 3 results
📏 Metrics: ANS-EM

JaQuAD

JaQuAD (Japanese Question Answering Dataset) is a question answering dataset in Japanese that consists of 39,696 extractive question-answer pairs on …

📊 1 results
📏 Metrics: Exact Match, F1

KQA Pro

A large-scale dataset for Complex KBQA. Source: [KQA Pro: A Large-Scale Dataset with Interpretable Programs and Accurate SPARQLs for Complex …

📊 1 results
📏 Metrics: Accuracy

MML

MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively …

📊 1 results
📏 Metrics: Accuracy

MRQA

The MRQA (Machine Reading for Question Answering) dataset is a dataset for evaluating the generalization capabilities of reading comprehension systems. …

📊 2 results
📏 Metrics: Average F1

MS MARCO

The MS MARCO (Microsoft MAchine Reading Comprehension) is a collection of datasets focused on deep learning in search. The first …

📊 4 results
📏 Metrics: Rouge-L, BLEU-1

MapEval-API

MapEval-Textual contains 300 question-answer pairs. The task is to answer question by fetching necessary informations using external Map APIs.

📊 2 results
📏 Metrics: Accuracy (%)

MapEval-Textual

MapEval-Textual contains 300 context-question-answer triplets. The necessary geo-spatial information is provided in the context. The task is to answer question …

📊 1 results
📏 Metrics: Accuracy (% )

Mathematics Dataset

This dataset code generates mathematical question and answer pairs, from a range of question types at roughly school-level difficulty. This …

📊 3 results
📏 Metrics: Accuracy

MedQA

Multiple choice question answering based on the United States Medical License Exams (USMLE). The dataset is collected from the professional …

📊 27 results
📏 Metrics: Accuracy

MetaQA

The MetaQA dataset consists of a movie ontology derived from the WikiMovies Dataset and three sets of question-answer pairs written …

📊 1 results
📏 Metrics: AnswerExactMatch (Question Answering)

Molweni

A machine reading comprehension (MRC) dataset with discourse structure built over multiparty dialog. Molweni's source samples from the Ubuntu Chat …

📊 4 results
📏 Metrics: EM, F1

MultiQ

MultiQ is a multi-hop QA dataset for Russian, suitable for general open-domain question answering, information retrieval, and reading comprehension tasks. …

📊 4 results
📏 Metrics: Accuracy

MultiRC

MultiRC (Multi-Sentence Reading Comprehension) is a dataset of short paragraphs and multi-sentence questions, i.e., questions that can be answered by …

📊 30 results
📏 Metrics: F1, EM

MultiTQ

MULTITQ is a large-scale dataset featuring ample relevant facts and multiple temporal granularities.

📊 9 results
📏 Metrics: Hits@1, Hits@10

NExT-QA (Open-ended VideoQA)

NExT-QA is a VideoQA benchmark targeting the explanation of video contents. It challenges QA models to reason about the causal …

📊 6 results
📏 Metrics: Accuracy, Confidence Score

NarrativeQA

The NarrativeQA dataset includes a list of documents with Wikipedia summaries, links to full stories, and questions and answers. Source: …

📊 8 results
📏 Metrics: Rouge-L, BLEU-1, BLEU-4, METEOR

Natural Questions

The Natural Questions corpus is a question answering dataset containing 307,373 training examples, 7,830 development examples, and 7,842 test examples. …

📊 46 results
📏 Metrics: EM

NewsQA

The NewsQA dataset is a crowd-sourced machine reading comprehension dataset of 120,000 question-answer pairs. * Documents are CNN news articles. …

📊 16 results
📏 Metrics: EM, F1

OTT-QA

The Open Table-and-Text Question Answering (OTT-QA) dataset contains open questions which require retrieving tables and text from the web to …

📊 3 results
📏 Metrics: ANS-EM

OpenBookQA

OpenBookQA is a new kind of question-answering dataset modeled after open book exams for assessing human understanding of a subject. …

📊 40 results
📏 Metrics: Accuracy

PIQA

PIQA is a dataset for commonsense reasoning, and was created to investigate the physical knowledge of existing models in NLP. …

📊 67 results
📏 Metrics: Accuracy

PeerQA

We present PeerQA, a real-world, scientific, document-level Question Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, which …

📊 5 results
📏 Metrics: Prometheus-2 Answer Correctness, Rouge-L, AlignScore

PopQA

PopQA is an open-domain QA dataset with 14k QA pairs with fine-grained Wikidata entity ID, Wikipedia page views, and relationship …

📊 2 results
📏 Metrics: Accuracy

PubChemQA

PubChemQA consists of molecules and their corresponding textual descriptions from PubChem. It contains a single type of question, i.e., please …

📊 2 results
📏 Metrics: BLEU-2, BLEU-4, ROUGE-1, ROUGE-2, ROUGE-L, MEATOR

PubMedQA

The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary …

📊 26 results
📏 Metrics: Accuracy

QASPER

QASPER is a dataset for question answering on scientific research papers. It consists of 5,049 questions over 1,585 Natural Language …

📊 1 results
📏 Metrics: Token F1

QuAC

Question Answering in Context is a large-scale dataset that consists of around 14K crowdsourced Question Answering dialogs with 98K question-answer …

📊 2 results
📏 Metrics: F1, HEQD, HEQQ

QuALITY

QuALITY (Question Answering with Long Input Texts, Yes!) is a multiple-choice question answering dataset for long document comprehension. The dataset …

📊 1 results
📏 Metrics: Accuracy

Quora Question Pairs

Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary …

📊 19 results
📏 Metrics: Accuracy

RACE

The ReAding Comprehension dataset from Examinations (RACE) dataset is a machine reading comprehension dataset consisting of 27,933 passages and 97,867 …

📊 6 results
📏 Metrics: RACE-m, RACE-h, RACE

ReClor

Logical reasoning is an important ability to examine, analyze, and critically evaluate arguments as they occur in ordinary language as …

📊 3 results
📏 Metrics: Accuracy, Accuracy (easy), Accuracy (hard)

RecipeQA

RecipeQA is a dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from …

📊 1 results
📏 Metrics: Accuracy

RuOpenBookQA

RuOpenBookQA is a QA dataset with multiple-choice elementary-level science questions which probe the understanding of core science facts. Motivation RuOpenBookQA …

📊 4 results
📏 Metrics: Accuracy

SCDE

SCDE is a human-created sentence cloze dataset, collected from public school English examinations in China. The task requires a model …

📊 1 results
📏 Metrics: BA, PA, DE

SIQA

Social Interaction QA (SIQA) is a question-answering benchmark for testing social commonsense intelligence. Contrary to many prior benchmarks that focus …

📊 24 results
📏 Metrics: Accuracy

SQA3D

SQA3D is a dataset for embodied scene understanding, where an agent needs to comprehend the scene it situates from an …

📊 7 results
📏 Metrics: AnswerExactMatch (Question Answering)

SQuAD

The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct …

📊 2 results
📏 Metrics: Exact Match, F1

SWAG

Given a partial description like "she opened the hood of the car," humans can reason about the situation and anticipate …

📊 1 results
📏 Metrics: Accuracy

SberQuAD

A large scale analogue of Stanford SQuAD in the Russian language - is a valuable resource that has not been …

📊 3 results
📏 Metrics: EM, F1

SchizzoSQUAD

The “Mental Health” forum was used, a forum dedicated to people suffering from schizophrenia and different mental disorders. Relevant posts …

📊 1 results
📏 Metrics: Average F1, Averaged Precision

SimpleQuestions

SimpleQuestions is a large-scale factoid question answering dataset. It consists of 108,442 natural language questions, each paired with a corresponding …

📊 1 results
📏 Metrics: F1

StepGame

A Benchmark for Robust Multi-Hop Spatial Reasoning in Texts

📊 1 results
📏 Metrics: 1-of-100 Accuracy

StoryCloze

Representation and learning of commonsense knowledge is one of the foundational problems in the quest to enable deep language understanding. …

📊 20 results
📏 Metrics: Accuracy

StrategyQA

StrategyQA is a question answering benchmark where the required reasoning steps are implicit in the question, and should be inferred …

📊 11 results
📏 Metrics: Accuracy, EM

TAT-QA

TAT-QA (Tabular And Textual dataset for Question Answering) is a large-scale QA dataset, aiming to stimulate progress of QA research …

📊 1 results
📏 Metrics: Exact Match (EM)

TIQ

Existing benchmarks for temporal QA focus on a single information source (either a KB or a text corpus), and include …

📊 9 results
📏 Metrics: P@1

TempQA-WD

TempQA-WD is a benchmark dataset for temporal reasoning designed to encourage research in extending the present approaches to target a …

📊 1 results
📏 Metrics: F1

TempQuestions

Here, we take a key step in this direction and release a new benchmark, TempQuestions, containing 1,271 questions, that are …

📊 4 results
📏 Metrics: Hits@1, F1

TimeQuestions

Question answering over knowledge graphs (KG-QA) is a vital topic in IR. Questions with temporal intent are a special class …

📊 16 results
📏 Metrics: P@1

Torque

Torque is an English reading comprehension benchmark built on 3.2k news snippets with 21k human-generated questions querying temporal relationships. Source: …

📊 2 results
📏 Metrics: F1, EM, C

TrecQA

Text Retrieval Conference Question Answering (TrecQA) is a dataset created from the TREC-8 (1999) to TREC-13 (2004) Question Answering tracks. …

📊 12 results
📏 Metrics: MAP, MRR

TriviaQA

TriviaQA is a realistic text-based question answering dataset which includes 950K question-answer pairs from 662K documents collected from Wikipedia and …

📊 51 results
📏 Metrics: EM, F1

TruthfulQA

TruthfulQA is a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises …

📊 30 results
📏 Metrics: MC1, MC2, % true, % info, % true (GPT-judge), BLEURT, ROUGE, BLEU, EM, Accuracy

TweetQA

With social media becoming increasingly popular on which lots of news and real-time events are reported, developing automated question answering …

📊 3 results
📏 Metrics: BLEU-1, ROUGE-L

UniProtQA

UniProtQA consists of proteins and textual queries about their functions and properties. The dataset is constructed from UniProt, and consists …

📊 2 results
📏 Metrics: BLEU-2, BLEU-4, ROUGE-1, ROUGE-2, ROUGE-L, MEATOR

WebQuestions

The WebQuestions dataset is a question answering dataset using Freebase as the knowledge base and contains 6,642 question-answer pairs. It …

📊 36 results
📏 Metrics: EM, F1

WebQuestionsSP

The WebQuestionsSP dataset is released as part of our ACL-2016 paper “The Value of Semantic Parse Labeling for Knowledge Base …

📊 1 results
📏 Metrics: Accuracy

WebSRC

WebSRC is a novel Web-based Structural Reading Comprehension dataset. It consists of 0.44M question-answer pairs, which are collected from 6.5K …

📊 1 results
📏 Metrics: F1

WikiHop

WikiHop is a multi-hop question-answering dataset. The query of WikiHop is constructed with entities and relations from WikiData, while supporting …

📊 9 results
📏 Metrics: Test

WikiQA

The WikiQA corpus is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain …

📊 23 results
📏 Metrics: MAP, MRR

WikiSQL

WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further …

📊 2 results
📏 Metrics: Exact Match (EM)

WikiTableQuestions

WikiTableQuestions is a question answering dataset over semi-structured tables. It is comprised of question-answer pairs on HTML tables, and was …

📊 2 results
📏 Metrics: Accuracy, Accuracy (Test)

catbAbI LM-mode

We aim to improve the bAbI benchmark as a means of developing intelligent dialogue agents. To this end, we propose …

📊 4 results
📏 Metrics: Accuracy (mean)

catbAbI QA-mode

We aim to improve the bAbI benchmark as a means of developing intelligent dialogue agents. To this end, we propose …

📊 4 results
📏 Metrics: 1:1 Accuracy

Question Generation

FairytaleQA

FairytaleQA is a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Annotated by educational experts based on an …

📊 3 results
📏 Metrics: ROUGE-L

Natural Questions

The Natural Questions corpus is a question answering dataset containing 307,373 training examples, 7,830 development examples, and 7,842 test examples. …

📊 2 results
📏 Metrics: QAE, R-QAE

SQuAD

The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct …

📊 2 results
📏 Metrics: QAE, R-QAE

TriviaQA

TriviaQA is a realistic text-based question answering dataset which includes 950K question-answer pairs from 662K documents collected from Wikipedia and …

📊 2 results
📏 Metrics: QAE, R-QAE

WeiboPolls

Dataset Description The dataset described in the provided text is focused on social media polls collected from Weibo, a …

📊 2 results
📏 Metrics: ROUGE-1, ROUGE-L, BLEU-1, BLEU-3

Question-Answer categorization

QC-Science

QC-Science contains 47832 question-answer pairs belonging to the science domain tagged with labels of the form subject - chapter - …

📊 6 results
📏 Metrics: R@5, R@10, R@15, R@20

RGB Salient Object Detection

DAVIS-S

To enrich the diversity, we also collect 92 images which are suitable for saliency detection from DAVIS [27], a densely …

📊 12 results
📏 Metrics: S-measure, F-measure, MAE, mBA

DUT-OMRON

The DUT-OMRON dataset is used for evaluation of Salient Object Detection task and it contains 5,168 high quality images. The …

📊 17 results
📏 Metrics: S-Measure, F-measure, mean E-Measure, MAE, mean F-Measure, Weighted F-Measure

ECSSD

The Extended Complex Scene Saliency Dataset (ECSSD) is comprised of complex scenes, presenting textures and structures common to real-world images. …

📊 13 results
📏 Metrics: S-Measure, F-measure, MAE, mean F-Measure, mean E-Measure, F-Score, Weighted F-Measure

HKU-IS

HKU-IS is a visual saliency prediction dataset which contains 4447 challenging images, most of which have either low contrast or …

📊 13 results
📏 Metrics: S-Measure, F-measure, MAE, mean F-Measure, mean E-Measure, Weighted F-Measure, F-Score

HRSOD

There exist several datasets for saliency detection, but none of them is specifically designed for high-resolution salient object detection. High-Resolution …

📊 14 results
📏 Metrics: S-Measure, max F-Measure, MAE, mBA

ISTD

The Image Shadow Triplets dataset (ISTD) is a dataset for shadow understanding that contains 1870 image triplets of shadow image, …

📊 4 results
📏 Metrics: Balanced Error Rate

PASCAL-S

PASCAL-S is a dataset for salient object detection consisting of a set of 850 images from PASCAL VOC 2010 validation …

📊 11 results
📏 Metrics: S-Measure, F-measure, MAE, mean F-Measure, mean E-Measure, F-Score, Weighted F-Measure

SBU / SBU-Refine

SBU-Kinect-Interaction dataset version 2.0 comprises of RGB-D video sequences of humans performing interaction activities that are recording using the Microsoft …

📊 4 results
📏 Metrics: Balanced Error Rate

SOC

SOC (Salient Objects in Clutter) is a dataset for Salient Object Detection (SOD). It includes images with salient and non-salient …

📊 3 results
📏 Metrics: Average MAE, S-Measure, mean E-Measure

SOD

Aiming Detect small obstacles, like lost and found. # frames 3000+ picture. 3000+ claimed labelled. 1600 actually labelled.

📊 1 results
📏 Metrics: MAE, F-measure

UHRSD

Recent salient object detection (SOD) methods based on deep neural network have achieved remarkable performance. However, most of existing SOD …

📊 12 results
📏 Metrics: S-Measure, max F-Measure, MAE, mBA

Rain Removal

Nightrain

Synthetically Generated Night-time Weather Degraded Database

📊 4 results
📏 Metrics: PSNR

Reading Comprehension

AdversarialQA

We have created three new Reading Comprehension datasets constructed using an adversarial model-in-the-loop. We use three different models; BiDAF (Seo …

📊 3 results
📏 Metrics: Overall: F1, D(BiDAF): F1, D(BERT): F1, D(RoBERTa): F1

MuSeRC

We present a reading comprehension challenge in which questions can only be answered by taking into account information from multiple …

📊 6 results
📏 Metrics: Average F1, EM

RACE

The ReAding Comprehension dataset from Examinations (RACE) dataset is a machine reading comprehension dataset consisting of 27,933 passages and 97,867 …

📊 24 results
📏 Metrics: Accuracy, Accuracy (Middle), Accuracy (High)

ReCAM

Tasks Our shared task has three subtasks. Subtask 1 and 2 focus on evaluating machine learning models' performance with regard …

📊 1 results
📏 Metrics: Accuracy

ReClor

Logical reasoning is an important ability to examine, analyze, and critically evaluate arguments as they occur in ordinary language as …

📊 10 results
📏 Metrics: Test

Reading Order Detection

ROOR

ROOR is a reading order prediction (ROP) benchmark which annotates layout reading order as ordering relations. Layout reading order is …

📊 4 results
📏 Metrics: Segment-level F1

ReadingBank

ReadingBank is a benchmark dataset for reading order detection built with weak supervision from WORD documents, which contains 500K document …

📊 2 results
📏 Metrics: Average Relative Distance (ARD), Average Page-level BLEU

Recognizing Emotion Cause in Conversations

EmoCause

EmoCause is a dataset of annotated emotion cause words in emotional situations from the EmpatheticDialogues valid and test set. The …

📊 6 results
📏 Metrics: Top-1 Recall, Top-3 Recall, Top-5 Recall

RECCON

RECCON is a dataset for the task of recognizing emotion cause in conversations. Source: Recognizing Emotion Cause in Conversations

📊 2 results
📏 Metrics: F1, Exact Span F1, F1(Pos), F1(Neg)

Recommendation Systems

Amazon Beauty

This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links …

📊 5 results
📏 Metrics: Hit@10, nDCG@10, NDCG

Amazon Fashion

This datasets is a subset of the Amazon reviews dataset which contain Fashion related products

📊 4 results
📏 Metrics: HitRatio@ 10 (100 Neg. Samples), nDCG@10 (100 Neg. Samples), AUC, nDCG@10 (500 Neg. Samples), Hit@10, NDCG

Amazon Men

This datasets is a subset of the Amazon reviews dataset which contain Men related products

📊 3 results
📏 Metrics: Hit@10, nDCG@10, NDCG

Amazon Product Data

This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014. This …

📊 1 results
📏 Metrics: AUC, F1

Amazon-Book

N/A

📊 15 results
📏 Metrics: nDCG@20, Recall@20, HR@10, NDCG@10, HR@50, NDCG@50

Ciao

The Ciao dataset contains rating information of users given to items, and also contain item category information. The data comes …

📊 1 results
📏 Metrics: Hits@10, Hits@20, nDCG@10, nDCG@20

Delicious

Delicious : This data set contains tagged web pages retrieved from the website delicious.com. Source: [Text segmentation on multilabel documents: …

📊 1 results
📏 Metrics: NDCG, Recall@20

Douban

We release Douban Conversation Corpus, comprising a training data set, a development set and a test set for retrieval based …

📊 5 results
📏 Metrics: RMSE, NDCG, Recall@20, AUC, HR@10, HR@100, PSP@10, nDCG@10, nDCG@100

Epinions

The Epinions dataset is built form a who-trust-whom online social network of a general consumer review site Epinions.com. Members of …

📊 4 results
📏 Metrics: MAE, RMSE, MAP@20, MRR@20, NDCG@20

Gowalla

Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and …

📊 13 results
📏 Metrics: nDCG@20, Recall@20, HR@10, HR@100, PSP@10, nDCG@10, nDCG@100

Pinterest

The Pinterest dataset contains more than 1 million images associated to Pinterest users’ who have “pinned” them. Source: https://openaccess.thecvf.com/content_iccv_2015/papers/Geng_Learning_Image_and_ICCV_2015_paper.pdf

📊 1 results
📏 Metrics: nDCG@10, Hits@10, Hits@20, nDCG@20

PixelRec

an image cover dataset in short video recommendation

📊 1 results
📏 Metrics: Hit@10

Polyvore

This dataset contains 21,889 outfits from polyvore.com, in which 17,316 are for training, 1,497 for validation and 3,076 for testing. …

📊 3 results
📏 Metrics: AUC, Accuracy

ReDial

ReDial (Recommendation Dialogues) is an annotated dataset of dialogues, where users recommend movies to each other. The dataset consists of …

📊 7 results
📏 Metrics: Recall@1, Recall@10, Recall@50

WeChat

The WeChat dataset for fake news detection contains more than 20k news labelled as fake news or not.

📊 2 results
📏 Metrics: AUC, P@10

Yelp

The Yelp Dataset is a valuable resource for academic research, teaching, and learning. It provides a rich collection of real-world …

📊 2 results
📏 Metrics: NDCG, NDCG@20, Recall@20

Yelp2018

The Yelp2018 dataset is adopted from the 2018 edition of the yelp challenge. Wherein local businesses like restaurants and bars …

📊 11 results
📏 Metrics: NDCG@20, Recall@20, HR@10, HR@100, PSP@10, nDCG@10, nDCG@100

Reconstruction

ADE20K

The ADE20K semantic segmentation dataset contains more than 20K scene-centric images exhaustively annotated with pixel-level objects and object parts labels. …

📊 1 results
📏 Metrics: PSNR

CelebAMask-HQ

CelebAMask-HQ is a large-scale face image dataset that has 30,000 high-resolution face images selected from the CelebA dataset by following …

📊 1 results
📏 Metrics: PSNR, R-FID

PPMI

The Parkinson’s Progression Markers Initiative (PPMI) dataset originates from an observational clinical and longitudinal study comprising evaluations of people with …

📊 1 results
📏 Metrics: runtime (s)

iDesigner

Fashion trends are constantly evolving, but a trained eye can estimate with some accuracy the signature elements of a particular …

📊 1 results
📏 Metrics: PSNR, R-FID

Red Teaming

SUDO Dataset

SUDO is a benchmark of 50 real-world malicious tasks designed to evaluate LLM-based computer agents in live desktop and web …

📊 1 results
📏 Metrics: Attack Success Rate

Referring Expression

SQA3D

SQA3D is a dataset for embodied scene understanding, where an agent needs to comprehend the scene it situates from an …

📊 1 results
📏 Metrics: [email protected], [email protected], Acc@15°, Acc@30°

Referring Expression Segmentation

A2D Sentences

The Actor-Action Dataset (A2D) by Xu et al. [29] serves as the largest video dataset for the general actor and …

📊 20 results

CLEVR-Ref+

CLEVR-Ref+ is a synthetic diagnostic dataset for referring expression comprehension. The precise locations and attributes of the objects are readily …

📊 1 results
📏 Metrics: IoU

PhraseCut

PhraseCut is a dataset consisting of 77,262 images and 345,486 phrase-region pairs. The dataset is collected on top of the …

📊 6 results

RefCOCO

The RefCOCO dataset is a referring expression generation (REG) dataset used for tasks related to understanding natural language expressions that …

📊 4 results
📏 Metrics: IoU, IoU (%)

Refer-YouTube-VOS

There exist previous works [6, 10] that constructed referring segmentation datasets for videos. Gavrilyuk et al. [6] extended the A2D …

📊 2 results
📏 Metrics: Mean IoU, [email protected], [email protected]

Referring Expressions for DAVIS 2016 & 2017

Our task is to localize and provide a pixel-level mask of an object on all video frames given a language …

📊 1 results
📏 Metrics: F, J, J&F 1st frame

Referring expression generation

ColonINST-v1 (Seen)

ColonINST is a large-scale instruction tuning dataset designed for multimodal analysis in colonoscopy. This dataset comprises 62 categories, 303,001 colonoscopy …

📊 17 results
📏 Metrics: Accuray

ColonINST-v1 (Unseen)

ColonINST is a large-scale instruction tuning dataset designed for multimodal analysis in colonoscopy. This dataset comprises 62 categories, 303,001 colonoscopy …

📊 17 results
📏 Metrics: Accuray

Reinforcement Learning

iris

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician, …

📊 1 results
📏 Metrics: 10 Images, 4*4 Stitching, Exact Accuracy

Reinforcement Learning (Atari Games)

Seaquest - OpenAI Gym

Dataset: The experiments are conducted using the Seaquest environment from the OpenAI Gym framework, which simulates the Atari 2600 game …

📊 1 results
📏 Metrics: Average Return

Reinforcement Learning (RL)

ProcGen

Procgen Benchmark includes 16 simple-to-use procedurally-generated environments which provide a direct measure of how quickly a reinforcement learning agent learns …

📊 2 results
📏 Metrics: Mean Normalized Performance

Relation Classification

AbstRCT - Neoplasm

The AbstRCT dataset consists of randomized controlled trials retrieved from the MEDLINE database via PubMed search. The trials are annotated …

📊 1 results
📏 Metrics: Macro F1

CDCP

The Cornell eRulemaking Corpus – CDCP is an argument mining corpus annotated with argumentative structure information capturing the evaluability of …

📊 1 results
📏 Metrics: Macro F1

DRI Corpus

The Dr. Inventor Multi-Layer Scientific Corpus (DRI Corpus) includes 40 Computer Graphics papers, selected by domain experts. Each paper of …

📊 1 results
📏 Metrics: Macro F1

Discovery

The Discovery datasets consists of adjacent sentence pairs (s1,s2) with a discourse marker (y) that occurred at the beginning of …

📊 1 results
📏 Metrics: 1:1 Accuracy

FewRel

The FewRel (Few-Shot Relation Classification Dataset) contains 100 relations and 70,000 instances from Wikipedia. The dataset is divided into three …

📊 5 results
📏 Metrics: F1 (10-way 1-shot), F1 (10-way 5-shot), F1 (5-way 1-shot), F1 (5-way 5-shot, F1

TACRED

TACRED is a large-scale relation extraction dataset with 106,264 examples built over newswire and web text from the corpus used …

📊 17 results
📏 Metrics: F1

Relation Extraction

2010 i2b2/VA

2010 i2b2/VA is a biomedical dataset for relation classification and entity typing.

📊 1 results
📏 Metrics: Macro F1

2012 i2b2 Temporal Relations

The Sixth Informatics for Integrating Biology and the Bedside (i2b2) Natural Language Processing Challenge for Clinical Records focused on the …

📊 1 results
📏 Metrics: Macro F1

ACE 2004

ACE 2004 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2004 Automatic …

📊 9 results
📏 Metrics: RE+ Micro F1, RE Micro F1, NER Micro F1, Cross Sentence

ACE 2005

ACE 2005 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic …

📊 19 results
📏 Metrics: RE Micro F1, RE+ Micro F1, NER Micro F1, Sentence Encoder, Relation classification F1, Cross Sentence, Relation F1

Adverse Drug Events (ADE) Corpus

Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. A significant …

📊 10 results
📏 Metrics: RE+ Macro F1, RE Macro F1, NER Macro F1

BioRED

BioRED is a first-of-its-kind biomedical relation extraction dataset with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. …

📊 2 results
📏 Metrics: F1

CDR

The BioCreative V CDR task corpus is manually annotated for chemicals, diseases and chemical-induced disease (CID) relations. It contains the …

📊 9 results
📏 Metrics: F1

ChemProt

ChemProt consists of 1,820 PubMed abstracts with chemical-protein interactions annotated by domain experts and was used in the BioCreative VI …

📊 12 results
📏 Metrics: F1, Micro F1

CoNLL04

The CoNLL04 dataset is a benchmark dataset used for relation extraction tasks. It contains 1,437 sentences, each of which has …

📊 13 results
📏 Metrics: RE+ Macro F1 , RE+ Micro F1, NER Macro F1, NER Micro F1, RE+ Macro F1

DDI

The DDIExtraction 2013 task relies on the DDI corpus which contains MedLine abstracts on drug-drug interactions as well as documents …

📊 3 results
📏 Metrics: F1, Micro F1

DWIE

The 'Deutsche Welle corpus for Information Extraction' (DWIE) is a multi-task dataset that combines four main Information Extraction (IE) annotation …

📊 3 results
📏 Metrics: F1-Hard

Dataset: Relationship extraction for knowledge graph creation from biomedical literature (Gene-Disease relationships)

This is the dataset used for classifying Gene-Disease relationship types from sentences. The dataset consists of 3 files: * manually_annotated_set.xlsx …

📊 2 results
📏 Metrics: F1

DocRED

DocRED (Document-Level Relation Extraction Dataset) is a relation extraction dataset constructed from Wikipedia and Wikidata. Each document in the dataset …

📊 56 results
📏 Metrics: F1, Ign F1

FUNSD

Form Understanding in Noisy Scanned Documents (FUNSD) comprises 199 real, fully annotated, scanned forms. The documents are noisy and vary …

📊 9 results
📏 Metrics: F1

FewRel

The FewRel (Few-Shot Relation Classification Dataset) contains 100 relations and 70,000 instances from Wikipedia. The dataset is divided into three …

📊 2 results
📏 Metrics: F1, Precision, Recall

GAD

GAD, or Gene Associations Database, is a corpus of gene-disease associations curated from genetic association studies.

📊 3 results
📏 Metrics: F1, Micro F1

GDA

The gene-disease associations corpus contains 30,192 titles and abstracts from PubMed articles that have been automatically labelled for genes, diseases …

📊 9 results
📏 Metrics: F1

JNLPBA

JNLPBA is a biomedical dataset that comes from the GENIA version 3.02 corpus (Kim et al., 2003). It was created …

📊 1 results
📏 Metrics: F1

NYT10-HRL

a dataset from A Hierarchical Framework for Relation Extraction with Reinforcement Learning

📊 10 results
📏 Metrics: F1

NYT11-HRL

Preprocessed version of NYT11. Each relational triple is formatted as follows: rtext : relation type em1 : source entity mention …

📊 11 results
📏 Metrics: F1

PGR

Phenotype-Gene Relations (PGR) is a corpus that consists of 1712 abstracts, 5676 human phenotype annotations, 13835 gene annotations, and 4283 …

📊 1 results
📏 Metrics: Macro F1

REBEL

Wikipedia abstracts automatically annotated with WikiData entities and relations that are entailed by the text. Over 9 million triplets.

📊 2 results
📏 Metrics: Triplet F1 (strict EL)

Re-TACRED

The Re-TACRED dataset is a significantly improved version of the TACRED dataset for relation extraction. Using new crowd-sourced labels, Re-TACRED …

📊 7 results
📏 Metrics: F1

SciERC

SciERC dataset is a collection of 500 scientific abstract annotated with scientific entities, their relations, and coreference clusters. The abstracts …

📊 3 results
📏 Metrics: F1, NER Micro F1, RE+ Micro F1

SemEval-2010 Task-8

The dataset for the SemEval-2010 Task 8 is a dataset for multi-way classification of mutually exclusive semantic relations between pairs …

📊 22 results
📏 Metrics: F1

TACRED

TACRED is a large-scale relation extraction dataset with 106,264 examples built over newswire and web text from the corpus used …

📊 36 results
📏 Metrics: F1, F1 (10% Few-Shot), F1 (5% Few-Shot), F1 (1% Few-Shot), F1 (Zero-Shot)

TACRED-Revisited

The TACRED-Revisited dataset improves the crowd-sourced TACRED dataset for relation extraction by relabeling the dev and test sets using expert …

📊 3 results
📏 Metrics: F1

WNUT 2020

The training and development dataset for our task was taken from previous work on wet lab corpus (Kulkarni et al., …

📊 1 results
📏 Metrics: F1, Precision, Recall

WebNLG

The WebNLG corpus comprises of sets of triplets describing facts (entities and relations between them) and the corresponding facts in …

📊 10 results
📏 Metrics: F1, NER Micro F1

Remaining Useful Lifetime Estimation

NASA C-MAPSS-2

The generation of data-driven prognostics models requires the availability of datasets with run-to-failure trajectories. In order to contribute to the …

📊 1 results
📏 Metrics: Score

Remote Sensing Image Classification

FireRisk

In this work, we propose a novel remote sensing dataset, FireRisk, consisting of 7 fire risk classes with a total …

📊 4 results
📏 Metrics: Accuracy (%)

Repetitive Action Counting

Countix

Countix is a real world dataset of repetition videos collected in the wild (i.e.YouTube) covering a wide range of semantic …

📊 3 results
📏 Metrics: OBO, MAE, OBZ, RMSE

RepCount

Counting repetitive actions are widely seen in human activities such as physical exercise. Existing methods focus on performing repetitive action …

📊 6 results
📏 Metrics: OBO, MAE, OBZ, RMSE

UCFRep

The UCFRep dataset contains 526 annotated repetitive action videos. This dataset is built from the action recognition dataset UCF101. Source: …

📊 2 results
📏 Metrics: MAE, OBO, OBZ, RMSE

Representation Learning

Animals-10

It contains about 28K medium quality animal images belonging to 10 categories: dog, cat, horse, spyder, butterfly, chicken, sheep, cow, …

📊 1 results
📏 Metrics: 1:1 Accuracy

SciDocs

SciDocs evaluation framework consists of a suite of evaluation tasks designed for document-level tasks. Source: Allen Institute for AI

📊 7 results
📏 Metrics: Avg.

Sports10

  • Games dataset containing 100,000 Gameplay Images of 175 Video Games across 10 Sports Genres - AMERICAN FOOTBALL, BASKETBALL, BIKE …
📊 1 results
📏 Metrics: Silhouette Score

Respiratory Failure

HiRID

HiRID is a freely accessible critical care dataset containing data relating to almost 34 thousand patient admissions to the Department …

📊 8 results
📏 Metrics: AUPRC, Recall@50

Response Generation

ArgSciChat

ArgSciChat is an argumentative dialogue dataset. It consists of 498 messages collected from 41 dialogues on 20 scientific papers. It …

📊 3 results
📏 Metrics: Message-F1, BScore, Mover

MMConv

The main goal of the data collection is to acquire highly natural conversations that cover a wide variety of styles …

📊 2 results
📏 Metrics: BLEU, Comb., Inform, Success

SIMMC2.0

Next generation task-oriented dialog systems need to understand conversational contexts with their perceived surroundings, to effectively help users in the …

📊 2 results
📏 Metrics: BLEU

Retinal Vessel Segmentation

CHASE_DB1

CHASE_DB1 is a dataset for retinal vessel segmentation which contains 28 color retina images with the size of 999×960 pixels …

📊 15 results
📏 Metrics: AUC, F1 score, mIOU, Sensitivity, MCC, 1:1 Accuracy, Acc, Average IOU, DSC

DRIVE

The Digital Retinal Images for Vessel Extraction (DRIVE) dataset is a dataset for retinal vessel segmentation. It consists of a …

📊 19 results
📏 Metrics: AUC, F1 score, Accuracy, mIoU, sensitivity, Specificity, MCC, 1:1 Accuracy, Average IOU, DSC

HRF

The HRF dataset is a dataset for retinal vessel segmentation which comprises 45 images and is organized as 15 subsets. …

📊 4 results
📏 Metrics: AUC, F1 score, MCC, mIoU, 1:1 Accuracy, Acc, Average IOU, DSC, Sensitivity

INSPIRE-AVR (LUNet subset)

This dataset contains 65 DFIs acquired from patients with POAG at the University of Iowa Hospitals and Clinics. DFIs were …

📊 1 results
📏 Metrics: Average Dice

STARE

The STARE (Structured Analysis of the Retina) dataset is a dataset for retinal vessel segmentation. It contains 20 equal-sized (700×605) …

📊 9 results
📏 Metrics: AUC, F1 score, mIOU, Sensitivity, Acc, MCC, 1:1 Accuracy, Average IOU, DSC

UZLF

The Leuven-Haifa dataset contains 240 disc-centered fundus images of 224 unique patients (75 patients with normal tension glaucoma, 63 patients …

📊 5 results
📏 Metrics: Average Dice (0.5*Dice_a + 0.5*Dice_v)

Retrieval

HotpotQA

HotpotQA is a question answering dataset collected on the English Wikipedia, containing about 113K crowd-sourced questions that are constructed to …

📊 3 results
📏 Metrics: Queries per second

InfoSeek

In this project, we introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions that cannot be answered with …

📊 1 results
📏 Metrics: Recall@5

MVK

The dataset contains single-shot videos taken from moving cameras in underwater environments. The first shard of a new Marine Video …

📊 1 results
📏 Metrics: text-to-video Mean Rank

Natural Questions

The Natural Questions corpus is a question answering dataset containing 307,373 training examples, 7,830 development examples, and 7,842 test examples. …

📊 3 results
📏 Metrics: Queries per second

OK-VQA

Outside Knowledge Visual Question Answering (OK-VQA) includes more than 14,000 questions that require external knowledge to answer. Source: [OK-VQA: A …

📊 2 results
📏 Metrics: Recall@5

Polyvore

This dataset contains 21,889 outfits from polyvore.com, in which 17,316 are for training, 1,497 for validation and 3,076 for testing. …

📊 1 results
📏 Metrics: Recall@5

PubMedQA

The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary …

📊 1 results
📏 Metrics: Accuracy (Top-1)

PubMedQA corpus with metadata

PubMedQA-MetaGen: Metadata-Enriched PubMedQA Corpus Dataset Summary PubMedQA-MetaGen is a metadata-enriched version of the PubMedQA biomedical question-answering dataset, created using the …

📊 1 results
📏 Metrics: Accuracy (Top-1)

Quora Question Pairs

Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary …

📊 4 results
📏 Metrics: Queries per second

ToolLens

The ToolLens dataset consists of 18,770 concise yet intentionally multifaceted queries, each associated with 1 to 3 verified tools out …

📊 1 results
📏 Metrics: COMP@

Road Segmentation

ChesapeakeRSC

A novel remote sensing dataset for evaluating a geospatial machine learning model's ability to learn long range dependencies and spatial …

📊 4 results
📏 Metrics: DWR

DeepGlobe

We observe that satellite imagery is a powerful source of information as it contains more structured and uniform data, compared …

📊 2 results
📏 Metrics: APLS, IoU, mIoU

Massachusetts Roads Dataset

The datasets introduced in Chapter 6 of my PhD thesis are below. See the thesis for more details. If you …

📊 2 results
📏 Metrics: IoU, F1, APLS

Robot Manipulation

CALVIN

CALVIN (Composing Actions from Language and Vision), is an open-source simulated benchmark to learn long-horizon language-conditioned robot manipulation tasks.

📊 19 results
📏 Metrics: avg. sequence length (D to D)

RLBench

RLBench is an ambitious large-scale benchmark and learning environment designed to facilitate research in a number of vision-guided manipulation research …

📊 16 results
📏 Metrics: Succ. Rate (18 tasks, 100 demo/task), Succ. Rate (18 tasks, 10 demo/task), Training Time (V100 x 8 x day), Training Time (A100 x hour), Succ. Rate (10 tasks, 100 demos/task), Succ. Rate (74 tasks, 100 demos/task), Inference Speed (fps), Input Image Size

SimplerEnv-Google Robot

Significant progress has been made in building generalist robot manipulation policies, yet their scalable and reproducible evaluation remains challenging, as …

📊 9 results
📏 Metrics: Visual Matching, Visual Matching-Pick Coke Can, Visual Matching-Move Near, Visual Matching-Open/Close Drawer, Variant Aggregation, Variant Aggregation-Pick Coke Can, Variant Aggregation-Move Near, Variant Aggregation-Open/Close Drawer

SimplerEnv-Widow X

Significant progress has been made in building generalist robot manipulation policies, yet their scalable and reproducible evaluation remains challenging, as …

📊 7 results
📏 Metrics: Average, Put Spoon on Towel, Put Carrot on Plate, Stack Green Block on Yellow Block, Put Eggplant in Yellow Basket, Put Eggplant in Yellow Basket

Robot Task Planning

PackIt

The ability to jointly understand the geometry of objects and plan actions for manipulating them is crucial for intelligent agents. …

📊 4 results
📏 Metrics: Average Reward

SheetCopilot

The SheetCopilot dataset contains 28 evaluation workbooks and 221 spreadsheet manipulation tasks that are applied to these workbooks. These tasks …

📊 2 results
📏 Metrics: Pass@1

Robotic Grasping

GraspNet-1Billion

GraspNet-1Billion provides large-scale training data and a standard evaluation platform for the task of general robotic grasping. The dataset contains …

📊 5 results
📏 Metrics: mAP, AP_seen, AP_similar, AP_novel

NBMOD

Introduction NBMOD is a dataset created for researching the task of specific object grasp detection by robots in noisy …

📊 1 results
📏 Metrics: Acc

Role-filler Entity Extraction

MUC-4

A dataset for evaluate system's understanding of given passages.

📊 1 results
📏 Metrics: Avg. F1

Rolling Shutter Correction

BS-RSC

BS-RSC is a real-world rolling shutter (RS) correction dataset and a corresponding model to correct the RS frames in a …

📊 6 results
📏 Metrics: Average PSNR (dB)

Romanian Text Diacritization

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish …

📊 1 results
📏 Metrics: Alpha-Word accuracy

Room Layout Estimation

SUN RGB-D

The SUN RGBD dataset contains 10335 real RGB-D images of room scenes. Each RGB image has a corresponding depth and …

📊 6 results
📏 Metrics: IoU, Camera Pitch, Camera Roll

Rumour Detection

1

111

📊 1 results
📏 Metrics: 0..5sec

Sepehr_RumTel01

The expansion of social networks has accelerated the transmission of information and news at every communities. Over the past few …

📊 2 results
📏 Metrics: F-Measure

SQL-to-Text

WikiSQL

WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further …

📊 2 results
📏 Metrics: BLEU-4

SSIM

DocUNet

Various documents dataset. Each of the 65 documents includes scanned ground truth images, both hard and easy distorted photos, and …

📊 2 results
📏 Metrics: SSIM

Saliency Detection

CAT2000

Includes 4000 images; 200 from each of 20 categories covering different types of scenes such as Cartoons, Art, Objects, Low …

📊 1 results
📏 Metrics: AUC, NSS

DUT-OMRON

The DUT-OMRON dataset is used for evaluation of Salient Object Detection task and it contains 5,168 high quality images. The …

📊 5 results
📏 Metrics: MAE, Fwβ, Sm, relaxFbβ, {max}Fβ

ECSSD

The Extended Complex Scene Saliency Dataset (ECSSD) is comprised of complex scenes, presenting textures and structures common to real-world images. …

📊 1 results
📏 Metrics: MAE

HKU-IS

HKU-IS is a visual saliency prediction dataset which contains 4447 challenging images, most of which have either low contrast or …

📊 3 results
📏 Metrics: MAE, Fwβ, Sm, relaxFbβ, {max}Fβ

PASCAL Context

The PASCAL Context dataset is an extension of the PASCAL VOC 2010 detection challenge, and it contains pixel-wise labels for …

📊 1 results
📏 Metrics: max_F1

PASCAL-S

PASCAL-S is a dataset for salient object detection consisting of a set of 850 images from PASCAL VOC 2010 validation …

📊 1 results
📏 Metrics: MAE

Saliency Prediction

CAT2000

Includes 4000 images; 200 from each of 20 categories covering different types of scenes such as Cartoons, Art, Objects, Low …

📊 1 results
📏 Metrics: KL

SALICON

The SALIency in CONtext (SALICON) dataset contains 10,000 training images, 5,000 validation images and 5,000 test images for saliency prediction. …

📊 5 results
📏 Metrics: AUC, CC, KLD, NSS, SIM, sAUC, IG

Salient Object Detection

DUT-OMRON

The DUT-OMRON dataset is used for evaluation of Salient Object Detection task and it contains 5,168 high quality images. The …

📊 8 results
📏 Metrics: S-measure, E-measure, MAE, max_F1

ECSSD

The Extended Complex Scene Saliency Dataset (ECSSD) is comprised of complex scenes, presenting textures and structures common to real-world images. …

📊 10 results
📏 Metrics: S-measure, E-measure, MAE, max_F1

HKU-IS

HKU-IS is a visual saliency prediction dataset which contains 4447 challenging images, most of which have either low contrast or …

📊 9 results
📏 Metrics: S-measure, E-measure, MAE, max_F1

PASCAL-S

PASCAL-S is a dataset for salient object detection consisting of a set of 850 images from PASCAL VOC 2010 validation …

📊 10 results
📏 Metrics: S-measure, E-measure, MAE, max_F1

SOD

Aiming Detect small obstacles, like lost and found. # frames 3000+ picture. 3000+ claimed labelled. 1600 actually labelled.

📊 1 results
📏 Metrics: Fwβ, MAE, Sm, relaxFbβ, {max}Fβ

Sarcasm Detection

MUStARD++

MUStARD++ is a multimodal sarcasm detection dataset (MUStARD) pre-annotated with 9 emotions. It can be used for the task of …

📊 1 results
📏 Metrics: Precision, Recall, F1

WITS

This dataset is an extension of MASAC, a multimodal, multi-party, Hindi-English code-mixed dialogue dataset compiled from the popular Indian TV …

📊 1 results
📏 Metrics: R1

iSarcasm

iSarcasm is a dataset of tweets, each labelled as either sarcastic or non_sarcastic. Each sarcastic tweet is further labelled for …

📊 1 results
📏 Metrics: F1-Score

Scanpath prediction

CapMIT1003

The CapMIT1003 database contains captions and clicks collected for images from the MIT1003 database, for which reference eye scanpath are …

📊 2 results
📏 Metrics: SBTDE

Scene Change Detection

ChangeSim

ChangeSim is a dataset aimed at online scene change detection (SCD) and more. The data is collected in photo-realistic simulation …

📊 3 results
📏 Metrics: Category mIoU, macro F1

ChangeVPR

Scene change detection (SCD) dataset tailored for generalizable SCD algorithm. It consists of change-labeld images from SF-XL, St Lucia, Nordland …

📊 1 results
📏 Metrics: F1 score

PCD

The Arabic dataset is scraped mainly from الموسوعة الشعرية and الديوان. After merging both, the total number of verses is …

📊 2 results
📏 Metrics: F1-score

Unaligned-VL-CMU-CD (neighbor distance 2)

Street-View images captured at different timestamps often undergo geometric transformations. To make the VL-CMU-CD dataset more challenging and closer to …

📊 2 results
📏 Metrics: F1-score

Scene Classification

UC Merced Land Use Dataset

This is a 21 class land use image dataset meant for research purposes. There are 100 images for each of …

📊 4 results
📏 Metrics: Accuracy (%)

Scene Flow Estimation

Argoverse 2

Argoverse 2 (AV2) is a collection of three datasets for perception and forecasting research in the self-driving domain. The annotated …

📊 6 results
📏 Metrics: EPE 3-Way, EPE Foreground Dynamic, EPE Foreground Static, EPE Background Static

Spring

Spring is a large, high-resolution and high-detail, computer-generated benchmark for scene flow, optical flow, and stereo. Based on rendered scenes …

📊 6 results
📏 Metrics: 1px total

Scene Generation

AVD

AVD focuses on simulating robotic vision tasks in everyday indoor environments using real imagery. The dataset includes 20,000+ RGB-D images …

📊 3 results
📏 Metrics: FID, SwAV-FID

GoogleEarth

The GoogleEarth dataset is collected from Google Earth Studio, including 400 orbit trajectories in Manhattan and Brooklyn. Each trajectory consists …

📊 4 results
📏 Metrics: Depth Error, KID, Camera Error, FID

KITTI

KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile …

📊 1 results
📏 Metrics: FID, KID

OSM

The OSM dataset, sourced from OpenStreetMap, is composed of the rasterized semantic maps and height fields of 80 cities worldwide, …

📊 1 results
📏 Metrics: Average FID, KID

Replica

The Replica Dataset is a dataset of high quality reconstructions of a variety of indoor spaces. Each reconstruction has clean …

📊 3 results
📏 Metrics: FID, SwAV-FID

VizDoom

ViZDoom is an AI research platform based on the classical First Person Shooter game Doom. The most popular game mode …

📊 3 results
📏 Metrics: FID, SwAV-FID

Scene Graph Generation

4D-OR

4D-OR includes a total of 6734 scenes, recorded by six calibrated RGB-D Kinect sensors 1 mounted to the ceiling of …

📊 5 results
📏 Metrics: F1

MM-OR

Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment for enhancing …

📊 1 results
📏 Metrics: Macro F1

VRD

The Visual Relationship Dataset (VRD) contains 4000 images for training and 1000 for testing annotated with visual relationships. Bounding boxes …

📊 2 results
📏 Metrics: Recall@50

Visual Genome

Visual Genome contains Visual Question Answering data in a multi-choice setting. It consists of 101,174 images from MSCOCO with 1.7 …

📊 16 results
📏 Metrics: Recall@50, mean Recall @20, Recall@100, Recall@20, mean Recall @100, R@100, mR@100, mR@50, zR@100, zR@20, zR@50

Scene Parsing

PGDP5K

PGDP5K is a dataset consisting of 5000 diagram samples composed of 16 shapes, covering 5 positional relations, 22 symbol types …

📊 2 results
📏 Metrics: Total Accuracy

Scene Segmentation

MovieNet

MovieNet is a holistic dataset for movie understanding. MovieNet contains 1,100 movies with a large amount of multi-modal data, e.g. …

📊 1 results
📏 Metrics: AP

ScanNet

ScanNet is an instance-level indoor RGB-D dataset that includes both 2D and 3D data. It is a collection of labeled …

📊 3 results
📏 Metrics: Average Accuracy, 3DIoU

StreetHazards

StreetHazards is a synthetic dataset for anomaly detection, created by inserting a diverse array of foreign objects into driving scenes …

📊 3 results
📏 Metrics: Open-mIoU

UAVid

UAVid is a high-resolution UAV semantic segmentation dataset as a complement, which brings new challenges, including large scale variation, moving …

📊 1 results
📏 Metrics: Category mIoU

Scene Text Detection

COCO-Text

The COCO-Text dataset is a dataset for text detection and recognition. It is based on the MS COCO dataset, which …

📊 6 results
📏 Metrics: F-Measure, Precision, Recall

ICDAR 2013

The ICDAR 2013 dataset consists of 229 training images and 233 testing images, with word-level annotations provided. It is the …

📊 14 results
📏 Metrics: F-Measure, Precision, Recall, H-Mean

ICDAR 2015

ICDAR 2015 was a scene text detection used for the ICDAR 2015 conference.

📊 41 results
📏 Metrics: F-Measure, Precision, Recall, Accuracy, FPS

MSRA-TD500

The MSRA-TD500 dataset is a text detection dataset that contains 300 training images and 200 test images. Text regions are …

📊 18 results
📏 Metrics: F-Measure, Precision, Recall, FPS

SCUT-CTW1500

The SCUT-CTW1500 dataset contains 1,500 images: 1,000 for training and 500 for testing. In particular, it provides 10,751 cropped text …

📊 16 results
📏 Metrics: F-Measure, Precision, Recall, FPS

Total-Text

Total-Text is a text detection dataset that consists of 1,555 images with a variety of text types including horizontal, multi-oriented, …

📊 26 results
📏 Metrics: F-Measure, Precision, Recall, FPS

Scene Text Recognition

COCO-Text

The COCO-Text dataset is a dataset for text detection and recognition. It is based on the MS COCO dataset, which …

📊 4 results
📏 Metrics: 1:1 Accuracy

CUTE80

The CUTE80 dataset is a lightweight collection of images specifically designed for text detection in natural scene images. It contains …

📊 17 results
📏 Metrics: Accuracy

HOST

The heavily occluded scene text (HOST) dataset is a dataset that contains images of text with occlusions. It is used …

📊 3 results
📏 Metrics: 1:1 Accuracy

IC13

The IC13 dataset contains 561 images: 420 for training and 141 for testing. It inherits data from the IC03 dataset …

📊 1 results
📏 Metrics: Accuracy

ICDAR 2003

The ICDAR2003 dataset is a dataset for scene text recognition. It contains 507 natural scene images (including 258 training images …

📊 11 results
📏 Metrics: Accuracy

IIIT5k

The IIIT5K dataset contains 5,000 text instance images: 2,000 for training and 3,000 for testing. It contains words from street …

📊 16 results
📏 Metrics: Accuracy

MSDA

  • 5 domains: synthetic domain, document domain, street view domain, handwritten domain, and car license domain * over five million …
📊 2 results
📏 Metrics: Average Accuracy

SVT

The Street View Text (SVT) dataset was harvested from Google Street View. Image text in this data exhibits high variability …

📊 34 results
📏 Metrics: Accuracy

SVTP

SVTP dataset stands for Scene Text Recognition Datasets. It is a collection of 4 popular Latin/English scene text recognition datasets, …

📊 16 results
📏 Metrics: Accuracy

WOST

The Weakly Occluded Scene Text (WOST) dataset is a public dataset for scene text segmentation. It is used to generate …

📊 5 results
📏 Metrics: 1:1 Accuracy

Scene-Aware Dialogue

AVSD

The Audio Visual Scene-Aware Dialog (AVSD) dataset, or DSTC7 Track 3, is a audio-visual dataset for dialogue understanding. The goal …

📊 1 results
📏 Metrics: CIDEr

Scientific Document Summarization

CL-SciSumm

📊 1 results
📏 Metrics: ROUGE-2

Security Studies

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Seeing Beyond the Visible

KITTI360-EX

KITTI360-EX is a dataset for outer- and inner FoV expansion. It contains 76k pinhole images as well as 76k spherical …

📊 6 results
📏 Metrics: Average PSNR

Segmentation

SA-1B

SA-1B consists of 11M diverse, high resolution, licensed, and privacy protecting images and 1.1B high-quality segmentation masks. Source: Segment Anything

📊 2 results
📏 Metrics: Average Precision, AR-small, AR-medium, AR-large

SimGas

This dataset consists of computer-generated images for gas leakage segmentation. It features diverse backgrounds, interfering foreground objects, and precise ground …

📊 1 results
📏 Metrics: IoU, Precision, Recall

Segmentation Based Workflow Recognition

PETRAW

PETRAW data set was composed of 150 sequences of peg transfer training sessions. The objective of the peg transfer session …

📊 4 results
📏 Metrics: Average AD-Accuracy

Segmented Multimodal Named Entity Recognition

Twitter-SMNER

This task aims to extract named entities and entity types while further predicting segmentation masks of visual objects.

📊 1 results
📏 Metrics: F1

Seizure Detection

CHB-MIT

The CHB-MIT dataset is a dataset of EEG recordings from pediatric subjects with intractable seizures. Subjects were monitored for up …

📊 1 results
📏 Metrics: Accuracy

TUH EEG Seizure Corpus

Our goal is to enable deep learning research in neuroscience by releasing the largest publicly available unencumbered database of EEG …

📊 2 results
📏 Metrics: AUROC

Self-Supervised Learning

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 1 results
📏 Metrics: Top-1 Accuracy

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 1 results
📏 Metrics: Top-1 Accuracy

CREMA-D

CREMA-D is an emotional multimodal actor data set of 7,442 original clips from 91 actors. These clips were from 48 …

📊 1 results
📏 Metrics: Accuracy

DABS

DABS is a domain-agnostic benchmark for self-supervised learning to encourage research and progress towards domain-agnostic methods.

📊 3 results
📏 Metrics: Images & Text, Med. Imaging, Natural Images, Sensors, Speech, Text

ImageNet-100 (TEMI Split)

This split was introduced in TEMI (BMVC 2023) Adaloglou, Nikolas, Felix Michels, Hamza Kalisch, and Markus Kollmann. "Exploring the Limits …

📊 2 results
📏 Metrics: Top-1 Accuracy

STL-10

The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or …

📊 3 results
📏 Metrics: Accuracy

Tiny ImageNet

Tiny ImageNet contains 100000 images of 200 classes (500 for each class) downsized to 64×64 colored images. Each class has …

📊 1 results
📏 Metrics: Top-1 Accuracy

Semantic Communication

Europarl

A corpus of parallel text in 21 European languages from the proceedings of the European Parliament. The Europarl parallel corpus …

📊 1 results
📏 Metrics: 0..5sec

Semantic Parsing

ATIS

The ATIS (Airline Travel Information Systems) is a dataset consisting of audio recordings and corresponding manual transcripts about humans asking …

📊 3 results
📏 Metrics: Accuracy

CFQ

A large and realistic natural language question answering dataset. Source: Measuring Compositional Generalization: A Comprehensive Method on Realistic Data

📊 5 results
📏 Metrics: Exact Match

GraphQuestions

GraphQuestions is a characteristic-rich dataset designed for factoid question answering. The dataset aims to provide a systematic way of constructing …

📊 1 results
📏 Metrics: F1 Score

SParC

SParC is a large-scale dataset for complex, cross-domain, and context-dependent (multi-turn) semantic parsing and text-to-SQL task (interactive natural language interfaces …

📊 1 results
📏 Metrics: Exact

SQA

The SQA dataset was created to explore the task of answering sequences of inter-related questions on HTML tables. It has …

📊 2 results
📏 Metrics: Denotation Accuracy, Accuracy

WebQuestionsSP

The WebQuestionsSP dataset is released as part of our ACL-2016 paper “The Value of Semantic Parse Labeling for Knowledge Base …

📊 4 results
📏 Metrics: Accuracy

WikiSQL

WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further …

📊 5 results
📏 Metrics: Accuracy, Denotation accuracy (test)

WikiTableQuestions

WikiTableQuestions is a question answering dataset over semi-structured tables. It is comprised of question-answer pairs on HTML tables, and was …

📊 22 results
📏 Metrics: Accuracy (Test), Accuracy (Dev), Accuracy, Test Accuracy

Semantic Retrieval

Contract Discovery

A new shared task of semantic retrieval from legal texts, in which a so-called contract discovery is to be performed, …

📊 6 results
📏 Metrics: Soft-F1

Semantic Role Labeling

CoNLL-2009

The task builds on the CoNLL-2008 task and extends it to multiple languages. The core of the task is to …

📊 1 results
📏 Metrics: F1 (Arg.), F1 (Prd.)

Semantic Segmentation

ACDC Scribbles

We release expert-made scribble annotations for the medical ACDC dataset [1]. The released data must be considered as extending the …

📊 6 results
📏 Metrics: Dice (Average)

ADE20K

The ADE20K semantic segmentation dataset contains more than 20K scene-centric images exhaustively annotated with pixel-level objects and object parts labels. …

📊 229 results
📏 Metrics: Validation mIoU, Test Score, Params (M), GFLOPs (512 x 512), GFLOPs, Mean IoU (class)

AI-TOD

AI-TOD comes with 700,621 object instances for eight categories across 28,036 aerial images. Compared to existing object detection datasets in …

📊 2 results
📏 Metrics: Dice

AIRS

The AIRS (Aerial Imagery for Roof Segmentation) dataset provides a wide coverage of aerial imagery with 7.5 cm resolution and …

📊 1 results
📏 Metrics: IoU

ATLANTIS

ATLANTIS is a benchmark for semantic segmentation of waterbody images. This dataset covers a wide range of natural waterbodies such …

📊 1 results
📏 Metrics: A-acc, A-mIoU, Accuracy, mIoU

ApolloScape

ApolloScape is a large dataset consisting of over 140,000 video frames (73 street scene videos) from various locations in China …

📊 2 results
📏 Metrics: mIoU

BIG

A high-resolution semantic segmentation dataset with 50 validation and 100 test objects. Image resolution in BIG ranges from 2048×1600 to …

📊 4 results
📏 Metrics: mBA, IoU

CC3M-TagMask

The dataset offers tag and mask annotations for image-text pairs from the CC3M validation set. Tag annotations denote words that …

📊 4 results
📏 Metrics: mIoU

CEMS-W

The dataset includes annotations for burned area delineation and land cover segmentation, with a focus on European soil. The dataset …

📊 3 results
📏 Metrics: mIoU

COCO (Common Objects in Context)

The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset. It is designed to …

📊 9 results
📏 Metrics: mIoU

COCO-Stuff

The Common Objects in COntext-stuff (COCO-stuff) dataset is a dataset for scene understanding tasks like semantic segmentation, object detection and …

📊 1 results
📏 Metrics: F.W. IU, Per-Class Accuracy, Pixel Accuracy, mIoU

Cam2BEV

The dataset contains two subsets of synthetic, semantically segmented road-scene images, which have been created for developing and applying the …

📊 1 results
📏 Metrics: Mean IoU

CamVid

CamVid (Cambridge-driving Labeled Video Database) is a road/driving scene understanding database which was originally captured as five video sequences with …

📊 20 results
📏 Metrics: Mean IoU, Global Accuracy

Cityscapes

Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense …

📊 2 results
📏 Metrics: mIoU, Pixel Accuracy

Cityscapes 3D

Detecting vehicles and representing their position and orientation in the three dimensional space is a key technology for autonomous driving. …

📊 1 results
📏 Metrics: mIoU

Cityscapes VIPriors subset

The training and validation data are subsets of the training split of the Cityscapes dataset. The test set is taken …

📊 1 results
📏 Metrics: Accuracy, mIoU

DADA-seg

DADA-seg is a pixel-wise annotated accident dataset, which contains a variety of critical scenarios from traffic accidents. It is used …

📊 27 results
📏 Metrics: mIoU

DDD17

DDD17 has over 12 h of a 346x260 pixel DAVIS sensor recording highway and city driving in daytime, evening, night, …

📊 9 results
📏 Metrics: mIoU

DELIVER

DELIVER is an arbitrary-modal segmentation benchmark, covering Depth, LiDAR, multiple Views, Events, and RGB. Aside from this, the dataset is …

📊 9 results
📏 Metrics: mIoU, test mIoU

DIVA-HisDB

The database consists of 150 annotated pages of three different medieval manuscripts with challenging layouts. Furthermore, we provide a layout …

📊 2 results
📏 Metrics: Mean IoU (class)

DSEC

DSEC is a stereo camera dataset in driving scenarios that contains data from two monochrome event cameras and two global …

📊 9 results
📏 Metrics: mIoU

Dark Zurich

Dark Zurich is an image dataset containing a total of 8779 images captured at nighttime, twilight, and daytime, along with …

📊 14 results
📏 Metrics: mIoU

DensePASS

DensePASS - a novel densely annotated dataset for panoramic segmentation under cross-domain conditions, specifically built to study the Pinhole-to-Panoramic transfer …

📊 35 results
📏 Metrics: mIoU

DroneDeploy

From DroneDeploy: We’ve collected a dataset of aerial orthomosaics and elevation images. These have been annotated into 6 different classes: …

📊 1 results
📏 Metrics: Mean IoU (test), Mean IoU (val)

Endoscapes

Cholecystectomy is a very common abdominal surgical procedure almost ubiquitously performed with a laparoscopic approach, hence guided by an endoscopic …

📊 2 results
📏 Metrics: Mean F1

FLAIR (French Land cover from Aerospace ImageRy)

The French National Institute of Geographical and Forest Information (IGN) has the mission to document and measure land-cover on French …

📊 4 results
📏 Metrics: mIoU

FMB Dataset

FMB contains 1500 well-registered infrared and visible image pairs with 14 annotated pixel-level categories. Also, it covers a wide range …

📊 13 results
📏 Metrics: mIoU

Fine-Grained Cloud Segmentation Dataset

The dataset consists of 96 terrain-corrected (Level-1T) scenes from Landsat 8 OLI and TIRS, covering diverse biomes. This variety supports …

📊 3 results
📏 Metrics: mIoU

Fine-Grained Grass Segmentation Dataset

The dataset was created using high-resolution (8 m) satellite imagery from the Gaofen series (Gaofen-2 and Gaofen-6), captured in 2019 …

📊 9 results
📏 Metrics: mIoU

FoodSeg103

FoodSeg103 is a new food image dataset containing 7,118 images. Images are annotated with 104 ingredient classes and each image …

📊 7 results
📏 Metrics: mIoU

Forward-Looking Sonar Marine Debris Datasets

This dataset is made up of forward-looking sonar images containing ten classes of underwater debris. The dataset can be used …

📊 1 results
📏 Metrics: mIOU

Freiburg Forest

The Freiburg Forest dataset was collected using a Viona autonomous mobile robot platform equipped with cameras for capturing multi-spectral and …

📊 2 results
📏 Metrics: Mean IoU

HAM10000

HAM10000 is a dataset of 10000 training images for detecting pigmented skin lesions. The authors collected dermatoscopic images from different …

📊 1 results
📏 Metrics: Average Dice, Average IOU

HERA RFI Detection

This dataset contains simulated and expert-labelled spectrograms from two radio telescopes: the Hydrogen Epoch of Reionization Array (HERA) in South …

📊 2 results
📏 Metrics: AUPRC, AUROC, F1

Hypersim

For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real images. …

📊 5 results
📏 Metrics: mIoU, mIoU (test)

INRIA Aerial Image Labeling

The INRIA Aerial Image Labeling dataset is comprised of 360 RGB tiles of 5000×5000px with a spatial resolution of 30cm/px …

📊 6 results
📏 Metrics: IoU, mIOU

ISPRS Potsdam

The data set contains 38 patches (of the same size), each consisting of a true orthophoto (TOP) extracted from a …

📊 17 results
📏 Metrics: Overall Accuracy, Mean F1, Mean IoU

ISPRS Vaihingen

The data set contains 33 patches (of different sizes), each consisting of a true orthophoto (TOP) extracted from a larger …

📊 10 results
📏 Metrics: Overall Accuracy, Average F1, Category mIoU

ImageNet-S

Powered by the ImageNet dataset, unsupervised learning on large-scale data has made significant advances for classification tasks. There are two …

📊 20 results
📏 Metrics: mIoU (val), mIoU (test)

KITTI-360

KITTI-360 is a large-scale dataset that contains rich sensory information and full annotations. It is the successor of the popular …

📊 14 results
📏 Metrics: mIoU

Kvasir-Instrument

Consists of annotated frames containing GI procedure tools such as snares, balloons and biopsy forceps, etc. Beside of the images, …

📊 2 results
📏 Metrics: DSC, mIoU

LOFAR RFI Detection

This dataset contains simulated and expert-labelled spectrograms from two radio telescopes: the Hydrogen Epoch of Reionization Array (HERA) in South …

📊 2 results
📏 Metrics: AUPRC, AUROC, F1

LaRS

LaRS is the largest and most diverse panoptic maritime obstacle detection dataset. Highlights: * Diverse scenes from manual capture, public …

📊 20 results
📏 Metrics: Q, F1, μ, mIoU

LoveDA

  1. 5987 high spatial resolution (0.3 m) remote sensing images from Nanjing, Changzhou, and Wuhan 2. Focus on different geographical …
📊 16 results
📏 Metrics: Category mIoU

MCubeS

Multimodal material segmentation (MCubeS) dataset contains 500 sets of images from 42 street scenes. Each scene has images for four …

📊 21 results
📏 Metrics: mIoU

MCubeS (P)

Multimodal material segmentation (MCubeS) dataset contains 500 sets of images from 42 street scenes. Each scene has images for four …

📊 8 results
📏 Metrics: mIoU

MUSES: MUlti-SEnsor Semantic perception dataset

MUSES offers 2500 multi-modal scenes, evenly distributed across various combinations of weather conditions (clear, fog, rain, and snow) and types …

📊 2 results
📏 Metrics: mIoU

Matterport3D

The Matterport3D dataset is a large RGB-D dataset for scene understanding in indoor environments. It contains 10,800 panoramic views inside …

📊 4 results
📏 Metrics: Test mIoU, Validation mIoU

Mila Simulated Floods

Mila Simulated Floods Dataset is a 1.5 square km virtual world using the Unity3D game engine including urban, suburban and …

📊 1 results
📏 Metrics: mIoU

MixedWM38

MixedWM38 Dataset(WaferMap) has more than 38000 wafer maps, including 1 normal pattern, 8 single defect patterns, and 29 mixed defect …

📊 1 results
📏 Metrics: Dice, Mean IoU

Montgomery County X-ray Set

X-ray images in this data set have been acquired from the tuberculosis control program of the Department of Health andHuman …

📊 3 results
📏 Metrics: F1-score

Nighttime Driving

Nighttime Driving is a dataset of road scenes consisting of 35,000 images ranging from daytime to twilight time and to …

📊 12 results
📏 Metrics: mIoU

OpenEDS

OpenEDS (Open Eye Dataset) is a large scale data set of eye-images captured using a virtual-reality (VR) head mounted display …

📊 1 results
📏 Metrics: mIOU

PASCAL Context

The PASCAL Context dataset is an extension of the PASCAL VOC 2010 detection challenge, and it contains pixel-wise labels for …

📊 62 results
📏 Metrics: mIoU, Mean Accuracy, Pixel Accuracy

PASCAL VOC

The PASCAL Visual Object Classes (VOC) 2012 dataset contains 20 object categories including vehicles, household, animals, and other: aeroplane, bicycle, …

📊 1 results
📏 Metrics: mIoU

PASCAL VOC 2007

PASCAL VOC 2007 is a dataset for image recognition. The twenty object classes that have been selected are: Person: person …

📊 2 results
📏 Metrics: Mean IoU

PASCAL VOC 2011

PASCAL VOC 2011 is an image segmentation dataset. It contains around 2,223 images for training, consisting of 5,034 objects. Testing …

📊 1 results
📏 Metrics: Mean IoU

PASCAL VOC 2012 test

SCC Data Set

📊 51 results
📏 Metrics: Mean IoU, FLOPS, Params

PASTIS

PASTIS is a benchmark dataset for panoptic and semantic segmentation of agricultural parcels from satellite image time series. It is …

📊 3 results
📏 Metrics: Mean IoU (test), Number of Params, Overall Accuracy

PASTIS-R

Extension of the PASTIS benchmark with radar and optical image time series.

📊 1 results
📏 Metrics: IoU

PETRAW

PETRAW data set was composed of 150 sequences of peg transfer training sessions. The objective of the peg transfer session …

📊 4 results
📏 Metrics: Mean IoU (class)

PH2

The increasing incidence of melanoma has recently promoted the development of computer-aided diagnosis systems for the classification of dermoscopic images. …

📊 2 results
📏 Metrics: Average Dice, Average IOU

Pothole Mix

This dataset for the semantic segmentation of potholes and cracks on the road surface was assembled from 5 other datasets …

📊 7 results
📏 Metrics: Test Dice Multiclass, Test mIoU, Validation Dice Multiclass, Validation mIoU

Potsdam

https://paperswithcode.com/sota/semantic-segmentation-on-isprs-potsdam

📊 3 results
📏 Metrics: mIoU

RUGD

A Video Dataset for Visual Perception and Autonomous Navigation in Unstructured Environments. Website: http://rugd.vision/ The RUGD dataset focuses on semantic …

📊 1 results
📏 Metrics: AIOU, mIoU

Replica

The Replica Dataset is a dataset of high quality reconstructions of a variety of indoor spaces. Each reconstruction has clean …

📊 5 results
📏 Metrics: mIoU

S3DIS

The Stanford 3D Indoor Scene Dataset (S3DIS) dataset contains 6 large-scale indoor areas with 271 rooms. Each point in the …

📊 50 results
📏 Metrics: Mean IoU, mAcc, oAcc, FLOPs, Number of params, mIoU, Params (M)

SBCoseg

The SBCoseg dataset includes 889 groups of images and each group consists of 18 images with a common object, leading …

📊 1 results
📏 Metrics: Jaccard

STARE

The STARE (Structured Analysis of the Retina) dataset is a dataset for retinal vessel segmentation. It contains 20 equal-sized (700×605) …

📊 1 results
📏 Metrics: AUC

SWIMSEG

The SWIMSEG dataset contains 1013 images of sky/cloud patches, along with their corresponding binary segmentation maps. The ground truth annotation …

📊 1 results
📏 Metrics: Average Precision, Average Recall, F1-Score, MCC, Mean IoU

SWINSEG

The SWINSEG dataset contains 115 nighttime images of sky/cloud patches along with their corresponding binary ground truth maps. The ground …

📊 1 results
📏 Metrics: Average Precision, Average Recall, F1-Score, MCC, Mean IoU

SWINySEG

The SWINySEG dataset contains 6768 daytime- and nighttime-images of sky/cloud patches along with their corresponding binary ground truth maps. The …

📊 1 results
📏 Metrics: Average Precision, Average Recall, F1-Score, MCC, Mean IoU

SYNTHIA

The SYNTHIA dataset is a synthetic dataset that consists of 9400 multi-viewpoint photo-realistic frames rendered from a virtual city and …

📊 2 results
📏 Metrics: mIoU

ScanNet

ScanNet is an instance-level indoor RGB-D dataset that includes both 2D and 3D data. It is a collection of labeled …

📊 44 results
📏 Metrics: val mIoU, test mIoU

Semantic3D

Semantic3D is a point cloud dataset of scanned outdoor scenes with over 3 billion points. It contains 15 training and …

📊 13 results
📏 Metrics: mIoU, oAcc

SemanticPOSS

The SemanticPOSS dataset for 3D semantic segmentation contains 2988 various and complicated LiDAR scans with large quantity of dynamic instances. …

📊 1 results
📏 Metrics: Mean IoU

ShapeNet

ShapeNet is a large scale repository for 3D CAD models developed by researchers from Stanford University, Princeton University and the …

📊 4 results
📏 Metrics: Mean IoU

SpaceNet 1

SpaceNet 1: Building Detection v1 is a dataset for building footprint detection. The data is comprised of 382,534 building footprints, …

📊 10 results
📏 Metrics: Mean IoU

Structured3D

Structured3D is a large-scale photo-realistic dataset containing 3.5K house designs (a) created by professional designers with a variety of ground …

📊 4 results
📏 Metrics: Test mIoU, Validation mIoU

Trans10K

A large-scale dataset for transparent object segmentation, named Trans10K, consisting of 10,428 images of real scenarios with carefully manual annotations, …

📊 14 results
📏 Metrics: mIoU, GFLOPs

UAVid

UAVid is a high-resolution UAV semantic segmentation dataset as a complement, which brings new challenges, including large scale variation, moving …

📊 6 results
📏 Metrics: Mean IoU

UPLight

UPLight is an underwater RGB-Polarization multimodal semantic segmentation dataset with 12 typical underwater semantic classes.

📊 6 results
📏 Metrics: mIoU

VDD

Semantic segmentation of drone images is critical for various aerial vision tasks as it provides essential seman- tic details to …

📊 7 results
📏 Metrics: mIoU

WildDash

WildDash is a benchmark evaluation method is presented that uses the meta-information to calculate the robustness of a given algorithm …

📊 1 results
📏 Metrics: Mean IoU

ZJU-RGB-P

Research on semantic segmentation of traffic scenes using color and polarization information (including training and testing sets).

📊 13 results
📏 Metrics: mIoU, Frame (fps)

iSAID

iSAID contains 655,451 object instances for 15 categories across 2,806 high-resolution images. The images of iSAID is the same as …

📊 15 results
📏 Metrics: mIoU

Semantic Similarity

BIOSSES

The BIOSSES data set comprises total 100 sentence pairs all of which were selected from the "[TAC2 Biomedical Summarization Track …

📊 3 results
📏 Metrics: Pearson Correlation

CHIP-STS

CHIP Semantic Textual Similarity, a dataset for sentence similarity in the non-i.i.d. (non-independent and identically distributed) setting, is used for …

📊 1 results
📏 Metrics: Macro F1

SICK

The Sentences Involving Compositional Knowledge (SICK) dataset is a dataset for compositional distributional semantics. It includes a large number of …

📊 5 results
📏 Metrics: MSE, Pearson Correlation, Spearman Correlation

Semantic Textual Similarity

CxC

Crisscrossed Captions (CxC) contains 247,315 human-labeled annotations including positive and negative associations between image pairs, caption pairs and image-caption pairs. …

📊 4 results
📏 Metrics: avg ± std

MRPC

Microsoft Research Paraphrase Corpus (MRPC) is a corpus consists of 5,801 sentence pairs collected from newswire articles. Each pair is …

📊 43 results
📏 Metrics: Accuracy, F1

MTEB

MTEB is a benchmark that spans 8 embedding tasks covering a total of 56 datasets and 112 languages. The 8 …

📊 30 results
📏 Metrics: Spearman Correlation

SICK

The Sentences Involving Compositional Knowledge (SICK) dataset is a dataset for compositional distributional semantics. It includes a large number of …

📊 22 results
📏 Metrics: Spearman Correlation

STS Benchmark

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval …

📊 62 results
📏 Metrics: Pearson Correlation, Spearman Correlation, Accuracy, Dev Pearson Correlation, Dev Spearman Correlation

SentEval

SentEval is a toolkit for evaluating the quality of universal sentence representations. SentEval encompasses a variety of tasks, including binary …

📊 5 results
📏 Metrics: MRPC, SICK-R, SICK-E, STS

Semantic correspondence

AP-10K

AP-10K is the first large-scale benchmark for general animal pose estimation, to facilitate the research in animal pose estimation. AP-10K …

📊 1 results
📏 Metrics: PCK

CUB-200-2011

The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of …

📊 1 results
📏 Metrics: Mean [email protected], Mean [email protected]

Caltech-101

The Caltech101 dataset contains images from 101 object categories (e.g., “helicopter”, “elephant” and “chair” etc.) and a background category that …

📊 2 results
📏 Metrics: IoU, LT-ACC, IoU (weak), LT-ACC (weak)

PF-PASCAL

📊 14 results
📏 Metrics: PCK, PCK (weak)

PF-WILLOW

📊 7 results
📏 Metrics: PCK, PCK (weak)

SPair-71k

SPair-71k contains 70,958 image pairs with diverse variations in viewpoint and scale. Compared to previous datasets, it is significantly larger …

📊 21 results
📏 Metrics: PCK

Semantic entity labeling

EC-FUNSD

EC-FUNSD is introduced in [arXiv:2402.02379] as a benchmark of semantic entity recognition (SER) and entity linking (EL), designed for the …

📊 8 results
📏 Metrics: F1

FUNSD

Form Understanding in Noisy Scanned Documents (FUNSD) comprises 199 real, fully annotated, scanned forms. The documents are noisy and vary …

📊 15 results
📏 Metrics: F1

Semi Supervised Learning for Image Captioning

Flickr30k

The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators. Source: [Guiding …

📊 1 results
📏 Metrics: CIDEr

FlickrStyle10K

FlickrStyle10K is collected and built on Flickr30K image caption dataset. The original FlickrStyle10K dataset has 10,000 pairs of images and …

📊 1 results
📏 Metrics: CIDEr

Semi-Supervised Image Classification

Caltech-101

The Caltech101 dataset contains images from 101 object categories (e.g., “helicopter”, “elephant” and “chair” etc.) and a background category that …

📊 1 results
📏 Metrics: Accuracy

Caltech-256

Caltech-256 is an object recognition dataset containing 30,607 real-world images, of different sizes, spanning 257 classes (256 object classes and …

📊 1 results
📏 Metrics: Accuracy

STL-10

The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or …

📊 3 results
📏 Metrics: Accuracy

Semi-Supervised Instance Segmentation

ADE20K

The ADE20K semantic segmentation dataset contains more than 20K scene-centric images exhaustively annotated with pixel-level objects and object parts labels. …

📊 1 results
📏 Metrics: AP

COCO 10% labeled data

Semi-Supervised Object Detection on COCO 10% labeled data

📊 3 results
📏 Metrics: mask AP

Cityscapes

Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense …

📊 1 results
📏 Metrics: AP

Semi-Supervised Object Detection

COCO 10% labeled data

Semi-Supervised Object Detection on COCO 10% labeled data

📊 27 results
📏 Metrics: mAP, detector

Semi-Supervised Video Object Segmentation

DAVIS 2016

DAVIS16 is a dataset for video object segmentation which consists of 50 videos in total (30 videos for training and …

📊 72 results
📏 Metrics: J&F, Jaccard (Mean), Jaccard (Recall), Jaccard (Decay), F-measure (Mean), F-measure (Recall), F-measure (Decay), Speed (FPS)

DAVIS 2017

DAVIS17 is a dataset for video object segmentation. It contains a total of 150 videos - 60 for training, 30 …

📊 1 results
📏 Metrics: F-measure (Decay), F-measure (Mean), F-measure (Recall), J&F, Jaccard (Decay), Jaccard (Mean), Jaccard (Recall)

Long Video Dataset

We randomly selected three videos from the Internet, that are longer than 1.5K frames and have their main objects continuously …

📊 9 results
📏 Metrics: J&F, J, F

Long Video Dataset (3X)

We randomly selected three videos from the Internet, that are longer than 1.5K frames and have their main objects continuously …

📊 2 results
📏 Metrics: J&F, J, F

MOSE

CoMplex video Object SEgmentation (MOSE) is a dataset to study the tracking and segmenting objects in complex environments. MOSE contains …

📊 17 results
📏 Metrics: J&F, J, F, FPS

VOT2020

VOT2020 is a Visual Object Tracking benchmark for short-term tracking in RGB.

📊 20 results
📏 Metrics: EAO, EAO (real-time)

YouTube-VOS 2018

Youtube-VOS is a Video Object Segmentation dataset that contains 4,453 videos - 3,471 for training, 474 for validation, and 508 …

📊 52 results
📏 Metrics: Overall, Jaccard (Seen), Jaccard (Unseen), F-Measure (Seen), F-Measure (Unseen), Speed (FPS), Params(M), Speed (FPS)

Semi-supervised Anomaly Detection

UBI-Fights

UBI-Fights - Concerning a specific anomaly detection and still providing a wide diversity in fighting scenarios, the UBI-Fights dataset is …

📊 4 results
📏 Metrics: AUC, Decidability, EER

Sentence Classification

CHIP-CTC

CHIP Clinical Trial Classification, a dataset aimed at classifying clinical trials eligibility criteria, which are fundamental guidelines of clinical trials …

📊 1 results
📏 Metrics: Macro F1

Paper Field

Paper Field is built from the Microsoft Academic Graph and maps paper titles to one of 7 fields of study. …

📊 2 results
📏 Metrics: F1

SciCite

SciCite is a dataset of citation intents that addresses multiple scientific domains and is more than five times larger than …

📊 2 results
📏 Metrics: F1

Sentence Completion

HellaSwag

HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are …

📊 86 results
📏 Metrics: Accuracy

Sentence Ordering

EconLogicQA

EconLogicQA is a benchmark designed to test the sequential reasoning skills of large language models (LLMs) in economics, business, and …

📊 18 results
📏 Metrics: Accuracy

Sentiment Analysis

BanglaBook

This repository contains the code, data, and models of the paper titled "BᴀɴɢʟᴀBᴏᴏᴋ: A Large-scale Bangla Dataset for Sentiment Analysis …

📊 13 results
📏 Metrics: Weighted Average F1-score

DBRD

The DBRD (pronounced dee-bird) dataset contains over 110k book reviews along with associated binary sentiment polarity labels. It is greatly …

📊 3 results
📏 Metrics: Accuracy, F1

DynaSent

DynaSent is an English-language benchmark task for ternary (positive/negative/neutral) sentiment analysis. DynaSent combines naturally occurring sentences with sentences created using …

📊 12 results
📏 Metrics: Macro F1, 10 fold Cross validation

HARD

The Hotel Arabic-Reviews Dataset (HARD) contains 93700 hotel reviews in Arabic language. The hotel reviews were collected from Booking.com website …

📊 1 results
📏 Metrics: Accuracy

IMDb Movie Reviews

The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database …

📊 2 results
📏 Metrics: Accuracy (2 classes), F1 Macro

MR

MR Movie Reviews is a dataset for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect …

📊 18 results
📏 Metrics: Accuracy, Training Time

SLUE

Spoken Language Understanding Evaluation (SLUE) is a suite of benchmark tasks for spoken language understanding evaluation. It consists of limited-size …

📊 8 results
📏 Metrics: Recall (%) , F1 (%), Text model

SST-3

SST-5 is the Stanford Sentiment Treebank 5-way classification dataset (positive, somewhat positive, neutral, somewhat negative, negative). To create SST-3 (positive, …

📊 11 results
📏 Metrics: Macro F1

Sentiment Merged

This is a dataset for 3-way sentiment classification of reviews (negative, neutral, positive). It is a merge of [Stanford Sentiment …

📊 10 results
📏 Metrics: Macro F1

TweetEval

TweetEval introduces an evaluation framework consisting of seven heterogeneous Twitter-specific classification tasks. Source: [TweetEval: Unified Benchmark and Comparative Evaluation for …

📊 7 results
📏 Metrics: Emoji, Emotion, Hate, Irony, Offensive, Sentiment, Stance, ALL

Shadow Detection

CUHK-Shadow

Collects shadow images for multiple scenarios and compiled a new dataset of 10,500 shadow images, each with labeled ground-truth mask, …

📊 6 results
📏 Metrics: BER

SBU / SBU-Refine

SBU-Kinect-Interaction dataset version 2.0 comprises of RGB-D video sequences of humans performing interaction activities that are recording using the Microsoft …

📊 6 results
📏 Metrics: BER

Shadow Removal

INS Dataset

A significant challenge in removing shadows from indoor scenes is obtaining shadow-free images. To overcome this challenge, we propose a …

📊 1 results
📏 Metrics: Average PSNR (dB)

ISTD

The Image Shadow Triplets dataset (ISTD) is a dataset for shadow understanding that contains 1870 image triplets of shadow image, …

📊 9 results
📏 Metrics: MAE

ISTD+

ISTD+ consists of shadow images, shadow-free images, and shadow masks, with 1,330 training images and 540 testing images from 135 …

📊 20 results
📏 Metrics: RMSE, PSNR, SSIM, LPIPS

SRD

SRD is a dataset for shadow removal that contains 3088 shadow and shadow-free image pairs.

📊 19 results
📏 Metrics: RMSE, PSNR, SSIM, LPIPS

WSRD+

A version of the WSRD Dataset will be used as a benchmark for the NTIRE24 Challenge on Image Shadow Removal.

📊 1 results
📏 Metrics: LPIPS, PSNR, SSIM

Short-term Object Interaction Anticipation

Ego4D

Ego4D is a massive-scale egocentric video dataset and benchmark suite. It offers 3,025 hours of daily life activity video spanning …

📊 4 results
📏 Metrics: Overall (Top5 mAP), Noun (Top5 mAP), Noun+Verb(Top5 mAP), Noun+TTC (Top5 mAP)

Sign Language Recognition

AUTSL

The Ankara University Turkish Sign Language Dataset (AUTSL) is a large-scale, multimode dataset that contains isolated Turkish sign videos. It …

📊 4 results
📏 Metrics: Rank-1 Recognition Rate

BOBSL

BOBSL is a large-scale dataset of British Sign Language (BSL). It comprises 1,962 episodes (approximately 1,400 hours) of BSL-interpreted BBC …

📊 1 results
📏 Metrics: Actions Top-1

Bukva

We introduce a video dataset Bukva for Russian Dactyl Recognition task. Bukva dataset size is about 27 GB, and it …

📊 1 results
📏 Metrics: Accuracy (Top-1)

CSL-Daily

CSL-Daily (Chinese Sign Language Corpus) is a large-scale continuous SLT dataset. It provides both spoken language translations and gloss-level annotations. …

📊 12 results
📏 Metrics: Word Error Rate (WER)

ChicagoFSWild

This is the home of a collaborative data collection effort by U. Chicago and TTI-Chicago researchers. This is to our …

📊 3 results
📏 Metrics: CER (%)

ChicagoFSWild+

This is the home of a collaborative data collection effort by U. Chicago and TTI-Chicago researchers. This is to our …

📊 3 results
📏 Metrics: CER (%)

FDMSE-ISL

A large-scale isolated Indian sign language dataset. It contains 2002 common words, used in daily communications among Indian deaf community. …

📊 1 results
📏 Metrics: Top-1 Accuracy

LIBRAS-UFOP

A multimodal LIBRAS-UFOP Brazilian sign language dataset of minimal pairs using a microsoft Kinect senso. The dataset is based on …

📊 1 results
📏 Metrics: Accuracy, F1-score, Precision, Recall

LSA64

The sign database for the Argentinian Sign Language, created with the goal of producing a dictionary for LSA and training …

📊 2 results
📏 Metrics: Accuracy (%)

MINDS-Libras

Brazilian Sign Language (Libras) data set with 20 signs for sign language and gesture recognition benchmark: - Acontecer (To happen) …

📊 1 results
📏 Metrics: Accuracy, F1-score, Precision, Recall

MSASL-1000

MSASL is a real-life large-scale sign language data set comprising over 25,000 annotated videos. Source: [MS-ASL: A Large-Scale Data Set …

📊 2 results
📏 Metrics: P-I Top-1 Accuracy, P-C Top-1 Accuracy

RWTH-PHOENIX-Weather 2014

The signing is recorded by a stationary color camera placed in front of the sign language interpreters. Interpreters wear dark …

📊 10 results
📏 Metrics: Word Error Rate (WER)

RWTH-PHOENIX-Weather 2014 T

Over a period of three years (2009 - 2011) the daily news and weather forecast airings of the German public …

📊 9 results
📏 Metrics: Word Error Rate (WER)

Slovo: Russian Sign Language Dataset

We introduce a large-scale video dataset Slovo for Russian Sign Language task. Slovo dataset size is about 16 GB, and …

📊 1 results
📏 Metrics: Mean Accuracy

WLASL

WLASL is a large video dataset for Word-Level American Sign Language (ASL) recognition, which features 2,000 common different words in …

📊 2 results
📏 Metrics: Top-1 Accuracy

Znaki

The first and the one open dataset for Russian finger- spelling, contained 1,593 annotated phrases and over 37 thousand HD+ …

📊 3 results
📏 Metrics: CER (%)

Sign Language Translation

ASLG-PC12

An artificial corpus built using grammatical dependencies rules due to the lack of resources for Sign Language. Source: ASLG-PC12

📊 1 results
📏 Metrics: BLEU-4

CSL-Daily

CSL-Daily (Chinese Sign Language Corpus) is a large-scale continuous SLT dataset. It provides both spoken language translations and gloss-level annotations. …

📊 8 results
📏 Metrics: BLEU-4

How2Sign

The How2Sign is a multimodal and multiview continuous American Sign Language (ASL) dataset consisting of a parallel corpus of more …

📊 1 results
📏 Metrics: BLEU

LSA-T

LSA-T is the first continuous Argentinian Sign Language (LSA) dataset. It contains 14,880 sentence level videos of LSA extracted from …

📊 1 results
📏 Metrics: Word Error Rate (WER)

RWTH-PHOENIX-Weather 2014 T

Over a period of three years (2009 - 2011) the daily news and weather forecast airings of the German public …

📊 8 results
📏 Metrics: BLEU-4, ROUGE

Single-Source Domain Generalization

PACS

PACS is an image dataset for domain generalization. It consists of four domains, namely Photo (1,670 images), Art Painting (2,048 …

📊 8 results
📏 Metrics: Accuracy

Single-View 3D Reconstruction

CUB-200-2011

The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of …

📊 1 results
📏 Metrics: FID

Common Objects in 3D

Common Objects in 3D is a large-scale dataset with real multi-view images of object categories annotated with camera poses and …

📊 3 results
📏 Metrics: Avg. F1

GSO

Scanned Objects by Google Research is a dataset of common household objects that have been 3D scanned for use in …

📊 3 results
📏 Metrics: Chamfer Distance, IoU, F-Score

ShapeNet

ShapeNet is a large scale repository for 3D CAD models developed by researchers from Stanford University, Princeton University and the …

📊 7 results
📏 Metrics: 3DIoU, F-Score

ShapeNetCore

ShapeNetCore is a subset of the full ShapeNet dataset with single clean 3D models and manually verified category and alignment …

📊 6 results
📏 Metrics: 3DIoU

SynthEVox3D-Tiny

Event cameras are sensors that are inspired by biological systems and specialize in capturing changes in brightness. These emerging cameras …

📊 2 results
📏 Metrics: A-mIoU

TransProteus

The dataset contains procedurally generated images of transparent vessels containing liquid and objects . The data for each image includes …

📊 1 results
📏 Metrics: R2

Single-object discovery

Object Discovery

The Object Discovery dataset was collected by downloading images from Internet for airplane, car and horse. It is significantly larger …

📊 2 results
📏 Metrics: CorLoc

Single-step retrosynthesis

USPTO-50k

Subset and preprocessed version of Chemical reactions from US patents (1976-Sep2016) by Daniel Lowe. It includes 50K randomly selected reactions …

📊 23 results
📏 Metrics: Top-1 accuracy, Top-3 accuracy, Top-5 accuracy, Top-10 accuracy, Top-20 accuracy, Top-50 accuracy

Sketch-Based Image Retrieval

Chairs

The Chairs dataset contains rendered images of around 1000 different three-dimensional chair models. Source: Adversarial Disentanglement with Grouped Observations Image …

📊 2 results
📏 Metrics: R@1, R@10

Sketch-to-Image Translation

COCO-Stuff

The Common Objects in COntext-stuff (COCO-stuff) dataset is a dataset for scene understanding tasks like semantic segmentation, object detection and …

📊 3 results
📏 Metrics: FID, FID-C

Scribble

Scribble is a new outline dataset consisting of 200 images (150 train, 50 test) for each of 10 classes – …

📊 2 results
📏 Metrics: FID, Accuracy, Human (%)

SketchyCOCO

SketchyCOCO dataset consists of two parts: Object-level data Object-level data contains $20198(train18869+val1329)$ triplets of {foreground sketch, foreground image, foreground edge …

📊 2 results
📏 Metrics: FID, Accuracy, Human (%)

Skill Generalization

RGB-Stacking

RGB-Stacking is a benchmark for vision-based robotic manipulation. The robot is trained to learn how to grasp objects and balance …

📊 2 results
📏 Metrics: Group 1, Group 2, Group 3, Group 4, Group 5, Average

Skill Mastery

RGB-Stacking

RGB-Stacking is a benchmark for vision-based robotic manipulation. The robot is trained to learn how to grasp objects and balance …

📊 2 results
📏 Metrics: Average, Group 1, Group 2, Group 3, Group 4, Group 5

Skills Assessment

Multimodal PISA

Dataset for multimodal skills assessment focusing on assessing piano player’s skill level. Annotations include player's skills level, and song difficulty …

📊 1 results
📏 Metrics: Accuracy (%)

Skills Evaluation

eSports Sensors Dataset

The eSports Sensors dataset contains sensor data collected from 10 players in 22 matches in League of Legends. The sensor …

📊 5 results
📏 Metrics: Accuracy, LogLoss, ROC AUC

Sleep Stage Detection

ISRUC-Sleep

ISRUC-Sleep is a polysomnographic (PSG) dataset. The data were obtained from human adults, including healthy subjects, and subjects with sleep …

📊 2 results
📏 Metrics: Accuracy, AUROC, Kappa, Macro-F1

Montreal Archive of Sleep Studies

The Montreal Archive of Sleep Studies (MASS) is an open-access and collaborative database of laboratory-based polysomnography (PSG) recordings O’Reilly, C., …

📊 2 results
📏 Metrics: Accuracy, Cohen's kappa, Macro-F1

PhysioNet Challenge 2018

Data for this challenge were contributed by the Massachusetts General Hospital’s (MGH) Computational Clinical Neurophysiology Laboratory (CCNL), and the Clinical …

📊 2 results
📏 Metrics: Accuracy, Cohen's Kappa, Macro-F1

SHHS

The Sleep Heart Health Study (SHHS) is a multi-center cohort study implemented by the National Heart Lung & Blood Institute …

📊 7 results
📏 Metrics: Accuracy, Cohen's Kappa, Macro-F1

Sleep-EDF

The sleep-edf database contains 197 whole-night PolySomnoGraphic sleep recordings, containing EEG, EOG, chin EMG, and event markers. Some records also …

📊 8 results
📏 Metrics: Accuracy, Cohen's kappa, Macro-F1

Slot Filling

ATIS

The ATIS (Airline Travel Information Systems) is a dataset consisting of audio recordings and corresponding manual transcripts about humans asking …

📊 12 results
📏 Metrics: F1

CAIS

We collect utterances from the Chinese Artificial Intelligence Speakers (CAIS), and annotate them with slot tags and intent labels. The …

📊 1 results
📏 Metrics: F1

Dialogue State Tracking Challenge

The Dialog State Tracking Challenges 2 & 3 (DSTC2&3) were research challenge focused on improving the state of the art …

📊 1 results
📏 Metrics: F1 score

MASSIVE

MASSIVE is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks …

📊 3 results
📏 Metrics: Slot F1 Score

MixATIS

Dataset is constructed from single intent dataset ATIS. This is a publically available multi intent dataset, which can be downloaded …

📊 10 results
📏 Metrics: Micro F1

MixSNIPS

Dataset is constructed from single intent dataset SNIPS. This is a publicly available multi intent dataset, which can be downloaded …

📊 11 results
📏 Metrics: Micro F1

Polyvore

This dataset contains 21,889 outfits from polyvore.com, in which 17,316 are for training, 1,497 for validation and 3,076 for testing. …

📊 1 results
📏 Metrics: FITB

ProSLU

In the paper, to bridge the research gap, we propose a new and important task, Profile-based Spoken Language Understanding (ProSLU), …

📊 1 results
📏 Metrics: F1

SLURP

A new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets. …

📊 5 results
📏 Metrics: F1

SNIPS

The SNIPS Natural Language Understanding benchmark is a dataset of over 16,000 crowdsourced queries distributed among 7 user intents of …

📊 7 results
📏 Metrics: F1, F1 (1-shot) avg, F1 (5-shot) avg

Slovak Text Diacritization

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish …

📊 1 results
📏 Metrics: Alpha-Word accuracy

Small Object Detection

SODA-D

SODA-D is a large-scale dataset tailored for small object detection in driving scenario, which is built on top of MVD …

📊 1 results
📏 Metrics: [email protected]:0.95

Sociology

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Sound Event Detection

DESED

The DESED dataset is a dataset designed to recognize sound event classes in domestic environments. The dataset is designed to …

📊 10 results
📏 Metrics: event-based F1 score, PSDS1, PSDS2

L3DAS21

L3DAS21 is a dataset for 3D audio signal processing. It consists of a 65 hours 3D audio corpus, accompanied with …

📊 5 results
📏 Metrics: Error Rate, SED-score, F-Score

WildDESED

WildDESED is an extension of the original DESED dataset, created to reflect various domestic scenarios by incorporating complex and unpredictable …

📊 5 results
📏 Metrics: PSDS1 (-5dB), PSDS1 (0dB), PSDS1 (5dB), PSDS1 (10dB), PSDS1 (Clean)

Sound Event Localization and Detection

L3DAS21

L3DAS21 is a dataset for 3D audio signal processing. It consists of a 65 hours 3D audio corpus, accompanied with …

📊 1 results
📏 Metrics: SELD score

PodcastFillers

The PodcastFillers dataset consists of 199 full-length podcast episodes in English with manually annotated filler words and automatically generated transcripts. …

📊 2 results
📏 Metrics: event-based F1 score

RWCP Sound Scene Database

The RWCP Sound Scene Database includes non-speech sounds recorded in an anechoic room, reconstructed signals in various rooms, impulse responses …

📊 1 results
📏 Metrics: accuracy

STARSS22

The Sony-TAu Realistic Spatial Soundscapes 2022(STARSS22) dataset consists of recordings of real scenes captured with high channel-count spherical microphone array …

📊 2 results
📏 Metrics: Class-dependent localization error, Class-dependent localization recall, location-dependent F1-score (macro), location-dependent F1-score (micro), Localization-dependent error rate (20°)

TAU-NIGENS Spatial Sound Events 2021

The TAU-NIGENS Spatial Sound Events 2021 dataset contains multiple spatial sound-scene recordings, consisting of sound events of distinct categories integrated …

📊 1 results
📏 Metrics: ER≤20°, F1≤20°, LE-CD, LR-CD

Source Code Summarization

CoDesc

CoDesc is a large dataset of 4.2m Java source code and parallel data of their description from code search, and …

📊 1 results
📏 Metrics: BLEU-4

CodeSearchNet

The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and …

📊 1 results
📏 Metrics: F1

DeepCom-Java

The Java dataset introduced in DeepCom (Deep Code Comment Generation), commonly used to evaluate automated code summarization.

📊 2 results
📏 Metrics: BLEU-4, METEOR

Java scripts

The Java dataset introduced in Hybrid-DeepCom (Deep code comment generation with hybrid lexical and syntactical information), commonly used to evaluate …

📊 1 results
📏 Metrics: BLEU-4, METEOR

ParallelCorpus-Python

The Python dataset introduced in the Parallel Corpus paper ([A Parallel Corpus of Python Functions and Documentation Strings for Automated …

📊 2 results
📏 Metrics: BLEU-4, METEOR

Source-Free Domain Adaptation

PACS

PACS is an image dataset for domain generalization. It consists of four domains, namely Photo (1,670 images), Art Painting (2,048 …

📊 2 results
📏 Metrics: Average Accuracy

VisDA-2017

VisDA-2017 is a simulation-to-real dataset for domain adaptation with over 280,000 images across 12 categories in the training, validation and …

📊 10 results
📏 Metrics: Accuracy

Spanish Text Diacritization

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish …

📊 1 results
📏 Metrics: Alpha-Word accuracy

Sparse Learning

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 9 results
📏 Metrics: Top-1 Accuracy

Spatial Relation Recognition

Rel3D

Understanding spatial relations (e.g., “laptop on table”) in visual input is important for both humans and robots. Existing datasets are …

📊 9 results
📏 Metrics: Acc

Spatio-Temporal Video Grounding

HC-STVG1

The newly proposed HC-STVG task aims to localize the target person spatio-temporally in an untrimmed video. For this task, we …

📊 3 results
📏 Metrics: m_vIoU, [email protected], [email protected]

HC-STVG2

We have added data and cleaned the labels in HC-STVG to build the HC-STVG2.0. While the original database contained 5660 …

📊 4 results
📏 Metrics: Val m_vIoU, Val [email protected], Val [email protected]

VidSTG

The VidSTG dataset is a spatio-temporal video grounding dataset constructed based on the video relation dataset VidOR. VidOR contains 7,000, …

📊 3 results
📏 Metrics: Declarative m_vIoU, Declarative [email protected], Declarative [email protected], Interrogative m_vIoU, Interrogative [email protected], Interrogative [email protected]

Speaker Attribution in German Parliamentary Debates (GermEval 2023, subtask 1)

GePaDe

This dataset encompasses 265 speeches (over 200,000 tokens) from the German Bundestag, primarily from the 19th legislative term (2017-2021), given …

📊 1 results
📏 Metrics: F1

Speaker Attribution in German Parliamentary Debates (GermEval 2023, subtask 2)

GePaDe

This dataset encompasses 265 speeches (over 200,000 tokens) from the German Bundestag, primarily from the 19th legislative term (2017-2021), given …

📊 1 results
📏 Metrics: F1

Speaker Diarization

AliMeeting

AliMeeting corpus consists of 120 hours of recorded Mandarin meeting data, including far-field data collected by 8-channel microphone array as …

📊 1 results
📏 Metrics: DER(%)

DIHARD II

The DIHARD II development and evaluation sets draw from a diverse set of sources exhibiting wide variation in recording equipment, …

📊 1 results
📏 Metrics: DER(%), DER - no overlap

Speaker Identification

VoxCeleb1

VoxCeleb1 is an audio dataset containing over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.

📊 12 results
📏 Metrics: Top-1 (%), Top-5 (%), Number of Params, Accuracy

Speaker Recognition

VoxCeleb1

VoxCeleb1 is an audio dataset containing over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.

📊 2 results
📏 Metrics: EER

Speaker Verification

CN-CELEB

CN-Celeb is a large-scale speaker recognition dataset collected `in the wild'. This dataset contains more than 130,000 utterances from 1,000 …

📊 2 results
📏 Metrics: EER

VibraVox (forehead accelerometer)

This is the forehead accelerometer variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource for advancing …

📊 1 results
📏 Metrics: Test EER, Test min-DCF

VibraVox (headset microphone)

This is the reference headset microphone variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource for …

📊 1 results
📏 Metrics: Test EER, Test min-DCF

VibraVox (rigid in-ear microphone)

This is the in-ear rigid earpiece-embedded microphone variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource …

📊 1 results
📏 Metrics: Test EER, Test min-DCF

VibraVox (soft in-ear microphone)

This is the in-ear comply foam-embedded microphone variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource …

📊 1 results
📏 Metrics: Test EER, Test min-DCF

VibraVox (temple vibration pickup)

This is the temple vibration pickup variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource for …

📊 1 results
📏 Metrics: Test EER, Test min-DCF

VibraVox (throat microphone)

This is the throat microphone (laryngophone) variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource for …

📊 1 results
📏 Metrics: Test EER, Test min-DCF

VoxCeleb1

VoxCeleb1 is an audio dataset containing over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.

📊 16 results
📏 Metrics: EER

VoxCeleb2

VoxCeleb2 is a large scale speaker recognition dataset obtained automatically from open-source media. VoxCeleb2 consists of over a million utterances …

📊 1 results
📏 Metrics: EER

Speech Emotion Recognition

BERSt

BERSt Dataset We release the BERSt Dataset for various speech recognition tasks including Automatic Speech Recognition (ASR) and Speech Emotion …

📊 3 results
📏 Metrics: Unweighted Accuracy (UA), Weighted Accuracy (WA)

CREMA-D

CREMA-D is an emotional multimodal actor data set of 7,442 original clips from 91 actors. These clips were from 48 …

📊 8 results
📏 Metrics: Accuracy

EmoDB Dataset

The EMODB database is the freely available German emotional database. The database is created by the Institute of Communication Science, …

📊 1 results
📏 Metrics: Accuracy, F1

IEMOCAP

Multimodal Emotion Recognition IEMOCAP The IEMOCAP dataset consists of 151 videos of recorded dialogues, with 2 speakers per session for …

📊 5 results
📏 Metrics: UA CV, WA CV, UA, WA, F1

LSSED

LSSED, a challenging large-scale english dataset for speech emotion recognition. It contains 147,025 sentences (206 hours and 25 minutes in …

📊 1 results
📏 Metrics: Unweighted Accuracy (UA)

MSP-IMPROV

We present the MSP-IMPROV corpus, a multimodal emotional database, where the goal is to have control over lexical content and …

📊 1 results
📏 Metrics: UA

RAVDESS

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7,356 files (total size: 24.8 GB). The database contains …

📊 1 results
📏 Metrics: Accuracy, F1 Score, Precision, Recall, F1

RESD

Russian dataset of emotional speech dialogues. This dataset was assembled from ~3.5 hours of live speech by actors who voiced …

📊 3 results
📏 Metrics: Weighted Accuracy (WA), Unweighted Accuracy (UA), Weighted F1

ShEMO

The database includes 3000 semi-natural utterances, equivalent to 3 hours and 25 minutes of speech data extracted from online radio …

📊 1 results
📏 Metrics: Unweighted Accuracy

Speech Enhancement

DNS Challenge

The DNS Challenge at INTERSPEECH 2020 intended to promote collaborative research in single-channel Speech Enhancement aimed to maximize the perceptual …

📊 4 results
📏 Metrics: PESQ-NB, PESQ-WB

EARS-WHAM

The EARS-WHAM dataset mixes speech from the EARS dataset with real noise recordings from the WHAM! dataset. Speech and noise …

📊 6 results
📏 Metrics: PESQ-WB, SI-SDR, ESTOI, SIGMOS, DNSMOS, POLQA

EasyCom

The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the cocktail party effect from an augmented-reality …

📊 6 results
📏 Metrics: PESQ, STOI, ViSQOL, HASQI, Audio Quality MOS, SDR, ESTOI, HASPI, SI-SDR, SIIB, SNR, SegSNR

RealMAN

The Audio Signal and Information Processing Lab at Westlake University, in collaboration with AISHELL, has released the Real-recorded and annotated …

📊 1 results
📏 Metrics: DNSMOS, DNSMOS BAK, DNSMOS OVRL, DNSMOS SIG, PESQ-WB

VB-DemandEx

Uses same clean speech as VoiceBank+Demand but more noise types. Features much lower SNRs ([−10, −5, 0, 5, 10, 15, …

📊 4 results
📏 Metrics: ESTOI, Number of parameters (M), PESQ (wb), SI-SDR, SSNR

VoiceBank + DEMAND

VoiceBank+DEMAND is a noisy speech database for training speech enhancement algorithms and TTS models. The database was designed to train …

📊 34 results
📏 Metrics: PESQ (wb), CBAK, COVL, CSIG, STOI, ESTOI, SSNR, SI-SDR, Para. (M)

VoiceBank+DEMAND

VoiceBank+DEMAND is a noisy speech database for training speech enhancement algorithms and TTS models. The database was designed to train …

📊 2 results
📏 Metrics: PESQ, DNSMOS, DNSMOS BAK, DNSMOS OVRL, DNSMOS SIG, ESTOI, SI-SDR, PESQ (wb)

WHAM!

The WSJ0 Hipster Ambient Mixtures (WHAM!) dataset pairs each two-speaker mixture in the wsj0-2mix dataset with a unique noise background …

📊 1 results
📏 Metrics: PESQ, SDR, SI-SNR

WHAMR!

WHAMR! is a dataset for noisy and reverberant speech separation. It extends WHAM! by introducing synthetic reverberation to the speech …

📊 2 results
📏 Metrics: PESQ, SI-SDR, ΔPESQ, SI-SNR, SDR

Speech Recognition

AISHELL-1

AISHELL-1 is a corpus for speech recognition research and building speech recognition systems for Mandarin. Source: [AISHELL-1: An Open-Source Mandarin …

📊 18 results
📏 Metrics: Word Error Rate (WER), Params(M)

AISHELL-2

AISHELL-2 contains 1000 hours of clean read-speech data from iOS is free for academic usage. Source: [AISHELL-2: Transforming Mandarin ASR …

📊 2 results
📏 Metrics: Word Error Rate (WER)

Common Voice

Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded …

📊 2 results
📏 Metrics: Test WER

EasyCom

The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the cocktail party effect from an augmented-reality …

📊 5 results
📏 Metrics: WER (%)

GigaSpeech

GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, …

📊 1 results
📏 Metrics: Word Error Rate (WER)

Google Speech Commands - Musan

This noisy speech test set is created from the Google Speech Commands v2 [1] and the Musan dataset[2]. It could …

📊 1 results
📏 Metrics: Error rate - SNR 0dB

LRS2

The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences …

📊 1 results
📏 Metrics: Word Error Rate (WER)

LRS3-TED

LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of …

📊 4 results
📏 Metrics: Word Error Rate (WER)

LibriCSS

Continuous speech separation (CSS) is an approach to handling overlapped speech in conversational audio signals. A real recorded dataset, called …

📊 2 results
📏 Metrics: Word Error Rate (WER)

MediaSpeech

MediaSpeech is a media speech dataset (you might have guessed this) built with the purpose of testing Automated Speech Recognition …

📊 8 results
📏 Metrics: WER for Arabic, WER for French, WER for Spanish, WER for Turkish

SLUE

Spoken Language Understanding Evaluation (SLUE) is a suite of benchmark tasks for spoken language understanding evaluation. It consists of limited-size …

📊 8 results
📏 Metrics: VoxPopuli (Dev), VoxPopuli (Test), VoxCeleb (Dev), VoxCeleb (Test)

SPGISpeech

SPGISpeech (pronounced “speegie-speech”) is a large-scale transcription dataset, freely available for academic research. SPGISpeech is a collection of 5,000 hours …

📊 2 results
📏 Metrics: Word Error Rate (WER)

Speech Commands

Speech Commands is an audio dataset of spoken words designed to help train and evaluate keyword spotting systems .

📊 3 results
📏 Metrics: Accuracy (%)

TED-LIUM

The TED-LIUM corpus consists of English-language TED talks. It includes transcriptions of these talks. The audio is sampled at 16kHz. …

📊 2 results
📏 Metrics: Word Error Rate (WER)

TIMIT

The TIMIT Acoustic-Phonetic Continuous Speech Corpus is a standard dataset used for evaluation of automatic speech recognition systems. It consists …

📊 20 results
📏 Metrics: Percentage error

TUDA

Overall duration per microphone: about 36 hours (31 hrs train / 2.5 hrs dev / 2.5 hrs test) Count of …

📊 3 results
📏 Metrics: Test WER

VietMed

We introduced a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled …

📊 8 results
📏 Metrics: Dev WER, Test WER

WenetSpeech

WenetSpeech is a multi-domain Mandarin corpus consisting of 10,000+ hours high-quality labeled speech, 2,400+ hours weakly labelled speech, and about …

📊 8 results
📏 Metrics: Character Error Rate (CER)

Speech Separation

LRS2

The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences …

📊 8 results
📏 Metrics: SI-SNRi, SDRi, PESQ, STOI

LibriCSS

Continuous speech separation (CSS) is an approach to handling overlapped speech in conversational audio signals. A real recorded dataset, called …

📊 2 results
📏 Metrics: 0S, 0L, 10%, 20%, 30%, 40%

VoxCeleb2

VoxCeleb2 is a large scale speaker recognition dataset obtained automatically from open-source media. VoxCeleb2 consists of over a million utterances …

📊 5 results
📏 Metrics: SI-SNRi, SDRi

WHAM!

The WSJ0 Hipster Ambient Mixtures (WHAM!) dataset pairs each two-speaker mixture in the wsj0-2mix dataset with a unique noise background …

📊 5 results
📏 Metrics: SI-SDRi

WHAMR!

WHAMR! is a dataset for noisy and reverberant speech separation. It extends WHAM! by introducing synthetic reverberation to the speech …

📊 17 results
📏 Metrics: SI-SDRi, MACs (G), Number of parameters (M), SDRi

WSJ0-2mix

WSJ0-2mix is a speech recognition corpus of speech mixtures using utterances from the Wall Street Journal (WSJ0) corpus. Source: [Deep …

📊 36 results
📏 Metrics: SI-SDRi, SDRi, Number of parameters (M), MACs (G)

Speech Synthesis

Blizzard Challenge 2013

The English data for voice building was obtained, prepared and provided the the challenge by Lessac Technologies Inc., having originally …

📊 2 results
📏 Metrics: NLL

LJSpeech

This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from …

📊 4 results
📏 Metrics: Mean Opinion Score

LibriTTS

LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate, prepared by …

📊 15 results
📏 Metrics: PESQ, M-STFT, MCD, Periodicity, V/UV F1

Speech-to-Speech Translation

CVSS

CVSS is a massively multilingual-to-English speech to speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into …

📊 2 results
📏 Metrics: ASR-BLEU, Parameters

TAT

Taiwanese Across Taiwan (TAT) corpus is a Large-Scale database of Native Taiwanese Article/Reading Speech collected across Taiwan. This corpus contains …

📊 8 results
📏 Metrics: ASR-BLEU (Dev), ASR-BLEU (Test)

Speech-to-Text Translation

MuST-C

MuST-C currently represents the largest publicly available multilingual corpus (one-to-many) for speech translation. It covers eight language directions, from English …

📊 2 results
📏 Metrics: SacreBLEU

Splice Site Prediction

GUE

A collection of $28$ datasets across $7$ tasks constructed for genome language model evaluation. Contains seven tasks: promoter prediction. core …

📊 1 results
📏 Metrics: MCC

Spoken Language Understanding

Fluent Speech Commands

Fluent Speech Commands is an open source audio dataset for spoken language understanding (SLU) experiments. Each utterance is labeled with …

📊 17 results
📏 Metrics: Accuracy (%)

Snips-SmartLights

The SmartLights benchmark from Snipstests the capability of controlling lights in different rooms. It consists of 1660 requests which are …

📊 7 results
📏 Metrics: Accuracy (%)

Snips-SmartSpeaker

The SmartSpeaker benchmark tests the performance of reacting to music player commands in English as well as in French. It …

📊 5 results
📏 Metrics: Accuracy-EN (%), Accuracy-FR (%)

Spoken-SQuAD

In SpokenSQuAD, the document is in spoken form, the input question is in the form of text and the answer …

📊 4 results
📏 Metrics: F1 score

Timers and Such

Timers and Such is an open source dataset of spoken English commands for common voice control use cases involving numbers. …

📊 3 results
📏 Metrics: Accuracy (%)

Stance Detection

ARC (AI2 Reasoning Challenge)

The AI2’s Reasoning Challenge (ARC) dataset is a multiple-choice question-answering dataset, containing questions from science exams from grade 3 to …

📊 1 results
📏 Metrics: F1

Dhoroni

Climate change poses critical challenges globally, disproportionately affecting low-income countries that often lack resources and linguistic representation on the international …

📊 1 results
📏 Metrics: Accuracy, F1 Score, Precision, Recall

FNC-1

FNC-1 was designed as a stance detection dataset and it contains 75,385 labeled headline and article pairs. The pairs are …

📊 1 results
📏 Metrics: F1

MGTAB

MGTAB is the first standardized graph-based benchmark for stance and bot detection. MGTAB contains 10,199 expert-annotated users and 7 types …

📊 4 results
📏 Metrics: Acc, F1

P-Stance

P-Stance: A Large Dataset for Stance Detection in Political Domain 2021

📊 1 results
📏 Metrics: Average F1

Perspectrum

Perspectrum is a dataset of claims, perspectives and evidence, making use of online debate websites to create the initial data …

📊 1 results
📏 Metrics: F1

RuStance

Includes Russian tweets and news comments from multiple sources, covering multiple stories, as well as text classification approaches to stance …

📊 1 results
📏 Metrics: F1

Snopes

Fact-checking (FC) articles which contains pairs (multimodal tweet and a FC-article) from snopes.com. Source: [Where Are the Facts? Searching for …

📊 1 results
📏 Metrics: F1

VAST

VAST consists of a large range of topics covering broad themes, such as politics (e.g., ‘a Palestinian state’), education (e.g., …

📊 1 results
📏 Metrics: F1

State Change Object Detection

Ego4D

Ego4D is a massive-scale egocentric video dataset and benchmark suite. It offers 3,025 hours of daily life activity video spanning …

📊 1 results
📏 Metrics: AP, AP50, AP75

Stereo Depth Estimation

Spring

Spring is a large, high-resolution and high-detail, computer-generated benchmark for scene flow, optical flow, and stereo. Based on rendered scenes …

📊 4 results
📏 Metrics: 1px total

Stereo Disparity Estimation

Middlebury 2014

The Middlebury 2014 dataset contains a set of 23 high resolution stereo pairs for which known camera calibration parameters and …

📊 2 results
📏 Metrics: D1 Error (2px)

Stereotypical Bias Analysis

CrowS-Pairs

CrowS-Pairs has 1508 examples that cover stereotypes dealing with nine types of bias, like race, religion, and age. In CrowS-Pairs …

📊 4 results
📏 Metrics: Gender, Religion, Race/Color, Sexual Orientation, Age, Nationality, Disability, Physical Appearance, Socioeconomic status, Overall

Stochastic Optimization

AG News

AG News (AG’s News Corpus) is a subdataset of AG's corpus of news articles constructed by assembling titles and description …

📊 1 results
📏 Metrics: Accuracy (max), Accuracy (mean)

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 2 results
📏 Metrics: Accuracy (max), Accuracy (mean)

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 2 results
📏 Metrics: Accuracy (max), Accuracy (mean)

CoLA

The Corpus of Linguistic Acceptability (CoLA) consists of 10657 sentences from 23 linguistics publications, expertly annotated for acceptability (grammaticality) by …

📊 1 results
📏 Metrics: Accuracy (max), Accuracy (mean)

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 1 results
📏 Metrics: NLL

Stock Market Prediction

Astock

(1) provide financial news for each specific stock. (2) provide various stock technical factors and fundamental factors for each stock.

📊 16 results
📏 Metrics: Accuray, F1-score, Recall, Precision

stocknet

stocknet-dataset This repository releases a comprehensive dataset for stock movement prediction from tweets and historical stock prices. Please cite …

📊 1 results
📏 Metrics: F1

Story Continuation

VIST

The Visual Storytelling Dataset (VIST) consists of 210,819 unique photos and 50,000 stories. The images were collected from albums on …

📊 2 results
📏 Metrics: FID

Story Generation

WritingPrompts

WritingPrompts is a large dataset of 300K human-written stories paired with writing prompts from an online forum. Source: [Hierarchical Neural …

📊 1 results
📏 Metrics: BLEU-1, BLEU-2, Distinct-4

Style Transfer

GYAFC

Grammarly’s Yahoo Answers Formality Corpus (GYAFC) is the largest dataset for any style containing a total of 110K informal / …

📊 1 results
📏 Metrics: Accuracy, BLEU-4, Harmonic mean

StyleBench

To comprehensively evaluate the effectiveness and generalization ability of style transfer methods, we build StyleBench that covers 73 distinct styles, …

📊 7 results
📏 Metrics: CLIP Score

WikiArt

WikiArt contains painting from 195 different artists. The dataset has 42129 images for training and 10628 images for testing. Source: …

📊 2 results
📏 Metrics: SSIM, ArtFID

Subjectivity Analysis

Czech Subjectivity Dataset

Czech subjectivity dataset of 10k manually annotated subjective and objective sentences from movie reviews and descriptions. See the paper description …

📊 5 results
📏 Metrics: Accuracy

SUBJ

Available are collections of movie-review documents labeled with respect to their overall sentiment polarity (positive or negative) or subjective rating …

📊 16 results
📏 Metrics: Accuracy

Supervised Image Retrieval

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 3 results
📏 Metrics: Precision@100

Surface Normals Estimation

IBims-1

iBims-1 (independent Benchmark images and matched scans - version 1) is a new high-quality RGB-D dataset, especially designed for testing …

📊 2 results
📏 Metrics: % < 11.25, % < 22.5, % < 30, Mean

PASCAL Context

The PASCAL Context dataset is an extension of the PASCAL VOC 2010 detection challenge, and it contains pixel-wise labels for …

📊 1 results
📏 Metrics: Mean Angle Error

Stanford-ORB

We introduce Stanford-ORB, a new real-world 3D Object inverse Rendering Benchmark. Recent advances in inverse rendering have enabled a wide …

📊 7 results
📏 Metrics: Cosine Distance

Taskonomy

Taskonomy provides a large and high-quality dataset of varied indoor scenes. - Complete pixel-level geometric information via aligned meshes. - …

📊 1 results
📏 Metrics: L1 error

Surgical Skills Evaluation

JIGSAWS

The JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) is a surgical activity dataset for human motion modeling. The data …

📊 2 results
📏 Metrics: Accuracy, Edit Distance

Surgical phase recognition

Cholec80

Cholec80 is an endoscopic video dataset containing 80 videos of cholecystectomy surgeries performed by 13 surgeons. The videos are captured …

📊 6 results
📏 Metrics: F1, Acc

GraSP

Holistic and Multi-Granular Surgical Scene Understanding of Prostatectomies (GraSP) dataset, a curated benchmark that models surgical scene understanding as a …

📊 2 results
📏 Metrics: mAP

HeiChole Benchmark

Analyzing the surgical workflow is a prerequisite for many applications in computer assisted surgery (CAS), such as context-aware visualization of …

📊 5 results
📏 Metrics: F1

MISAW

The MISAW data set is composed of 27 sequences of micro-surgical anastomosis on artificial blood vessels performed by 3 surgeons …

📊 3 results
📏 Metrics: mAP

Symmetry Detection

YCB-Video

The YCB-Video dataset is a large-scale video dataset for 6D object pose estimation. provides accurate 6D poses of 21 objects …

📊 1 results
📏 Metrics: PR AUC

Synthetic Data Generation

UNSW-NB15

UNSW-NB15 is a network intrusion dataset. It contains nine different attacks, includes DoS, worms, Backdoors, and Fuzzers. The dataset contains …

📊 2 results
📏 Metrics: EMD

Table Detection

ICDAR 2019

Table is a compact and efficient form for summarizing and presenting correlative information in handwritten and printed archival documents, scientific …

📊 2 results
📏 Metrics: Weighted Average F1-score

STDW

STDW is a diverse large-scale dataset for table detection with more than seven thousand samples containing a wide variety of …

📊 2 results
📏 Metrics: IoU, AP

Table Recognition

PubTabNet

PubTabNet is a large dataset for image-based table recognition, containing 568k+ images of tabular data annotated with the corresponding HTML …

📊 13 results
📏 Metrics: TEDS (all samples), TEDS-Struct

WTW

WTW (Wired Table in the Wild) is a large-scale dataset which includes well-annotated structure parsing of multiple style tables in …

📊 1 results
📏 Metrics: F1

Table-based Fact Verification

TabFact

TabFact is a large-scale dataset which consists of 117,854 manually annotated statements with regard to 16,573 Wikipedia tables, their relations …

📊 15 results
📏 Metrics: Test, Val

Table-to-Text Generation

DART

DART is a large dataset for open-domain structured data record to text generation. DART consists of 82,191 examples across different …

📊 2 results
📏 Metrics: METEOR, BLEU, BERT, BLEURT, Mover, TER, FactSpotter

E2E

End-to-End NLG Challenge (E2E) aims to assess whether recent end-to-end NLG systems can generate more complex output by learning from …

📊 2 results
📏 Metrics: BLEU, CIDEr, METEOR, NIST, ROUGE-L

WikiBio

This dataset gathers 728,321 biographies from English Wikipedia. It aims at evaluating text generation algorithms. For each article, we provide …

📊 4 results
📏 Metrics: BLEU, ROUGE, PARENT

Wikipedia Person and Animal Dataset

This dataset gathers 428,748 person and 12,236 animal infobox with descriptions based on Wikipedia dump (2018/04/01) and Wikidata (2018/04/12).

📊 2 results
📏 Metrics: BLEU, ROUGE, METEOR

Tabular Data Generation

Adult Census Income

This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, …

📊 6 results
📏 Metrics: DT Accuracy, LR Accuracy, Parameters(M), RF Accuracy

California Housing Prices

Median house prices for California districts derived from the 1990 census. About Dataset Context This is the dataset used in …

📊 6 results
📏 Metrics: Parameters(M), RF Mean Squared Error, DT Mean Squared Error, LR Mean Squared Error

Diabetes

What do the instances in this dataset represent? The instances represent hospitalized patient records diagnosed with diabetes. **Are there recommended …

📊 6 results
📏 Metrics: DT Accuracy, Parameters(M), LR Accuracy, RF Accuracy

HELOC

HELOC The HELOC dataset from FICO. Each entry in the dataset is a line of credit, typically offered by …

📊 6 results
📏 Metrics: DT Accuracy, LR Accuracy, Parameters(M), RF Accuracy

SICK

The Sentences Involving Compositional Knowledge (SICK) dataset is a dataset for compositional distributional semantics. It includes a large number of …

📊 6 results
📏 Metrics: DT Accuracy, LR Accuracy, Parameters(M), RF Accuracy

Travel

A Tour & Travels Company Wants To Predict Whether A Customer Will Churn Or Not Based On Indicators Given Below. …

📊 6 results
📏 Metrics: DT Accuracy, LR Accuracy, RF Accuracy, Parameters(M)

Talking Face Generation

CREMA-D

CREMA-D is an emotional multimodal actor data set of 7,442 original clips from 91 actors. These clips were from 48 …

📊 1 results
📏 Metrics: EmoAcc, FID, LSE-C

LRW

The Lip Reading in the Wild (LRW) dataset a large-scale audio-visual database that contains 500 different words from over 1,000 …

📊 1 results
📏 Metrics: LMD, SSIM

Target Sound Extraction

AudioCaps

AudioCaps is a dataset of sounds with event descriptions that was introduced for the task of audio captioning, with sounds …

📊 1 results
📏 Metrics: SDRi, SI-SDRi

AudioSet

Audioset is an audio event dataset, which consists of over 2M human-annotated 10-second video clips. These clips are collected from …

📊 1 results
📏 Metrics: SDRi, SI-SDRi

FSDSoundScapes

A synthetic sound mixture specification dataset for the Target Sound Extraction (TSE) task. Dataset samples consist of a .jams file …

📊 1 results
📏 Metrics: SI-SNRi

Task-Oriented Dialogue Systems

SGD

The Schema-Guided Dialogue (SGD) dataset consists of over 20k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. …

📊 2 results
📏 Metrics: METEOR

Temporal Action Localization

CrossTask

CrossTask dataset contains instructional videos, collected for 83 different tasks. For each task an ordered list of steps with manual …

📊 7 results
📏 Metrics: Recall

EPIC-KITCHENS-100

This paper introduces the pipeline to scale the largest dataset in egocentric vision EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a …

📊 6 results
📏 Metrics: Avg mAP (0.1-0.5), mAP [email protected], mAP [email protected], mAP [email protected], mAP [email protected], mAP [email protected]

FineAction

FineAction contains 103K temporal instances of 106 action categories, annotated in 17K untrimmed videos. FineAction introduces new opportunities and challenges …

📊 9 results

HACS

HACS is a dataset for human action recognition. It uses a taxonomy of 200 action classes, which is identical to …

📊 11 results

MUSES

MUSES is a large-scale dataset for temporal event (action) localization. It focuses on the temporal localization of multi-shot events, which …

MultiTHUMOS

The MultiTHUMOS dataset contains dense, multilabel, frame-level action annotations for 30 hours across 400 videos in the THUMOS'14 action detection …

THUMOS14

The THUMOS14 (THUMOS 2014) dataset is a large-scale video dataset that includes 1,010 videos for validation and 1,574 videos for …

📊 1 results
📏 Metrics: Avg mAP (0.3:0.7)

Temporal Information Extraction

TempEval-3

Within the SemEval-2013 evaluation exercise, the TempEval-3 shared task aims to advance research on temporal information processing. It follows on …

📊 1 results
📏 Metrics: Temporal awareness

Temporal Relation Extraction

Vinoground

A temporal counterfactual dataset composing of 1000 short and natural video-caption pairs.

📊 16 results
📏 Metrics: Text Score, Video Score, Group Score

Temporal Sentence Grounding

Charades-STA

Charades-STA is a new dataset built on top of Charades by adding sentence temporal annotations. Source: [TALL: Temporal Activity Localization …

Text Classification

20 Newsgroups

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

📊 1 results
📏 Metrics: Accuracy

AG News

AG News (AG’s News Corpus) is a subdataset of AG's corpus of news articles constructed by assembling titles and description …

📊 21 results
📏 Metrics: Error

Adverse Drug Events (ADE) Corpus

Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. A significant …

📊 1 results
📏 Metrics: F1 - macro

An Amharic News Text classification Dataset

In NLP, text classification is one of the primary problems we try to solve and its uses in language analyses …

📊 2 results
📏 Metrics: Accuracy

Arxiv HEP-TH citation graph

Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a …

📊 1 results
📏 Metrics: Accuracy

BANKING77

Dataset composed of online banking queries annotated with their corresponding intents. BANKING77 dataset provides a very fine-grained set of intents …

📊 1 results
📏 Metrics: Accuracy

BLURB

BLURB is a collection of resources for biomedical natural language processing. In general domains such as newswire and the Web, …

📊 3 results
📏 Metrics: F1

Bala-Copa

The Balanced Choice of Plausible Alternatives dataset is a benchmark for training machine learning models that are robust to superficial …

📊 3 results
📏 Metrics: Accuracy

DBpedia

DBpedia (from "DB" for "database") is a project aiming to extract structured content from the information created in the Wikipedia …

📊 19 results
📏 Metrics: Error

HateXplain

Covers multiple aspects of the issue. Each post in the dataset is annotated from three different perspectives: the basic, commonly …

📊 4 results
📏 Metrics: Accuracy (2 classes), F1 Macro

IMDb Movie Reviews

The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database …

📊 2 results
📏 Metrics: AUC, Accuracy (2 classes), F1 Macro

Lot-insts

LoT-insts contains over 25k classes whose frequencies are naturally long-tail distributed. Its test set from four different subsets: many-, medium-, …

📊 5 results
📏 Metrics: Accuracy, Macro-F1

MR

MR Movie Reviews is a dataset for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect …

📊 9 results
📏 Metrics: Accuracy

MTEB

MTEB is a benchmark that spans 8 embedding tasks covering a total of 56 datasets and 112 languages. The 8 …

📊 31 results
📏 Metrics: Accuracy

Ohsumed

Ohsumed includes medical abstracts from the MeSH categories of the year 1991. In [Joachims, 1997] were used the first 20,000 …

📊 9 results
📏 Metrics: Accuracy

Overruling

The Overruling dataset is a law dataset corresponding to the task of determining when a sentence is overruling a prior …

📊 3 results
📏 Metrics: F1(10-fold)

RCV1

The RCV1 dataset is a benchmark dataset on text categorization. It is a collection of newswire articles producd by Reuters …

📊 3 results
📏 Metrics: Accuracy, Macro F1, Micro F1, P@1, P@3, P@5, nDCG@1, nDCG@3, nDCG@5

SILICONE Benchmark

The Sequence labellIng evaLuatIon benChmark fOr spoken laNguagE (SILICONE) benchmark is a collection of resources for training, evaluating, and analyzing …

📊 1 results
📏 Metrics: 1:1 Accuracy

SST-2

The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the …

📊 2 results
📏 Metrics: Accuracy

Social media attributions of YouTube comments

Data set constructed from YouTube comments (72,098 comments posted by 43,859 users on 623 relevant videos to the crisis)

📊 2 results
📏 Metrics: Accuracy (2 classes), F1 Macro

TREC-10

A question type classification dataset with 6 classes for questions about a person, location, numeric information, etc. The test split …

📊 1 results
📏 Metrics: Accuracy

Terms of Service

The Terms of Service dataset is a law dataset corresponding to the task of identifying whether contractual terms are potentially …

📊 3 results
📏 Metrics: F1(10-fold)

This is not a Dataset

We introduce a large semi-automatically generated dataset of ~400,000 descriptive sentences about commonsense knowledge that can be true or false …

📊 2 results
📏 Metrics: Accuracy, Coherence

UK Key Stage Readability

Education is increasingly data-driven, and the ability to analyse and adapt educational materials quickly and effectively is important for keeping …

📊 15 results
📏 Metrics: F1

WNUT-2020 Task 2

Briefly describe the dataset. Provide: * a high-level explanation of the dataset characteristics * explain motivations and summary of its …

📊 1 results
📏 Metrics: F1

Yahoo! Answers

The Yahoo! Answers topic classification dataset is constructed using 10 largest main categories. Each class contains 140,000 training samples and …

📊 9 results
📏 Metrics: Accuracy

arXiv-10

Benchmark dataset for abstracts and titles of 100,000 ArXiv scientific papers. This dataset contains 10 classes and is balanced (exactly …

📊 3 results
📏 Metrics: Accuracy

Text Clustering

20 Newsgroups

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

📊 2 results
📏 Metrics: Accuracy

MTEB

MTEB is a benchmark that spans 8 embedding tasks covering a total of 56 datasets and 112 languages. The 8 …

📊 31 results
📏 Metrics: V-Measure

Text Complexity Assessment (GermEval 2022)

TextComplexityDE

TextComplexityDE is a dataset consisting of 1000 sentences in German language taken from 23 Wikipedia articles in 3 different article-genres …

📊 1 results
📏 Metrics: RMSE

Text Detection

UrduDoc

The UrduDoc Dataset is a benchmark dataset for Urdu text line detection in scanned documents. It is created as a …

📊 5 results
📏 Metrics: Precision, Recall

Text Generation

CNN/Daily Mail

CNN/Daily Mail is a dataset for text summarization. Human generated abstractive summary bullets were generated from news stories in CNN …

📊 1 results
📏 Metrics: ROUGE-L

COCO Captions

COCO Captions contains over one and a half million captions describing over 330,000 images. For the training and validation images, …

📊 4 results
📏 Metrics: BLEU-2, BLEU-3, BLEU-4, BLEU-5

CSL

CSL is a synthetic dataset introduced in Murphy et al. (2019) to test the expressivity of GNNs. In particular, graphs …

📊 1 results
📏 Metrics: ROUGE-L

CommonGen

CommonGen is constructed through a combination of crowdsourced and existing caption corpora, consists of 79k commonsense descriptions over 35k unique …

📊 4 results
📏 Metrics: CIDEr, METEOR, BLEU-4, SPICE

Czech restaurant information

Czech restaurant information is a dataset for NLG in task-oriented spoken dialogue systems with Czech as the target language. It …

📊 3 results
📏 Metrics: METEOR

DART

DART is a large dataset for open-domain structured data record to text generation. DART consists of 82,191 examples across different …

📊 3 results
📏 Metrics: BLEU, METEOR, FactSpotter

DailyDialog

DailyDialog is a high-quality multi-turn open-domain English dialog dataset. It contains 13,118 dialogues split into a training set with 11,118 …

📊 1 results
📏 Metrics: BLEU-1, BLEU-2, BLEU-3, BLEU-4

HarmfulQA

Paper | Github | Dataset| Model As a part of our research efforts toward making LLMs more safe for public …

📊 1 results
📏 Metrics: ASR

LCSTS

LCSTS is a large corpus of Chinese short text summarization dataset constructed from the Chinese microblogging website Sina Weibo, which …

📊 1 results
📏 Metrics: ROUGE-L

OpenWebText

OpenWebText is an open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit …

📊 2 results
📏 Metrics: eval_loss

ROCStories

ROCStories is a collection of commonsense short stories. The corpus consists of 100,000 five-sentence stories. Each story logically follows everyday …

📊 4 results
📏 Metrics: BLEU-1, Perplexity

ReDial

ReDial (Recommendation Dialogues) is an annotated dataset of dialogues, where users recommend movies to each other. The dataset consists of …

📊 4 results
📏 Metrics: Distinct-3, Distinct-4, Distinct-2, Perplexity

SciQ

The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in …

📊 3 results
📏 Metrics: Accuracy

Text Simplification

ASSET

ASSET is a new dataset for assessing sentence simplification in English. ASSET is a crowdsourced multi-reference corpus where each simplification …

📊 11 results
📏 Metrics: BLEU, SARI (EASSE>=0.2.1), METEOR, FKGL, QuestEval (Reference-less, BERTScore)

DEplain-APA-doc

DEplain-APA-doc: A German Parallel Corpus for Document Simplification on News Texts DEplain is a new dataset of parallel, professionally …

📊 3 results
📏 Metrics: SARI (EASSE>=0.2.1), BLEU, BertScore (Precision), FRE (Flesch Reading Ease)

DEplain-APA-sent

DEplain-APA-sent: A German Parallel Corpus for Sentence Simplification on News Texts DEplain is a new dataset of parallel, professionally …

📊 2 results
📏 Metrics: SARI (EASSE>=0.2.1), BLEU, BertScore (Precision), FRE (Flesch Reading Ease)

DEplain-web-doc

DEplain-web-doc: A German Parallel Corpus for Document Simplification on Web Texts DEplain is a new dataset of parallel, professionally …

📊 3 results
📏 Metrics: SARI (EASSE>=0.2.1), BLEU, BertScore (Precision), FRE (Flesch Reading Ease)

DEplain-web-sent

DEplain-web-sent: A German Parallel Corpus for Sentence Simplification on Web Texts DEplain is a new dataset of parallel, professionally …

📊 2 results
📏 Metrics: SARI (EASSE>=0.2.1), BLEU, BertScore (Precision), FRE (Flesch Reading Ease)

Newsela

The Newsela dataset was introduced by Xu et al. in their research on text simplification. It is a corpus that …

📊 10 results
📏 Metrics: SARI, BLEU

TurkCorpus

TurkCorpus, a dataset with 2,359 original sentences from English Wikipedia, each with 8 manual reference simplifications. The dataset is divided …

📊 20 results
📏 Metrics: SARI (EASSE>=0.2.1), BLEU, METEOR, FKGL, QuestEval (Reference-less, BERTScore)

Text Spotting

ICDAR 2015

ICDAR 2015 was a scene text detection used for the ICDAR 2015 conference.

📊 17 results
📏 Metrics: F-measure (%) - Strong Lexicon, F-measure (%) - Weak Lexicon, F-measure (%) - Generic Lexicon

SCUT-CTW1500

The SCUT-CTW1500 dataset contains 1,500 images: 1,000 for training and 500 for testing. In particular, it provides 10,751 cropped text …

📊 10 results
📏 Metrics: F-measure (%) - No Lexicon, F-Measure (%) - Full Lexicon

Total-Text

Total-Text is a text detection dataset that consists of 1,555 images with a variety of text types including horizontal, multi-oriented, …

📊 12 results
📏 Metrics: F-measure (%) - No Lexicon, F-measure (%) - Full Lexicon

Text Summarization

ACI-Bench

Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation

📊 1 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

Arxiv HEP-TH citation graph

Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a …

📊 27 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

BigPatent

Consists of 1.3 million records of U.S. patent documents along with human written abstractive summaries. Source: [BIGPATENT: A Large-Scale Dataset …

📊 2 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

BillSum

BillSum is the first dataset for summarization of US Congressional and California state bills. The BillSum dataset consists of three …

📊 1 results
📏 Metrics: rouge1

BookSum

BookSum is a collection of datasets for long-form narrative summarization. This dataset covers source documents from the literature domain, such …

📊 3 results
📏 Metrics: ROUGE, ROUGE-2, ROUGE-L

CL-SciSumm

📊 1 results
📏 Metrics: ROUGE-2

DialogSum

DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 dialogues with corresponding manually labeled summaries and topics. This work …

📊 3 results
📏 Metrics: Rouge1, Rouge2, RougeL, BertScore

Gazeta

Gazeta is a dataset for automatic summarization of Russian news. The dataset consists of 63,435 text-summary pairs. To form training, …

📊 1 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L, BLEU, Meteor

GovReport

GovReport is a dataset for long document summarization, with significantly longer documents and summaries. It consists of reports written by …

📊 2 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

How2

The How2 dataset contains 13,500 videos, or 300 hours of speech, and is split into 185,187 training, 2022 development (dev), …

📊 2 results
📏 Metrics: Content F1, ROUGE-L, ROUGE-1

Klexikon

The dataset introduces document alignments between German Wikipedia and the children's lexicon Klexikon. The source texts in Wikipedia are both …

📊 4 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

LCSTS

LCSTS is a large corpus of Chinese short text summarization dataset constructed from the Chinese microblogging website Sina Weibo, which …

📊 1 results
📏 Metrics: ROUGE-1

MTEB

MTEB is a benchmark that spans 8 embedding tasks covering a total of 56 datasets and 112 languages. The 8 …

📊 26 results
📏 Metrics: Spearman Correlation

MeQSum

MeQSum is a dataset for medical question summarization. It contains 1,000 summarized consumer health questions. Source: https://www.aclweb.org/anthology/P19-1215.pdf Image Source: https://www.aclweb.org/anthology/P19-1215.pdf

📊 1 results
📏 Metrics: RougeL

MeetingBank

MeetingBank, a benchmark dataset created from the city councils of 6 major U.S. cities to supplement existing datasets. It contains …

📊 2 results
📏 Metrics: ROUGE-L, Rouge-1, ROUGE-2

MentSum

Mental health remains a significant challenge of public health worldwide. With increasing popularity of online platforms, many use the platforms …

📊 1 results
📏 Metrics: Rouge-1, Rouge-2, Rouge-L

OrangeSum

Source: BARThez: a Skilled Pretrained French Sequence-to-Sequence Model OrangeSum is a single-document extreme summarization dataset with two tasks: title and …

📊 2 results
📏 Metrics: ROUGE-1

Pubmed

The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. …

📊 28 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

QMSum

QMSum is a new human-annotated benchmark for query-based multi-domain meeting summarisation task, which consists of 1,808 query-summary pairs over 232 …

📊 1 results
📏 Metrics: ROUGE-1

Reddit TIFU

Reddit TIFU dataset is a newly collected Reddit dataset, where TIFU denotes the name of /r/tifu subbreddit. There are 122,933 …

📊 5 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

SAMSum

A new dataset with abstractive dialogue summaries. Source: SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization

📊 11 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L, BertScoreF1

WikiHow

WikiHow is a dataset of more than 230,000 article and summary pairs extracted and constructed from an online knowledge base …

📊 3 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L, Content F1

XSum

The Extreme Summarization (XSum) dataset is a dataset for evaluation of abstractive single-document summarization systems. The goal is to create …

📊 1 results
📏 Metrics: ROUGE-1

arXiv

For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from …

📊 1 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

arXiv Summarization Dataset

This is a dataset for evaluating summarisation methods for research papers. Source: [A Discourse-Aware Attention Model for Abstractive Summarization of …

📊 4 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

Text based Person Retrieval

CUHK-PEDES

The CUHK-PEDES dataset is a caption-annotated pedestrian dataset. It contains 40,206 images over 13,003 persons. Images are collected from five …

📊 16 results
📏 Metrics: R@1, R@5, R@10, mAP, Rank-1, Rank-10, Rank-5

ICFG-PEDES

One large-scale database for Text-to-Image Person Re-identification, i.e., Text-based Person Retrieval. Compared with existing databases, ICFG-PEDES has three key advantages. …

📊 11 results
📏 Metrics: R@1, Rank-1, R@5, R@10, mAP, mINP, Rank-10, Rank-5

RSTPReid

RSTPReid contains 20505 images of 4,101 persons from 15 cameras. Each person has 5 corresponding images taken by different cameras …

📊 9 results
📏 Metrics: R@1, R@5, R@10, mAP, Rank-1, Rank-10, Rank-5, mINP

Text to 3D

T$^3$Bench

T$^3$Bench is the first comprehensive text-to-3D benchmark containing diverse text prompts of three increasing complexity levels that are specially designed …

📊 6 results
📏 Metrics: Avg

Text to Audio Retrieval

AudioCaps

AudioCaps is a dataset of sounds with event descriptions that was introduced for the task of audio captioning, with sounds …

📊 11 results
📏 Metrics: R@1, R@5, R@10

Clotho

Clotho is an audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total …

📊 12 results
📏 Metrics: R@1, R@5, R@10, mAP@10

Localized Narratives

We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. We ask annotators to describe …

📊 1 results
📏 Metrics: Text-to-audio R@1, Text-to-audio R@10, Text-to-audio R@5

SoundDescs

We introduce a new audio dataset called SoundDescs that can be used for tasks such as text to audio retrieval, …

📊 4 results
📏 Metrics: R@1, R@10

Text to Video Retrieval

Kinetics-GEB+

Kinetics-GEB+ (Generic Event Boundary Captioning, Grounding and Retrieval) is a dataset that consists of over 170k boundaries associated with captions …

📊 2 results
📏 Metrics: mAP, text-to-video R@1, text-to-video R@10, text-to-video R@5, text-to-video R@50

MSR-VTT

MSR-VTT (Microsoft Research Video to Text) is a large-scale dataset for the open domain video captioning, which consists of 10,000 …

📊 1 results
📏 Metrics: text-to-video R@1

MSVD-Indonesian

MSVD-Indonesian is derived from the MSVD dataset, which is obtained with the help of a machine translation service. This dataset …

📊 1 results
📏 Metrics: R@1, R@5, R@10, Median Rank, Mean Rank

Text-To-SQL

BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation)

BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) represents a pioneering, cross-domain dataset that examines the impact of extensive …

📊 16 results
📏 Metrics: Execution Accuracy % (Test), Execution Accuracy % (Dev), Execution Accurarcy (Human)

KaggleDBQA

KaggleDBQA is a challenging cross-domain and complex evaluation dataset of real Web databases, with domain-specific data types, original formatting, and …

📊 2 results
📏 Metrics: Exact Match (EM)

SEDE

SEDE is a dataset comprised of 12,023 complex and diverse SQL queries and their natural language titles and descriptions, written …

📊 1 results
📏 Metrics: PCM-F1 (dev), PCM-F1 (test)

SParC

SParC is a large-scale dataset for complex, cross-domain, and context-dependent (multi-turn) semantic parsing and text-to-SQL task (interactive natural language interfaces …

📊 6 results
📏 Metrics: interaction match accuracy, question match accuracy

SQL-Eval

SQL-Eval is an open-source PostgreSQL evaluation dataset released by Defog, constructed based on Spider. The original link can be found …

📊 1 results
📏 Metrics: Execution Accuracy

Spider 2.0

Spider 2.0 is a comprehensive code generation agent task that includes 632 examples. The agent has to interactively explore various …

📊 8 results
📏 Metrics: Success Rate

Text-To-Speech Synthesis

20000 utterances

20000 utterances

📊 1 results
📏 Metrics: 10-keyword Speech Commands dataset

LJSpeech

This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from …

📊 15 results
📏 Metrics: Audio Quality MOS, Pleasantness MOS, Word Error Rate (WER), MOS, WER (%)

Trinity Speech-Gesture Dataset

Trinity Gesture Dataset includes 23 takes, totalling 244 minutes of motion capture and audio of a male native English speaker …

📊 1 results
📏 Metrics: MOS

Text-based Person Retrieval with Noisy Correspondence

CUHK-PEDES

The CUHK-PEDES dataset is a caption-annotated pedestrian dataset. It contains 40,206 images over 13,003 persons. Images are collected from five …

📊 5 results
📏 Metrics: Rank-1, Rank 10, Rank-5, mAP, mINP

ICFG-PEDES

One large-scale database for Text-to-Image Person Re-identification, i.e., Text-based Person Retrieval. Compared with existing databases, ICFG-PEDES has three key advantages. …

📊 5 results
📏 Metrics: Rank 1, Rank-10, Rank-5, mAP, mINP

RSTPReid

RSTPReid contains 20505 images of 4,101 persons from 15 cameras. Each person has 5 corresponding images taken by different cameras …

📊 5 results
📏 Metrics: Rank 1, Rank 10, Rank 5, mAP, mINP

Text-based de novo Molecule Generation

ChEBI-20

Dataset contains 33,010 molecule-description pairs split into 80\%/10\%/10\% train/val/test splits. The goal of the task is to retrieve the relevant …

📊 19 results
📏 Metrics: BLEU, Exact Match, Frechet ChemNet Distance (FCD), Levenshtein, MACCS FTS, Morgan FTS, RDK FTS, Text2Mol, Validity, Parameter Count

Text-to-3D-Human Generation

DeepFashion

DeepFashion is a dataset containing around 800K diverse fashion images with their rich annotations (46 categories, 1,000 descriptive attributes, bounding …

📊 1 results
📏 Metrics: CLIP Score, Depth Error, Fashion Accuracy, Frechet Inception Distance, Percentage of Correct Keypoints

Text-to-Image Generation

COCO (Common Objects in Context)

The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset. It is designed to …

📊 69 results
📏 Metrics: FID, Inception score, FID-1, FID-2, FID-4, FID-8, SOA-C, Zero shot FID

Colors

A large dataset of color names and their respective RGB values stores in CSV.

📊 1 results
📏 Metrics: Validation Accuracy

Conceptual Captions

Automatic image captioning is the task of producing a natural-language utterance (usually a sentence) that correctly reflects the visual content …

📊 5 results
📏 Metrics: FID

DrawBench

DrawBench is a comprehensive and challenging benchmark for text-to-image models, introduced by the Imagen research team. Let me provide you …

📊 8 results
📏 Metrics: Aesthetics (Laion Aesthtetics Predictor), Human Preference Alignement (HPSv2), Text Alignement (SentenceBERT)

Flickr-8k

Contains 8k flickr Images with captions. Visit this page to explore the data. Cite this paper if you find it …

📊 1 results
📏 Metrics: LPIPS

GenEval

Recent breakthroughs in diffusion models, multimodal pretraining, and efficient finetuning have led to an explosion of text-to-image generative models. Given …

📊 20 results
📏 Metrics: Overall, Single Obj., Two Obj., Color Attri., Colors, Counting, Position

LAION COCO

LAION-COCO is the world’s largest dataset of 600M generated high-quality captions for publicly available web-images. The images are extracted from …

📊 2 results
📏 Metrics: FID

T2I-CompBench

T2I-CompBench is a comprehensive benchmark for open-world compositional text-to-image generation, consisting of 6,000 compositional textual prompts from 3 categories (attribute …

📊 2 results
📏 Metrics: Color, Shape, Texture, Complex, Non-Spatial, Spatial

Text-to-Music Generation

MusicBench

The MusicBench dataset is a music audio-text pair dataset that was designed for text-to-music generation purpose and released along with …

📊 1 results
📏 Metrics: FAD

MusicCaps

MusicCaps is a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts. For each 10-second …

📊 20 results
📏 Metrics: FAD, FD_openl3, FD, KL_passt, IS, CLAP_LAION, CLAP_MS

Text-to-Video Generation

EvalCrafter Text-to-Video (ECTV) Dataset

This dataset contains around 10000 videos generated by various methods using the Prompt list. These videos have been evaluated using …

📊 5 results
📏 Metrics: Visual Quality, Motion Quality, Temporal Consistency, Text-to-Video Alignment, Total Score

Kinetics

The Kinetics dataset is a large-scale, high-quality dataset for human action recognition in videos. The dataset consists of around 500,000 …

📊 1 results
📏 Metrics: Accuracy

MSR-VTT

MSR-VTT (Microsoft Research Video to Text) is a large-scale dataset for the open domain video captioning, which consists of 10,000 …

📊 18 results
📏 Metrics: FVD, CLIPSIM, CLIP-FID, FID

Something-Something V2

The 20BN-SOMETHING-SOMETHING V2 dataset is a large collection of labeled video clips that show humans performing pre-defined basic actions with …

📊 1 results
📏 Metrics: FVD

WebVid

WebVid contains 10 million video clips with captions, sourced from the web. The videos are diverse and rich in their …

📊 1 results
📏 Metrics: FVD

Time Series Analysis

PhysioNet Challenge 2012

The PhysioNet Challenge 2012 dataset is publicly available and contains the de-identified records of 8000 patients in Intensive Care Units …

📊 7 results
📏 Metrics: F1

Speech Commands

Speech Commands is an audio dataset of spoken words designed to help train and evaluate keyword spotting systems .

📊 6 results
📏 Metrics: % Test Accuracy, % Test Accuracy (Raw Data)

Time Series Anomaly Detection

MSL

This dataset contains expert-labeled telemetry anomaly data from the Mars Science Laboratory (MSL) rover, Curiosity. Real spacecraft and curiosity rover …

📊 1 results
📏 Metrics: AUPR, F1 Score, Recall, precision

SMAP

Soil Moisture Active Passive (SMAP) dataset is a dataset of soil samples and telemetry information using the Mars rover by …

📊 1 results
📏 Metrics: AUPR, F1 Score, Recall, precision

SMD

a dataset of time-series anomaly detection

📊 1 results
📏 Metrics: AUPR, F1 score, Recall, precision

UCR Anomaly Archive

The UCR Anomaly Archive is a collection of 250 uni-variate time series collected in human medicine, biology, meteorology and industry. …

📊 16 results
📏 Metrics: accuracy

Time Series Classification

BorealTC

Recorded with a Husky A200 wheeled UGV, BorealTC contains 116 min of Inertial Measurement Unit (IMU), motor current, and wheel …

📊 2 results
📏 Metrics: Accuracy (5-fold)

ECG200

ECG200

📊 1 results
📏 Metrics: Accuracy(30-fold)

ECG5000

The original dataset for "ECG5000" is a 20-hour long ECG downloaded from Physionet. The name is BIDMC Congestive Heart Failure …

📊 1 results
📏 Metrics: Accuracy(30-fold)

EigenWorms

Caenorhabditis elegans is a roundworm commonly used as a model organism in the study of genetics. The movement of these …

📊 8 results
📏 Metrics: % Test Accuracy

PhysioNet Challenge 2012

The PhysioNet Challenge 2012 dataset is publicly available and contains the de-identified records of 8000 patients in Intensive Care Units …

📊 27 results
📏 Metrics: AUC, AUC Stdev, AUPRC, AUROC

SHAPES

SHAPES is a dataset of synthetic images designed to benchmark systems for understanding of spatial and logical relations among multiple …

📊 9 results
📏 Metrics: Accuracy, NLL

Time Series Forecasting

ETTh1 (96)

The Electricity Transformer Temperature (ETT) is a crucial indicator in the electric power long-term deployment. This dataset consists of 2 …

📊 2 results
📏 Metrics: MAE, MSE

Extreme Events > Natural Disasters > Hurricane

A new spatio-temporal benchmark dataset (Hurricane), is suited for forecasting during extreme events and anomalies. The dataset is provided through …

📊 1 results
📏 Metrics: RMSE

MLO-Cn2

The Mauna Loa Seeing Study was performed by the EOL/Integrated Surface Flux System team, capturing surface meteorology and flux products …

📊 7 results
📏 Metrics: RMSE

PeMSD7

PeMSD7 is traffic data in District 7 of California consisting of the traffic speed of 228 sensors while the period …

📊 3 results
📏 Metrics: 9 steps MAE

USNA-Cn2 (short-duration)

The USNA long-term scintillation study is a continuing effort to characterize and measure optical turbulence in the near-maritime boundary layer. …

📊 5 results
📏 Metrics: RMSE

Weather

Weather is recorded every 10 minutes for the 2020 whole year, which contains 21 meteorological indicators, such as air temperature, …

📊 1 results
📏 Metrics: MAE, MSE

Time Series Prediction

Data Collected with Package Delivery Quadcopter Drone

This experiment was performed in order to empirically measure the energy use of small, electric Unmanned Aerial Vehicles (UAVs). We …

📊 1 results
📏 Metrics: Average mean absolute error

Time Series Regression

FinSen

Enhancing Financial Market Predictions: Causality-Driven Feature Selection This paper introduces FinSen dataset that revolutionizes financial market analysis by integrating …

📊 1 results
📏 Metrics: Mean MSE

MLO-Cn2

The Mauna Loa Seeing Study was performed by the EOL/Integrated Surface Flux System team, capturing surface meteorology and flux products …

📊 5 results
📏 Metrics: RMSE

USNA-Cn2 (long-term)

The USNA long-term scintillation study is a continuing effort to characterize and measure optical turbulence in the near-maritime boundary layer. …

📊 9 results
📏 Metrics: RMSE

USNA-Cn2 (short-duration)

The USNA long-term scintillation study is a continuing effort to characterize and measure optical turbulence in the near-maritime boundary layer. …

📊 10 results
📏 Metrics: RMSE

Topic Models

20 Newsgroups

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

📊 2 results
📏 Metrics: Test perplexity

20NewsGroups

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

📊 6 results
📏 Metrics: C_v

AG News

AG News (AG’s News Corpus) is a subdataset of AG's corpus of news articles constructed by assembling titles and description …

📊 6 results
📏 Metrics: C_v, NPMI

Arxiv HEP-TH citation graph

Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a …

📊 2 results
📏 Metrics: MACC, Topic Coherence@50, Topic coherence@5

Traffic Accident Detection

A3D

A new dataset of diverse traffic accidents. Source: Unsupervised Traffic Accident Detection in First-Person Videos

📊 3 results
📏 Metrics: AUC

Traffic Prediction

Beijing Traffic

The Beijing Traffic Dataset collects traffic speeds at 5-minute granularity for 3126 roadway segments in Beijing between 2022/05/12 and 2022/07/25.

📊 1 results
📏 Metrics: MAE

EXPY-TKY

EXPY-TKY contains the traffic speed information and the corresponding traffic incident information in 10-minute interval for 1843 expressway road links …

📊 8 results
📏 Metrics: 1 step MAE, 3 step MAE, 6 step MAE

LargeST

In this work, we propose LargeST as a new benchmark dataset (see Figure 1), with the goal of facilitating the …

📊 5 results
📏 Metrics: SD MAE, GBA MAE, GLA MAE, CA MAE

METR-LA

METR-LA is a dataset for traffic prediction.

📊 14 results
📏 Metrics: MAE @ 12 step, 12 steps MAE, 12 steps MAPE, 12 steps RMSE, MAE @ 3 step

NYCBike1

Bike flow data of New York City with grid 16x8.

📊 3 results
📏 Metrics: MAE @ in, MAE @ out, MAPE (%) @ in, MAPE (%) @ out

NYCBike2

Bike flow data of New York City.

📊 3 results
📏 Metrics: MAE @ in, MAE @ out, MAPE (%) @ in, MAPE (%) @ out

NYCTaxi

Taxi flow data of New York City with grid 20x10.

📊 4 results
📏 Metrics: MAE @ in, MAE @ out, MAPE (%) @ in, MAPE (%) @ out

PEMS-BAY

PEMS-BAY is a dataset for traffic prediction.

📊 11 results
📏 Metrics: MAE @ 12 step, RMSE , RMSE

PeMS04

PeMS04 is a traffic forecasting benchmark.

📊 9 results
📏 Metrics: 12 Steps MAE, FLOPs(M), MAE, MAPE, Parameters(K), RMSE

PeMS07

PeMS07 is a traffic forecasting benchmark.

📊 12 results
📏 Metrics: MAE@1h

PeMS08

PeMS08 is a traffic forecasting dataset.

📊 10 results
📏 Metrics: MAE@1h, FLOPs(M), MAE, MAPE, Parameters(K), RMSE

PeMSD4

The dataset refers to the traffic speed data in San Francisco Bay Area, containing 307 sensors on 29 roads. The …

📊 10 results
📏 Metrics: 12 steps MAE, 12 steps MAPE, 12 steps RMSE

PeMSD7

PeMSD7 is traffic data in District 7 of California consisting of the traffic speed of 228 sensors while the period …

📊 7 results
📏 Metrics: 12 steps MAE, 12 steps MAPE, 12 steps RMSE

PeMSD8

This dataset contains the traffic data in San Bernardino from July to August in 2016, with 170 detectors on 8 …

📊 10 results
📏 Metrics: 12 steps MAE, 12 steps MAPE, 12 steps RMSE, MAE@1h

Q-Traffic

Q-Traffic is a large-scale traffic prediction dataset, which consists of three sub-datasets: query sub-dataset, traffic speed sub-dataset and road network …

📊 1 results
📏 Metrics: MAPE

SZ-Taxi

Taxi speed data in 15min interval from 156 sensors on major roads of Luohu District in Shenzhen, China, from Jan. …

📊 4 results
📏 Metrics: MAE @ 15min, MAE @ 30min, MAE @ 45min, MAE @ 60min

Traffic Sign Detection

CCTSDB-AUG

The CSUST Chinese Traffic Sign Detection Benchmark (CCTSDB) is an existing dataset for traffic sign detection. It consists of nearly …

📊 2 results
📏 Metrics: Averaged Precision, avg-mAP (0.1-0.5)

CCTSDB2021

Traffic signs are one of the most important information that guide cars to travel, and the detection of traffic signs …

📊 1 results
📏 Metrics: [email protected]

Training-free 3D Point Cloud Classification

ScanObjectNN

ScanObjectNN is a newly published real-world dataset comprising of 2902 3D objects in 15 categories. It is a challenging point …

📊 6 results
📏 Metrics: Accuracy (%), Parameters, Need 3D Data?

Trajectory Modeling

NBA SportVU

The NBA SportVU dataset contains player and ball trajectories for 631 games from the 2015-2016 NBA season. The raw tracking …

📊 1 results
📏 Metrics: 1x1 NLL

Trajectory Planning

ToolBench

ToolBench is an instruction-tuning dataset for tool use, which is created automatically using ChatGPT. Specifically, the authors collect 16,464 real-world …

📊 3 results
📏 Metrics: Win rate

nuScenes

The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in …

📊 4 results
📏 Metrics: Collision-3s, L2-3s, Collision-1s, Collision-2s, Collision-Avg, L2-1s, L2-2s, L2-Avg

Trajectory Prediction

ApolloScape

ApolloScape is a large dataset consisting of over 140,000 video frames (73 street scene videos) from various locations in China …

📊 1 results
📏 Metrics: ADE, FDE

Apolloscape Trajectory

Our trajectory dataset consists of camera-based images, LiDAR scanned point clouds, and manually annotated trajectories. It is collected under various …

📊 1 results
📏 Metrics: ADE

Argoverse

Argoverse is a tracking benchmark with over 30K scenarios collected in Pittsburgh and Miami. Each scenario is a sequence of …

📊 1 results
📏 Metrics: MR (K=6), brier-minFDE (K=6), minADE (K=6), minFDE (K=6)

ETH

ETH is a dataset for pedestrian detection. The testing set contains 1,804 images in three video clips. The dataset is …

📊 4 results
📏 Metrics: Avg AMD/AMV 8/12

GTA-IM Dataset

The GTA Indoor Motion dataset (GTA-IM) that emphasizes human-scene interactions in the indoor environments. It consists of HD RGB-D image …

📊 1 results
📏 Metrics: ADE, FDE, STB

HEV-I

Honda Egocentric View-Intersection Dataset (HEV-I) is introduced to enable research on traffic participants interaction modelling, future object localization, as well …

📊 2 results
📏 Metrics: ADE(0.5), ADE(1.0), ADE(1.5), FDE(1.5), FIOU(1.5)

JAAD

JAAD is a dataset for studying joint attention in the context of autonomous driving. The focus is on pedestrian and …

📊 4 results
📏 Metrics: MSE(0.5), MSE(1.0), MSE(1.5), C_MSE(1.5), CF_MSE(1.5)

PIE

PIE is a new dataset for studying pedestrian behavior in traffic. PIE contains over 6 hours of footage recorded in …

📊 4 results
📏 Metrics: MSE(0.5), MSE(1.0), MSE(1.5), C_MSE(1.5), CF_MSE(1.5)

PROX

A dataset composed of 12 different 3D scenes and RGB sequences of 20 subjects moving in and interacting with the …

📊 1 results
📏 Metrics: ADE, FDE, STB

SDD

SDD dataset contains a variety of indoor and outdoor scenes, designed for Image Defocus Deblurring. There are 50 indoor scenes …

📊 1 results
📏 Metrics: mADEK @4.8s, mF DEK @4.8s

UCY

The UCY dataset consist of real pedestrian trajectories with rich multi-human interaction scenarios captured at 2.5 Hz (Δt=0.4s). It is …

📊 1 results
📏 Metrics: Avg AMD/AMV 8/12

nuScenes

The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in …

📊 11 results
📏 Metrics: MinADE_5, MinADE_10, MissRateTopK_2_5, MissRateTopK_2_10, MinFDE_1, OffRoadRate

Transfer Learning

Office-Home

Office-Home is a benchmark dataset for domain adaptation which contains 4 domains where each domain consists of 65 categories. The …

📊 5 results
📏 Metrics: Accuracy

Retinal Fundus MultiDisease Image Dataset (RFMiD)

According to the WHO, World report on vision 2019, the number of visually impaired people worldwide is estimated to be …

📊 1 results
📏 Metrics: AUROC

Transferability

classification benchmark

This benchmark includes 11 image classification datasets that were used to evaluate the transferability of metrics. Datasets include FGVC Aircraft, …

📊 6 results
📏 Metrics: Kendall's Tau

Turkish Text Diacritization

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish …

📊 1 results
📏 Metrics: Alpha-Word accuracy

Twitter Bot Detection

MGTAB

MGTAB is the first standardized graph-based benchmark for stance and bot detection. MGTAB contains 10,199 expert-annotated users and 7 types …

📊 4 results
📏 Metrics: Acc, F1

Two-sample testing

HIGGS Data Set

The data has been produced using Monte Carlo simulations. The first 21 features (columns 2-22) are kinematic properties measured by …

📊 1 results
📏 Metrics: Avg accuracy

Type prediction

ManyTypes4TypeScript

DOI Type Inference dataset for TypeScript. Click on DOI tag for dataset files.

📊 7 results
📏 Metrics: Average Accuracy, Average Precision, Average Recall, Average F1

US Foreign Policy

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Unconditional Crystal Generation

MP20

MP20 (Xie et al., 2022) contains 45,231 metastable crystal structures from the Materials Project (Jain et al., 2013), each with …

📊 3 results
📏 Metrics: DFT Stable, Unique, Novel Rate, Validity

Unconditional Molecule Generation

GEOM-DRUGS

GEOM-DRUGS is a dataset of 430,000 large organic molecules of up to 180 atoms from [Axelrod and Gómez-Bombarelli, Nature Scientific …

📊 5 results
📏 Metrics: PoseBusters Validity, Validity, PoseBusters Atoms Connected

QM9

QM9 provides quantum chemical properties (at DFT level) for a relevant, consistent, and comprehensive chemical space of small organic molecules. …

📊 4 results
📏 Metrics: Validity, PoseBusters Internal Energy

Unified Image Restoration

GoPro

The GoPro dataset for deblurring consists of 3,214 blurred images with the size of 1,280×720 that are divided into 2,103 …

📊 1 results
📏 Metrics: Average PSNR (dB)

LOL

The LOL dataset is composed of 500 low-light and normal-light image pairs and divided into 485 training pairs and 15 …

📊 1 results
📏 Metrics: Average PSNR (dB)

RESIDE

A new large-scale benchmark consisting of both synthetic and real-world hazy images, called REalistic Single Image DEhazing (RESIDE). RESIDE highlights …

📊 1 results
📏 Metrics: Average PSNR (dB)

Universal Domain Adaptation

DomainNet

DomainNet is a dataset of common objects in six different domain. All domains include 345 categories (classes) of objects such …

📊 8 results
📏 Metrics: H-Score, Source-free

Office-31

The Office dataset contains 31 object categories in three domains: Amazon, DSLR and Webcam. The 31 categories in the dataset …

📊 7 results
📏 Metrics: H-score, Source-Free

Office-Home

Office-Home is a benchmark dataset for domain adaptation which contains 4 domains where each domain consists of 65 categories. The …

📊 9 results
📏 Metrics: H-Score, Source-free, VLM

Unsupervised Anomaly Detection

AnoShift

AnoShift is a large-scale anomaly detection benchmark, which focuses on splitting the test data based on its temporal distance to …

📊 15 results
📏 Metrics: ROC-AUC FAR, ROC-AUC IID, ROC-AUC NEAR, ROC-AUC-ID (In-Distribution setup)

Caltech-101

The Caltech101 dataset contains images from 101 object categories (e.g., “helicopter”, “elephant” and “chair” etc.) and a background category that …

📊 1 results
📏 Metrics: AUC (outlier ratio = 0.5)

DAGM2007

This is a synthetic dataset for defect detection on textured surfaces. It was originally created for a competition at the …

📊 1 results
📏 Metrics: Detection AUROC

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 1 results
📏 Metrics: AUC (outlier ratio = 0.5)

KolektorSDD

The dataset is constructed from images of defective production items that were provided and annotated by Kolektor Group d.o.o.. The …

📊 1 results
📏 Metrics: Segmentation AUROC

KolektorSDD2

KolektorSDD2 is a surface-defect detection dataset with over 3000 images containing several types of defects, obtained while addressing a real-world …

📊 3 results
📏 Metrics: Segmentation AP, Segmentation AUROC, Detection AP, Segmentation AUPRO

PRONTO

The PRONTO heterogeneous benchmark dataset is based on an industrial-scale multiphase flow facility. It includes data from heterogeneous sources, including …

📊 1 results
📏 Metrics: AUC, Best Delay, Best F1, F1

Reuters-21578

The Reuters-21578 dataset is a collection of documents with news articles. The original corpus has 10,369 documents and a vocabulary …

📊 1 results
📏 Metrics: AUC (outlier ratio = 0.5)

SMAP

Soil Moisture Active Passive (SMAP) dataset is a dataset of soil samples and telemetry information using the Mars rover by …

📊 7 results
📏 Metrics: F1, Precision, Recall, AUC

SMD

a dataset of time-series anomaly detection

📊 1 results
📏 Metrics: Precision

TIMo

TIMo (Time-of-Flight Indoor Monitoring) is a dataset of infrared and depth videos intended for the use in Anomaly Detection and …

📊 1 results
📏 Metrics: AUROC

Vehicle Claims

The code to create the dataset is available here. The dataset used in the paper is available on github - …

📊 9 results
📏 Metrics: AUC

Unsupervised Anomaly Detection with Specified Settings -- 0.1% anomaly

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 1 results
📏 Metrics: AUC-ROC

Cats and Dogs

A large set of images of cats and dogs. Homepage: https://www.microsoft.com/en-us/download/details.aspx?id=54765 Source code: tfds.image_classification.CatsVsDogs Versions: 4.0.0 (default): New split API …

📊 1 results
📏 Metrics: AUC-ROC

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 1 results
📏 Metrics: AUC-ROC

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 1 results
📏 Metrics: AUC-ROC

STL-10

The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or …

📊 1 results
📏 Metrics: AUC-ROC

Unsupervised Anomaly Detection with Specified Settings -- 1% anomaly

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 1 results
📏 Metrics: AUC-ROC

Cats and Dogs

A large set of images of cats and dogs. Homepage: https://www.microsoft.com/en-us/download/details.aspx?id=54765 Source code: tfds.image_classification.CatsVsDogs Versions: 4.0.0 (default): New split API …

📊 1 results
📏 Metrics: AUC-ROC

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 1 results
📏 Metrics: AUC-ROC

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 1 results
📏 Metrics: AUC-ROC

STL-10

The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or …

📊 1 results
📏 Metrics: AUC-ROC

Unsupervised Anomaly Detection with Specified Settings -- 10% anomaly

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 1 results
📏 Metrics: AUC-ROC

Cats and Dogs

A large set of images of cats and dogs. Homepage: https://www.microsoft.com/en-us/download/details.aspx?id=54765 Source code: tfds.image_classification.CatsVsDogs Versions: 4.0.0 (default): New split API …

📊 1 results
📏 Metrics: AUC-ROC

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 1 results
📏 Metrics: AUC-ROC

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 1 results
📏 Metrics: AUC-ROC

STL-10

The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or …

📊 1 results
📏 Metrics: AUC-ROC

Unsupervised Anomaly Detection with Specified Settings -- 20% anomaly

Cats and Dogs

A large set of images of cats and dogs. Homepage: https://www.microsoft.com/en-us/download/details.aspx?id=54765 Source code: tfds.image_classification.CatsVsDogs Versions: 4.0.0 (default): New split API …

📊 1 results
📏 Metrics: AUC-ROC

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 1 results
📏 Metrics: AUC-ROC

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 1 results
📏 Metrics: AUC-ROC

STL-10

The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or …

📊 1 results
📏 Metrics: AUC-ROC

Unsupervised Anomaly Detection with Specified Settings -- 30% anomaly

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 1 results
📏 Metrics: AUC-ROC

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 1 results
📏 Metrics: AUC-ROC

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 1 results
📏 Metrics: AUC-ROC

STL-10

The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or …

📊 1 results
📏 Metrics: AUC-ROC

Unsupervised Domain Adaptation

CFC-DAOD

CFC-DAOD is a domain adaptation extension to the Caltech Fish Counting domain generalization benchmark. The goal is cross-domain object detection …

📊 6 results
📏 Metrics: [email protected]

ClonedPerson

The ClonedPerson dataset is a large-scale synthetic person re-identification dataset introduced in the paper "Cloning Outfits from Real-World Images to …

📊 1 results
📏 Metrics: MSMT17->mAP, MSMT17->Rank-1, Market-1501->mAP, Market-1501->Rank-1, CUHK03-NP->mAP, CUHK03-NP->Rank-1

DomainNet

DomainNet is a dataset of common objects in six different domain. All domains include 345 categories (classes) of objects such …

📊 3 results
📏 Metrics: Accuracy

EPIC-KITCHENS-100

This paper introduces the pipeline to scale the largest dataset in egocentric vision EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a …

📊 4 results
📏 Metrics: Average Accuracy

ImageNet-A

The ImageNet-A dataset consists of real-world, unmodified, and naturally occurring examples that are misclassified by ResNet models. Source: [On Robustness …

📊 1 results
📏 Metrics: Top 1 Error

ImageNet-C

ImageNet-C is an open source data set that consists of algorithmically generated corruptions (blur, noise) applied to the ImageNet test-set. …

📊 16 results
📏 Metrics: mean Corruption Error (mCE)

ImageNet-R

ImageNet-R(endition) contains art, cartoons, deviantart, graffiti, embroidery, graphics, origami, paintings, patterns, plastic objects, plush objects, sculptures, sketches, tattoos, toys, and …

📊 8 results
📏 Metrics: Top 1 Error

Jester (Gesture Recognition)

Jester Gesture Recognition dataset includes 148,092 labeled video clips of humans performing basic, pre-defined hand gestures in front of a …

📊 4 results
📏 Metrics: Accuracy

OOD-CV

Enhancing the robustness of vision algorithms in real-world scenarios is challenging. One reason is that existing robustness benchmarks are limited, …

📊 2 results
📏 Metrics: pi/6 accuracy, Accuracy (Top-1)

Office-31

The Office dataset contains 31 object categories in three domains: Amazon, DSLR and Webcam. The 31 categories in the dataset …

📊 5 results
📏 Metrics: Accuracy, Avg accuracy

Office-Home

Office-Home is a benchmark dataset for domain adaptation which contains 4 domains where each domain consists of 65 categories. The …

📊 19 results
📏 Metrics: Accuracy, Avg accuracy, Average Accuracy

PACS

PACS is an image dataset for domain generalization. It consists of four domains, namely Photo (1,670 images), Art Painting (2,048 …

📊 3 results
📏 Metrics: Average Accuracy

UDA-CH

UDA-CH contains 16 objects that cover a variety of artworks which can be found in a museum like sculptures, paintings …

📊 1 results
📏 Metrics: [email protected]

VisDA-2017

VisDA-2017 is a simulation-to-real dataset for domain adaptation with over 280,000 images across 12 categories in the training, validation and …

📊 1 results
📏 Metrics: Accuracy

Unsupervised Instance Segmentation

UVO

UVO is a new benchmark for open-world class-agnostic object segmentation in videos. Besides shifting the problem focus to the open-world …

📊 1 results
📏 Metrics: AP, AP50, AP75

Unsupervised Object Segmentation

ClevrTex

ClevrTex is a new benchmark designed as the next challenge to compare, evaluate and analyze algorithms for unsupervised multi-object segmentation. …

📊 12 results
📏 Metrics: mIoU, MSE

DAVIS 2016

DAVIS16 is a dataset for video object segmentation which consists of 50 videos in total (30 videos for training and …

📊 8 results
📏 Metrics: J score

DUTS

DUTS is a saliency detection dataset containing 10,553 training images and 5,019 test images. All training images are collected from …

📊 1 results
📏 Metrics: mIoU

ECSSD

The Extended Complex Scene Saliency Dataset (ECSSD) is comprised of complex scenes, presenting textures and structures common to real-world images. …

📊 1 results
📏 Metrics: mIoU

FBMS-59

The Freiburg-Berkeley Motion Segmentation Dataset (FBMS-59) is a dataset for motion segmentation, which extends the BMS-26 dataset with 33 additional …

📊 7 results
📏 Metrics: mIoU

ObjectsRoom

The ObjectsRoom dataset is based on the MuJoCo environment used by the Generative Query Network [4] and is a multi-object …

📊 5 results
📏 Metrics: ARI-FG

SegTrack-v2

SegTrack v2 is a video segmentation dataset with full pixel-level annotations on multiple objects at each frame within each video. …

📊 7 results
📏 Metrics: mIoU

ShapeStacks

A simulation-based dataset featuring 20,000 stack configurations composed of a variety of elementary geometric primitives richly annotated regarding semantics and …

📊 5 results
📏 Metrics: ARI-FG

Shelf&Tote Training Dataset

📊 4 results
📏 Metrics: ARI

Unsupervised Panoptic Segmentation

Cityscapes

Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense …

📊 5 results
📏 Metrics: PQ

KITTI

KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile …

📊 4 results
📏 Metrics: PQ

MUSES: MUlti-SEnsor Semantic perception dataset

MUSES offers 2500 multi-modal scenes, evenly distributed across various combinations of weather conditions (clear, fog, rain, and snow) and types …

📊 4 results
📏 Metrics: PQ

Waymo Open Dataset

The Waymo Open Dataset is comprised of high resolution sensor data collected by autonomous vehicles operated by the Waymo Driver …

📊 4 results
📏 Metrics: PQ

Unsupervised Semantic Segmentation

ACDC (Adverse Conditions Dataset with Correspondences)

We introduce ACDC, the Adverse Conditions Dataset with Correspondences for training and testing semantic segmentation methods on adverse visual conditions. …

📊 1 results
📏 Metrics: mIoU

Dark Zurich

Dark Zurich is an image dataset containing a total of 8779 images captured at nighttime, twilight, and daytime, along with …

📊 1 results
📏 Metrics: mIoU

ImageNet-S

Powered by the ImageNet dataset, unsupervised learning on large-scale data has made significant advances for classification tasks. There are two …

📊 1 results
📏 Metrics: mIoU (test), mIoU (val)

Nighttime Driving

Nighttime Driving is a dataset of road scenes consisting of 35,000 images ranging from daytime to twilight time and to …

📊 1 results
📏 Metrics: mIoU

SUIM

The Segmentation of Underwater IMagery (SUIM) dataset contains over 1500 images with pixel annotations for eight object categories: fish (vertebrates), …

📊 2 results
📏 Metrics: Pixel Accuracy, mIoU

VCGBench-Diverse

VideoInstruct

Video Instruction Dataset is used to train Video-ChatGPT. It consists of 100,000 high-quality video instruction pairs. employs a combination of …

📊 6 results
📏 Metrics: mean, Correctness of Information, Detail Orientation, Contextual Understanding, Temporal Understanding, Consistency, Dense Captioning, Spatial Understanding, Reasoning

Vehicle Key-Point and Orientation Estimation

ApolloCar3D

ApolloCar3DT is a dataset that contains 5,277 driving images and over 60K car instances, where each car is fitted with …

📊 1 results
📏 Metrics: A3DP

Vehicle Re-Identification

CityFlow

CityFlow is a city-scale traffic camera dataset consisting of more than 3 hours of synchronized HD videos from 40 cameras …

📊 1 results
📏 Metrics: mAP

VeRi-776

VeRi-776 is a vehicle re-identification dataset which contains 49,357 images of 776 vehicles from 20 cameras. The dataset is collected …

📊 17 results
📏 Metrics: mAP, Rank-1, Rank1, Rank5, Rank-10, Rank-5

VehicleID

The “VehicleID” dataset contains CARS captured during the daytime by multiple real-world surveillance cameras distributed in a small city in …

📊 1 results
📏 Metrics: Rank1

Vehicle Speed Estimation

BrnoCompSpeed

The dataset contains 21 full-HD videos, each around 1 hr long, captured at six different locations. Vehicles in the videos …

📊 2 results
📏 Metrics: Mean Speed Measurement Error (km/h), Median Speed Measurement Error (km/h), 95-th Percentile Speed Measurement Error (km/h), 99-th Percentile Speed Measurement Error (km/h)

Video & Kinematic Base Workflow Recognition

PETRAW

PETRAW data set was composed of 150 sequences of peg transfer training sessions. The objective of the peg transfer session …

📊 6 results
📏 Metrics: Average AD-Accuracy

Video Anomaly Detection

CHAD

CHAD: Charlotte Anomaly Dataset CHAD is high-resolution, multi-camera dataset for surveillance video anomaly detection. It includes bounding box, Re-ID, …

📊 1 results
📏 Metrics: AUC

CUHK Avenue

Avenue Dataset contains 16 training and 21 testing video clips. The videos are captured in CUHK campus avenue with 30652 …

📊 6 results
📏 Metrics: AUC, RBDC, TBDC

HR-Avenue

The human-Related version of the CUHK Avenue dataset, first presented by Morais et al. in the paper "Learning Regularity in …

📊 11 results
📏 Metrics: AUC

HR-ShanghaiTech

The human-Related version of the ShanghaiTech Campus, was first presented by Morais et al. in the paper "Learning Regularity in …

📊 14 results
📏 Metrics: AUC

HR-UBnormal

The Human Related version of UBnormal ("UBnormal: New Benchmark for Supervised Open-Set Video Anomaly Detection," Acsintoae et al.) was introduced …

📊 8 results
📏 Metrics: AUC

IITB Corridor

An abnormal activity data-set for research use that contains 4,83,566 annotated frames. Source: [Multi-timescale Trajectory Prediction for Abnormal Human Activity …

📊 1 results
📏 Metrics: AUC

ShanghaiTech

The Shanghaitech dataset is a large-scale crowd counting dataset. It consists of 1198 annotated crowd images. The dataset is divided …

📊 7 results
📏 Metrics: AUC, RBDC, TBDC

ShanghaiTech Campus

The ShanghaiTech Campus dataset has 13 scenes with complex light conditions and camera angles. It contains 130 abnormal events and …

📊 4 results
📏 Metrics: AUC

Street Scene

Street Scene is a dataset for video anomaly detection. Street Scene consists of 46 training and 35 testing high resolution …

📊 1 results
📏 Metrics: AUC, RBDC, TBDC

UBnormal

UBnormal is a new supervised open-set benchmark composed of multiple virtual scenes for video anomaly detection. Unlike existing data sets, …

📊 4 results
📏 Metrics: AUC

UCF-Crime

The UCF-Crime dataset is a large-scale dataset of 128 hours of videos. It consists of 1900 long and untrimmed real-world …

📊 1 results
📏 Metrics: AUC

UCSD Ped2

The UCSD Anomaly Detection Dataset was acquired with a stationary camera mounted at an elevation, overlooking pedestrian walkways. The crowd …

📊 3 results
📏 Metrics: AUC

Video Based Workflow Recognition

PETRAW

PETRAW data set was composed of 150 sequences of peg transfer training sessions. The objective of the peg transfer session …

📊 5 results
📏 Metrics: Average AD-Accuracy

Video Captioning

ActivityNet Captions

The ActivityNet Captions dataset is built on ActivityNet v1.3 which includes 20k YouTube untrimmed videos with 100k caption annotations. The …

📊 5 results
📏 Metrics: BLEU4, BLEU-3, CIDEr, ROUGE-L, METEOR

MSR-VTT

MSR-VTT (Microsoft Research Video to Text) is a large-scale dataset for the open domain video captioning, which consists of 10,000 …

📊 22 results
📏 Metrics: CIDEr, METEOR, ROUGE-L, BLEU-4, GS

MSRVTT-CTN

MSRVTT-CTN Dataset This dataset contains CTN annotations for the MSRVTT-CTN benchmark dataset in JSON format. It has three files …

📊 3 results
📏 Metrics: CIDEr, SPICE, ROUGE-L

MSVD

The Microsoft Research Video Description Corpus (MSVD) dataset consists of about 120K sentences collected during the summer of 2010. Workers …

📊 14 results
📏 Metrics: CIDEr, BLEU-4, METEOR, ROUGE-L, GS

MSVD-CTN

MSVD-CTN Dataset This dataset contains CTN annotations for the MSVD-CTN benchmark dataset in JSON format. It has three files …

📊 3 results
📏 Metrics: CIDEr, ROUGE-L, SPICE

MSVD-Indonesian

MSVD-Indonesian is derived from the MSVD dataset, which is obtained with the help of a machine translation service. This dataset …

📊 1 results
📏 Metrics: BLEU-4, CIDEr, METEOR, ROUGE-L

Shot2Story20K

A short clip of video may contain progression of multiple events and an interesting story line. A human needs to …

📊 2 results
📏 Metrics: CIDEr, BLEU-4, METEOR, ROUGE

TVC

TV show Caption is a large-scale multimodal captioning dataset, containing 261,490 caption descriptions paired with 108,965 short video moments. TVC

📊 2 results
📏 Metrics: BLEU-4, CIDEr

VATEX

VATEX is multilingual, large, linguistically complex, and diverse dataset in terms of both video and natural language descriptions. It has …

📊 8 results
📏 Metrics: BLEU-4, CIDEr, METEOR, ROUGE-L

YouCook2

YouCook2 is the largest task-oriented, instructional video dataset in the vision community. It contains 2000 long untrimmed videos from 89 …

📊 14 results
📏 Metrics: BLEU-4, BLEU-3, CIDEr, ROUGE-L, METEOR

Video Chaptering

VidChapters-7M

VidChapters-7M is a dataset of 817K user-chaptered videos including 7M chapters in total. VidChapters-7M is automatically created from videos online …

📊 2 results
📏 Metrics: P@5s, CIDEr, [email protected], [email protected], P@3s, [email protected], [email protected], R@3s, R@5s, SODA

Video Classification

Breakfast

The Breakfast Actions Dataset comprises of 10 actions related to breakfast preparation, performed by 52 different individuals in 18 different …

📊 8 results
📏 Metrics: Accuracy (%)

COIN

The COIN dataset (a large-scale dataset for COmprehensive INstructional video analysis) consists of 11,827 videos related to 180 different tasks …

📊 7 results
📏 Metrics: Accuracy (%)

Charades

The Charades dataset is composed of 9,848 videos of daily indoors activities with an average length of 30 seconds, involving …

📊 1 results
📏 Metrics: mAP

Hockey Fight Detection Dataset

Whereas the action recognition community has focused mostly on detecting simple actions like clapping, walking or jogging, the detection of …

📊 1 results
📏 Metrics: 1:1 Accuracy, Accuracy

Home Action Genome

Home Action Genome is a large-scale multi-view video database of indoor daily activities. Every activity is captured by synchronized multi-view …

📊 1 results
📏 Metrics: Accuracy (%)

Kinetics

The Kinetics dataset is a large-scale, high-quality dataset for human action recognition in videos. The dataset consists of around 500,000 …

📊 1 results
📏 Metrics: Top-1

MoB

A dataset of cartoon video clips. For each video clip, the presence or absence of each feature was marked by …

📊 3 results
📏 Metrics: Accuracy

Multimodal PISA

Dataset for multimodal skills assessment focusing on assessing piano player’s skill level. Annotations include player's skills level, and song difficulty …

📊 1 results
📏 Metrics: Accuracy (%)

Something-Something V1

The 20BN-SOMETHING-SOMETHING dataset is a large collection of labeled video clips that show humans performing pre-defined basic actions with everyday …

📊 1 results
📏 Metrics: Top-5 Accuracy

Something-Something V2

The 20BN-SOMETHING-SOMETHING V2 dataset is a large collection of labeled video clips that show humans performing pre-defined basic actions with …

📊 1 results
📏 Metrics: Top-5 Accuracy

YouTube-8M

The YouTube-8M dataset is a large scale video dataset, which includes more than 7 million videos with 4716 classes labeled …

📊 3 results
📏 Metrics: Hit@1, PERR, Hit@5, Global Average Precision, mAP

Video Enhancement

MFQE v2

A dataset for compressed video quality enhancement.

📊 5 results
📏 Metrics: Incremental PSNR, Parameters(M)

Video Frame Interpolation

ATD-12K

ATD-12K is a large-scale animation triplet dataset, which comprises 12,000 triplets(train10k,test2k) by manually inspect and the test2k with rich annotations, …

📊 1 results
📏 Metrics: PSNR, SSIM

DAVIS

The Densely Annotation Video Segmentation dataset (DAVIS) is a high quality and high resolution densely annotated video segmentation dataset under …

📊 1 results
📏 Metrics: PSNR, SSIM

GoPro

The GoPro dataset for deblurring consists of 3,214 blurred images with the size of 1,280×720 that are divided into 2,103 …

📊 1 results
📏 Metrics: PSNR, SSIM

LAVIB

LAVIB comprises a large collection of high-resolution videos sourced from the web. Metrics are computed for each video's motion magnitudes, …

📊 3 results
📏 Metrics: LPIPS, PSNR, SSIM

MSU Video Frame Interpolation

This is a dataset for video frame interpolation task. The dataset contains the 1920×1080 videos in 240 FPS for videos …

📊 20 results
📏 Metrics: Subjective score, PSNR, SSIM, VMAF, LPIPS, MS-SSIM, FPS

Middlebury

The Middlebury Stereo dataset consists of high-resolution stereo sequences with complex geometry and pixel-accurate ground-truth disparity data. The ground-truth disparities …

📊 10 results
📏 Metrics: Interpolation Error, PSNR, SSIM, LPIPS

UCF101

UCF101 dataset is an extension of UCF50 and consists of 13,320 video clips, which are classified into 101 categories. These …

📊 18 results
📏 Metrics: PSNR, SSIM, PSNR (sRGB), LPIPS

VFITex

To test interpolation performance on various texture types, we developed a new test set, VFITex, which contains twenty 100-frame UHD …

📊 1 results
📏 Metrics: PSNR

Vimeo90K

The Vimeo-90K is a large-scale high-quality video dataset for lower-level video processing. It proposes three different video processing tasks: frame …

📊 21 results
📏 Metrics: PSNR, SSIM, LPIPS, Speed (ms/f)

X4K1000FPS

Dataset of high-resolution (4096×2160), high-fps (1000fps) video frames with extreme motion. X-TEST consists of 15 video clips with 33-length of …

📊 17 results
📏 Metrics: PSNR, SSIM, tOF, Speed (ms/f)

Video Generation

BAIR Robot Pushing

Dataset of 64x64 images of a robot pushing objects on a table top. From Berkeley AI Research (BAIR). Source: Self-Supervised …

📊 31 results
📏 Metrics: FVD score, SSIM, PSNR, LPIPS, Cond, Train, Pred, Notes

How2Sign

The How2Sign is a multimodal and multiview continuous American Sign Language (ASL) dataset consisting of a parallel corpus of more …

📊 1 results
📏 Metrics: FVD16

Kinetics-700

Kinetics-700 is a video dataset of 650,000 clips that covers 700 human action classes. The videos include human-object interactions such …

📊 1 results
📏 Metrics: FID, FVD

LAION-400M

LAION-400M is a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity …

📊 6 results
📏 Metrics: CLIP R-Precision, CLIP

MSR-VTT

MSR-VTT (Microsoft Research Video to Text) is a large-scale dataset for the open domain video captioning, which consists of 10,000 …

📊 1 results
📏 Metrics: FVD16, Inception score

YouTube Driving

YouTube Driving Dataset contains a massive amount of real-world driving frames with various conditions, from different weather, different regions, to …

📊 1 results
📏 Metrics: FVD16

Video Grounding

MAD

MAD (Movie Audio Descriptions) is an automatically curated large-scale dataset for the task of natural language grounding in videos or …

📊 2 results
📏 Metrics: R@1,IoU=0.1, R@5,IoU=0.1, R@10,IoU=0.1, R@100,IoU=0.1, R@50,IoU=0.1, R@1,IoU=0.3, R@5,IoU=0.3

QVHighlights

The Query-based Video Highlights (QVHighlights) dataset is a dataset for detecting customized moments and highlights from videos given natural language …

📊 6 results
📏 Metrics: R@1,IoU=0.7, R@1,IoU=0.5

Video Inpainting

DAVIS

The Densely Annotation Video Segmentation dataset (DAVIS) is a high quality and high resolution densely annotated video segmentation dataset under …

📊 11 results
📏 Metrics: PSNR, SSIM, VFID, Ewarp, LPIPS (object), LPIPS (square), PNSR (object), SSIM (object), SSIM (square)

How2Sign

The How2Sign is a multimodal and multiview continuous American Sign Language (ASL) dataset consisting of a parallel corpus of more …

📊 1 results
📏 Metrics: L1 error

YouTube-VOS 2018

Youtube-VOS is a Video Object Segmentation dataset that contains 4,453 videos - 3,471 for training, 474 for validation, and 508 …

📊 10 results
📏 Metrics: PSNR, SSIM, VFID, Ewarp

Video Instance Segmentation

HQ-YTVIS

While Video Instance Segmentation (VIS) has seen rapid progress, current approaches struggle to predict high-quality masks with accurate boundary details. …

📊 4 results
📏 Metrics: Tube-Boundary AP

YouTube-VIS 2021

3,859 high-resolution YouTube videos, 2,985 training videos, 421 validation videos and 453 test videos. An improved 40-category label set by …

📊 26 results
📏 Metrics: mask AP, AP50, AP75, AR1, AR10

Youtube-VIS 2022 Validation

Video object segmentation has been studied extensively in the past decade due to its importance in understanding video spatial-temporal structures …

📊 7 results
📏 Metrics: mAP_L, AP50_L, AP75_L, AR1_L, AR10_L

Video Object Segmentation

DAVIS 2016

DAVIS16 is a dataset for video object segmentation which consists of 50 videos in total (30 videos for training and …

📊 24 results
📏 Metrics: J&F, F-Score, Jaccard (Mean), mIoU, Contour Accuracy

DAVIS 2017

DAVIS17 is a dataset for video object segmentation. It contains a total of 150 videos - 60 for training, 30 …

📊 5 results
📏 Metrics: Jaccard (Mean), mIoU, J&F, F-Score

FBMS

The Freiburg-Berkeley Motion Segmentation Dataset (FBMS-59) is an extension of the BMS dataset with 33 additional video sequences. A total …

📊 2 results
📏 Metrics: F-Score, Jaccard (Mean)

FBMS-59

The Freiburg-Berkeley Motion Segmentation Dataset (FBMS-59) is a dataset for motion segmentation, which extends the BMS-26 dataset with 33 additional …

📊 1 results
📏 Metrics: mIoU

M$^3$-VOS

💡 Description A new benchmark, Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation (M$^3$-VOS), to verify the ability of models …

📊 4 results
📏 Metrics: Average IOU

MOSE

CoMplex video Object SEgmentation (MOSE) is a dataset to study the tracking and segmenting objects in complex environments. MOSE contains …

📊 1 results
📏 Metrics: J&F

SegTrack-v2

SegTrack v2 is a video segmentation dataset with full pixel-level annotations on multiple objects at each frame within each video. …

📊 1 results
📏 Metrics: mIoU

YouTube-VOS 2018

Youtube-VOS is a Video Object Segmentation dataset that contains 4,453 videos - 3,471 for training, 474 for validation, and 508 …

📊 17 results
📏 Metrics: Mean Jaccard & F-Measure, Jaccard (Seen), Jaccard (Unseen), F-Measure (Seen), F-Measure (Unseen)

Video Prediction

BAIR Robot Pushing

Dataset of 64x64 images of a robot pushing objects on a table top. From Berkeley AI Research (BAIR). Source: Self-Supervised …

📊 6 results
📏 Metrics: FVD

Cityscapes

Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense …

📊 3 results
📏 Metrics: LPIPS, MS-SSIM

DAVIS 2017

DAVIS17 is a dataset for video object segmentation. It contains a total of 150 videos - 60 for training, 30 …

📊 2 results
📏 Metrics: LPIPS, MS-SSIM

Human3.6M

The Human3.6M dataset is one of the largest motion capture datasets, which consists of 3.6 million human poses and corresponding …

📊 6 results
📏 Metrics: SSIM, MSE, MAE

KITTI

KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile …

📊 3 results
📏 Metrics: LPIPS, MS-SSIM

KTH

The efforts to create a non-trivial and publicly available dataset for action recognition was initiated at the KTH Royal Institute …

📊 28 results
📏 Metrics: FVD, SSIM, PSNR, LPIPS, Cond, Train, Pred, Params (M), MSE, Diversity

MPI Sintel

MPI (Max Planck Institute) Sintel is a dataset for optical flow evaluation that has 1064 synthesized stereo images and ground …

📊 1 results
📏 Metrics: LPIPS, PSNR, SSIM, ST-RRED

Moving MNIST

The Moving MNIST dataset contains 10,000 video sequences, each consisting of 20 frames. In each video sequence, two digits move …

📊 24 results
📏 Metrics: MSE, MAE, SSIM, LPIPS, PSNR

Something-Something V2

The 20BN-SOMETHING-SOMETHING V2 dataset is a large collection of labeled video clips that show humans performing pre-defined basic actions with …

📊 1 results
📏 Metrics: FVD

Sprites

The Sprites dataset contains 60 pixel color images of animated characters (sprites). There are 672 sprites, 500 for training, 100 …

📊 1 results
📏 Metrics: MSE

Vimeo90K

The Vimeo-90K is a large-scale high-quality video dataset for lower-level video processing. It proposes three different video processing tasks: frame …

📊 2 results
📏 Metrics: LPIPS, MS-SSIM

YouTube-8M

The YouTube-8M dataset is a large scale video dataset, which includes more than 7 million videos with 4716 classes labeled …

📊 1 results
📏 Metrics: Average PSNR

Video Quality Assessment

KoNViD-1k

Subjective video quality assessment (VQA) strongly depends on semantics, context, and the types of visual distortions. A lot of existing …

📊 20 results
📏 Metrics: PLCC

LIVE Livestream

LIVE Livestream is a database for Video Quality Assessment (VQA), specifically designed for live streaming VQA research. The dataset is …

📊 3 results
📏 Metrics: SRCC

LIVE-ETRI

The video deployed parameter space is continuously increasing to provide more realistic and immersive experiences to global streaming and social …

📊 4 results
📏 Metrics: SRCC

LIVE-FB LSVQ

No-reference (NR) perceptual video quality assessment (VQA) is a complex, unsolved, and important problem to social and streaming media applications. …

📊 13 results
📏 Metrics: PLCC

LIVE-VQC

The great variations of videographic skills in videography, camera designs, compression and processing protocols, communication and bandwidth environments, and displays …

📊 19 results
📏 Metrics: PLCC

LIVE-YT-HFR

LIVE-YT-HFR comprises of 480 videos having 6 different frame rates, obtained from 16 diverse contents. Source: [Subjective and Objective Quality …

📊 3 results
📏 Metrics: SRCC

MSU FR VQA Database

The dataset was created for video quality assessment problem. It was formed with 36 clips from Vimeo, which were selected …

📊 6 results
📏 Metrics: SRCC, PLCC, KLCC

MSU NR VQA Database

The dataset was created for video quality assessment problem. It was formed with 36 clips from Vimeo, which were selected …

📊 17 results
📏 Metrics: SRCC, PLCC, KLCC, Type

MSU SR-QA Dataset

Our dataset was made of videos from MSU Video Upscalers Benchmark Dataset, MSU Video Super-Resolution Benchmark Dataset and MSU Super-Resolution …

📊 40 results
📏 Metrics: SROCC, PLCC, KLCC, Type

YouTube-UGC

This YouTube dataset is a sampling from thousands of User Generated Content (UGC) as uploaded to YouTube distributed under the …

📊 17 results
📏 Metrics: PLCC

Video Question Answering

ActivityNet-QA

The ActivityNet-QA dataset contains 58,000 human-annotated QA pairs on 5,800 videos derived from the popular ActivityNet dataset. The dataset provides …

📊 36 results
📏 Metrics: Accuracy, Confidence score

DramaQA

The DramaQA focuses on two perspectives: 1) Hierarchical QAs as an evaluation metric based on the cognitive developmental stages of …

📊 1 results
📏 Metrics: Accuracy

How2QA

To collect How2QA for video QA task, the same set of selected video clips are presented to another group of …

📊 7 results
📏 Metrics: Accuracy

IntentQA

We contribute an IntentQA dataset with diverse intents in daily social activities. We utilize NExT-QA as the source dataset to …

📊 4 results
📏 Metrics: Accuarcy, CW, CH, TP&TN

MSR-VTT

MSR-VTT (Microsoft Research Video to Text) is a large-scale dataset for the open domain video captioning, which consists of 10,000 …

📊 1 results
📏 Metrics: Accuracy

MSRVTT-MC

The MSRVTT-MC (Multiple Choice) dataset is a video question-answering dataset created based on the MSR-VTT dataset. It consists of 2,990 …

📊 7 results
📏 Metrics: Accuracy

MSRVTT-QA

The MSR-VTT-QA dataset is a benchmark for the task of Visual Question Answering (VQA) on the MSR-VTT (Microsoft Research Video …

📊 14 results
📏 Metrics: Accuracy

MSVD-QA

The MSVD-QA dataset is a Video Question Answering (VideoQA) dataset. It is based on the existing Microsoft Research Video Description …

📊 1 results
📏 Metrics: Accuracy

MVBench

MVBench is a comprehensive Multi-modal Video understanding Benchmark. It was introduced to evaluate the comprehension capabilities of Multi-modal Large Language …

📊 22 results
📏 Metrics: Avg.

NExT-QA

NExT-QA is a VideoQA benchmark targeting the explanation of video contents. It challenges QA models to reason about the causal …

📊 47 results
📏 Metrics: Accuracy

OVBench

OVBench is a benchmark tailored for real-time video understanding: - Memory, Perception, and Prediction of Temporal Contexts: Questions are framed …

📊 15 results
📏 Metrics: AVG

Perception Test

Perception Test is a benchmark designed to evaluate the perception and reasoning skills of multimodal models. It introduces real-world videos …

📊 6 results
📏 Metrics: Accuracy (Top-1)

RoadTextVQA

Text and signs around roads provide crucial information for drivers, vital for safe navigation and situational awareness. Scene text recognition …

📊 2 results
📏 Metrics: ACCURACY

STAR Benchmark

How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence. …

📊 17 results
📏 Metrics: Average Accuracy

SUTD-TrafficQA

SUTD-TrafficQA (Singapore University of Technology and Design - Traffic Question Answering) is a dataset which takes the form of video …

📊 5 results
📏 Metrics: 1/4, 1/2

TGIF-QA

The TGIF-QA dataset contains 165K QA pairs for the animated GIFs from the TGIF dataset [Li et al. CVPR 2016]. …

📊 1 results
📏 Metrics: Accuracy

TVBench

TVBench is a new benchmark specifically created to evaluate temporal understanding in video QA. We identified three main issues in …

📊 28 results
📏 Metrics: Average Accuracy

TVQA

The TVQA dataset is a large-scale video dataset for video question answering. It is based on 6 popular TV shows …

📊 6 results
📏 Metrics: Accuracy

VLEP

VLEP contains 28,726 future event prediction examples (along with their rationales) from 10,234 diverse TV Show and YouTube Lifestyle Vlog …

📊 1 results
📏 Metrics: Accuracy

WildQA

WildQA is a video understanding dataset of videos recorded in outside settings. The dataset can be used to evaluate models …

📊 5 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

iVQA

An open-ended VideoQA benchmark that aims to: i) provide a well-defined evaluation by including five correct answer annotations per question …

📊 7 results
📏 Metrics: Accuracy

Video Reconstruction

Event-Camera Dataset

The Event-Camera Dataset is a collection of datasets with an event-based camera for high-speed robotics. The data also include intensity …

📊 3 results
📏 Metrics: Mean Squared Error, LPIPS

MGif

MGif is a dataset of videos containing movements of different cartoon animals. Each video is a moving gif file. The …

📊 2 results
📏 Metrics: L1

MVSEC

The Multi Vehicle Stereo Event Camera (MVSEC) dataset is a collection of data designed for the development of novel 3D …

📊 3 results
📏 Metrics: Mean Squared Error, LPIPS

TED-talks

In order to create the TED-talks dataset, 3,035 YouTube videos were downloaded using the "TED talks" query. From these initial …

📊 2 results
📏 Metrics: AED, AKD, L1, MKR

Tai-Chi-HD

Thai-Chi-HD is a high resolution dataset which can be used as reference benchmark for evaluating frameworks for image animation and …

📊 1 results
📏 Metrics: L1

Video Restoration

SEPE 8K

SEPE 8K dataset is made of 40 different 8K (8192 x 4320) video sequences and 40 variant 8K (8192 x …

📊 2 results
📏 Metrics: Average PSNR (dB)

Video Retrieval

ActivityNet

The ActivityNet dataset contains 200 different types of activities and a total of 849 hours of videos collected from YouTube. …

📊 31 results
📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video R@50, text-to-video Mean Rank, text-to-video Median Rank, video-to-text R@1, video-to-text R@5, video-to-text Mean Rank, video-to-text Median Rank, video-to-text R@10, video-to-text R@50

Charades-STA

Charades-STA is a new dataset built on top of Charades by adding sentence temporal annotations. Source: [TALL: Temporal Activity Localization …

📊 1 results
📏 Metrics: text-to-video Mean Rank, text-to-video Median Rank, text-to-video R@1, text-to-video R@10, video-to-text Mean Rank, video-to-text Median Rank, video-to-text R@1, video-to-text R@10

Condensed Movies

A large-scale video dataset, featuring clips from movies with detailed captions.

📊 3 results
📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10

DiDeMo

The Distinct Describable Moments (DiDeMo) dataset is one of the largest and most diverse datasets for the temporal localization of …

📊 39 results
📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video R@50, text-to-video Median Rank, text-to-video Mean Rank, video-to-text R@1, video-to-text R@5, video-to-text R@10, video-to-text Median Rank, video-to-text Mean Rank, text-to-videoR@1

EgoExoLearn

EgoExoLearn is a fascinating dataset designed to bridge the gap between egocentric and exocentric views of procedural activities. 1. **What …

📊 2 results
📏 Metrics: Accuracy

FIVR-200K

The FIVR-200K dataset has been collected to simulate the problem of Fine-grained Incident Video Retrieval (FIVR). The dataset comprises 225,960 …

📊 15 results
📏 Metrics: mAP (ISVR), mAP (CSVR), mAP (DSVR)

LSMDC

This dataset contains 118,081 short video clips extracted from 202 movies. Each video has a caption, either extracted from the …

📊 38 results
📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video Median Rank, text-to-video Mean Rank, video-to-text R@1, video-to-text R@10, video-to-text R@5, video-to-text Median Rank, video-to-text Mean Rank

MSR-VTT

MSR-VTT (Microsoft Research Video to Text) is a large-scale dataset for the open domain video captioning, which consists of 10,000 …

📊 38 results
📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video Mean Rank, text-to-video Median Rank, video-to-text R@1, video-to-text R@5, video-to-text R@10, video-to-text Median Rank, video-to-text Mean Rank, text-to-video MedianR, text-to-videoMedian Rank

MSVD

The Microsoft Research Video Description Corpus (MSVD) dataset consists of about 120K sentences collected during the summer of 2010. Workers …

📊 24 results
📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video Median Rank, text-to-video Mean Rank, text-to-video R@50, video-to-text R@1, video-to-text R@5, video-to-text R@10, video-to-text Median Rank, video-to-text Mean Rank

MSVD-Indonesian

MSVD-Indonesian is derived from the MSVD dataset, which is obtained with the help of a machine translation service. This dataset …

📊 1 results
📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video Median Rank, text-to-video Mean Rank, video-to-text R@1, video-to-text R@5, video-to-text R@10, video-to-text Median Rank, video-to-text Mean Rank

QuerYD

A large-scale dataset for retrieval and event localisation in video. A unique feature of the dataset is the availability of …

📊 5 results
📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10

TGIF

The Tumblr GIF (TGIF) dataset contains 100K animated GIFs and 120K sentences describing visual content of the animated GIFs. The …

📊 2 results
📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video Mean Rank, text-to-video Median Rank

TVR

A new multimodal retrieval dataset. TVR requires systems to understand both videos and their associated subtitle (dialogue) texts, making it …

📊 2 results
📏 Metrics: R@10, R@1, R@100

VATEX

VATEX is multilingual, large, linguistically complex, and diverse dataset in terms of both video and natural language descriptions. It has …

📊 12 results
📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video R@50, text-to-video MedianR, text-to-video MeanR, video-to-text R@1, video-to-text R@10

YouCook2

YouCook2 is the largest task-oriented, instructional video dataset in the vision community. It contains 2000 long untrimmed videos from 89 …

📊 15 results
📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video Median Rank, text-to-video Mean Rank

Video Semantic Segmentation

CamVid

CamVid (Cambridge-driving Labeled Video Database) is a road/driving scene understanding database which was originally captured as five video sequences with …

📊 6 results
📏 Metrics: Mean IoU

LaRS

LaRS is the largest and most diverse panoptic maritime obstacle detection dataset. Highlights: * Diverse scenes from manual capture, public …

📊 3 results
📏 Metrics: Q, F1, μ, mIoU

VSPW

A Large-scale Dataset for Video Scene Parsing in the Wild

📊 5 results
📏 Metrics: mIoU

Video Summarization

Query-Focused Video Summarization Dataset

Collects dense per-video-shot concept annotations. Source: Query-Focused Video Summarization: Dataset, Evaluation, and A Memory Network Based Approach

📊 2 results
📏 Metrics: F1 (avg)

Shot2Story20K

A short clip of video may contain progression of multiple events and an interesting story line. A human needs to …

📊 2 results
📏 Metrics: CIDEr, BLEU-4, METEOR, ROUGE

SumMe

The SumMe dataset is a video summarization dataset consisting of 25 videos, each annotated with at least 15 human summaries …

📊 3 results
📏 Metrics: F1-score (Canonical), F1-score (Augmented), Kendall's Tau, Spearman's Rho

TvSum

Introduced by Song et al. in TVSum: Summarizing web videos using titles. The TVSum dataset comprises 50 videos, with durations …

📊 3 results
📏 Metrics: F1-score (Canonical), F1-score (Augmented), Kendall's Tau, Spearman's Rho

Video Super-Resolution

Falling Objects

📊 3 results
📏 Metrics: SSIM, PSNR, TIoU

MSU Super-Resolution for Video Compression

This is a dataset for a super-resolution task. The dataset contains 480x270 videos that were decoded with 6 different bitrates …

📊 60 results
📏 Metrics: BSQ-rate over Subjective Score, BSQ-rate over ERQA, BSQ-rate over VMAF, BSQ-rate over PSNR, BSQ-rate over MS-SSIM, BSQ-rate over LPIPS

MSU Video Super Resolution Benchmark: Detail Restoration

This is a dataset for a video super-resolution task. The dataset contains the most complex content for the restoration task: …

📊 26 results
📏 Metrics: Subjective score, ERQAv1.0, 1 - LPIPS, SSIM, QRCRv1.0, PSNR, FPS

MSU Video Upscalers: Quality Enhancement

The dataset aims to find the algorithms that produce the most visually pleasant image possible and generalize well to a …

📊 28 results
📏 Metrics: LPIPS, SSIM, PSNR, VMAF

TbD

📊 3 results
📏 Metrics: SSIM, PSNR, TIoU

TbD-3D

📊 3 results
📏 Metrics: SSIM, PSNR, TIoU

Vimeo90K

The Vimeo-90K is a large-scale high-quality video dataset for lower-level video processing. It proposes three different video processing tasks: frame …

📊 3 results
📏 Metrics: PSNR, SSIM

Video deraining

VRDS

We generate a synthesized dataset, namely VRDS, with 102 rainy videos from diverse scenarios, and each video frame has the …

📊 8 results
📏 Metrics: SSIM, PSNR

Video Waterdrop Removal Dataset

Due to the lack of training data for video waterdrop removal, we propose a large-scale synthetic dataset with simulated waterdrops …

📊 4 results
📏 Metrics: PSNR, SSIM

Video scene graph generation

ImageNet-VidVRD

ImageNet-VidVRD dataset contains 1,000 videos selected from ILVSRC2016-VID dataset based on whether the video contains clear visual relations. It is …

📊 1 results
📏 Metrics: Recall@50

Video, Kinematic & Segmentation Base Workflow Recognition

PETRAW

PETRAW data set was composed of 150 sequences of peg transfer training sessions. The objective of the peg transfer session …

📊 4 results
📏 Metrics: Average AD-Accuracy

Video-Adverb Retrieval

AIR

Adverbs in Recipes (AIR) is a dataset specifically collected for adverb recognition. AIR is a subset of HowTo100M where recipe …

📊 4 results
📏 Metrics: mAP M, Acc-A, mAP W

ActivityNet Adverbs

ActivityNet Adverbs is a subset from the ActivityNet dataset with extracted verb-adverb annotations. ActivityNet Adverbs contains 20 adverbs appearing across …

📊 4 results
📏 Metrics: Acc-A, mAP M, mAP W

HowTo100M Adverbs

HowTo100M Adverbs is a subset from HowTo100M with mined adverbs from 83 tasks in HowTo100M. The annotations were obtained from …

📊 4 results
📏 Metrics: Acc-A, mAP M, mAP W

MSR-VTT Adverbs

MSR-VTT Adverbs is a subset from MSR-VTT with extracted verb-adverb annotations. MSR-VTT Adverbs contains 18 adverbs appearing across 106 actions, …

📊 4 results
📏 Metrics: Acc-A, mAP M, mAP W

VATEX Adverbs

VATEX Adverbs is a subset from VATEX with extracted verb-adverb annotations. VATEX Adverbs contains 34 adverbs appearing across 135 actions, …

📊 4 results
📏 Metrics: Acc-A, mAP M, mAP W

Video-based Generative Performance Benchmarking

VideoInstruct

Video Instruction Dataset is used to train Video-ChatGPT. It consists of 100,000 high-quality video instruction pairs. employs a combination of …

📊 23 results
📏 Metrics: mean, Correctness of Information, Detail Orientation, Contextual Understanding, Temporal Understanding, Consistency

Video-based Generative Performance Benchmarking (Correctness of Information)

VideoInstruct

Video Instruction Dataset is used to train Video-ChatGPT. It consists of 100,000 high-quality video instruction pairs. employs a combination of …

📊 18 results
📏 Metrics: gpt-score

Video-to-image Affordance Grounding

EPIC-Hotspot

From Grounded Human-Object Interaction Hotspots from Video (ICCV'19): We collect annotations for interaction keypoints on EPIC Kitchens in order to …

📊 3 results
📏 Metrics: KLD, SIM, AUC-J

OPRA

The OPRA Dataset was introduced in Demo2Vec: Reasoning Object Affordances From Online Videos (CVPR'18) for reasoning object affordances from online …

📊 2 results
📏 Metrics: KLD, Top-1 Action Accuracy

Vietnamese Natural Language Inference

ViNLI

A large-scale and high-quality corpus is necessary for studies on NLI for Vietnamese, which can be considered a low-resource language. …

📊 1 results
📏 Metrics: 3-class test accuracy, 4-class test accuracy

Vietnamese Text Diacritization

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish …

📊 1 results
📏 Metrics: Alpha-Word accuracy

Virology

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Virtual Try-on

Dress Code

Dress Code is a new dataset for image-based virtual try-on composed of image pairs coming from different catalogs of YOOX …

📊 1 results
📏 Metrics: FID

MPV

Consists of 37,723/14,360 person/clothes images, with the resolution of 256x192. Each person has different poses. We split them into the …

📊 2 results
📏 Metrics: FID, SWD

StreetTryOn

StreetTryOn, the new in-the-wild Virtual Try-On dataset, consists of 12,364 and 2,089 street person images for training and validation, respectively. …

📊 1 results
📏 Metrics: FID

VITON

VITON was a dataset for virtual try-on of clothing items. It consisted of 16,253 pairs of images of a person …

📊 6 results
📏 Metrics: FID, SSIM, LPIPS, IS, KID, PSNR

VITON-HD

VITON-HD dataset is a dataset for high-resolution (i.e., 1024x768) virtual try-on of clothing items. Specifically, it consists of 13,679 frontal-view …

📊 5 results
📏 Metrics: FID

Vision and Language Navigation

RxR

Room-Across-Room (RxR) is a multilingual dataset for Vision-and-Language Navigation (VLN) for Matterport3D environments. In contrast to related datasets such as …

📊 6 results
📏 Metrics: ndtw

Touchdown Dataset

Touchdown is a corpus for executing navigation instructions and resolving spatial descriptions in visual real-world environments. The task is to …

📊 12 results
📏 Metrics: Task Completion (TC)

map2seq

7,672 human written natural language navigation instructions for routes in OpenStreetMap with a focus on visual landmarks. Validated in Street …

📊 5 results
📏 Metrics: Task Completion (TC)

Visual Dialog

ConvAI2

The ConvAI2 NeurIPS competition aimed at finding approaches to creating high-quality dialogue agents capable of meaningful open domain conversation. The …

📊 1 results
📏 Metrics: BLEU-4, F1, ROUGE-L

EmpatheticDialogues

The EmpatheticDialogues dataset is a large-scale multi-turn empathetic dialogue dataset collected on the Amazon Mechanical Turk, containing 24,850 one-to-one open-domain …

📊 1 results
📏 Metrics: BLEU-4, F1, ROUGE-L

Image-Chat

The IMAGE-CHAT dataset is a large collection of (image, style trait for speaker A, style trait for speaker B, dialogue …

📊 1 results
📏 Metrics: BLEU-4, F1, ROUGE-L

Wizard of Wikipedia

Wizard of Wikipedia is a large dataset with conversations directly grounded with knowledge retrieved from Wikipedia. It is used to …

📊 1 results
📏 Metrics: BLEU-4, F1, ROUGE-L

Visual Localization

Aachen Day-Night v1.1 Benchmark

Aachen Day-Night v1.1 dataset is an extended version of the original Aachen Day-Night dataset. Besides the original query images, the …

📊 7 results
📏 Metrics: [email protected], 2°, [email protected], 5°, Acc@5m, 10°

Visual Navigation

AI2-THOR

AI2-Thor is an interactive environment for embodied AI. It contains four types of scenes, including kitchen, living room, bedroom and …

📊 2 results
📏 Metrics: SPL (All), SPL (L≥5), Success Rate (All), Success Rate (L≥5)

R2R

R2R is a dataset for visually-grounded natural language navigation in real buildings. The dataset requires autonomous agents to follow human-generated …

📊 11 results
📏 Metrics: spl

Visual Object Tracking

AVisT

One of the key factors behind the recent success in visual tracking is the availability of dedicated benchmarks. While being …

📊 7 results
📏 Metrics: Success Rate

DiDi

DiDi is a distractor-distilled tracking dataset created to address the limitation of low distractor presence in current visual object tracking …

📊 10 results
📏 Metrics: Tracking quality

GOT-10k

The GOT-10k dataset contains more than 10,000 video segments of real-world moving objects and over 1.5 million manually labelled bounding …

📊 41 results
📏 Metrics: Average Overlap, Success Rate 0.5, Success Rate 0.75

ITB

Informative Tracking Benchmark (ITB) is a small and informative tracking benchmark with 7% out of 1.2 M frames of existing …

📊 1 results
📏 Metrics: AUC

LaSOT

LaSOT is a high-quality benchmark for Large-scale Single Object Tracking. LaSOT consists of 1,400 sequences with more than 3.5M frames …

📊 44 results
📏 Metrics: AUC, Normalized Precision, Precision

OTB-2013

OTB2013 is the previous version of the current OTB2015 Visual Tracker Benchmark. It contains only 50 tracking sequences, as opposed …

📊 5 results
📏 Metrics: AUC

OTB-2015

OTB-2015, also referred as Visual Tracker Benchmark, is a visual tracking dataset. It contains 100 commonly used video sequences for …

📊 17 results
📏 Metrics: AUC, Precision

TNL2K

Tracking by Natural Language (TNL2K) is constructed for the evaluation of tracking by natural language specification. TNL2K features: - Large-scale: …

📊 14 results
📏 Metrics: AUC, precision, Normalized Precision

TrackingNet

TrackingNet is a large-scale tracking dataset consisting of videos in the wild. It has a total of 30,643 videos split …

📊 38 results
📏 Metrics: Accuracy, Normalized Precision, Precision, Success Rate, AUC

UAV123

📊 15 results
📏 Metrics: AUC, Precision

VOT2014

The dataset comprises 25 short sequences showing various objects in challenging backgrounds. Eight sequences are from the VOT2013 challenge (bolt, …

📊 1 results
📏 Metrics: Expected Average Overlap (EAO)

VOT2016

VOT2016 is a video dataset for visual object tracking. It contains 60 video clips and 21,646 corresponding ground truth maps …

📊 6 results
📏 Metrics: Expected Average Overlap (EAO)

VOT2017

VOT2017 is a Visual Object Tracking dataset for different tasks that contains 60 short sequences annotated with 6 different attributes. …

📊 6 results
📏 Metrics: Expected Average Overlap (EAO)

VOT2018

VOT2018 is a dataset for visual object tracking. It consists of 60 challenging videos collected from real-life datasets. Source: [Remove …

📊 2 results
📏 Metrics: Expected Average Overlap (EAO), Accuracy

VOT2019

VOT2019 is a Visual Object Tracking benchmark for short-term tracking in RGB. Source: https://www.votchallenge.net/vot2019/dataset.html Image Source: https://www.votchallenge.net/vot2019/dataset.html

📊 3 results
📏 Metrics: Expected Average Overlap (EAO), Accuracy

VOT2022

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 4 results
📏 Metrics: EAO

VideoCube

VideoCube is a high-quality and large-scale benchmark to create a challenging real-world experimental environment for Global Instance Tracking (GIT). MGIT …

📊 1 results
📏 Metrics: Normalized Precision, Precision, Success Rate

YouTube-VOS 2018

Youtube-VOS is a Video Object Segmentation dataset that contains 4,453 videos - 3,471 for training, 474 for validation, and 508 …

📊 9 results
📏 Metrics: O (Average of Measures), Jaccard (Seen), Jaccard (Unseen), F-Measure (Seen), F-Measure (Unseen)

Visual Odometry

EuRoC MAV

EuRoC MAV is a visual-inertial datasets collected on-board a Micro Aerial Vehicle (MAV). The dataset contains stereo images, synchronized IMU …

📊 1 results
📏 Metrics: Relative Position Error Translation [cm]

Visual Place Recognition

AmsterTime

AmsterTime dataset offers a collection of 2,500 well-curated images matching the same scene from a street view matched to historical …

📊 8 results
📏 Metrics: Recall@1, Recall@10, Recall@5

CV-Cities

CV-Cities comprises $223,736$ ground panoramic images and an equal number of satellite images all accompanied by high-precision GPS coordinates. These …

📊 3 results
📏 Metrics: Recall@1, Recall@5

KITTI

KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile …

📊 1 results
📏 Metrics: Average F1

KITTI360pose

The KITTI360Pose dataset encompasses a total area of 15.51 square kilometers across nine urban regions, consisting of 43,381 point cloud- …

📊 5 results
📏 Metrics: Localization Recall@1

MSLS

The largest and most diverse dataset for lifelong place recognition from image sequences in urban and suburban settings.

📊 3 results
📏 Metrics: Recall@1, Recall@5

Nordland

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 13 results
📏 Metrics: Recall@1, Recall@5, Recall@10

Nordland* (2760 queries)

The nordland used in SALAD and BoQ (2760 queries, 27592 reference images, threshold: 1 frames).

📊 4 results
📏 Metrics: Recall@1, Recall@5, Recall@10

Oxford RobotCar Dataset

The Oxford RobotCar Dataset contains over 100 repetitions of a consistent route through Oxford, UK, captured over a period of …

📊 7 results
📏 Metrics: Recall@1

SF-XL Night

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 2 results
📏 Metrics: Recall@1, Recall@5, Recall@10

SF-XL Occlusion

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 2 results
📏 Metrics: Recall@1, Recall@5, Recall@10

SF-XL test v1

Test set version 1 for the San Francisco eXtra Large dataset

📊 5 results
📏 Metrics: Recall@1, Recall@10, Recall@5

SF-XL test v2

Test set version 2 for the San Francisco eXtra Large dataset

📊 5 results
📏 Metrics: Recall@1, Recall@5, Recall@10

SVOX

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 2 results
📏 Metrics: Recall@1, Recall@5, Recall@10

San Francisco Landmark Dataset

The San Francisco Landmark Dataset contains a database of 1.7 million images of buildings in San Francisco with ground truth …

📊 3 results
📏 Metrics: Recall@1, Recall@10, Recall@5

Visual Question Answering

BenchLMM

Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown remarkable capabilities in visual reasoning with common image styles. …

📊 10 results
📏 Metrics: GPT-3.5 score

CLEVR

CLEVR (Compositional Language and Elementary Visual Reasoning) is a synthetic Visual Question Answering dataset. It contains images of 3D-rendered objects; …

📊 1 results
📏 Metrics: Accuracy

EarthVQA

Earth vision research typically focuses on extracting geospatial object locations and categories but neglects the exploration of relations between objects …

📊 1 results
📏 Metrics: Overall Accuracy

GQA

The GQA dataset is a large-scale visual question answering dataset with real images from the Visual Genome dataset and balanced …

📊 1 results
📏 Metrics: Accuracy

GRIT

The General Robust Image Task (GRIT) Benchmark is an evaluation-only benchmark for evaluating the performance and robustness of vision systems …

📊 1 results
📏 Metrics: VQA (ablation)

MM-Vet

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

📊 222 results
📏 Metrics: GPT-4 score, Params

MM-Vet v2

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities

📊 17 results
📏 Metrics: GPT-4 score, Params

MMBench

MMBench is a multi-modality benchmark. It methodically develops a comprehensive evaluation pipeline, primarily comprised of two elements. The first element …

📊 5 results
📏 Metrics: GPT-3.5 score

MSRVTT-QA

The MSR-VTT-QA dataset is a benchmark for the task of Visual Question Answering (VQA) on the MSR-VTT (Microsoft Research Video …

📊 3 results
📏 Metrics: Test Accuracy, Accuracy

MSVD-QA

The MSVD-QA dataset is a Video Question Answering (VideoQA) dataset. It is based on the existing Microsoft Research Video Description …

📊 2 results
📏 Metrics: Accuracy

MapEval-Visual

MapEval-Visual contains 400 image-question-answer triplets. Each question is paired with a snapshot from google maps website. The task is the …

📊 1 results
📏 Metrics: Accuracy (% )

ViP-Bench

ViP-Bench is a comprehensive benchmark designed to assess the capability of multimodal models in understanding visual prompts across multiple dimensions. …

📊 13 results
📏 Metrics: GPT-4 score (bbox), GPT-4 score (human)

VisualMRC

VisualMRC is a visual machine reading comprehension dataset that proposes a task: given a question and a document image, a …

📊 1 results
📏 Metrics: CIDEr

VizWiz

The VizWiz-VQA dataset originates from a natural visual question answering setting where blind people each took an image and recorded …

📊 1 results
📏 Metrics: Accuracy

Visual Question Answering (VQA)

A-OKVQA

A-OKVQA is crowdsourced visual question answering dataset composed of a diverse set of about 25K questions requiring a broad base …

📊 15 results
📏 Metrics: MC Accuracy, DA VQA Score

AI2D

AI2 Diagrams (AI2D) is a dataset of over 5000 grade school science diagrams with over 150000 rich annotations, their ground …

📊 4 results
📏 Metrics: EM

ActivityNet

The ActivityNet dataset contains 200 different types of activities and a total of 849 hours of videos collected from YouTube. …

📊 1 results
📏 Metrics: ClipMatch@1, ClipMatch@5, Contains, ExactMatch, Follow-up ClipMatch@1, Follow-up ClipMatch@5, Follow-up Contains, Follow-up ExactMatch

AutoHallusion

Large vision-language models (LVLMs) are prone to hallucinations, where certain contextual cues in an image can trigger the language module …

📊 3 results
📏 Metrics: Overall Accuracy

CLEVR

CLEVR (Compositional Language and Elementary Visual Reasoning) is a synthetic Visual Question Answering dataset. It contains images of 3D-rendered objects; …

📊 15 results
📏 Metrics: Accuracy

CLEVR-Humans

We collect a new dataset of human-posed free-form natural language questions about CLEVR images. Many of these questions have out-of-vocabulary …

📊 5 results
📏 Metrics: Accuracy

CORE-MM

CORE-MM is an Open-ended VQA benchmark dataset specifically designed for MLLMs, with a focus on complex reasoning tasks. CORE-MM benchmark …

📊 1 results
📏 Metrics: Abductive, Analogical, Deductive, Overall score, Params

DocVQA

DocVQA consists of 50,000 questions defined on 12,000+ document images. Source: DocVQA: A Dataset for VQA on Document Images

📊 1 results
📏 Metrics: ANLS

EgoSchema

EgoSchema is very long-form video question-answering dataset, and benchmark to evaluate long video understanding capabilities of modern vision and language …

📊 1 results
📏 Metrics: Acc

GQA

The GQA dataset is a large-scale visual question answering dataset with real images from the Visual Genome dataset and balanced …

📊 2 results
📏 Metrics: Accuracy

GRIT

The General Robust Image Task (GRIT) Benchmark is an evaluation-only benchmark for evaluating the performance and robustness of vision systems …

📊 2 results
📏 Metrics: VQA (ablation), VQA (test)

HallusionBench

Large language models (LLMs), after being aligned with vision models and integrated into vision-language models (VLMs), can bring impressive improvement …

📊 3 results
📏 Metrics: Question Pair Acc , Question Pair Acc

IconQA

Current visual question answering (VQA) tasks mainly consider answering human-annotated questions for natural images in the daily-life context. **Icon question …

📊 12 results
📏 Metrics: Sub-tasks (Img.), Sub-tasks (Txt.), Sub-tasks (Blank), Reasoning (Geo.), Reasoning (Cou.), Reasoning (Com.), Reasoning (Spa.), Reasoning (Sce.), Reasoning (Pat.), Reasoning (Tim.), Reasoning (Fra.), Reasoning (Est.), Reasoning (Alg.), Reasoning (Mea.), Reasoning (Sen.), Reasoning (Pro.)

IllusionVQA

IllusionVQA is a Visual Question Answering (VQA) dataset with two sub-tasks. The first task tests comprehension on 435 instances in …

📊 7 results
📏 Metrics: Accuracy

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 1 results
📏 Metrics: ClipMatch@1, ClipMatch@5, Contains, ExactMatch, Follow-up ClipMatch@1, Follow-up ClipMatch@5, Follow-up Contains, Follow-up ExactMatch

InfiMM-Eval

Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. Although many benchmarks attempt to holistically …

📊 14 results
📏 Metrics: Overall score, Deductive, Abductive, Analogical, Params

InfoSeek

In this project, we introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions that cannot be answered with …

📊 6 results
📏 Metrics: Accuracy

InfographicVQA

InfographicVQA is a dataset that comprises a diverse collection of infographics along with natural language questions and answers annotations. The …

📊 21 results
📏 Metrics: ANLS

MM-Vet

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

📊 1 results
📏 Metrics: Acc

MSRVTT-QA

The MSR-VTT-QA dataset is a benchmark for the task of Visual Question Answering (VQA) on the MSR-VTT (Microsoft Research Video …

📊 33 results
📏 Metrics: Accuracy

MSVD-QA

The MSVD-QA dataset is a Video Question Answering (VideoQA) dataset. It is based on the existing Microsoft Research Video Description …

📊 35 results
📏 Metrics: Accuracy

MVBench

MVBench is a comprehensive Multi-modal Video understanding Benchmark. It was introduced to evaluate the comprehension capabilities of Multi-modal Large Language …

📊 1 results
📏 Metrics: Acc

OK-VQA

Outside Knowledge Visual Question Answering (OK-VQA) includes more than 14,000 questions that require external knowledge to answer. Source: [OK-VQA: A …

📊 35 results
📏 Metrics: Accuracy, Exact Match (EM), Recall@5

OVAD benchmark

Vision-language modeling has enabled open-vocabulary tasks where predictions can be queried using any text prompt in a zero-shot manner. Existing …

📊 1 results
📏 Metrics: Contains w. Synonyms, ExactMatch w. Synonyms

PMC-VQA

PMC-VQA is a large-scale medical visual question-answering dataset that contains 227k VQA pairs of 149k images that cover various modalities …

📊 4 results
📏 Metrics: Accuracy

QLEVR

Synthetic datasets have successfully been used to probe visual question-answering datasets for their reasoning abilities. CLEVR, for example, tests a …

📊 5 results
📏 Metrics: Overall Accuracy

RetVQA

The RetVQA dataset is a large-scale dataset designed for Retrieval-Based Visual Question Answering (RetVQA). RetVQA is a more challenging task …

📊 1 results
📏 Metrics: Accuarcy, Accuracy * Fluency

TDIUC

Task Directed Image Understanding Challenge (TDIUC) dataset is a Visual Question Answering dataset which consists of 1.6M questions and 170K …

📊 2 results
📏 Metrics: Accuracy

TGIF-QA

The TGIF-QA dataset contains 165K QA pairs for the animated GIFs from the TGIF dataset [Li et al. CVPR 2016]. …

📊 2 results
📏 Metrics: Accuracy

TextVQA

TextVQA is a dataset to benchmark visual reasoning based on text in images. TextVQA requires models to read and reason …

📊 1 results
📏 Metrics: Acc

VLM2-Bench

VLM²-Bench: Benchmarking Vision-Language Models on Visual Cue Matching ### Description VLM²-Bench is the first comprehensive benchmark designed to evaluate …

📊 9 results
📏 Metrics: GC-mat, GC-trk, OC-cpr, OC-cnt, OC-grp, PC-cpr, PC-cnt, PC-grp, PC-VID, Average Score on VLM2-bench (9 subtasks)

VQA-CE

This dataset provides a new split of VQA v2 (similarly to VQA-CP v2), which is built of questions that are …

📊 9 results
📏 Metrics: Accuracy (Counterexamples)

VQA-CP

The VQA-CP dataset was constructed by reorganizing VQA v2 such that the correlation between the question type and correct answer …

📊 10 results
📏 Metrics: Score

Visual7W

Visual7W is a large-scale visual question answering (QA) dataset, with object-level groundings and multimodal answers. Each question starts with one …

📊 4 results
📏 Metrics: Percentage correct

WHOOPS!

WHOOPS! Is a dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers …

📊 6 results
📏 Metrics: Exact Match, BEM

WebSRC

WebSRC is a novel Web-based Structural Reading Comprehension dataset. It consists of 0.44M question-answer pairs, which are collected from 6.5K …

📊 1 results
📏 Metrics: EM

ZS-F-VQA

The ZS-F-VQA dataset is a new split of the F-VQA dataset for zero-shot problem. Firstly we obtain the original train/test …

📊 1 results
📏 Metrics: Top-1 Accuracy

Visual Reasoning

Bongard-OpenWorld

Bongard-OpenWorld is a new benchmark for evaluating real-world few-shot reasoning for machine vision. We hope it can help us better …

📊 9 results
📏 Metrics: 2-Class Accuracy

IRFL: Image Recognition of Figurative Language

The IRFL dataset consists of idioms, similes, and metaphors with matching figurative and literal images, as well as two novel …

📊 1 results
📏 Metrics: 1-of-100 Accuracy

NLVR

NLVR contains 92,244 pairs of human-written English sentences grounded in synthetic images. Because the images are synthetically generated, this dataset …

📊 1 results
📏 Metrics: Accuracy (Dev), Accuracy (Test-P), Accuracy (Test-U)

VASR

Visual Analogies of Situation Recognition (VASR) is a dataset for visual analogical mapping, adapting the classical word-analogy task into the …

📊 4 results
📏 Metrics: 1:1 Accuracy

VSR

The Visual Spatial Reasoning (VSR) corpus is a collection of caption-image pairs with true/false labels. Each caption describes the spatial …

📊 5 results
📏 Metrics: accuracy

WinoGAViL

This dataset is collected via the WinoGAViL game to collect challenging vision-and-language associations. Inspired by the popular card game Codenames, …

📊 8 results
📏 Metrics: Jaccard Index

Winoground

Winoground is a dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning. Given two …

📊 105 results
📏 Metrics: Text Score, Image Score, Group Score

Visual Relationship Detection

VRD

The Visual Relationship Dataset (VRD) contains 4000 images for training and 1000 for testing annotated with visual relationships. Bounding boxes …

📊 1 results
📏 Metrics: R@50 k=1

Visual Genome

Visual Genome contains Visual Question Answering data in a multi-choice setting. It consists of 101,174 images from MSCOCO with 1.7 …

📊 1 results
📏 Metrics: R@100, R@50, mR@100, mR@50

Visual Social Relationship Recognition

PIPA

The PIPA database is collected from Flickr photo albums for the task of person recognition. Then the dataset is extended …

📊 6 results
📏 Metrics: Accuracy, Accuracy (domain)

PISC

The People in Social Context (PISC) dataset is a dataset that focuses on social relationships. It consists of 22,670 images …

📊 5 results
📏 Metrics: mAP, mAP (Coarse)

Visual Speech Recognition

LRS2

The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences …

📊 2 results
📏 Metrics: Word Error Rate (WER)

LRS3-TED

LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of …

📊 3 results
📏 Metrics: Word Error Rate (WER)

Visual Storytelling

VIST

The Visual Storytelling Dataset (VIST) consists of 210,819 unique photos and 50,000 stories. The images were collected from albums on …

📊 21 results
📏 Metrics: BLEU-4, CIDEr, METEOR, BLEU-1, BLEU-2, BLEU-3, ROUGE-L, SPICE, BLEURT, MLTD

Visual Tracking

DAVIS

The Densely Annotation Video Segmentation dataset (DAVIS) is a high quality and high resolution densely annotated video segmentation dataset under …

📊 2 results
📏 Metrics: Average Jaccard

Kinetics

The Kinetics dataset is a large-scale, high-quality dataset for human action recognition in videos. The dataset consists of around 500,000 …

📊 2 results
📏 Metrics: Average Jaccard

Kubric

Kubric is a data generation pipeline for creating semi-realistic synthetic multi-object videos with rich annotations such as instance segmentation masks, …

📊 2 results
📏 Metrics: Average Jaccard

LaSOT

LaSOT is a high-quality benchmark for Large-scale Single Object Tracking. LaSOT consists of 1,400 sequences with more than 3.5M frames …

📊 1 results
📏 Metrics: AUC

OTB-2013

OTB2013 is the previous version of the current OTB2015 Visual Tracker Benchmark. It contains only 50 tracking sequences, as opposed …

📊 1 results
📏 Metrics: AUC

RGB-Stacking

RGB-Stacking is a benchmark for vision-based robotic manipulation. The robot is trained to learn how to grasp objects and balance …

📊 2 results
📏 Metrics: Average Jaccard

TNL2K

Tracking by Natural Language (TNL2K) is constructed for the evaluation of tracking by natural language specification. TNL2K features: - Large-scale: …

📊 5 results
📏 Metrics: AUC, precision

TrackingNet

TrackingNet is a large-scale tracking dataset consisting of videos in the wild. It has a total of 30,643 videos split …

📊 1 results
📏 Metrics: ACCURACY, Normalized Precision

Voice Conversion

VCTK

This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about …

📊 1 results
📏 Metrics: Total Length Error (TLE), Word Length Error (WLE), Phone Length Error (PLE)

Vulnerability Detection

VulScribeR

Datasets are listed in the repository's readme file. This one is extra and yields 20K+ items after filtering with a …

📊 1 results
📏 Metrics: F1 Score

Vulnerability Java Dataset

The dataset consists of two versions: $X_1$ with $P_3$ and $X_1$ without $P_3$, where $P_3$ represents a set of random …

📊 2 results
📏 Metrics: AUC, F1

Weakly Supervised Action Localization

BEOID

The BEOID dataset includes object interactions ranging from preparing a coffee to operating a weight lifting machine and opening a …

📊 5 results

FineAction

FineAction contains 103K temporal instances of 106 action categories, annotated in 17K untrimmed videos. FineAction introduces new opportunities and challenges …

📊 4 results

GTEA

The Georgia Tech Egocentric Activities (GTEA) dataset contains seven types of daily activities such as making sandwich, tea, or coffee. …

📊 5 results

THUMOS14

The THUMOS14 (THUMOS 2014) dataset is a large-scale video dataset that includes 1,010 videos for validation and 1,574 videos for …

📊 12 results
📏 Metrics: avg-mAP (0.3-0.7), avg-mAP (0.1:0.7), avg-mAP (0.1-0.5)

Weakly Supervised Classification

ShARe/CLEF 2014: Task 2 Disorders

📊 1 results
📏 Metrics: F1

THYME-2016

📊 1 results
📏 Metrics: F1

Weakly-supervised Temporal Action Localization

UCF101-24

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 1 results
📏 Metrics: [email protected]

Weather Forecasting

NOAA Atmospheric Temperature Dataset

This dataset contains meteorological observations (temperature) at the land-based weather stations located in the United States, collected from the Online …

📊 4 results
📏 Metrics: MAE (t+1), MAE (t+10)

SEVIR

SEVIR is an annotated, curated and spatio-temporally aligned dataset containing over 10,000 weather events that each consist of 384 km …

📊 5 results
📏 Metrics: MSE, mCSI

Shifts

The Shifts Dataset is a dataset for evaluation of uncertainty estimates and robustness to distributional shift. The dataset, which has …

📊 2 results
📏 Metrics: R-AUC MSE

Word Sense Disambiguation

FEWS

FEWS (Few-shot Examples of Word Senses) is a few-shot dataset for English Word Sense Disambiguation (WSD) gathered from Wiktionary, an …

📊 2 results
📏 Metrics: F1 (Zeroshot Dev), F1 (Zero shot test), F1(FewShot Dev), F1 (Fewshot Test)

RUSSE

WiC: The Word-in-Context Dataset A reliable benchmark for the evaluation of context-sensitive word embeddings. Depending on its context, an ambiguous …

📊 5 results
📏 Metrics: Accuracy

WiC-TSV

WiC-TSV is a new multi-domain evaluation benchmark for Word Sense Disambiguation. More specifically, it is a framework for Target Sense …

📊 6 results
📏 Metrics: Task 1 Accuracy: all, Task 1 Accuracy: general purpose, Task 1 Accuracy: domain specific, Task 2 Accuracy: all, Task 2 Accuracy: general purpose, Task 2 Accuracy: domain specific, Task 3 Accuracy: all, Task 3 Accuracy: general purpose, Task 3 Accuracy: domain specific

Word Similarity

WS353

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: …

📊 3 results
📏 Metrics: Spearman's Rho

Workflow Discovery

ABCD

Action-Based Conversations Dataset (ABCD) is a goal-oriented dialogue fully-labeled dataset with over 10K human-to-human dialogues containing 55 distinct user intents …

📊 1 results
📏 Metrics: In-domain EM, In-domain CE, Cross-domain EM, Cross-domain CE

World Religions

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 1 results
📏 Metrics: Accuracy

Zero Shot Segmentation

Segmentation in the Wild

Recent advances in language-image pre-training has witnessed the emerging field of building transferable systems that can effortlessly adapt to a …

📊 12 results
📏 Metrics: Mean AP

Zero-Shot Action Recognition

ActivityNet

The ActivityNet dataset contains 200 different types of activities and a total of 849 hours of videos collected from YouTube. …

📊 4 results
📏 Metrics: Top-1 Accuracy

Charades

The Charades dataset is composed of 9,848 videos of daily indoors activities with an average length of 30 seconds, involving …

📊 4 results
📏 Metrics: mAP

HMDB51

The HMDB51 dataset is a large collection of realistic videos from various sources, including movies and web videos. The dataset …

📊 24 results
📏 Metrics: Top-1 Accuracy, Top-5 Accuracy, Accuracy

Kinetics

The Kinetics dataset is a large-scale, high-quality dataset for human action recognition in videos. The dataset consists of around 500,000 …

📊 16 results
📏 Metrics: Top-1 Accuracy, Top-5 Accuracy

UCF101

UCF101 dataset is an extension of UCF50 and consists of 13,320 video clips, which are classified into 101 categories. These …

📊 27 results
📏 Metrics: Top-1 Accuracy, Top-5 accuracy

Zero-Shot Image Classification

Country211

Country211 is a dataset released by OpenAI, designed to assess the geolocation capability of visual representations. It filters the YFCC100m …

📊 1 results
📏 Metrics: Top-1 accuracy

Zero-Shot Learning

AwA2

Animals with Attributes 2 (AwA2) is a dataset for benchmarking transfer-learning algorithms, such as attribute base classification and zero-shot learning. …

📊 4 results
📏 Metrics: average top-1 classification accuracy, Accuracy Seen, Accuracy Unseen, H

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 2 results
📏 Metrics: Accuracy

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 2 results
📏 Metrics: Accuracy

COCO-MLT

The COCO-MLT is created from MS COCO-2017, containing 1,909 images from 80 classes. The maximum of training number per class …

📊 2 results
📏 Metrics: Average mAP

CUB-200-2011

The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of …

📊 14 results
📏 Metrics: average top-1 classification accuracy, Accuracy Seen, Accuracy Unseen, H, Accuracy

Caltech-101

The Caltech101 dataset contains images from 101 object categories (e.g., “helicopter”, “elephant” and “chair” etc.) and a background category that …

📊 2 results
📏 Metrics: Accuracy

DTD

The Describable Textures Dataset (DTD) contains 5640 texture images in the wild. They are annotated with human-centric attributes inspired by …

📊 2 results
📏 Metrics: Accuracy

EuroSAT

Eurosat is a dataset and deep learning benchmark for land use and land cover classification. The dataset is based on …

📊 1 results
📏 Metrics: Accuracy

FGVC-Aircraft

FGVC-Aircraft contains 10,200 images of aircraft, with 100 images for each of 102 different aircraft model variants, most of which …

📊 2 results
📏 Metrics: Accuracy

Food-101

The Food-101 dataset consists of 101 food categories with 750 training and 250 test images per category, making a total …

📊 2 results
📏 Metrics: Accuracy

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 2 results
📏 Metrics: Top 1 Accuracy

ImageNet_CN

transform the ImageNet-1K classification datatset for Chinese models by translating labels and prompts into Chinese.

📊 1 results
📏 Metrics: Accuracy

LSMDC

This dataset contains 118,081 short video clips extracted from 202 movies. Each video has a caption, either extracted from the …

📊 1 results
📏 Metrics: Accuracy

MIT-States

The MIT-States dataset has 245 object classes, 115 attribute classes and ∼53K images. There is a wide range of objects …

📊 1 results
📏 Metrics: A-acc

MSRVTT-QA

The MSR-VTT-QA dataset is a benchmark for the task of Visual Question Answering (VQA) on the MSR-VTT (Microsoft Research Video …

📊 1 results
📏 Metrics: Accuracy

MSVD-QA

The MSVD-QA dataset is a Video Question Answering (VideoQA) dataset. It is based on the existing Microsoft Research Video Description …

📊 1 results
📏 Metrics: Accuracy

MedConceptsQA

MedConceptsQA - Open Source Medical Concepts QA Benchmark The benchmark can be found here: https://huggingface.co/datasets/ofir408/MedConceptsQA

📊 12 results
📏 Metrics: Accuracy

Oxford 102 Flower

Oxford 102 Flower is an image classification dataset consisting of 102 flower categories. The flowers chosen to be flower commonly …

📊 2 results
📏 Metrics: average top-1 classification accuracy

Oxford-IIIT Pets

The Oxford-IIIT Pet Dataset is a 37-category pet dataset with roughly 200 images for each class. The images have large …

📊 2 results
📏 Metrics: Accuracy

PASCAL Context

The PASCAL Context dataset is an extension of the PASCAL VOC 2010 detection challenge, and it contains pixel-wise labels for …

📊 1 results
📏 Metrics: k=10 mIOU

SNIPS

The SNIPS Natural Language Understanding benchmark is a dataset of over 16,000 crowdsourced queries distributed among 7 user intents of …

📊 1 results
📏 Metrics: Accuracy

SUN Attribute

The SUN Attribute dataset consists of 14,340 images from 717 scene categories, and each category is annotated with a taxonomy …

📊 9 results
📏 Metrics: average top-1 classification accuracy, Accuracy Seen, Accuracy Unseen, H

SUN397

The Scene UNderstanding (SUN) database contains 899 categories and 130,519 images. There are 397 well-sampled categories to evaluate numerous state-of-the-art …

📊 2 results
📏 Metrics: Accuracy

Stanford Cars

The Stanford Cars dataset consists of 196 classes of cars with a total of 16,185 images, taken from the rear. …

📊 2 results
📏 Metrics: Accuracy

TVQA

The TVQA dataset is a large-scale video dataset for video question answering. It is based on 6 popular TV shows …

📊 1 results
📏 Metrics: Accuracy

UCF101

UCF101 dataset is an extension of UCF50 and consists of 13,320 video clips, which are classified into 101 categories. These …

📊 2 results
📏 Metrics: Accuracy

VOC-MLT

We construct the long-tailed version of VOC from its 2012 train-val set. It contains 1,142 images from 20 classes, with …

📊 2 results
📏 Metrics: Average mAP

iVQA

An open-ended VideoQA benchmark that aims to: i) provide a well-defined evaluation by including five correct answer annotations per question …

📊 1 results
📏 Metrics: Accuracy

Zero-Shot Semantic Segmentation

COCO-Stuff

The Common Objects in COntext-stuff (COCO-stuff) dataset is a dataset for scene understanding tasks like semantic segmentation, object detection and …

📊 13 results
📏 Metrics: Transductive Setting hIoU, Inductive Setting hIoU

PASCAL VOC

The PASCAL Visual Object Classes (VOC) 2012 dataset contains 20 object categories including vehicles, household, animals, and other: aeroplane, bicycle, …

📊 11 results
📏 Metrics: Transductive Setting hIoU, Inductive Setting hIoU

Zero-Shot Transfer Image Classification

Food-101

The Food-101 dataset consists of 101 food categories with 750 training and 250 test images per category, making a total …

📊 5 results
📏 Metrics: Top 1 Accuracy

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 20 results
📏 Metrics: Param, Accuracy (Private), Accuracy (Public)

ImageNet-A

The ImageNet-A dataset consists of real-world, unmodified, and naturally occurring examples that are misclassified by ResNet models. Source: [On Robustness …

📊 12 results
📏 Metrics: Accuracy (Private), Accuracy (Public)

ImageNet-R

ImageNet-R(endition) contains art, cartoons, deviantart, graffiti, embroidery, graphics, origami, paintings, patterns, plastic objects, plush objects, sculptures, sketches, tattoos, toys, and …

📊 11 results
📏 Metrics: Accuracy

ImageNet-S

Powered by the ImageNet dataset, unsupervised learning on large-scale data has made significant advances for classification tasks. There are two …

📊 1 results
📏 Metrics: Accuracy (Private), Top 5 Accuracy

ImageNet-Sketch

ImageNet-Sketch data set consists of 50,889 images, approximately 50 images for each of the 1000 ImageNet classes. The data set …

📊 6 results
📏 Metrics: Accuracy (Private)

ObjectNet

ObjectNet is a test set of images collected directly using crowd-sourcing. ObjectNet is unique as the objects are captured at …

📊 9 results
📏 Metrics: Accuracy (Private), Accuracy (Public), Top 5 Accuracy

SUN

When glancing at a magazine, or browsing the Internet, we are continuously being exposed to photographs. Despite of this overflow …

📊 3 results
📏 Metrics: Accuracy

Zero-Shot Video Retrieval

ActivityNet

The ActivityNet dataset contains 200 different types of activities and a total of 849 hours of videos collected from YouTube. …

📊 12 results
📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, video-to-text R@1, video-to-text R@5, video-to-text R@10

DiDeMo

The Distinct Describable Moments (DiDeMo) dataset is one of the largest and most diverse datasets for the temporal localization of …

📊 26 results
📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, video-to-text R@1, video-to-text R@5, video-to-text R@10, text-to-video Median Rank, video-to-text Median Rank

LSMDC

This dataset contains 118,081 short video clips extracted from 202 movies. Each video has a caption, either extracted from the …

📊 16 results
📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video Median Rank, text-to-video Mean Rank, video-to-text R@1, video-to-text R@5, video-to-text R@10

MSR-VTT

MSR-VTT (Microsoft Research Video to Text) is a large-scale dataset for the open domain video captioning, which consists of 10,000 …

📊 41 results
📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video Median Rank, text-to-video Mean Rank, video-to-text R@1, video-to-text R@5, video-to-text R@10, video-to-text Median Rank

MSVD

The Microsoft Research Video Description Corpus (MSVD) dataset consists of about 120K sentences collected during the summer of 2010. Workers …

📊 14 results
📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video Median Rank, text-to-video Mean Rank, video-to-text R@1, video-to-text R@5, video-to-text R@10, video-to-text Median Rank

VATEX

VATEX is multilingual, large, linguistically complex, and diverse dataset in terms of both video and natural language descriptions. It has …

📊 5 results
📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, video-to-text R@1, video-to-text R@5, video-to-text R@10

YouCook2

YouCook2 is the largest task-oriented, instructional video dataset in the vision community. It contains 2000 long untrimmed videos from 89 …

📊 8 results
📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video Mean Rank, text-to-video Median Rank

Zero-shot Generalization

CALVIN

CALVIN (Composing Actions from Language and Vision), is an open-source simulated benchmark to learn long-horizon language-conditioned robot manipulation tasks.

📊 5 results
📏 Metrics: Avg. sequence length

Zero-shot Sentiment Classification

AfriSenti

AfriSenti is the largest sentiment analysis dataset for under-represented African languages, covering 110,000+ annotated tweets in 14 African languages (Amharic, …

📊 5 results
📏 Metrics: weighted-F1 score

answerability prediction

PeerQA

We present PeerQA, a real-world, scientific, document-level Question Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, which …

📊 5 results
📏 Metrics: Macro F1

audio-visual event localization

UnAV-100

Existing audio-visual event localization (AVE) handles manually trimmed videos with only a single instance in each of them. However, this …

📊 2 results
📏 Metrics: mAP, [email protected]

inverse tone mapping

VDS dataset: Multi exposure stack-based inverse tone mapping

  • Have need seven multiple exposure ground truth images satisfying EV 0, ±1, ±2, ±3 for static scenes. * 96 …
📊 5 results
📏 Metrics: HDR-VDP-2, HDR-VDP-3, PU21-PSNR, PU21-SSIM, Reinhard'TMO-PSNR, Kim and Kautz TMO-PSNR

parameter-efficient fine-tuning

BoolQ

BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring – they are …

📊 4 results
📏 Metrics: Accuracy (% )

HellaSwag

HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are …

📊 3 results
📏 Metrics: Accuracy (% )

WinoGrande

WinoGrande is a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the …

📊 3 results
📏 Metrics: Accuracy (% )

regression

California Housing Prices

Median house prices for California districts derived from the 1990 census. About Dataset Context This is the dataset used in …

📊 3 results
📏 Metrics: R2 Score, lambda

Car_Price_Prediction

In this dataset we added [Company Name, Car Model, Car Type, Fuel Type, Transmission, Engine (cc), Mileage, Kms_driven, Buyers, Horsepower …

📊 1 results
📏 Metrics: R Squared

Concrete Compressive Strength

Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age …

📊 3 results
📏 Metrics: R2 Score, lambda

Medical Cost Personal Dataset

This dataset contains demographic and personal health information for individuals, along with the corresponding medical insurance charges billed to them. …

📊 3 results
📏 Metrics: R2 Score, lambda

self-supervised scene text recognition

TextSeg

TextSeg is a large-scale fine-annotated and multi-purpose text detection and segmentation dataset, collecting scene and design text with six types …

📊 1 results
📏 Metrics: IoU (%)

TextZoom

TextZoom is a super-resolution dataset that consists of paired Low Resolution – High Resolution scene text images. The images are …

📊 1 results
📏 Metrics: Average PSNR (dB), SSIM

video narration captioning

Shot2Story20K

A short clip of video may contain progression of multiple events and an interesting story line. A human needs to …

📊 1 results
📏 Metrics: BLEU-4, CIDEr, METEOR, ROUGE