1 Image, 2*2 Stitchi

FQL-Driving

FQL-driving

ConceptNet

ConceptNet is a knowledge graph that connects words and phrases of natural language with labeled edges. Its knowledge is collected …

📊 1 results

📏 Metrics: 1'"

2D Human Pose Estimation

COCO-WholeBody

COCO-WholeBody is an extension of COCO dataset with whole-body annotations. There are 4 types of bounding boxes (person box, face …

📊 14 results

📏 Metrics: WB, body, foot, face, hand

Human-Art

Human-Art is a versatile human-centric dataset to bridge the gap between natural and artificial scenes. It includes twenty high-quality human …

📊 10 results

📏 Metrics: AP, AP (gt bbox), Validation AP

OCHuman

This dataset focuses on heavily occluded human with comprehensive annotations including bounding-box, humans pose and instance mask. This dataset contains …

UAVDB is a high-resolution RGB video dataset meticulously designed for UAV detection tasks across diverse scales and complex backgrounds. Comprising …

📏 Metrics: mIoU

GF-PA66 3D XCT

Stack of 2D gray images of glass fiber-reinforced polyamide 66 (GF-PA66) 3D X-ray Computed Tomography (XCT) specimen. Usage: 2D/3D image …

📊 1 results

📏 Metrics: Jaccard (Mean)

WaterScenes

A Multi-Task 4D Radar-Camera Fusion Dataset for Autonomous Driving on Water Surfaces description of the dataset * WaterScenes, the first …

📊 1 results

📏 Metrics: mIoU

WildScenes

WildScenes is a bi-modal benchmark dataset consisting of multiple large-scale, sequential traversals in natural environments, including semantic annotations in high-resolution …

📊 5 results

📏 Metrics: mIoU, mIoU (Temporal DA) , mIoU (Env DA)

xBD

The xBD dataset contains over 45,000KM2 of polygon labeled pre and post disaster imagery. The dataset provides the post-disaster imagery …

📊 5 results

📏 Metrics: Weighted Average F1-score, Localization F1-score, Classification F1-score

2D Semantic Segmentation task 3 (25 classes)

CaDIS

CaDIS: a Cataract Dataset for Image Segmentation is a dataset for semantic segmentation created by Digital Surgery Ltd. on top …

📊 6 results

📏 Metrics: Mean IoU (test), Mean IoU (val)

3D Absolute Human Pose Estimation

Human3.6M

The Human3.6M dataset is one of the largest motion capture datasets, which consists of 3.6 million human poses and corresponding …

📊 4 results

📏 Metrics: MRPE, Average MPJPE (mm), PA-MPJPE

3D Action Recognition

Assembly101

Assembly101 is a new procedural activity dataset featuring 4321 videos of people assembling and disassembling 101 "take-apart" toy vehicles. Participants …

📊 7 results

📏 Metrics: Actions Top-1, Verbs Top-1, Object Top-1

NTU RGB+D

NTU RGB+D is a large-scale dataset for RGB-D human action recognition. It involves 56,880 samples of 60 action classes collected …

📊 3 results

📏 Metrics: Cross Subject Accuracy, Cross View Accuracy

3D Anomaly Detection

Real 3D-AD

Real 3D-AD is the first point cloud anomaly detection dataset for industrial products. Real3D-AD comprises a total of 1,254 samples …

📊 19 results

📏 Metrics: Mean Performance of P. and O. , Point AUROC, Object AUROC

3D Canonical Hand Pose Estimation

STB

3D hand pose data set created using stereo camera - contains 18,000 RGB images and paired depth images - 3D …

📊 1 results

📏 Metrics: AUC

3D Classification

U-10: United-10 COVID19 CT Dataset

This dataset supports the research detailed in the pre-print "Virtual Imaging Trials Improved the Transparency and Reliability of AI Systems …

📊 2 results

📏 Metrics: AUC

3D Depth Estimation

Relative Human

Relative Human (RH) contains multi-person in-the-wild RGB images with rich human annotations, including: Depth layers: relative depth relationship/ordering between all …

📊 3 results

📏 Metrics: PCDR, PCDR-Baby, PCDR-Kid, PCDR-Teen, PCDR-Adult, mPCDK

3D Face Animation

BEAT2

We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures, encompassing facial, local body, hands, …

📊 5 results

📏 Metrics: MSE

Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2

BIWI 3D corpus comprises a total of 1109 sentences uttered by 14 native English speakers (6 males and 8 females). …

📊 5 results

📏 Metrics: Lip Vertex Error, FDD

VOCASET

VOCASET is a 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio. …

📊 2 results

📏 Metrics: Lip Vertex Error

3D Face Modelling

Voxceleb-3D

A dataset for voice and 3D face structure study. It contains about 1.4K identities with their 3D face models and …

📊 2 results

📏 Metrics: Mean ARE, ARE-ER, ARE-FR, ARE-MR, ARE-CR

3D Face Reconstruction

AFLW2000-3D

AFLW2000-3D is a dataset of 2000 images that have been annotated with image-level 68-point 3D facial landmarks. This dataset is …

📊 8 results

📏 Metrics: Mean NME , Mean NME

Florence

The Florence 3D faces dataset consists of: * High-resolution 3D scans of human faces from many subjects. * Several video …

📊 15 results

📏 Metrics: Mean NME , Average 3D Error, RMSE Cooperative, RMSE Indoor, RMSE Outdoor, Mean NME

NoW Benchmark

The goal of this benchmark is to introduce a standard evaluation metric to measure the accuracy and robustness of 3D …

📊 15 results

📏 Metrics: Median Reconstruction Error, Mean Reconstruction Error (mm), Stdev Reconstruction Error (mm)

REALY

The REALY benchmark aims to introduce a region-aware evaluation pipeline to measure the fine-grained normalized mean square error (NMSE) of …

📊 24 results

📏 Metrics: all, @nose, @mouth, @forehead, @cheek

3D Hand Pose Estimation

DexYCB

DexYCB is a dataset for capturing hand grasping of objects. It can be used three relevant tasks: 2D object and …

📊 9 results

📏 Metrics: Average MPJPE (mm), Procrustes-Aligned MPJPE, MPVPE, VAUC, PA-MPVPE, PA-VAUC

FreiHAND

FreiHAND is a 3D hand pose dataset which records different hand actions performed by 32 people. For each hand image, …

📊 30 results

📏 Metrics: PA-MPJPE, PA-MPVPE, PA-F@5mm, PA-F@15mm

H3WB

Human3.6M 3D WholeBody (H3WB) is a large scale dataset with 133 whole-body keypoint annotations on 100K images, made possible by …

📊 15 results

📏 Metrics: Average MPJPE (mm)

HInt: Hand Interactions in the wild

The HInt dataset is frequently used as a generalizability benchmark for 3D Hand Reconstruction. It features three data subsets: HInt-NewDays, …

📊 9 results

📏 Metrics: [email protected] (New Days) All, [email protected] (VISOR) All, [email protected] (Ego4D) All, [email protected] (NewDays) Visible, [email protected] (VISOR) Visible, [email protected] (Ego4D) Visible, [email protected] (NewDays) Occ, [email protected] (VISOR) Occ, [email protected] (Ego4D) Occ

HO-3D v2

A hand-object interaction dataset with 3D pose annotations of hand and object. The dataset contains 66,034 training images and 11,524 …

📊 21 results

📏 Metrics: PA-MPJPE (mm), PA-MPVPE, F@5mm, F@15mm, AUC_J, AUC_V

HO-3D v3

The HO-3D v3 is the version 3 of the HO-3D dataset with more accurate hand-object poses. HO-3D v3 provides more …

📊 8 results

📏 Metrics: PA-MPJPE, PA-MPVPE, F@5mm, F@15mm, AUC_J, AUC_V

InterHand2.6M

The InterHand2.6M dataset is a large-scale real-captured dataset with accurate GT 3D interacting hand poses, used for 3D hand pose …

📊 1 results

📏 Metrics: MPJPE

3D Human Pose Estimation

3DPW

The 3D Poses in the Wild dataset is the first dataset in the wild with accurate 3D poses for evaluation. …

📊 115 results

📏 Metrics: MPVPE, PA-MPJPE, MPJPE, Acceleration Error, FLOPs (G), Number of parameters (M)

AGORA

AGORA is a synthetic human dataset with high realism and accurate ground truth. It consists of around 14K training and …

📊 11 results

📏 Metrics: B-NMVE, B-NMJE, B-MVE, B-MPJPE

AIST++

AIST++ is a 3D dance dataset which contains 3D motion reconstructed from real dancers paired with music. The AIST++ Dance …

📊 5 results

📏 Metrics: MPJPE, Single-view, Acceleration Error

DHP19

DHP19 is the first human pose dataset with data collected from DVS event cameras. It has recordings from 4 synchronized …

📊 2 results

📏 Metrics: MPJPE3D, GFLOPs, MPJPE2D, Params (M)

EMDB

EMDB contains in-the-wild videos of human activity recorded with a hand-held iPhone. It features reference SMPL body pose and shape …

📊 13 results

📏 Metrics: Average MPJPE-PA (mm), Average MPJPE (mm), Average MVE (mm), Average MVE-PA (mm), Average MPJAE (deg), Average MPJAE-PA (deg), Jitter (10m/s^3)

H3WB

Human3.6M 3D WholeBody (H3WB) is a large scale dataset with 133 whole-body keypoint annotations on 100K images, made possible by …

📊 17 results

📏 Metrics: MPJPE

HSPACE

HSPACE (Human-SPACE) is a large-scale photo-realistic dataset of animated humans placed in complex synthetic indoor and outdoor environments. For all …

📊 1 results

📏 Metrics: MPJPE, MPVPE, PA-MPJPE, PA-MPVPE

Human3.6M

The Human3.6M dataset is one of the largest motion capture datasets, which consists of 3.6 million human poses and corresponding …

📊 80 results

📏 Metrics: Average MPJPE (mm), Using 2D ground-truth joints, Multi-View or Monocular, PA-MPJPE, Acceleration Error, Angular Error, MPVE (mm)

JTA

JTA is a dataset for people tracking in urban scenarios by exploiting a photorealistic videogame. It is up to now …

📊 1 results

📏 Metrics: F1(t=0.4m), F1(t=0.8m), F1(t=1.2m)

MPI-INF-3DHP

MPI-INF-3DHP is a 3D human body pose estimation dataset consisting of both constrained indoor and complex outdoor scenes. It records …

📊 105 results

📏 Metrics: MPJPE, AUC, PCK, 3DPCK, PA-MPJPE, Acceleration Error

Panoptic

CMU Panoptic is a large scale dataset providing 3D pose annotations (1.5 millions) for multiple people engaging social activities. It …

📊 6 results

📏 Metrics: Average MPJPE (mm)

RICH

Inferring human-scene contact (HSC) is the first step toward understanding how humans interact with their surroundings. While detecting 2D human-object …

📊 4 results

📏 Metrics: MPJPE, MPVPE, PA-MPJPE, BoSE

SLOPER4D

SLOPER4D is a novel scene-aware dataset collected in large urban environments to facilitate the research of global human pose estimation …

📊 4 results

📏 Metrics: Average MPJPE (mm)

UBody

UBody is a large-scale Upper-Body dataset with the following annotations: * 2D whole-body keypoints * 3D SMPLX annotations * Frame …

📊 4 results

📏 Metrics: PVE-All, PVE-Hands, PVE-Face, PA-PVE-All, PA-PVE-Hands, PA-PVE-Face

Waymo Open Dataset

The Waymo Open Dataset is comprised of high resolution sensor data collected by autonomous vehicles operated by the Waymo Driver …

Waymo Open Dataset

The Waymo Open Dataset is comprised of high resolution sensor data collected by autonomous vehicles operated by the Waymo Driver …

📊 3 results

📏 Metrics: MOTA/L2

nuScenes

MuPoTs-3D (Multi-person Pose estimation Test Set in 3D) is a dataset for pose estimation composed of more than 8,000 frames …

📊 19 results

📏 Metrics: 3DPCK, MPJPE, AUC

3D Object Captioning

Objaverse

Objaverse is a large dataset of objects with 800K+ (and growing) 3D models with descriptive captions, tags, and animations. Objaverse …

📊 6 results

📏 Metrics: GPT-4, Sentence-BERT, SimCSE, Precision, Correctness, Hallucination

3D Object Detection

3RScan

A novel dataset and benchmark, which features 1482 RGB-D scans of 478 environments across multiple time steps. Each scene includes …

📊 3 results

📏 Metrics: [email protected], [email protected]

ARKitScenes

ARKitScenes is an RGB-D dataset captured with the widely available Apple LiDAR scanner. Along with the per-frame raw data (Wide …

📊 4 results

📏 Metrics: [email protected], [email protected]

Aria Everyday Objects

A small-scale, real-world Project Aria dataset with high quality static 3D oriented bounding boxs annotations. Dataset Contents - Project Aria …

📊 4 results

📏 Metrics: mAP

[1]: https://www.projectaria.com/datasets/ase/ "" [2]: https://facebookresearch.github.io/projectaria_tools/docs/open_datasets/aria_synthetic_environments_dataset "" [3]: https://www.projectaria.com/research/ "" Aria Synthetic Environments is a large-scale, fully simulated dataset created by …

📊 4 results

📏 Metrics: MAP

Cityscapes 3D

Detecting vehicles and representing their position and orientation in the three dimensional space is a key technology for autonomous driving. …

📊 1 results

📏 Metrics: mDS

Clear Weather

We introduce an object detection dataset in challenging adverse weather conditions covering 12000 samples in real-world driving scenes and 1500 …

📊 1 results

📏 Metrics: mod. Car [email protected]

DAIR-V2X

DAIR-V2X is a large-scale, multi-modality, multi-view dataset from real scenarios for VICAD. DAIR-V2X comprises 71254 LiDAR frames and 71254 Camera …

📊 1 results

📏 Metrics: AP50

DTTD-Mobile

Are current 3D object tracking methods truely robust enough for low-fidelity depth sensors like the iPhone LiDAR? We introduce DTTD-Mobile …

📊 5 results

📏 Metrics: ADD AUC, ADD-S AUC

Dense Fog

We introduce an object detection dataset in challenging adverse weather conditions covering 12000 samples in real-world driving scenes and 1500 …

📊 1 results

📏 Metrics: mod. Car [email protected], mod. Cyclist [email protected], mod. Pedestrian [email protected], mod. mAP

Heavy Snowfall

We introduce an object detection dataset in challenging adverse weather conditions covering 12000 samples in real-world driving scenes and 1500 …

📊 1 results

📏 Metrics: mod. Car [email protected]

Light Snowfall

We introduce an object detection dataset in challenging adverse weather conditions covering 12000 samples in real-world driving scenes and 1500 …

📊 1 results

📏 Metrics: mod. Car [email protected]

MultiScan

We introduce MultiScan, a scalable RGBD dataset construction pipeline leveraging commodity mobile devices to scan indoor scenes with articulated objects …

📊 3 results

📏 Metrics: [email protected], [email protected]

ONCE

ONCE (One millioN sCenEs) is a dataset for 3D object detection in the autonomous driving scenario. The ONCE dataset consists …

📊 2 results

📏 Metrics: mAP

OPV2V

OPV2V is a large-scale open simulated dataset for Vehicle-to-Vehicle perception. It contains over 70 interesting scenes, 11,464 frames, and 232,913 …

📊 5 results

📏 Metrics: [email protected]@Default, [email protected]@CulverCity

Rope3D

Roadside Perception 3D (Rope3D) is a dataset for autonomous driving and monocular 3D object detection task consisting of 50k images …

📊 7 results

📏 Metrics: [email protected]

S3DIS

The Stanford 3D Indoor Scene Dataset (S3DIS) dataset contains 6 large-scale indoor areas with 271 rooms. Each point in the …

📊 7 results

📏 Metrics: [email protected], [email protected]

ScanNet++

ScanNet++ is a large scale dataset with 450+ 3D indoor scenes containing sub-millimeter resolution laser scans, registered 33-megapixel DSLR images, …

📊 3 results

📏 Metrics: [email protected], [email protected]

SimBEV

The SimBEV dataset is a collection of 320 scenes spread across all 11 CARLA maps and contains data from a …

📊 5 results

📏 Metrics: SDS, mAP, mATE, mAOE, mASE, mAVE

TruckScenes

Autonomous trucking is a promising technology that can greatly impact modern logistics and the environment. Ensuring its safety on public …

📊 4 results

📏 Metrics: NDS, mAP

V2X-SIM

V2X-Sim, short for vehicle-to-everything simulation, is the a synthetic collaborative perception dataset in autonomous driving developed by AI4CE Lab at …

📊 5 results

📏 Metrics: mAP, mATE, mASE, mAOE

V2XSet

A large-scale V2X perception dataset using CARLA and OpenCDA

📊 6 results

📏 Metrics: AP0.5 (Perfect), AP0.7 (Perfect), AP0.5 (Noisy), AP0.7 (Noisy)

Waymo Open Dataset

The Waymo Open Dataset is comprised of high resolution sensor data collected by autonomous vehicles operated by the Waymo Driver …

📊 7 results

📏 Metrics: mAPH/L2

aiMotive Dataset

aiMotive dataset is a multimodal dataset for robust autonomous driving with long-range perception. The dataset consists of 176 scenes with …

📊 3 results

📏 Metrics: BEV [email protected] Highway, BEV [email protected] Night, BEV [email protected] Rain, BEV [email protected] Urban

nuScenes

The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in …

📊 32 results

📏 Metrics: NDS, mAP, mATE, mASE, mAOE, mAVE, mAAE

nuScenes LiDAR only

Robust detection and tracking of objects is crucial for the deployment of autonomous vehicle technology. Image based benchmark datasets have …

📊 7 results

📏 Metrics: NDS, NDS (val), mAP, mAP (val)

3D Object Reconstruction

BEHAVE

BEHAVE is a full body human-object interaction dataset with multi-view RGBD frames and corresponding 3D SMPL and object fits along …

📊 3 results

📏 Metrics: Chamfer Distance

ShapeNet

ShapeNet is a large scale repository for 3D CAD models developed by researchers from Stanford University, Princeton University and the …

📊 1 results

📏 Metrics: 3DIoU

3D Object Tracking

RTB

The Robot Tracking Benchmark (RTB) is a synthetic dataset that facilitates the quantitative evaluation of 3D tracking algorithms for multi-body …

📊 1 results

📏 Metrics: ADDS AUC, Runtime [ms]

3D Open-Vocabulary Instance Segmentation

Replica

The Replica Dataset is a dataset of high quality reconstructions of a variety of indoor spaces. Each reconstruction has clean …

📊 7 results

📏 Metrics: mAP

S3DIS

The Stanford 3D Indoor Scene Dataset (S3DIS) dataset contains 6 large-scale indoor areas with 271 rooms. Each point in the …

📊 4 results

📏 Metrics: AP50 Base B8/N4 , AP50 Novel B8/N4, AP50 Base B6/N6, AP50 Novel B6/N6

STPLS3D

Our project (STPLS3D) aims to provide a large-scale aerial photogrammetry dataset with synthetic and real annotated 3D point clouds for …

📊 3 results

📏 Metrics: AP50

ScanNet200

The ScanNet200 benchmark studies 200-class 3D semantic segmentation - an order of magnitude more class categories than previous 3D scene …

📊 6 results

📏 Metrics: mAP, AP50, AP25, AP Head, AP Common, AP Tail

3D Point Cloud Classification

IntrA

IntrA is an open-access 3D intracranial aneurysm dataset that makes the application of points-based and mesh-based classification and segmentation models …

📊 11 results

📏 Metrics: F1 score (5-fold)

ModelNet40-C

ModelNet40-C is a comprehensive dataset to benchmark the corruption robustness of 3D point cloud recognition. We create ModelNet40-C based on …

📊 11 results

📏 Metrics: Error Rate

ScanObjectNN

ScanObjectNN is a newly published real-world dataset comprising of 2902 3D objects in 15 categories. It is a challenging point …

📊 67 results

📏 Metrics: Overall Accuracy, Mean Accuracy, OBJ-BG (OA), OBJ-ONLY (OA), FLOPs, Number of params

Sydney Urban Objects

This dataset contains a variety of common urban road objects scanned with a Velodyne HDL-64E LIDAR, collected in the CBD …

📊 2 results

📏 Metrics: F1

3D Point Cloud Interpolation

DHB Dataset

Dynamic Human Bodies dataset (DHB), containing 10 point cloud sequences from the MITAMA dataset and 4 from the 8IVFB dataset. …

📊 5 results

📏 Metrics: CD, EMD

NL-Drive

A challenging multi-frame interpolation dataset for autonomous driving scenarios. Based on the principle of hard-sample selection and the diversity of …

📊 4 results

📏 Metrics: CD, EMD

3D Point Cloud Linear Classification

ScanObjectNN

ScanObjectNN is a newly published real-world dataset comprising of 2902 3D objects in 15 categories. It is a challenging point …

📊 2 results

📏 Metrics: Overall Accuracy

3D Pose Estimation

ApolloCar3D

ApolloCar3DT is a dataset that contains 5,277 driving images and over 60K car instances, where each car is fitted with …

📊 1 results

📏 Metrics: A3DP

HARPER

We introduce HARPER, a novel dataset for 3D body pose estimation and forecast in dyadic interactions between users and \spot, …

📊 1 results

📏 Metrics: Average MPJPE (mm)

Human3.6M

The Human3.6M dataset is one of the largest motion capture datasets, which consists of 3.6 million human poses and corresponding …

📊 3 results

📏 Metrics: Average MPJPE (mm)

K2HPD

Includes 100K depth images under challenging scenarios. Source: Human Pose Estimation from Depth Images via Inference Embedded Multi-task Learning

ShapeNet is a large scale repository for 3D CAD models developed by researchers from Stanford University, Princeton University and the …

📊 8 results

📏 Metrics: IoU, Chamfer Distance, F-Score@1%

3D Scene Graph Alignment

3DSSG

3DSSG provides 3D semantic scene graphs for 3RScan. A semantic scene graph is defined by a set of tuples between …

📊 2 results

📏 Metrics: MRR, F1, Hits@1

3D Semantic Scene Completion

KITTI-360

KITTI-360 is a large-scale dataset that contains rich sensory information and full annotations. It is the successor of the popular …

📊 7 results

📏 Metrics: mIoU

NYUv2

The NYU-Depth V2 data set is comprised of video sequences from a variety of indoor scenes as recorded by both …

📊 26 results

📏 Metrics: mIoU

PRO-teXt

PRO-teXt is an extension of PROXD with the inclusion of text prompts to synthesize objects. There are 180/20 interactions for …

📊 3 results

📏 Metrics: F1, CD, CMD

SemanticKITTI

SemanticKITTI is a large-scale outdoor-scene dataset for point cloud semantic segmentation. It is derived from the KITTI Vision Odometry Benchmark …

📊 1 results

📏 Metrics: CLIP

3D scene Editing

LLFF

Local Light Field Fusion (LLFF) is a practical and robust deep learning solution for capturing and rendering novel views of …

📊 1 results

📏 Metrics: CLIP

4D Panoptic Segmentation

SemanticKITTI

SemanticKITTI is a large-scale outdoor-scene dataset for point cloud semantic segmentation. It is derived from the KITTI Vision Odometry Benchmark …

📊 5 results

📏 Metrics: LSTQ

6D Pose Estimation

3D-BSLS-6D

Dataset consist of both real captures from Photoneo PhoXi structured light scanner devices annotated by hand and synthetic samples produced …

📊 1 results

📏 Metrics: eRE, eTE

ApolloCar3D

ApolloCar3DT is a dataset that contains 5,277 driving images and over 60K car instances, where each car is fitted with …

📊 1 results

📏 Metrics: A3DP

DTTD-Mobile

Are current 3D object tracking methods truely robust enough for low-fidelity depth sensors like the iPhone LiDAR? We introduce DTTD-Mobile …

📊 8 results

📏 Metrics: ADD AUC, ADD-S AUC, AR CoU, AR CH, AR pCH

OPT

Accurately tracking the six degree-of-freedom pose of an object in real scenes is an important task in computer vision and …

📊 2 results

📏 Metrics: AUC

YCB-Video

The YCB-Video dataset is a large-scale video dataset for 6D object pose estimation. provides accurate 6D poses of 21 objects …

📊 9 results

📏 Metrics: ADDS AUC

Abnormal Event Detection In Video

UBI-Fights

UBI-Fights - Concerning a specific anomaly detection and still providing a wide diversity in fighting scenarios, the UBI-Fights dataset is …

📊 4 results

📏 Metrics: AUC, Decidability, EER

UCSD Ped2

The UCSD Anomaly Detection Dataset was acquired with a stationary camera mounted at an elevation, overlooking pedestrian walkways. The crowd …

📊 4 results

📏 Metrics: AUC

Action Anticipation

Assembly101

Assembly101 is a new procedural activity dataset featuring 4321 videos of people assembling and disassembling 101 "take-apart" toy vehicles. Participants …

📊 2 results

📏 Metrics: Verbs Recall@5, Objects Recall@5, Actions Recall@5

EGTEA

Extended GTEA Gaze+ EGTEA Gaze+ is a large-scale dataset for FPV actions and gaze. It subsumes GTEA Gaze+ and comes …

📊 2 results

📏 Metrics: Top-1 Accuracy

EPIC-KITCHENS-100

This paper introduces the pipeline to scale the largest dataset in egocentric vision EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a …

📊 8 results

📏 Metrics: Recall@5, Top-5 Verb, Top-5 Noun

EgoExoLearn

EgoExoLearn is a fascinating dataset designed to bridge the gap between egocentric and exocentric views of procedural activities. 1. **What …

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 15 results

📏 Metrics: Frame-mAP 0.5, Video-mAP 0.1, Video-mAP 0.2, Video-mAP 0.5

Action Quality Assessment

AQA-7

Consists of 1106 action samples from seven actions with quality scores as measured by expert human judges. Source: [Action Quality …

📊 9 results

📏 Metrics: Spearman Correlation, RL2(*100)

EgoExoLearn

EgoExoLearn is a fascinating dataset designed to bridge the gap between egocentric and exocentric views of procedural activities. 1. **What …

📊 2 results

📏 Metrics: Accuracy

FineDiving

We construct a fine-grained video dataset organized by both semantic and temporal structures, where each structure contains two-level annotations. * …

📊 4 results

📏 Metrics: Spearman Correlation, RL2(*100)

JIGSAWS

The JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) is a surgical activity dataset for human motion modeling. The data …

📊 5 results

📏 Metrics: Spearman Correlation

MTL-AQA

A new multitask action quality assessment (AQA) dataset, the largest to date, comprising of more than 1600 diving samples; contains …

📊 21 results

📏 Metrics: Spearman Correlation, RL2(*100)

Rhythmic Gymnastic

The Rhythmic Gymnastics dataset contains videos of four different types of gymnastics routines: ball, clubs, hoop and ribbon. Each type …

📊 1 results

📏 Metrics: Spearman Correlation

UI-PRMD

UI-PRMD is a data set of movements related to common exercises performed by patients in physical therapy and rehabilitation programs. …

First of its kind paired win-fail action understanding dataset with samples from the following domains: “General Stunts,” “Internet Wins-Fails,” “Trick …

UCF101 dataset is an extension of UCF50 and consists of 13,320 video clips, which are classified into 101 categories. These …

📊 5 results

📏 Metrics: 3-fold Accuracy

Action Segmentation

50 Salads

Activity recognition research has shifted focus from distinguishing full-body motion patterns to recognizing complex interactions of multiple entities. Manipulative gestures …

📊 21 results

📏 Metrics: F1@50%, F1@25%, F1@10%, Acc, Edit

Assembly101

Assembly101 is a new procedural activity dataset featuring 4321 videos of people assembling and disassembling 101 "take-apart" toy vehicles. Participants …

📊 5 results

📏 Metrics: F1@10%, F1@25%, F1@50%, Edit, MoF

Breakfast

The Breakfast Actions Dataset comprises of 10 actions related to breakfast preparation, performed by 52 different individuals in 18 different …

📊 28 results

📏 Metrics: Average F1, F1@50%, F1@25%, F1@10%, Edit, Acc, mIoU, F1

COIN

The COIN dataset (a large-scale dataset for COmprehensive INstructional video analysis) consists of 11,827 videos related to 180 different tasks …

📊 9 results

📏 Metrics: Frame accuracy

GTEA

The Georgia Tech Egocentric Activities (GTEA) dataset contains seven types of daily activities such as making sandwich, tea, or coffee. …

📊 19 results

📏 Metrics: F1@50%, F1@25%, F1@10%, Acc, Edit

JIGSAWS

The JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) is a surgical activity dataset for human motion modeling. The data …

📊 7 results

📏 Metrics: Edit Distance, Accuracy, F1@10, F1@25, F1@50

MPII Cooking 2 Dataset

A dataset which provides detailed annotations for activity recognition. Source: [Recognizing Fine-Grained and Composite Activities using Hand-Centric Features and Script …

📊 1 results

📏 Metrics: Accuracy, mIoU

Youtube INRIA Instructional

We address the problem of automatically learning the main steps to complete a certain task, such as changing a car …

📊 2 results

📏 Metrics: Acc, F1

Action Understanding

Win-Fail Action Understanding

First of its kind paired win-fail action understanding dataset with samples from the following domains: “General Stunts,” “Internet Wins-Fails,” “Trick …

📊 1 results

📏 Metrics: 2-Class Accuracy

Activity Detection

AVA-Speech

Contains densely labeled speech activity in YouTube videos, with the goal of creating a shared, available dataset for this task. …

📊 3 results

📏 Metrics: ROC-AUC

Activity Recognition

RWF-2000

A database with 2,000 videos captured by surveillance cameras in real-world scenes. Source: [RWF-2000: An Open Large Scale Video Database …

📊 4 results

📏 Metrics: Accuracy

Stanford40

The Stanford 40 Action Dataset contains images of humans performing 40 actions. In each image, we provide a bounding box …

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 6 results

📏 Metrics: Attack: PGD20, Attack: AutoAttack, Attack: DeepFool, Robust Accuracy

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 2 results

📏 Metrics: Attack: AutoAttack

WSJ0-2mix

WSJ0-2mix is a speech recognition corpus of speech mixtures using utterances from the Wall Street Journal (WSJ0) corpus. Source: [Deep …

📊 1 results

📏 Metrics: SDR

Affordance Detection

3D AffordanceNet

3D AffordanceNet is a dataset of 23k shapes for visual affordance. It consists of 56,307 well-defined affordance information annotations for …

📊 1 results

📏 Metrics: AIOU, mAP

Affordance Recognition

HICO-DET

HICO-DET is a dataset for detecting human-object interactions (HOI) in images. It contains 47,776 images (38,118 in train set and …

📊 5 results

📏 Metrics: mIoU, road, car, truck, bus, motorcycle, bicycle, rider, pedestrian

Binary Classification

TII-SSRC-23

The TII-SSRC-23 dataset offers a comprehensive collection of network traffic patterns, meticulously compiled to support the development and research of …

📊 1 results

📏 Metrics: F1-Score

fake

[Real or Fake] : Fake Job Description Prediction This dataset contains 18K job descriptions out of which about 800 are …

📊 8 results

📏 Metrics: AUROC

kickstarter

Kickstarter is a community of more than 10 million people comprising of creative, tech enthusiasts who help in bringing creative …

📊 4 results

📏 Metrics: AUROC

Bird's-Eye View Semantic Segmentation

SimBEV

The SimBEV dataset is a collection of 320 scenes spread across all 11 CARLA maps and contains data from a …

📊 5 results

📏 Metrics: mIoU, road, car, truck, bus, motorcycle, bicycle, rider, pedestrian

nuScenes

The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in …

📊 15 results

📏 Metrics: IoU veh - 224x480 - Vis filter. - 100x100 at 0.5, IoU veh - 448x800 - Vis filter. - 100x100 at 0.5, IoU veh - 224x480 - No vis filter - 100x100 at 0.5, IoU veh - 448x800 - No vis filter - 100x100 at 0.5, IoU ped - 224x480 - Vis filter. - 100x100 at 0.5, IoU lane - 224x480 - 100x100 at 0.5, IoU veh - 224x480 - No vis filter - 100x50 at 0.25, IoU vehicle - Setting 3

Blind Face Restoration

CelebA-HQ

The CelebA-HQ dataset is a high-quality version of CelebA that consists of 30,000 images at 1024×1024 resolution. Source: [IntroVAE: Introspective …

📊 6 results

📏 Metrics: FID, LPIPS, PSNR

LFW

The LFW dataset contains 13,233 images of faces collected from the web. This dataset consists of the 5749 identities with …

📊 9 results

📏 Metrics: FID

WIDER

WIDER is a dataset for complex event recognition from static images. As of v0.1, it contains 61 event categories and …

📊 9 results

📏 Metrics: FID

Boundary Detection

CoAuthor

CoAuthor is a dataset designed for revealing GPT-3's capabilities in assisting creative and argumentative writing. CoAuthor captures rich interactions between …

📊 3 results

📏 Metrics: Cohen’s Kappa score

PASCAL Context

The PASCAL Context dataset is an extension of the PASCAL VOC 2010 detection challenge, and it contains pixel-wise labels for …

📊 1 results

📏 Metrics: odsF

RoFT

RoFT is a dataset of 21,000 human annotations of generated text. The task is "Boundary detection" i.e. given a passage …

📊 4 results

📏 Metrics: Accuracy (%), MSE

RoFT-chatgpt

RoFT-chatgpt is a variation of RoFT dataset, where the same human prompts are continued with the gpt-3.5-turbo model. Each dataset …

📊 4 results

📏 Metrics: Accuracy (%), MSE

UruDendro

UruDendro is a database of wood cross section images of commercially grown Pinus taeda trees from northern Uruguay. It is …

📊 2 results

📏 Metrics: Average Precision, Average Recall, F1-score, FScore

Breast Cancer Histology Image Classification

BreakHis

The Breast Cancer Histopathological Image Classification (BreakHis) is composed of 9,109 microscopic images of breast tumor tissue collected from 82 …

📊 3 results

📏 Metrics: Accuracy (%), 1:1 Accuracy, Accuracy (Inter-Patient)

Calving Front Delineation In Synthetic Aperture Radar Imagery

CaFFe

The temporal variability in calving front positions of marine-terminating glaciers permits inference on the frontal ablation. Frontal ablation, the sum …

📊 1 results

📏 Metrics: Mean Distance Error

Camera Pose Estimation

KITTI Odometry Benchmark

The odometry benchmark consists of 22 stereo sequences, saved in loss less png format: We provide 11 sequences (00-10) with …

📊 7 results

📏 Metrics: Average Translational Error et[%], Average Rotational Error er[%], Absolute Trajectory Error [m]

Camouflaged Object Segmentation

CAMO

Camouflaged Object (CAMO) dataset specifically designed for the task of camouflaged object segmentation. We focus on two categories, i.e., naturally …

📊 10 results

📏 Metrics: S-Measure, Weighted F-Measure, MAE

Camouflaged Animal Dataset

The nine (moving camera) videos in this benchmark exhibit camouflaged animals that are difficult to see in a single frame, …

📊 2 results

📏 Metrics: S-measure, weighted F-measure, MAE, mDice, mIoU

MoCA-Mask

The original Moving Camouflaged Animals (MoCA) Dataset includes 37K frames from 141 YouTube Video sequences with resolution and sampling rate …

📊 3 results

📏 Metrics: S-measure, weighted F-measure, MAE, mDice, mIoU

NC4K

As far as we know, there only exists one large camouflaged object testing dataset, the COD10K, while the sizes of …

📊 6 results

📏 Metrics: S-measure, weighted F-measure, MAE

Cancer Classification

Multi-omics mRNA, miRNA, and DNA Methylation Dataset

The dataset contains multi-omics data, incuding mRNA, miRNA, and DNA methylation. The dataset comprises 8,464 samples involving 2,794 omics features …

CDD Dataset (season-varying)

Source: CHANGE DETECTION IN REMOTE SENSING IMAGES USING CONDITIONAL ADVERSARIAL NETWORKS

📊 13 results

📏 Metrics: F1-Score, F1, Precision, Recall, Overall Accuracy, KC, IoU

CLCD

The CLCD dataset consists of 600 pairs image of cropland change samples, with 360 pairs for training, 120 pairs for …

📊 3 results

📏 Metrics: F1

ChangeSim

ChangeSim is a dataset aimed at online scene change detection (SCD) and more. The data is collected in photo-realistic simulation …

📊 1 results

📏 Metrics: Category mIoU

DSIFN-CD

The dataset is manually collected from Google Earth. It consists of six large bi-temporal high resolution images covering six cities …

📊 7 results

📏 Metrics: F1, Precision, Recall, Overall Accuracy, KC, IoU, Params(M)

EGY-BCD

Bi-temporal images in the EGY-BCD dataset are taken from 4 different regions located in Egypt, including New Mansoura, El Galala …

📊 3 results

📏 Metrics: F1

GVLM

For change detection tasks, current open-source datasets mainly focus on building extraction (e.g., WHU building dataset and LEVIR-CD dataset) (Chen …

📊 4 results

📏 Metrics: F1

LEVIR-CD

LEVIR-CD is a new large-scale remote sensing building Change Detection dataset. The introduced dataset would be a new benchmark for …

📊 24 results

📏 Metrics: F1, IoU, Overall Accuracy, F1-score, Recall, Precision

PCD

The Arabic dataset is scraped mainly from الموسوعة الشعرية and الديوان. After merging both, the total number of verses is …

📊 1 results

📏 Metrics: F1 score

S2Looking

S2Looking is a building change detection dataset that contains large-scale side-looking satellite images captured at varying off-nadir angles. The S2Looking …

📊 11 results

📏 Metrics: F1-Score, Precision, Recall, OA, KC, IoU, F1

SECOND

SECOND is a well-annotated semantic change detection dataset. To ensure data diversity, we firstly collect 4662 pairs of aerial images …

📊 2 results

📏 Metrics: SeK, Fscd, mIoU

WHU Building Dataset

We manually edited an aerial and a satellite imagery dataset of building samples and named it a WHU building dataset. …

📊 7 results

📏 Metrics: F1-score

Classification

Adult

Data Set Information: Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records …

📊 1 results

TLF2K

Table-LastFm2K (TLF2K) is a relational table dataset derived from the classical LastFM2K dataset. It contains three tables: artists, user_artists, and …

📊 1 results

📏 Metrics: Accuracy

TML1M

Table-MovieLens1M (TML1M) is a relational table dataset derived from the classical MovieLens1M dataset. It consists of three tables: users, movies, …

📊 1 results

📏 Metrics: Accuracy

WSC

The Winograd Schema Challenge was introduced both as an alternative to the Turing Test and as a test of a …

📊 2 results

📏 Metrics: Test Accuracy

WiC

WiC is a benchmark for the evaluation of context-sensitive word embeddings. WiC is framed as a binary classification task. Each …

📊 2 results

📏 Metrics: Test Accuracy

XImageNet-12

Enlarge the dataset to understand how image background effect the Computer Vision ML model. With the following topics: Blur Background …

📊 3 results

📏 Metrics: Robustness Score

Clothing Attribute Recognition

Clothing Attributes Dataset

We introduce the Clothing Attribute Dataset for promoting research in learning visual attributes for objects. The dataset contains 1856 images, …

📊 3 results

📏 Metrics: Accuracy

Colorization

ImageNet ctest10k

Colorization validation set for unconditional/conditional colorization tasks. Subset of the ImageNet validation images and excludes andy grayscale single-channel images.

Set5

The Set5 dataset is a dataset consisting of 5 images (“baby”, “bird”, “butterfly”, “head”, “woman”) commonly used for testing performance …

Cross-Modal Retrieval

CUHK-PEDES

The CUHK-PEDES dataset is a caption-annotated pedestrian dataset. It contains 40,206 images over 13,003 persons. Images are collected from five …

📊 1 results

📏 Metrics: Text-to-image Medr

ChEBI-20

Dataset contains 33,010 molecule-description pairs split into 80\%/10\%/10\% train/val/test splits. The goal of the task is to retrieve the relevant …

📊 5 results

📏 Metrics: Hits@1, Hits@10, Mean Rank, Test MRR

Flickr30k

The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators. Source: [Guiding …

📊 23 results

📏 Metrics: Image-to-text R@1, Image-to-text R@5, Image-to-text R@10, Text-to-image R@1, Text-to-image R@5, Text-to-image R@10

MSCOCO

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 1 results

📏 Metrics: Image-to-text R@1

RSICD

The Remote Sensing Image Captioning Dataset (RSICD) is a dataset for remote sensing image captioning task. It contains more than …

📊 7 results

📏 Metrics: Mean Recall, Image-to-text R@1, text-to-image R@1

RSITMD

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 7 results

📏 Metrics: Image-to-text R@1, Mean Recall, text-to-imageR@1

Recipe1M+

Recipe1M+ is a dataset which contains one million structured cooking recipes with 13M associated images. Source: [Recipe1M+: A Dataset for …

📊 2 results

📏 Metrics: Image-to-text R@1, Text-to-image R@1

SoundingEarth

SoundingEarth consists of co-located aerial imagery and audio samples all around the world.

📊 2 results

📏 Metrics: Median Rank, Image-to-sound R@100, Sound-to-image R@100

Data Augmentation

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 5 results

📏 Metrics: Percentage error

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 17 results

📏 Metrics: Accuracy (%)

Deblurring

Beam-Splitter Deblurring (BSD)

Using the proposed beam-splitter acquisition system, we have collected a new real-world video deblurring dataset (BSD). We collected blurry/sharp video …

📊 4 results

📏 Metrics: PSNR

GoPro

The GoPro dataset for deblurring consists of 3,214 blurred images with the size of 1,280×720 that are divided into 2,103 …

📊 52 results

📏 Metrics: PSNR, SSIM

HIDE

Consists of 8,422 blurry and sharp image pairs with 65,784 densely annotated FG human bounding boxes. Source: Human-Aware Motion Deblurring

📊 1 results

📏 Metrics: PSNR

MSU BASED

Qualitative dataset with real blurred videos, created by using beam-splitter setup in lab environment

📊 11 results

📏 Metrics: Subjective, SSIM, PSNR, VMAF, LPIPS, ERQAv2.0

REDS

The realistic and dynamic scenes (REDS) dataset was proposed in the NTIRE19 Challenge. The dataset is composed of 300 video …

📊 3 results

📏 Metrics: Average PSNR

RSBlur

The RSBlur dataset provides pairs of real and synthetic blurred images with ground truth sharp images. The dataset enables the …

Benchmarking Denoising Algorithms with Real Photographs This dataset consists of 50 pairs of noisy and (nearly) noise-free images captured with …

📊 1 results

📏 Metrics: Average PSNR, SSIM (sRGB)

Darmstadt Noise Dataset

the dataset contains data about hydrogen storage in metal hydrides

📊 10 results

📏 Metrics: PSNR

iris

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician, …

The Matterport3D dataset is a large RGB-D dataset for scene understanding in indoor environments. It contains 10,800 panoramic views inside …

📊 2 results

📏 Metrics: RMSE

PLAD

PLAD is a dataset where sparse depth is provided by line-based visual SLAM to verify StructMDC.

📊 1 results

📏 Metrics: MAE, RMSE

VOID

The dataset was collected using the Intel RealSense D435i camera, which was configured to produce synchronized accelerometer and gyroscope measurements …

📊 6 results

📏 Metrics: MAE, RMSE, iMAE, iRMSE

Depth Estimation

DCM

The DCM dataset is composed of 772 annotated images from 27 golden age comic books. We freely collected them from …

📊 3 results

📏 Metrics: Abs Rel, RMSE, RMSE log, Sq Rel

DIODE

Diode Dense Indoor/Outdoor DEpth (DIODE) is the first standard dataset for monocular depth estimation comprising diverse indoor and outdoor scenes …

📊 2 results

📏 Metrics: Delta < 1.25, Delta < 1.25^2, Delta < 1.25^3

Matterport3D

The Matterport3D dataset is a large RGB-D dataset for scene understanding in indoor environments. It contains 10,800 panoramic views inside …

📊 1 results

📏 Metrics: Abs Rel

ScanNet

ScanNet is an instance-level indoor RGB-D dataset that includes both 2D and 3D data. It is a collection of labeled …

📊 2 results

📏 Metrics: RMSE, absolute relative error

Taskonomy

Taskonomy provides a large and high-quality dataset of varied indoor scenes. - Complete pixel-level geometric information via aligned meshes. - …

📊 1 results

📏 Metrics: L1 error

eBDtheque

The eBDtheque database is a selection of one hundred comic pages from America, Japan (manga) and Europe. Image source: http://ebdtheque.univ-lr.fr/database/

📊 3 results

📏 Metrics: Abs Rel, RMSE, RMSE log, Sq Rel

Dimensionality Reduction

EMNIST

EMNIST (extended MNIST) has 4 times more data than MNIST. It is a set of handwritten digits with a 28 …

📊 2 results

📏 Metrics: Classification Accuracy

Document AI

EPHOIE

EPHOIE is a fully-annotated dataset which is the first Chinese benchmark for both text spotting and visual information extraction. EPHOIE …

📊 1 results

📏 Metrics: Average F1

Document Image Classification

RVL-CDIP

The RVL-CDIP dataset consists of scanned document images belonging to 16 classes such as letter, form, email, resume, memo, etc. …

📊 29 results

📏 Metrics: Accuracy, Parameters

Tobacco-3482

The Tobacco-3482 dataset consists of document images belonging to 10 classes such as letter, form, email, resume, memo, etc. The …

📊 9 results

📏 Metrics: Accuracy, Memory

Document Layout Analysis

D4LA

The D4LA dataset is a diverse benchmark for document layout analysis (DLA) derived from the RVL-CDIP dataset. It focuses on …

📊 3 results

📏 Metrics: mAP, Model Parameters

RVL-CDIP

The RVL-CDIP dataset consists of scanned document images belonging to 16 classes such as letter, form, email, resume, memo, etc. …

📊 1 results

📏 Metrics: FAR, WAR

U-DIADS-Bib

U-DIADS-Bib is a proprietary dataset developed through the collaboration of computer scientists and humanities at the University of Udine. It …

📊 1 results

📏 Metrics: Class Average IoU, Class Average IoU (Few-shot setting)

Document Text Classification

Tobacco-3482

The Tobacco-3482 dataset consists of document images belonging to 10 classes such as letter, form, email, resume, memo, etc. The …

📊 3 results

📏 Metrics: Accuracy, Training time (hours)

Domain Adaptation

ECG-Image-Database

The George B. Moody PhysioNet Challenges are annual competitions that invite participants to develop automated approaches for addressing important physiological …

📊 1 results

📏 Metrics: SNR

Edge Detection

BIPED

Details It contains 250 outdoor images of 1280$\times$720 pixels each. These images have been carefully annotated by experts on the …

📊 4 results

📏 Metrics: ODS, Number of parameters (M)

BRIND

BRIND is a short name of BSDS-RIND is the first public benchmark that dedicated to studying simultaneously the four edge …

📊 2 results

📏 Metrics: ODS, Number of parameters (M)

BSDS500

Berkeley Segmentation Data Set 500 (BSDS500) is a standard benchmark for contour detection. This dataset is designed for evaluating natural …

📊 1 results

📏 Metrics: F1

CID

The CID (Campus Image Dataset) is a dataset captured in low-light env with the help of Android programming. Its basic …

📊 1 results

📏 Metrics: ODS

MDBD

In order to study the interaction of several early visual cues (luminance, color, stereo, motion) during boundary detection in challenging …

📊 5 results

📏 Metrics: ODS, Number of parameters (M)

SBD

The Semantic Boundaries Dataset (SBD) is a dataset for predicting pixels on the boundary of the object (as opposed to …

📊 2 results

📏 Metrics: Maximum F-measure

UDED

This dataset is a collection of 1, 2, or 3 images from: BIPED, BSDS500, BSDS300, DIV2K, WIRE-FRAME, CID, CITYSCAPES, ADE20K, …

📊 3 results

📏 Metrics: ODS

Emotion Classification

CAER-Dynamic

13,201 clips from 79 TV shows. Each video clip was manually annotated with six emotion categories, including “anger”, “disgust”, “fear”, …

📊 1 results

📏 Metrics: Accuracy

CMU-MOSEI

CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) is the largest dataset of sentence-level sentiment analysis and emotion recognition in …

📊 4 results

📏 Metrics: Accuracy, Weighted Accuracy

MFA

The MFA (Many Faces of Anger) dataset includes 200 in-the-wild videos from North American and Persian cultures with fine-grained labels …

📊 2 results

📏 Metrics: F-F1 score (Comb.), F-F1 score (Persian), V-F1 score (Comb.), V-F1 score (NA), F-F1 score (NA), V-F1 score (Persian)

ROCStories

ROCStories is a collection of commonsense short stories. The corpus consists of 100,000 five-sentence stories. Each story logically follows everyday …

Comprises 11 hand gesture categories from 29 subjects under 3 illumination conditions. Source: [A Low Power, Fully Event-Based Gesture Recognition …

📊 1 results

📏 Metrics: Accuracy (% )

N-Caltech 101

The Neuromorphic-Caltech101 (N-Caltech101) dataset is a spiking version of the original frame-based Caltech101 dataset. The original dataset contained both a …

📊 1 results

📏 Metrics: Accuracy (% )

Explainable Artificial Intelligence (XAI)

ADNI

Alzheimer's Disease Neuroimaging Initiative (ADNI) is a multisite study that aims to improve clinical trials for the prevention and treatment …

📊 1 results

📏 Metrics: AD-Related Brain Areas Identified

Explanatory Visual Question Answering

GQA-REX

A GQA-based dataset with 1,040,830 multi-modal explanations of visual reasoning processes.

📊 4 results

📏 Metrics: BLEU-4, CIDEr, GQA-test, GQA-val, Grounding, METEOR, ROUGE-L, SPICE

Eyeblink detection

HUST-LEBW

An eyeblink detection in the wild dataset.

📊 1 results

📏 Metrics: Avg. F1

MPEblink

The color FERET database is a dataset for face recognition. It contains 11,338 color images of size 512×768 pixels captured …

📊 1 results

📏 Metrics: Pearson Correlation

LFW

The LFW dataset contains 13,233 images of faces collected from the web. This dataset consists of the 5749 identities with …

📊 1 results

📏 Metrics: Equal Error Rate

mebeblurf

Matanga Darknet — 2025 Access Guide As internet censorship intensifies, Shadow Marketplaces remain crucial tools for anonymous transactions. Matanga Darknet …

An evaluation protocol for face verification focusing on a large intra-pair image quality difference. Real-world face recognition applications often deal …

📊 1 results

📏 Metrics: Accuracy

mebeblurf

Matanga Darknet — 2025 Access Guide As internet censorship intensifies, Shadow Marketplaces remain crucial tools for anonymous transactions. Matanga Darknet …

📊 3 results

📏 Metrics: FNMR [%] @ 10-3 FMR

Face Verification

The Radboud Faces Database (RaFD) is a set of pictures of 67 models (both adult and children, males and females) …

📊 1 results

📏 Metrics: Accuracy

SFEW

The Static Facial Expressions in the Wild (SFEW) dataset is a dataset for facial expression recognition. It was created by …

📊 3 results

📏 Metrics: Accuracy

Facial Landmark Detection

300W

The 300-W is a face dataset that consists of 300 Indoor and 300 Outdoor in-the-wild images. It covers a large …

📊 13 results

📏 Metrics: NME, Mean Error Rate

AFLW2000-3D

AFLW2000-3D is a dataset of 2000 images that have been annotated with image-level 68-point 3D facial landmarks. This dataset is …

📊 1 results

📏 Metrics: GTE

COCO-WholeBody

📊 1 results

📏 Metrics: ACCURACY

UT Zappos50K

UT Zappos50K is a large shoe dataset consisting of 50,025 catalog images collected from Zappos.com. The images are divided into …

UCF101 dataset is an extension of UCF50 and consists of 13,320 video clips, which are classified into 101 categories. These …

📊 1 results

📏 Metrics: Harmonic mean

Few-Shot Object Detection

CAMO-FS

CAMO-FS Dataset comes with the paper entitled The Art of Camouflage: Few-shot Learning for Animal Detection and Segmentation. DOI: https://doi.org/10.1109/ACCESS.2024.3432873 …

📊 26 results

📏 Metrics: box AP

Few-Shot Semantic Segmentation

FSS-1000

FSS-1000 is a 1000 class dataset for few-shot segmentation. The dataset contains significant number of objects that have never been …

The Stanford Dogs dataset contains 20,580 images of 120 classes of dogs from around the world, which are divided into …

📊 18 results

📏 Metrics: Accuracy

iNaturalist

The iNaturalist 2017 dataset (iNat) contains 675,170 training and validation images from 5,089 natural fine-grained categories. Those categories belong to …

325 word images intended for font recognition, whose fonts are included in [VFR-447] (and [VFR-2420]). > (...) 325 real world …

📊 1 results

📏 Metrics: Top 1 Accuracy, Top 5 Error Rate, Top-1 Error Rate, Top 10 Accuracy, Top 5 Accuracy

Future Hand Prediction

Ego4D

Ego4D is a massive-scale egocentric video dataset and benchmark suite. It offers 3,025 hours of daily life activity video spanning …

📊 1 results

📏 Metrics: Disp(Total), M.Disp(Left), C.Disp(Left), M.Disp(Right), C.Disp(Right)

Gait Recognition

Gait3D

Gait3D is a large-scale 3D representation-based gait recognition dataset. It contains 4,000 subjects and over 25,000 sequences extracted from 39 …

📊 2 results

📏 Metrics: Rank-1, Rank-5, mAP, mINP

OUMVLP

The OU-ISIR Gait Database, Multi-View Large Population Dataset (OU-MVLP) is meant to aid research efforts in the general area of …

📊 6 results

📏 Metrics: Averaged rank-1 acc(%)

Gaze Estimation

ETH-XGaze

Consists of over one million high-resolution images of varying gaze under extreme head poses. The dataset is collected from 110 …

📊 1 results

📏 Metrics: Angular Error

Gaze360

Understanding where people are looking is an informative social cue. In this work, we present Gaze360, a large-scale gaze-tracking dataset …

📊 4 results

📏 Metrics: Angular Error

GazeCapture

From scientific research to commercial applications, eye tracking is an important tool across many domains. Despite its range of applications, …

📊 2 results

📏 Metrics: Euclidean Mean Error (EME), FPS

MPSGaze

This is a synthetic dataset containing full images (instead of only cropped faces) that provides ground truth 3D gaze directions …

gRefCOCO

gRefCOCO is the first large-scale Generalized Referring Expression Segmentation dataset that contains multi-target, no-target, and single-target expressions.

📊 5 results

📏 Metrics: Precision@(F1=1, IoU≥0.5), N-acc.

Generative 3D Object Classification

Objaverse

Objaverse is a large dataset of objects with 800K+ (and growing) 3D models with descriptive captions, tags, and animations. Objaverse …

📊 7 results

📏 Metrics: Objaverse (Average), Objaverse (I), Objaverse (C)

Geometric Matching

HPatches

The HPatches is a recent dataset for local patch descriptor evaluation that consists of 116 sequences of 6 images with …

📊 1 results

📏 Metrics: 1:1 Accuracy

Benchmark for HMER and OHMER Source: CROHME 2014

📊 14 results

📏 Metrics: ExpRate

CROHME 2016

Source: ICFHR2016 CROHME: Competition on Recognition of Online Handwritten Mathematical Expressions

📊 13 results

📏 Metrics: ExpRate

CROHME 2019

Source: ICDAR 2019 CROHME + TFD: Competition on Recognition of Handwritten Mathematical Expressions and Typeset Formula Detection

📊 12 results

📏 Metrics: ExpRate

HME100K

Source: HME100K

📊 11 results

📏 Metrics: ExpRate

Handwritten Text Recognition

📏 Metrics: F-score (average), F-score (stroke), F-score (word), F-score (text-line), F-score (para., layout)

Highlight Detection

QVHighlights

The Query-based Video Highlights (QVHighlights) dataset is a dataset for detecting customized moments and highlights from videos given natural language …

📊 20 results

📏 Metrics: mAP, Hit@1

TvSum

Introduced by Song et al. in TVSum: Summarizing web videos using titles. The TVSum dataset comprises 50 videos, with durations …

📊 7 results

📏 Metrics: mAP

Holdout Set

xView3-SAR

Unsustainable fishing practices worldwide pose a major threat to marine resources and ecosystems. Identifying vessels that do not show up …

📊 5 results

📏 Metrics: Aggregate xView3 Score

Human Instance Segmentation

OCHuman

This dataset focuses on heavily occluded human with comprehensive annotations including bounding-box, humans pose and instance mask. This dataset contains …

📊 14 results

📏 Metrics: AP

Human Interaction Recognition

EPIC-SOUNDS

EPIC-SOUNDS is a large scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of …

📊 1 results

📏 Metrics: Top-1 accuracy %

NTU RGB+D

NTU RGB+D is a large-scale dataset for RGB-D human action recognition. It involves 56,880 samples of 60 action classes collected …

📊 4 results

📏 Metrics: Accuracy (Cross-Subject), Accuracy (Cross-View)

NTU RGB+D 120

NTU RGB+D 120 is a large-scale dataset for RGB+D human action recognition, which is collected from 106 distinct subjects and …

📊 5 results

📏 Metrics: Accuracy (Cross-Setup), Accuracy (Cross-Subject)

SBU / SBU-Refine

SBU-Kinect-Interaction dataset version 2.0 comprises of RGB-D video sequences of humans performing interaction activities that are recording using the Microsoft …

📊 2 results

📏 Metrics: Accuracy

UT-Interaction

The UT-Interaction dataset contains videos of continuous executions of 6 classes of human-human interactions: shake-hands, point, hug, push, kick and …

📊 1 results

📏 Metrics: Accuracy (Set 1), Accuracy (Set 2)

Human Mesh Recovery

BEDLAM

BEDLAM is a large-scale synthetic video dataset designed to train and test algorithms on the task of 3D human pose …

📊 3 results

📏 Metrics: PVE-All

Human Parsing

4D-DRESS

4D-DRESS is the first real-world 4D dataset of human clothing, capturing 64 human outfits in more than 520 motion sequences. …

📊 6 results

📏 Metrics: mAcc, mIoU

PASCAL Context

The PASCAL Context dataset is an extension of the PASCAL VOC 2010 detection challenge, and it contains pixel-wise labels for …

📊 1 results

📏 Metrics: mIoU

Human Part Segmentation

CIHP

The Crowd Instance-level Human Parsing (CIHP) dataset has 38,280 diverse human images. Each image in CIHP is labeled with pixel-wise …

📊 6 results

📏 Metrics: Mean IoU

Human3.6M

The Human3.6M dataset is one of the largest motion capture datasets, which consists of 3.6 million human poses and corresponding …

📊 3 results

📏 Metrics: mIoU

PASCAL-Part

PASCAL-Part is a set of additional annotations for PASCAL VOC 2010. It goes beyond the original PASCAL object detection task …

📊 7 results

📏 Metrics: mIoU

Human action generation

Human3.6M

The Human3.6M dataset is one of the largest motion capture datasets, which consists of 3.6 million human poses and corresponding …

📊 5 results

📏 Metrics: MMDa, MMDs

HumanAct12

HumanAct12 is a new 3D human motion dataset adopted from the polar image and 3D pose dataset PHSPD, with proper …

📊 1 results

📏 Metrics: Accuracy, Diversity, FID, Multimodality

NTU RGB+D

NTU RGB+D is a large-scale dataset for RGB-D human action recognition. It involves 56,880 samples of 60 action classes collected …

📊 2 results

📏 Metrics: FID (CS), FID (CV)

NTU RGB+D 120

NTU RGB+D 120 is a large-scale dataset for RGB+D human action recognition, which is collected from 106 distinct subjects and …

📊 2 results

📏 Metrics: FID (CS), FID (CV)

NTU RGB+D 2D

NTU RGB+D 2D is a curated version of NTU RGB+D often used for skeleton-based action prediction and synthesis. It contains …

📊 5 results

📏 Metrics: MMDa (CS), MMDs (CS), MMDa (CV), MMDs (CV)

UESTC RGB-D

UESTC RGB-D Varying-view action database contains 40 categories of aerobic exercise. We utilized 2 Kinect V2 cameras in 8 fixed …

📊 1 results

📏 Metrics: Accuracy, Diversity, FID, Test

Human-Object Interaction Anticipation

VidHOI

VidHOI is a video-based human-object interaction detection benchmark. VidHOI is based on VidOR which is densely annotated with all humans …

📊 3 results

📏 Metrics: Person-wise Top5: t=1([email protected]), Person-wise Top5: t=3([email protected]), Person-wise Top5: t=5([email protected])

Human-Object Interaction Concept Discovery

HICO-DET

HICO-DET is a dataset for detecting human-object interactions (HOI) in images. It contains 47,776 images (38,118 in train set and …

📊 3 results

📏 Metrics: Unknown (AP)

Human-Object Interaction Detection

HICO

HICO is a benchmark for recognizing human-object interactions (HOI). Key features: - A diverse set of interactions with common object …

📊 8 results

📏 Metrics: mAP

HICO-DET

HICO-DET is a dataset for detecting human-object interactions (HOI) in images. It contains 47,776 images (38,118 in train set and …

📊 54 results

📏 Metrics: mAP, Time Per Frame (ms), Detection: Full ([email protected]), Detection: Non-Rare ([email protected]), Detection: Rare ([email protected])

MECCANO

The MECCANO dataset is the first dataset of egocentric videos to study human-object interactions in industrial-like settings. The MECCANO dataset …

📊 1 results

📏 Metrics: [email protected] role

V-COCO

Verbs in COCO (V-COCO) is a dataset that builds off COCO for human-object interaction detection. V-COCO provides 10,346 images (2,533 …

📊 34 results

📏 Metrics: AP(S1), AP(S2), Time Per Frame(ms), MAP

VidHOI

VidHOI is a video-based human-object interaction detection benchmark. VidHOI is based on VidOR which is densely annotated with all humans …

📊 3 results

📏 Metrics: Detection: Full ([email protected]), Detection: Non-Rare ([email protected]), Detection: Rare ([email protected]), Oracle: Full ([email protected]), Oracle: Non-Rare ([email protected]), Oracle: Rare ([email protected])

Image Attribution

CUB-200-2011

The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of …

📊 8 results

📏 Metrics: Insertion AUC score (ResNet-101), Deletion AUC score (ResNet-101)

CelebA

CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels …

📊 8 results

📏 Metrics: Insertion AUC score (ArcFace ResNet-101), Deletion AUC score (ArcFace ResNet-101)

VGGFace2

VGGFace2 is a large-scale face recognition dataset. Images are downloaded from Google Image Search and have large variations in pose, …

We construct Gaze-CIFAR-10, a gaze-augmented image dataset based on the standard CIFAR-10 benchmark, enhanced with human eye-tracking annotations collected using …

📊 2 results

📏 Metrics: 1:1 Accuracy

CelebA

CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels …

📊 3 results

📏 Metrics: Consistency, FID

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 4 results

📏 Metrics: Consistency, FID

NIR2RGB VCIP Challange Dataset

Image Deblurring

CelebA

CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels …

📊 3 results

📏 Metrics: FID, PSNR, SSIM

GoPro

The GoPro dataset for deblurring consists of 3,214 blurred images with the size of 1,280×720 that are divided into 2,103 …

📊 44 results

📏 Metrics: PSNR, SSIM, Params (M), FID, LPIPS

HIDE

Consists of 8,422 blurry and sharp image pairs with 65,784 densely annotated FG human bounding boxes. Source: Human-Aware Motion Deblurring

📊 5 results

📏 Metrics: PSNR, SSIM

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

Image Editing

GEdit-Bench-EN

This dataset is a new benchmark, grounded in real-world usages is developed to support more authentic and comprehensive evaluation of …

📊 3 results

📏 Metrics: Overall, Perceptual Quality, Semantic Consistency

ImgEdit-Data

ImgEdit is a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs, which contain both novel and complex …

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 5 results

📏 Metrics: FID, Precision, Recall

KMNIST

📊 1 results

📏 Metrics: FID

LLVIP

Visible-infrared Paired Dataset for Low-light Vision * 30976 images (15488 pairs) * 24 dark scenes, 2 daytime scenes * …

📊 1 results

📏 Metrics: PSNR, SSIM

LSUN

The Large-scale Scene Understanding (LSUN) challenge aims to provide a different benchmark for large-scale scene classification and understanding. The LSUN …

📊 1 results

📏 Metrics: Average FID

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 11 results

📏 Metrics: bits/dimension, FID, Precision, Recall, PSNR, SSIM

MetFaces

MetFaces is an image dataset of human faces extracted from works of art. The dataset consists of 1336 high-quality PNG …

📊 3 results

📏 Metrics: MAE Signature, MAE log-signature, RMSE Signature, RMSE log-signature

ApolloScape

ApolloScape is a large dataset consisting of over 140,000 video frames (73 street scene videos) from various locations in China …

📊 1 results

📏 Metrics: MAE, PSNR, RMSE, SSIM

Apolloscape Inpainting

The Inpainting dataset consists of synchronized Labeled image and LiDAR scanned point clouds. It's captured by HESAI Pandora All-in-One Sensing …

📊 1 results

📏 Metrics: RMSE

CelebA

CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels …

📊 5 results

📏 Metrics: FID, PSNR, SSIM, LPIPS

CelebA-HQ

The CelebA-HQ dataset is a high-quality version of CelebA that consists of 30,000 images at 1024×1024 resolution. Source: [IntroVAE: Introspective …

📊 6 results

📏 Metrics: FID, P-IDS, U-IDS

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 5 results

📏 Metrics: FID, PSNR, SSIM

Image Manipulation

LRS2

The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences …

📊 2 results

📊 7 results

📏 Metrics: SAD, MSE, MAD

Composition-1K

Composition-1K is a large-scale image matting dataset including 49300 training images and 1000 testing images. Image source: https://arxiv.org/pdf/1703.03872v3.pdf

📊 13 results

📏 Metrics: MSE, SAD, Conn, Grad

Distinctions-646

Dinstinctions-646 are composed of 646 foreground images with manually annotated alpha mattes

📊 4 results

📏 Metrics: SAD, MSE, Grad, Conn, Trimap

P3M-10k

P3M-10k contains 10421 high-resolution real-world face-blurred portrait images, along with their manually labeled alpha mattes. The Dataset is aimed to …

📊 5 results

📏 Metrics: SAD, MSE, MAD

PPM-100

PPM is a portrait matting benchmark with the following characteristics: - Fine Annotation - All images are labeled and checked …

📊 1 results

📏 Metrics: MAD, MSE

Image Outpainting

Image Restoration

CDD-11

An image restoration dataset

📊 11 results

📏 Metrics: Average PSNR (dB), SSIM

UHDM

The first ultra-high-definition image demoireing dataset, consisting of 4,500 4K resolution training pairs and 500 standard 4K resolution validation pairs.

📏 Metrics: Average PSNR (dB)

Image Stitching

HPatches

The HPatches is a recent dataset for local patch descriptor evaluation that consists of 116 sequences of 6 images with …

📏 Metrics: Test error

PhysioNet Challenge 2012

The PhysioNet Challenge 2012 dataset is publicly available and contains the de-identified records of 8000 patients in Intensive Care Units …

📊 1 results

📏 Metrics: AUROC

Sprites

The Sprites dataset contains 60 pixel color images of animated characters (sprites). There are 672 sprites, 500 for training, 100 …

iShape is an irregular shape dataset for instance segmentation. iShape contains six sub-datasets with one real and five synthetics, each …

📊 1 results

📏 Metrics: mask AP

nuScenes

The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in …

📊 1 results

📏 Metrics: MOTA

Instance Shadow Detection

SOBA

A new dataset called SOBA, named after Shadow-OBject Association, with 3,623 pairs of shadow and object instances in 1,000 photos, …

📊 2 results

📏 Metrics: mask SOAP, Bounding Box SOAP, Asso. AP_segm, Asso. AP_bbox, Instance AP_segm, Instance AP_bbox

Instruction Following

IFEval

This dataset evaluates instruction following ability of large language models. There are 500+ prompts with instructions such as "write an …

📊 4 results

📏 Metrics: Inst-level loose-accuracy, Inst-level strict-accuracy, Prompt-level loose-accuracy, Prompt-level strict-accuracy

Interactive Segmentation

DAVIS

The Densely Annotation Video Segmentation dataset (DAVIS) is a high quality and high resolution densely annotated video segmentation dataset under …

📊 13 results

📏 Metrics: NoC@90, NoC@85, NoC@95

DAVIS-585

A dataset for interactive segmentation with simulated initial masks.

📊 2 results

📏 Metrics: NoC@90, NoC@85

PASCAL VOC

The PASCAL Visual Object Classes (VOC) 2012 dataset contains 20 object categories including vehicles, household, animals, and other: aeroplane, bicycle, …

📊 2 results

📏 Metrics: NoC@95, NoC@90, NoC@85

SBD

The Semantic Boundaries Dataset (SBD) is a dataset for predicting pixels on the boundary of the object (as opposed to …

📊 10 results

📏 Metrics: NoC@90, NoC@85, NoC@95

Inverse Rendering

Stanford-ORB

We introduce Stanford-ORB, a new real-world 3D Object inverse Rendering Benchmark. Recent advances in inverse rendering have enabled a wide …

📊 7 results

📏 Metrics: HDR-PSNR

Inverse-Tone-Mapping

MSU HDR Video Reconstruction Benchmark

This is a dataset for a video inverse-tone-mapping task. The dataset contains various contents for the task of restoring HDR …

📊 7 results

📏 Metrics: HDR-PSNR, HDR-SSIM, HDR-VQM

JPEG Decompression

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 6 results

📏 Metrics: FID-5K, IS, CA, PD

Key Information Extraction

Keyword Spotting

FKD

The football keyword dataset (FKD), as a new keyword spotting dataset in Persian, is collected with crowdsourcing. This dataset contains …

📊 2 results

📏 Metrics: Accuracy

TAU Urban Acoustic Scenes 2019

TAU Urban Acoustic Scenes 2019 development dataset consists of 10-seconds audio segments from 10 acoustic scenes: airport, indoor shopping mall, …

📊 2 results

📏 Metrics: Accuracy

VoxForge

VoxForge is an open speech dataset that was set up to collect transcribed speech for use with Free and Open …

📊 2 results

📏 Metrics: Accuracy (%)

Kinematic Based Workflow Recognition

PETRAW

PETRAW data set was composed of 150 sequences of peg transfer training sessions. The objective of the peg transfer session …

📊 6 results

📏 Metrics: Average AD-Accuracy

Kinship Verification

KinFaceW-I

KinFaceW-I dataset contains 533 pairs of facial images of persons with a kin relation. Four different kin relations are considered …

📊 3 results

📏 Metrics: Mean Accuracy

KinFaceW-II

KinFaceW-II Dataset consists of 1000 pairs of facial images of individuals with a kin relation. This database also considers four …

📊 3 results

📏 Metrics: Mean Accuracy

Knowledge Distillation

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 27 results

📏 Metrics: Top-1 Accuracy (%)

COCO (Common Objects in Context)

The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset. It is designed to …

📊 4 results

📏 Metrics: box AP, mask AP, mAP

Cityscapes

Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense …

📊 1 results

📏 Metrics: AP

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 50 results

📏 Metrics: Top-1 accuracy %, model size, CRD training setting,

KITTI

KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile …

📊 1 results

📏 Metrics: RMSE, model size

PASCAL VOC

The PASCAL Visual Object Classes (VOC) 2012 dataset contains 20 object categories including vehicles, household, animals, and other: aeroplane, bicycle, …

📊 2 results

📏 Metrics: mAP

LIDAR Semantic Segmentation

Paris-Lille-3D

The Paris-Lille-3D is a Benchmark on Point Cloud Classification. The Point Cloud has been labeled entirely by hand with 50 …

📊 7 results

📏 Metrics: mIOU

S.MID

SeMantic InDustry (S.MID) is a dataset designed to advance the field of LiDAR semantic segmentation, specifically for robotic applications and …

📊 4 results

📏 Metrics: val mIoU

SemanticKITTI

SemanticKITTI is a large-scale outdoor-scene dataset for point cloud semantic segmentation. It is derived from the KITTI Vision Odometry Benchmark …

📊 3 results

📏 Metrics: mIOU, val mIoU

SemanticSTF

SemanticSTF is an adverse-weather point cloud dataset that provides dense point-level annotations and allows to study 3DSS under various adverse …

📊 2 results

📏 Metrics: Mean IoU

ULS labeled data

UAV Laser Scanning data collected over neotropical forest (Paracou French Guiana). Four flights conducted over one ha plot in 2021 …

📊 1 results

📏 Metrics: Binary Accuracy, G-mean, Specificity

nuScenes

The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in …

VidChapters-7M

VidChapters-7M is a dataset of 817K user-chaptered videos including 7M chapters in total. VidChapters-7M is automatically created from videos online …

📊 1 results

📏 Metrics: [email protected], R@10s

Lidar Scene Completion

SemanticKITTI

SemanticKITTI is a large-scale outdoor-scene dataset for point cloud semantic segmentation. It is derived from the KITTI Vision Odometry Benchmark …

📊 4 results

📏 Metrics: Chamfer Distance, JSD 3D, JSD BEV, Voxel IoU 0.1m, Voxel IoU 0.2m, Voxel IoU 0.5m

Lifelike 3D Human Generation

THuman2.0 Dataset

THuman2.0 Dataset contains 500 high-quality human scans captured by a dense DLSR rig. For each scan, we provide the 3D …

📊 6 results

📏 Metrics: CLIP Similarity, SSIM, LPIPS, PSNR

Line Detection

NKL

NKL (short for NanKai Lines) is a dataset for semantic line detection. Semantic lines are meaningful line structures that outline …

📊 2 results

📏 Metrics: F_measure (EA)

SEL

The semantic line (SEL) dataset contains 1,750 outdoor images in total, which are split into 1,575 training and 175 testing …

📊 2 results

📏 Metrics: AUC_F, HIoU

Lip to Speech Synthesis

LRW

The Lip Reading in the Wild (LRW) dataset a large-scale audio-visual database that contains 500 different words from over 1,000 …

📊 1 results

📏 Metrics: ESTOI, PESQ, STOI

Lipreading

CAS-VSR-W1k (LRW-1000)

LRW-1000 has been renamed as CAS-VSR-W1k.* It is a naturally-distributed large-scale benchmark for word-level lipreading in the wild, including 1000 …

📊 9 results

📏 Metrics: Top-1 Accuracy

LRS2

The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences …

📊 18 results

📏 Metrics: Word Error Rate (WER)

LRS3-TED

LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of …

📊 20 results

📏 Metrics: Word Error Rate (WER)

Long Video Retrieval (Background Removed)

YouCook2

YouCook2 is the largest task-oriented, instructional video dataset in the vision community. It contains 2000 long untrimmed videos from 89 …

📊 6 results

📏 Metrics: Cap. Avg. R@1, Cap. Avg. R@5, Cap. Avg. R@10, DTW R@1, DTW R@5, DTW R@10, OTAM R@1, OTAM R@5, OTAM R@10

Lung Nodule Classification

LIDC-IDRI

The LIDC-IDRI dataset contains lesion annotations from four experienced thoracic radiologists. LIDC-IDRI contains 1,018 low-dose lung CTs from 1010 lung …

This dataset contains a large number of segmented nuclei images. The images were acquired under a variety of conditions and …

📊 10 results

📏 Metrics: Dice, mIoU, Recall, Precision, AHD95, ASD

ACDC

The goal of the Automated Cardiac Diagnosis Challenge (ACDC) challenge is to: - compare the performance of automatic methods on …

📊 6 results

📏 Metrics: Dice Score

AMOS

Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the …

📊 1 results

📏 Metrics: Average Dice

BKAI-IGH NeoPolyp-Small

This dataset contains 1200 images (1000 WLI images and 200 FICE images) with fine-grained segmentation annotations. The training set consists …

📊 9 results

📏 Metrics: Average Dice, mIoU, Average Dice (5-folds), MAE (5-folds), mIoU (5-folds)

Brain US

This brain anatomy segmentation dataset has 1300 2D US scans for training and 329 for testing. A total of 1629 …

📊 3 results

📏 Metrics: F1, IoU

CHASE_DB1

CHASE_DB1 is a dataset for retinal vessel segmentation which contains 28 color retina images with the size of 999×960 pixels …

📊 3 results

📏 Metrics: DSC

CVC-ClinicDB

CVC-ClinicDB is an open-access dataset of 612 images with a resolution of 384×288 from 31 colonoscopy sequences.It is used for …

📊 40 results

📏 Metrics: mean Dice, Average MAE, S-Measure, mIoU, max E-Measure, F-measure

Cell

The CELL benchmark is made of fluorescence microscopy images of cell. Source: Multi-Domain Adversarial Learning Image Source: https://arxiv.org/pdf/1903.09239v1.pdf

📊 1 results

📏 Metrics: IoU

DRIVE

The Digital Retinal Images for Vessel Extraction (DRIVE) dataset is a dataset for retinal vessel segmentation. It consists of a …

📊 4 results

📏 Metrics: mIoU, F1 score, Recall, Specificity, Precision

Electron Microscopy Dataset

The dataset available for download on this webpage represents a 5x5x5µm section taken from the CA1 hippocampus region of the …

📊 1 results

📏 Metrics: AHD95, ASD, Dice, IoU

Endotect Polyp Segmentation Challenge Dataset

A challenge that consists of three tasks, each targeting a different requirement for in-clinic use. The first task involves classifying …

📊 2 results

📏 Metrics: DSC, mIoU, FPS

Extended Task10_Colon Medical Decathlon

A dataset of abdominal CT studies in NifTi format from the open-source medical data repository Medical Decathlon was utilized. To …

📊 1 results

📏 Metrics: Average Dice

GlaS

The dataset used in this challenge consists of 165 images derived from 16 H&E stained histological sections of stage T3 …

📊 9 results

📏 Metrics: F1, IoU, Dice

Kvasir-Instrument

Consists of annotated frames containing GI procedure tools such as snares, balloons and biopsy forceps, etc. Beside of the images, …

📊 2 results

📏 Metrics: DSC, Dice Score, Intersection over Union

Kvasir-SEG

Kvasir-SEG is an open-access dataset of gastrointestinal polyp images and corresponding segmentation masks, manually annotated by a medical doctor and …

📊 51 results

📏 Metrics: mean Dice, Average MAE, S-Measure, max E-Measure, mIoU, FPS, F-measure, Precision, Recall

KvasirCapsule-SEG

The dataset contains a Video capsule endoscopy dataset for polyp segmentation. The dataset can be downloaded from here: https://www.kaggle.com/debeshjha1/kvasircapsuleseg https://www.dropbox.com/home/KvasirCapsule-SEG …

📊 2 results

📏 Metrics: DSC, mIoU

MICCAI 2015 Head and Neck Challenge

This database is provided and maintained by Dr. Gregory C Sharp (Harvard Medical School – MGH, Boston) and his group. …

📊 1 results

📏 Metrics: Dice

MICCAI 2015 Multi-Atlas Abdomen Labeling Challenge

Under Institutional Review Board (IRB) supervision, 50 abdomen CT scans of were randomly selected from a combination of an ongoing …

📊 6 results

📏 Metrics: Avg DSC, Avg HD

Medical Segmentation Decathlon

The Medical Segmentation Decathlon is a collection of medical image segmentation datasets. It contains a total of 2,633 three-dimensional images …

📊 5 results

📏 Metrics: Dice (Average), NSD

Medico automatic polyp segmentation challenge (dataset)

The “Medico automatic polyp segmentation challenge” aims to develop computer-aided diagnosis systems for automatic polyp segmentation to detect all types …

📊 2 results

📏 Metrics: DSC, mIoU, Recall, Precision, FPS

MoNuSAC

Different types of cells play a vital role in the initiation, development, invasion, metastasis and therapeutic response of tumors of …

📊 1 results

📏 Metrics: Dice, IoU

MoNuSeg

The dataset for this challenge was obtained by carefully annotating tissue images of several patients with tumors of different organs …

📊 14 results

📏 Metrics: F1, IoU, AHD95, ASD, mIoU

MosMedData

MosMedData contains anonymised human lung computed tomography (CT) scans with COVID-19 related findings, as well as without such findings. A …

📊 1 results

📏 Metrics: Average Dice

RITE

The RITE (Retinal Images vessel Tree Extraction) is a database that enables comparative studies on segmentation or classification of arteries …

📊 3 results

📏 Metrics: Dice, Jaccard Index

ROBUST-MIS

The ROBUST-MIS dataset was made available to support the Robust Medical Instrument Segmentation (ROBUST-MIS) Challenge 2019, part of the Endoscopic …

📊 3 results

📏 Metrics: DSC, mIoU, FPS

TNBC

Inolves an annotated a large number of cells, including normal epithelial and myoepithelial breast cells (localized in ducts and lobules), …

📊 1 results

📏 Metrics: AHD95, Dice, IoU

Meme Classification

Hateful Memes

The Hateful Memes data set is a multimodal dataset for hateful meme detection (image + text) that contains 10,000+ new …

📊 17 results

📏 Metrics: ROC-AUC, Accuracy

MultiOFF

Introudced from Multimodal Meme Dataset (MultiOFF) for Identifying Offensive Content in Image and Text

📊 4 results

📏 Metrics: Accuracy, F1

Tamil Memes

Social media are interactive platforms that facilitate the creation or sharing of information, ideas or other forms of expression among …

📊 2 results

📏 Metrics: Micro-F1

Meter Reading

Copel-AMR

This dataset contains 12,500 meter images acquired in the field by the employees of the Energy Company of Paraná (Copel), …

📊 2 results

📏 Metrics: Rank-1 Recognition Rate

UFPR-ADMR-v1

This dataset contains 2,000 dial meter images obtained on-site by employees of the Energy Company of Paraná (Copel), which serves …

📊 11 results

📏 Metrics: Rank-1 Recognition Rate

UFPR-AMR

This dataset contains 2,000 images taken from inside a warehouse of the Energy Company of Paraná (Copel), which directly serves …

iMiGUE

iMiGUE is a dataset for emotional artificial intelligence research: identity-free video dataset for Micro-Gesture Understanding and Emotion analysis (iMiGUE). Different …

📊 1 results

📏 Metrics: Top 1 Accuracy, Top 5 Accuracy

Moment Retrieval

Charades-STA

Charades-STA is a new dataset built on top of Charades by adding sentence temporal annotations. Source: [TALL: Temporal Activity Localization …

📊 25 results

📏 Metrics: R@1 IoU=0.5, R@1 IoU=0.7, R@5 IoU=0.5, R@5 IoU=0.7, R@1 IoU=0.3, mIoU

QVHighlights

The Query-based Video Highlights (QVHighlights) dataset is a dataset for detecting customized moments and highlights from videos given natural language …

📊 31 results

📏 Metrics: mAP, R@1 IoU=0.5, R@1 IoU=0.7, [email protected], [email protected]

Motion Captioning

HumanML3D

HumanML3D is a 3D human motion-language dataset that originates from a combination of HumanAct12 and Amass dataset. It covers a …

📊 4 results

📏 Metrics: BLEU-4, BERTScore

KIT Motion-Language

The KIT Motion-Language is a dataset linking human motion and natural language. Source: The KIT Motion-Language Dataset

📊 3 results

📏 Metrics: BLEU-4, BERTScore

Motion Detection

nuScenes

The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in …

📊 2 results

📏 Metrics: F1 (%)

Motion Segmentation

ApolloScape

ApolloScape is a large dataset consisting of over 140,000 video frames (73 street scene videos) from various locations in China …

📊 5 results

📏 Metrics: Accuracy

Hopkins155

The Hopkins 155 dataset consists of 156 video sequences of two or three motions. Each video sequence motion corresponds to …

📊 4 results

📏 Metrics: Classification Error

KT3DMoSeg

Please find more details of this dataset at https://alex-xun-xu.github.io/ProjectPage/CVPR_18/index.html 3D motion segmentation has been the key problem in computer vision …

📊 1 results

📏 Metrics: Error

Motion Synthesis

AIOZ-GDANCE

AIOZ-GDANCE comprises 16.7 hours of whole-body motion and music audio of group dancing. The duration of each video in our …

📊 4 results

📏 Metrics: FID, MMC, GenDiv, PFC, GMR, GMC, TIF

AIST++

AIST++ is a 3D dance dataset which contains 3D motion reconstructed from real dancers paired with music. The AIST++ Dance …

📊 12 results

📏 Metrics: FID, Beat alignment score

BRACE

BRACE is a dataset for audio-conditioned dance motion synthesis challenging common assumptions for this task: - strong music-dance correlation - …

📊 3 results

📏 Metrics: Frechet Inception Distance, Beat alignment score, Beat DTW cost, Footwork average, Powermove average, Toprock average

FineDance

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 7 results

📏 Metrics: fid_k, BAS

HumanAct12

HumanAct12 is a new 3D human motion dataset adopted from the polar image and 3D pose dataset PHSPD, with proper …

📊 2 results

📏 Metrics: Accuracy, FID, Multimodality

HumanML3D

HumanML3D is a 3D human motion-language dataset that originates from a combination of HumanAct12 and Amass dataset. It covers a …

📊 35 results

📏 Metrics: FID, R Precision Top3, Diversity, Multimodality

Inter-X

Inter-X is a large-scale dataset containing ~11K interaction sequences, more than 8.1M frames and 34K fine-grained human textual descriptions.

📊 5 results

📏 Metrics: FID, R-Precision Top3, MMDist, MModality

InterHuman

InterHuman is a multimodal dataset, named InterHuman. It consists of about 107M frames for diverse two-person interactions, with accurate skeletal …

📊 8 results

📏 Metrics: FID, R-Precision Top3, MMDist, MModality

KIT Motion-Language

The KIT Motion-Language is a dataset linking human motion and natural language. Source: The KIT Motion-Language Dataset

📊 29 results

📏 Metrics: FID, R Precision Top3, Diversity, Multimodality

LaFAN1

Ubisoft La Forge Animation Dataset ("LAFAN1") Ubisoft La Forge Animation dataset and accompanying code for the SIGGRAPH 2020 paper …

📊 4 results

📏 Metrics: L2Q@5, L2Q@15, L2Q@30, L2P@5, L2P@15, L2P@30, NPSS@5, NPSS@15, NPSS@30

Motion-X

Motion-X is a large-scale 3D expressive whole-body motion dataset, which comprises 15.6M precise 3D whole-body pose annotations (i.e., SMPL-X) covering …

📊 4 results

📏 Metrics: FID, TMR-R-Precision Top3, TMR-Matching Score, MModality, Diversity

TMD

The Text-Music-Dance (TMD) dataset establishes a pioneering benchmark comprising 2,153 text-music-motion pairs. Dance motions and corresponding text annotations are sourced …

📊 1 results

📏 Metrics: FID, BAS, MModality, MMDist

Trinity Speech-Gesture Dataset

Trinity Gesture Dataset includes 23 takes, totalling 244 minutes of motion capture and audio of a male native English speaker …

BigEarthNet

BigEarthNet consists of 590,326 Sentinel-2 image patches, each of which is a section of i) 120x120 pixels for 10m bands; …

📊 10 results

📏 Metrics: mAP (micro), mAP (macro), FScore, official split

VizWiz-Classification

Our goal is to improve upon the status quo for designing image classification models trained in one domain that perform …

SeaDronesSee is a large-scale data set aimed at helping develop systems for Search and Rescue (SAR) using Unmanned Aerial Vehicles …

📊 3 results

📏 Metrics: MOTA

SportsMOT

Motivation Multi-object tracking (MOT) is a fundamental task in computer vision, aiming to estimate objects (e.g., pedestrians and vehicles) …

📊 22 results

📏 Metrics: HOTA, IDF1, AssA, MOTA, DetA

Synthehicle

Synthehicle is a massive CARLA-based synthehic multi-vehicle multi-camera tracking dataset and includes ground truth for 2D detection and tracking, 3D …

📊 1 results

📏 Metrics: MOTA

TAO

TAO is a federated dataset for Tracking Any Object, containing 2,907 high resolution videos, captured in diverse environments, which are …

📊 9 results

📏 Metrics: TETA, LocA, AssocA, ClsA, Track mAP

UAVDT

UAVDT is a large scale challenging UAV Detection and Tracking benchmark (i.e., about 80, 000 representative frames from 10 hours …

📊 2 results

📏 Metrics: IDF1, MOTA

Wildtrack

Wildtrack is a large-scale and high-resolution dataset. It has been captured with seven static cameras in a public open area, …

📊 9 results

📏 Metrics: IDF1, MOTA

Multi-Object Tracking and Segmentation

KITTI MOTS

The Multi-Object and Segmentation (MOTS) benchmark [2] consists of 21 training sequences and 29 test sequences. It is based on …

📊 1 results

📏 Metrics: AssA, DetA, HOTA

Multi-Person Pose Estimation

COCO (Common Objects in Context)

The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset. It is designed to …

📊 15 results

📏 Metrics: AP, Test AP, Validation AP

COCO-WholeBody

COCO-WholeBody is an extension of COCO dataset with whole-body annotations. There are 4 types of bounding boxes (person box, face …

📊 2 results

📏 Metrics: keypoint AP

CrowdPose

The CrowdPose dataset contains about 20,000 images and a total of 80,000 human poses with 14 labeled keypoints. The test …

📊 28 results

📏 Metrics: mAP @0.5:0.95, AP Easy, AP Medium, AP Hard, FPS

OCHuman

This dataset focuses on heavily occluded human with comprehensive annotations including bounding-box, humans pose and instance mask. This dataset contains …

📊 8 results

📏 Metrics: AP50, AP75, Validation AP

Multi-View 3D Reconstruction

ETH3D

ETHD is a multi-view stereo benchmark / 3D reconstruction benchmark that covers a variety of indoor and outdoor scenes. Ground …

📊 4 results

📏 Metrics: F1 score

Multi-class Classification

TII-SSRC-23

The TII-SSRC-23 dataset offers a comprehensive collection of network traffic patterns, meticulously compiled to support the development and research of …

📊 1 results

📏 Metrics: F1-Score

Multi-target Domain Adaptation

DomainNet

DomainNet is a dataset of common objects in six different domain. All domains include 345 categories (classes) of objects such …

📊 4 results

📏 Metrics: Accuracy

OBJ-MDA

The dataset contains images of 16 artworks included in the cultural site “Galleria Regionale di Palazzo Bellomo2”. The collection covers …

📊 1 results

📏 Metrics: [email protected]

Office-31

The Office dataset contains 31 object categories in three domains: Amazon, DSLR and Webcam. The 31 categories in the dataset …

📊 5 results

📏 Metrics: Accuracy

Office-Home

Office-Home is a benchmark dataset for domain adaptation which contains 4 domains where each domain consists of 65 categories. The …

📊 4 results

📏 Metrics: Accuracy

Multimodal Emotion Recognition

IEMOCAP

Multimodal Emotion Recognition IEMOCAP The IEMOCAP dataset consists of 151 videos of recorded dialogues, with 2 speakers per session for …

📊 2 results

📏 Metrics: Weighted F1, Accuracy

MELD

Multimodal EmotionLines Dataset (MELD) has been created by enhancing and extending EmotionLines dataset. MELD contains the same dialogue instances available …

📊 3 results

📏 Metrics: Weighted F1, Accuracy

Multimodal Machine Translation

Multi30K

Multi30K is a large-scale multilingual multimodal dataset for interdisciplinary machine learning research. It extends the Flickr30K dataset with German translations …

📊 14 results

📏 Metrics: BLEU (EN-DE), BLUE (DE-EN), Meteor (EN-DE), Meteor (EN-FR)

Multimodal Reasoning

AlgoPuzzleVQA

We introduce the novel task of multimodal puzzle solving, framed within the context of visual question-answering. We present a new …

📊 1 results

📏 Metrics: Acc

MATH-V

Math-Vision (Math-V) dataset is a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math …

📊 4 results

📏 Metrics: Accuracy

REBUS

Recent advances in large language models have led to the development of multimodal LLMs (MLLMs), which take both image data …

📊 8 results

📏 Metrics: Accuracy

Multiple Object Tracking

GMOT-40

GMOT-40 is the first public dense dataset for Generic Multiple Object Tracking (GMOT). It contains 40 carefully annotated sequences evenly …

📊 1 results

📏 Metrics: HOTA, IDF1, MOTA

RADIATE

RADIATE (RAdar Dataset In Adverse weaThEr) is new automotive dataset created by Heriot-Watt University which includes Radar, Lidar, Stereo Camera …

📊 2 results

📏 Metrics: MOTA

SportsMOT

Motivation Multi-object tracking (MOT) is a fundamental task in computer vision, aiming to estimate objects (e.g., pedestrians and vehicles) …

📊 19 results

📏 Metrics: HOTA, IDF1, AssA, MOTA, DetA

UA-DETRAC

Consists of 100 challenging video sequences captured from real-world traffic scenes (over 140,000 frames with rich annotations, including occlusion, weather, …

📊 1 results

📏 Metrics: MOTA

Waymo Open Dataset

The Waymo Open Dataset is comprised of high resolution sensor data collected by autonomous vehicles operated by the Waymo Driver …

📊 2 results

📏 Metrics: Category, MOTA, mAP

Multispectral Object Detection

KAIST Multispectral Pedestrian Detection Benchmark

KAIST Multispectral Pedestrian Dataset The KAIST Multispectral Pedestrian Dataset is imaging hardware consisting of a color camera, a thermal camera …

📊 13 results

📏 Metrics: All Miss Rate, Reasonable Miss Rate

LLVIP

Visible-infrared Paired Dataset for Low-light Vision * 30976 images (15488 pairs) * 24 dark scenes, 2 daytime scenes * …

SVHN

Street View House Numbers (SVHN) is a digit classification benchmark dataset that contains 600,000 32×32 RGB images of printed digits …

📊 1 results

📏 Metrics: Clustering Accuracy

Novel View Synthesis

ACID

ACID consists of thousands of aerial drone videos of different coastline and nature scenes on YouTube. Structure-from-motion is used to …

📊 3 results

📏 Metrics: FID, NLL, PSIM, PSNR, SSIM

BLEFF

Synthetic (Blender) Dataset for forward facing scenes Toe vaualte NVS quality and camera parameter accuracy.

📊 3 results

📏 Metrics: PSNR/SSIM

DONeRF: Evaluation Dataset

This is the dataset for the CGF 2021 paper "DONeRF: Towards Real-Time Rendering of Compact Neural Radiance Fields using Depth …

📊 6 results

📏 Metrics: PSNR

Deep Blending

The Deep Blending Dataset comprises 19 diverse scenes, offering comprehensive resources for free-viewpoint image-based rendering (IBR). Each scene includes input …

📊 1 results

📏 Metrics: LPIPS, PSNR, SSIM, Size (MB)

HDR-GS

This is dataset for high dynamic range novel view synthesis. It is collected by HDR-NeRF and recalibrated by HDR-GS for …

📊 2 results

📏 Metrics: Average PSNR, SSIM, LPIPS

KITTI

KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile …

📊 1 results

📏 Metrics: Average PSNR

LLFF

Local Light Field Fusion (LLFF) is a practical and robust deep learning solution for capturing and rendering novel views of …

📊 15 results

📏 Metrics: PSNR, LPIPS, SSIM

Mip-NeRF 360

Mip-NeRF 360 is an extension to the Mip-NeRF that uses a non-linear parameterization, online distillation, and a novel distortion-based regularize …

📊 12 results

📏 Metrics: LPIPS, PSNR, SSIM, Size (MB)

NeRF

Neural Radiance Fields (NeRF) is a method for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric …

📊 12 results

📏 Metrics: PSNR, SSIM, LPIPS, Average PSNR, Size (MB)

PhotoShape

The PhotoShape dataset consists of photorealistic, relightable, 3D shapes produced by the work proposed in the work of [Park et …

📊 1 results

📏 Metrics: LPIPS, PSNR

RTMV

RTMV is a large-scale synthetic dataset for novel view synthesis consisting of ∼300k images rendered from nearly 2000 complex scenes …

📊 4 results

📏 Metrics: PSNR, SSIM

RealEstate10K

RealEstate10K is a large dataset of camera poses corresponding to 10 million frames derived from about 80,000 video clips, gathered …

📊 2 results

📏 Metrics: FID, NLL, PSIM, PSNR, SSIM

RefRef

RefRef is a synthetic dataset and benchmark designed for the task of reconstructing scenes with complex refractive and reflective objects. …

📊 5 results

📏 Metrics: Average PSNR (dB)

SWORD

The new dataset contains around 1,500 train videos and 290 test videos, with 50 frames per video on average. The …

📊 3 results

📏 Metrics: LPIPS, PSNR, SSIM

ScanNet++

ScanNet++ is a large scale dataset with 450+ 3D indoor scenes containing sub-millimeter resolution laser scans, registered 33-megapixel DSLR images, …

📊 4 results

📏 Metrics: PSNR, SSIM, LPIPS

Tanks and Temples

We present a benchmark for image-based 3D reconstruction. The benchmark sequences were acquired outside the lab, in realistic conditions. Ground-truth …

📊 10 results

📏 Metrics: PSNR, SSIM, LPIPS, Size (MB)

X3D

X3D is a dataset containing 15 scenes and covering 4 applications for X-ray 3D reconstruction. More specifically, the X3D dataset …

📊 5 results

📏 Metrics: PSNR, SSIM

iFF

Real-world dataset on forward facing scenes with different camera intrinisc parameters.

📊 2 results

📏 Metrics: Average PSNR, SSIM, Focal Error

Object Categorization

GRIT

The General Robust Image Task (GRIT) Benchmark is an evaluation-only benchmark for evaluating the performance and robustness of vision systems …

📊 2 results

📏 Metrics: mAP

LDD

The Instance Segmentation task, an extension of the well-known Object Detection task, is of great help in many areas, such …

📊 1 results

📏 Metrics: box mAP

LLVIP

Visible-infrared Paired Dataset for Low-light Vision * 30976 images (15488 pairs) * 24 dark scenes, 2 daytime scenes * …

📊 2 results

📏 Metrics: AP(l), AP(m), AP(s), AP50, AP75, AP85, AR, AR(l), AR(m), AR(s), MAP

Object Localization

GRIT

The General Robust Image Task (GRIT) Benchmark is an evaluation-only benchmark for evaluating the performance and robustness of vision systems …

📊 3 results

📏 Metrics: Localization (ablation), Localization (test)

IllusionVQA

IllusionVQA is a Visual Question Answering (VQA) dataset with two sub-tasks. The first task tests comprehension on 435 instances in …

📊 9 results

📏 Metrics: Accuracy

Mall

The Mall is a dataset for crowd counting and profiling research. Its images are collected from publicly accessible webcam. It …

📊 1 results

📏 Metrics: Precision

PASCAL VOC 2007

PASCAL VOC 2007 is a dataset for image recognition. The twenty object classes that have been selected are: Person: person …

The 'shape bias' dataset was introduced in Geirhos et al. (ICLR 2019) and consists of 224x224 images with conflicting texture …

📊 18 results

📏 Metrics: shape bias

Object Segmentation

GRIT

The General Robust Image Task (GRIT) Benchmark is an evaluation-only benchmark for evaluating the performance and robustness of vision systems …

VisEvent (Visible-Event benchmark) is a dataset constructed for the evaluation of tracking by combing visible and event cameras. VisEvent is …

📊 1 results

📏 Metrics: Precision Plot

Occluded 3D Object Symmetry Detection

YCB-Video

The YCB-Video dataset is a large-scale video dataset for 6D object pose estimation. provides accurate 6D poses of 21 objects …

📊 1 results

📏 Metrics: PR AUC

One-Shot Segmentation

Cluttered Omniglot

Dataset for one-shot segmentation. Source: One-Shot Segmentation in Clutter

📊 2 results

📏 Metrics: IoU [32 distractors], IoU [4 distractors], IoU [256 distractors]

Open Vocabulary Action Detection

JHMDB

JHMDB is an action recognition dataset that consists of 960 video sequences belonging to 21 actions. It is a subset …

📊 1 results

📏 Metrics: val mAP

MultiSports

Spatio-temporal action detection is an important and challenging problem in video understanding. The existing action detection benchmarks are limited in …

📊 1 results

📏 Metrics: val mAP

UCF101-24

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 1 results

📏 Metrics: val mAP

Open Vocabulary Object Detection

MSCOCO

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 32 results

📏 Metrics: AP 0.5

Objects365

Objects365 is a large-scale object detection dataset, Objects365, which has 365 object categories over 600K training images. More than 10 …

📊 2 results

📏 Metrics: mask AP50

Open Vocabulary Panoptic Segmentation

ADE20K

The ADE20K semantic segmentation dataset contains more than 20K scene-centric images exhaustively annotated with pixel-level objects and object parts labels. …

📊 10 results

📏 Metrics: PQ

Open Vocabulary Semantic Segmentation

Cityscapes

Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense …

📊 5 results

📏 Metrics: mIoU

ISPRS Potsdam

The data set contains 38 patches (of the same size), each consisting of a true orthophoto (TOP) extracted from a …

📊 1 results

📏 Metrics: mIoU

iSAID

iSAID contains 655,451 object instances for 15 categories across 2,806 high-resolution images. The images of iSAID is the same as …

📊 2 results

📏 Metrics: mIoU-

Optical Character Recognition (OCR)

FSNS - Test

Arabic handwriting dataset.

📊 3 results

📏 Metrics: Sequence error

I2L-140K

Introduced by Singh, Sumeet S.. “Teaching Machines to Code: Neural Markup Generation with Visual Attention.” ArXiv abs/1802.05415 (2018): n. pag. …

📊 2 results

📏 Metrics: BLEU

VideoDB's OCR Benchmark Public Collection

Dataset Introduction This dataset leverages VideoDB's Public Collection to offer a diverse range of videos featuring text-containing scenes. It …

📊 5 results

📏 Metrics: Average Accuracy, Character Error Rate (CER), Word Error Rate (WER)

im2latex-100k

A prebuilt dataset for OpenAI's task for image-2-latex system. Includes total of ~100k formulas and images splitted into train, validation …

📊 1 results

📏 Metrics: BLEU

Optical Flow Estimation

Spring

Spring is a large, high-resolution and high-detail, computer-generated benchmark for scene flow, optical flow, and stereo. Based on rendered scenes …

📊 10 results

📏 Metrics: 1px total

Out-of-Distribution Detection

20 Newsgroups

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

📊 2 results

📏 Metrics: AUROC, FPR95

ADE-OoD

ADE-OoD is a public benchmark for dense out-of-distribution detection in general natural images. It measures the ability to detect and …

📊 4 results

📏 Metrics: AP, FPR@95

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 9 results

📏 Metrics: AUROC, FPR95

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 4 results

📏 Metrics: FPR95, AUROC

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 2 results

📏 Metrics: AUROC

ImageNet-1k vs NINCO

The NINCO (No ImageNet Class Objects) dataset is introduced in the ICML 2023 paper In or Out? Fixing ImageNet Out-of-Distribution …

📊 5 results

📏 Metrics: AUROC, FPR@95, Latency, ms

ImageNet-1k vs OpenImage-O

OpenImage-O is built for the ID dataset ImageNet-1k. It is manually annotated, comes with a naturally diverse distribution, and has …

📊 6 results

📏 Metrics: AUROC, FPR95, Latency, ms

ImageNet-1k vs Places

A benchmark dataset for out-of-distribution detection. ImageNet-1k is in-distribution, while Places is out-of-distribution.

📊 22 results

📏 Metrics: FPR95, AUROC

ImageNet-1k vs SUN

A benchmark dataset for out-of-distribution detection. ImageNet-1k is in-distribution, while SUN is out-of-distribution.

📊 19 results

📏 Metrics: FPR95, AUROC

ImageNet-1k vs Textures

A benchmark dataset for out-of-distribution detection. ImageNet-1k is in-distribution, while Textures is out-of-distribution.

📊 30 results

📏 Metrics: AUROC, FPR95, Latency, ms

ImageNet-1k vs iNaturalist

A benchmark dataset for out-of-distribution detection. ImageNet-1k is in-distribution, while iNaturalist is out-of-distribution.

📊 24 results

📏 Metrics: AUROC, FPR95, Latency, ms

SST

The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the …

📊 1 results

📏 Metrics: AUROC, FPR95

STL-10

The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or …

Action-Camera Parking

The Action-Camera Parking Dataset contains 293 images captured at a roughly 10-meter height using a GoPro Hero 6 camera. It …

📊 7 results

📏 Metrics: F1-score, F1

PKLot

The PKLot dataset contains 12,417 images of parking lots and 695,899 images of parking spaces segmented from them, which were …

📊 2 results

📏 Metrics: Average-mAP, F1-score

SPKL

The SPKL dataset contains 1203 images of parking lots divided into 11 categories regarding vision conditions (including the 'winter' category …

📊 7 results

📏 Metrics: F1-score

Part-aware Panoptic Segmentation

Cityscapes Panoptic Parts

The Cityscapes Panoptic Parts dataset introduces part-aware panoptic segmentation annotations for the Cityscapes dataset. It extends the original panoptic annotations …

📊 4 results

📏 Metrics: PartPQ

Pascal Panoptic Parts

The Pascal Panoptic Parts dataset consists of annotations for the part-aware panoptic segmentation task on the PASCAL VOC 2010 dataset. …

📊 4 results

📏 Metrics: PartPQ

Partial Point Cloud Matching

4DMatch

A benchmark for matching and registration of partial point clouds with time-varying geometry. It is constructed using randomly selected 1761 …

📊 9 results

📏 Metrics: NFMR, IR

Pedestrian Attribute Recognition

PA-100K

PA-100K is a recent-proposed large pedestrian attribute dataset, with 100,000 images in total collected from outdoor surveillance cameras. It is …

📊 10 results

📏 Metrics: Accuracy, Accuracy , F1 score

PETA

The PEdesTrian Attribute dataset (PETA) is a dataset fore recognizing pedestrian attributes, such as gender and clothing style, at a …

📊 5 results

📏 Metrics: Accuracy

RAP

The Richly Annotated Pedestrian (RAP) dataset is a dataset for pedestrian attribute recognition. It contains 41,585 images collected from indoor …

📊 2 results

📏 Metrics: Accuracy

UAV-Human

UAV-Human is a large dataset for human behavior understanding with UAVs. It contains 67,428 multi-modal video sequences and 119 subjects …

📊 2 results

📏 Metrics: Backpack, Gender, Hat, LCC, LCS, UCC, UCS

Pedestrian Detection

CityPersons

The CityPersons dataset is a subset of Cityscapes which only consists of person annotations. There are 2975 images for training, …

📊 18 results

📏 Metrics: Reasonable MR^-2, Heavy MR^-2, Partial MR^-2, Bare MR^-2, Small MR^-2, Medium MR^-2, Large MR^-2, Test Time

LLVIP

Visible-infrared Paired Dataset for Low-light Vision * 30976 images (15488 pairs) * 24 dark scenes, 2 daytime scenes * …

📊 9 results

📏 Metrics: AP, log average miss rate

MMPD-Dataset

MMPD Dataset is proposed in ECCV'2024 "When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model and Benchmark Dataset".

📊 1 results

📏 Metrics: box mAP

Period Estimation

OmniArt

Presents half a million samples and structured meta-data to encourage further research and societal engagement. Source: [OmniArt: Multi-task Deep Learning …

📊 2 results

📏 Metrics: Mean absolute error

Perpetual View Generation

LHQ

A dataset of 90,000 high-resolution nature landscape images, crawled from Unsplash and Flickr and preprocessed with Mask R-CNN and Inception …

📊 2 results

📏 Metrics: FID (first 20 steps), IS (first 20 steps), KID (first 20 steps), FID (full 100 steps), IS (full 100 steps), KID (full 100 steps)

Person Identification

EEG Motor Movement/Imagery Dataset

This data set consists of over 1500 one- and two-minute EEG recordings, obtained from 109 volunteers.

📊 3 results

📏 Metrics: Accuracy

WiGesture

WiGesture dataset contains data related to gesture recognition and people id identification in a meeting room scenario. The dataset provides …

📊 3 results

📏 Metrics: Accuracy (% )

Person Re-Identification

AG-ReID

Person re-ID matches persons across multiple non-overlapping cameras. Despite the increasing deployment of airborne platforms in surveillance, current existing person …

📊 2 results

📏 Metrics: Averaged rank-1 acc(%)

AG-ReID.v2

Aerial-ground person re-identification (Re-ID) presents unique challenges in computer vision, stemming from the distinct differences in viewpoints, poses, and resolutions …

📊 1 results

📏 Metrics: Average mAP

CCVID

Clothes-Changing Video person re-ID (CCVID) is a dataset constructed from the raw data of a gait recognition dataset, i.e. FVG. …

📊 3 results

📏 Metrics: Rank-1, mAP, Rank-1

CUHK-SYSU

The CUKL-SYSY dataset is a large scale benchmark for person search, containing 18,184 images and 8,432 identities. Different from previous …

📊 3 results

📏 Metrics: MAP, Rank-1

CUHK03

The CUHK03 consists of 14,097 images of 1,467 different identities, where 6 campus cameras were deployed for image collection and …

📊 19 results

📏 Metrics: MAP, Rank-1, Rank-5, Rank-10

CUHK03-C

CUHK03-C is an evaluation set that consists of algorithmically generated corruptions applied to the CUHK03 test-set. These corruptions consist of …

📊 8 results

📏 Metrics: Rank-1, mAP, mINP, Rank-1, mAP, mINP

ClonedPerson

The ClonedPerson dataset is a large-scale synthetic person re-identification dataset introduced in the paper "Cloning Outfits from Real-World Images to …

📊 1 results

📏 Metrics: mAP, Rank-1

DukeMTMC-VideoReID

The DukeMTMC-VideoReID (Duke Multi-Tracking Multi-Camera Video-based ReIDentification) dataset is a subset of the DukeMTMC for video-based person re-ID. The dataset …

📊 1 results

📏 Metrics: mAP

DukeMTMC-reID

The DukeMTMC-reID (Duke Multi-Tracking Multi-Camera ReIDentification) dataset is a subset of the DukeMTMC for image-based person re-ID. The dataset is …

📊 89 results

📏 Metrics: mAP, Rank-1, Rank-5, Rank-10, Rank-1, Rank-5

ENTIRe-ID

The growing importance of person re-identification in computer vision has highlighted the need for more extensive and diverse datasets. In …

📊 1 results

📏 Metrics: mAP

IUST_PersonReID

The IUST_PersonReID dataset was developed to address limitations in existing person re-identification datasets by including cultural and environmental contexts unique …

📊 2 results

📏 Metrics: Rank-1, Rank-5, Rank-10, mAP

LTCC

LTCC contains 17,119 person images of 152 identities, and each identity is captured by at least two cameras. The dataset …

📊 8 results

📏 Metrics: Rank-1, mAP, mAP, Rank-1

MARS

MARS (Motion Analysis and Re-identification Set) is a large scale video based person reidentification dataset, an extension of the Market-1501 …

📊 20 results

📏 Metrics: mAP, Rank-1, Rank-5, Rank-10, Rank-20

MSMT17

MSMT17 is a multi-scene multi-time person re-identification dataset. The dataset consists of 180 hours of videos, captured by 12 outdoor …

📊 43 results

📏 Metrics: mAP, Rank-1, Rank-10, Rank-5

MSMT17-C

MSMT17-C is an evaluation set that consists of algorithmically generated corruptions applied to the MSMT17 test-set. These corruptions consist of …

📊 5 results

📏 Metrics: Rank-1, mAP, mINP, Rank-1, mAP, mINP

Market-1501

Market-1501 is a large-scale public benchmark dataset for person re-identification. It contains 1501 identities which are captured by six different …

📊 125 results

📏 Metrics: Rank-1, mAP, Rank-5, mINP

Market-1501-C

Market-1501-C is an evaluation set that consists of algorithmically generated corruptions applied to the Market-1501 test-set. These corruptions consist of …

📊 22 results

📏 Metrics: Rank-1, mAP, mINP, Rank-1, mAP, mINP

Occluded REID

Occluded REID is an occluded person dataset captured by mobile cameras, consisting of 2,000 images of 200 occluded persons (see …

📊 5 results

📏 Metrics: mAP, Rank-1

Occluded-DukeMTMC

Occluded-DukeMTMC contains 15,618 training images, 17,661 gallery images, and 2,210 occluded query images. The experiment results on Occluded-DukeMTMC will demonstrate …

📊 25 results

📏 Metrics: Rank-1, mAP

Occluded-PoseTrack-ReID

We introduce Occluded PoseTrack-ReID (or simply Occ-PTrack), a new ReID dataset we built out of the annotation available with PoseTrack21, …

📊 1 results

📏 Metrics: MAP, Rank-1

P-DukeMTMC-reID

P-DukeMTMC-reID is a modified version based on DukeMTMC-reID dataset. There are 12,927 images (665 identifies) in training set, 2,163 images …

📊 2 results

📏 Metrics: mAP, Rank-1, Rank-5, Rank-10

PRCC

This dataset consists of 33698 images from 221 identities. Each person in Cameras A and B is wearing the same …

📊 10 results

📏 Metrics: mAP, Rank-1

PRID2011

PRID 2011 is a person reidentification dataset that provides multiple person trajectories recorded from two different static surveillance cameras, monitoring …

📊 10 results

📏 Metrics: Rank-1, Rank-20, Rank-5, Rank-10

Partial-REID

Partial REID is a specially designed partial person reidentification dataset that includes 600 images from 60 people, with 5 full-body …

📊 2 results

📏 Metrics: Rank-1

RegDB

RegDB is used for Visible-Infrared Re-ID which handles the cross-modality matching between the daytime visible and night-time infrared images. The …

📊 1 results

📏 Metrics: Rank-1

SYSU-30k

SYSU-30k contains 30k categories of persons, which is about 20 times larger than CUHK03 (1.3k categories) and Market1501 (1.5k categories), …

📊 10 results

📏 Metrics: Rank-1, Rank-1

SYSU-MM01

The SYSU-MM01 is a dataset collected for the Visible-Infrared Re-identification problem. The images in the dataset were obtained from 491 …

📊 1 results

📏 Metrics: rank1

SYSU-MM01-C

SYSU-MM01-C is an evaluation set that consists of algorithmically generated corruptions applied to the SYSU-MM01 test-set. These corruptions consist of …

📊 2 results

📏 Metrics: Rank-1 (All Search), mAP (All Search), mINP (All Search), Rank-1 (Indoor Search), mAP (Indoor Search), mINP (Indoor Search), Rank-1 (All Search), Rank-1 (Indoor Search), mAP (All Search), mAP (Indoor Search), mINP (All Search), mINP (Indoor Search)

SenseReID

SenseReID is a person re-identification dataset for evaluating ReID models. It is captured from real surveillance cameras and the person …

📊 1 results

📏 Metrics: Top-1

SoccerNet-v2

A novel large-scale corpus of manual annotations for the SoccerNet video dataset, along with open challenges to encourage more research …

📊 1 results

📏 Metrics: Rank-1, mAP

UAV-Human

UAV-Human is a large dataset for human behavior understanding with UAVs. It contains 67,428 multi-modal video sequences and 119 subjects …

📊 4 results

📏 Metrics: Rank-1, Rank-5, mAP

VC-Clothes

Person re-identification (Reid) is now an active research topic for AI-based video surveillance applications such as specific person search, but …

📊 4 results

📏 Metrics: Rank-1, mAP

eSports Sensors Dataset

The eSports Sensors dataset contains sensor data collected from 10 players in 22 matches in League of Legends. The sensor …

📊 5 results

📏 Metrics: Accuracy, LogLoss, ROC AUC

iLIDS-VID

The iLIDS-VID dataset is a person re-identification dataset which involves 300 different pedestrians observed across two disjoint camera views in …

📊 8 results

📏 Metrics: Rank-1, Rank-5, Rank-10, Rank-20

Person Search

CUHK-SYSU

The CUKL-SYSY dataset is a large scale benchmark for person search, containing 18,184 images and 8,432 identities. Different from previous …

📊 14 results

📏 Metrics: MAP, Top-1

PRW

PRW is a large-scale dataset for end-to-end pedestrian detection and person recognition in raw video frames. PRW is introduced to …

📊 13 results

📏 Metrics: mAP, Top-1

Personality Trait Recognition

Essays

J. W. Pennebaker and L. A. King, “Linguistic styles: Language use as an individual difference,” J. Pers. Soc. Psychol., vol. …

📊 2 results

📏 Metrics: Accuracy, F-Measure, Precision, Recall

SynthPAI

SynthPAI was created to provide a dataset that can be used to investigate the personal attribute inference (PAI) capabilities of …

📊 18 results

📏 Metrics: Average accuracy in %

Personalized Image Generation

DreamBooth

The DreamBooth dataset is a collection of images used for fine-tuning text-to-image diffusion models for subject-driven generation¹. Here are some …

📊 7 results

📏 Metrics: Overall (CP * PF), Concept Preservation (CP), Prompt Following (PF)

Personalized Segmentation

PerSeg

PerSeg is a dataset for personalized segmentation. The raw images are collect from the training data of subject driven diffusion …

📊 5 results

📏 Metrics: mIoU

Physical Attribute Prediction

Sound of Water 50

We collect a dataset of 805 clean videos that show the action of pouring water in a container. Our dataset …

📊 1 results

📏 Metrics: Mean Squared Error

Point Cloud Classification

PointCloud-C

PointCloud-C is the very first test-suite for point cloud robustness analysis under corruptions. - Two sets: ModelNet-C for point cloud …

📊 23 results

📏 Metrics: mean Corruption Error (mCE)

Point Cloud Completion

Completion3D

The Completion3D benchmark is a dataset for evaluating state-of-the-art 3D Object Point Cloud Completion methods. Ggiven a partial 3D object …

📊 6 results

📏 Metrics: Chamfer Distance

ShapeNet

ShapeNet is a large scale repository for 3D CAD models developed by researchers from Stanford University, Princeton University and the …

📊 9 results

📏 Metrics: Chamfer Distance, F-Score@1%, Earth Mover's Distance, Frechet Point cloud Distance, Chamfer Distance L2

ShapeNet-ViPC

A large-scale dataset for the point cloud completion task on the ShapeNet dataset.

📊 3 results

📏 Metrics: Chamfer Distance

Point Cloud Generation

ShapeNet

ShapeNet is a large scale repository for 3D CAD models developed by researchers from Stanford University, Princeton University and the …

📊 1 results

📏 Metrics: CD, EMD, 1-NNA-CD, 1-NNA-EMD

Point Cloud Registration

3RScan

A novel dataset and benchmark, which features 1482 RGB-D scans of 478 environments across multiple time steps. Each scene includes …

📊 2 results

📏 Metrics: CD, RRE, RTE

FPv1

FPv1 (prior name FAUST-partial) is a 3D registration benchmark dataset created to address the lack of data variability in the …

📊 7 results

📏 Metrics: Recall (3cm, 10 degrees), RRE (degrees), RTE (cm)

KITTI

KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile …

📊 5 results

📏 Metrics: Success Rate

Point Cloud Segmentation

PointCloud-C

PointCloud-C is the very first test-suite for point cloud robustness analysis under corruptions. - Two sets: ModelNet-C for point cloud …

📊 11 results

📏 Metrics: mean Corruption Error (mCE)

Point Clouds

DTU

DTU MVS 2014 is a multi-view stereo dataset, which is an order of magnitude larger in number of scenes and …

📊 1 results

📏 Metrics: Overall

Tanks and Temples

We present a benchmark for image-based 3D reconstruction. The benchmark sequences were acquired outside the lab, in realistic conditions. Ground-truth …

📊 17 results

📏 Metrics: Mean F1 (Advanced), Mean F1 (Intermediate)

Point Tracking

Perception Test

Perception Test is a benchmark designed to evaluate the perception and reasoning skills of multimodal models. It introduces real-world videos …

📊 1 results

📏 Metrics: Average Jaccard

PointOdyssey

PointOdyssey is a large-scale synthetic dataset, and data generation framework, for the training and evaluation of long-term fine-grained tracking algorithms. …

📊 2 results

📏 Metrics: Survival, δ, MTE

TAP-Vid

TAP-Vid is a benchmark which contains both real-world videos with accurate human annotations of point tracks, and synthetic videos with …

Human3.6M

The Human3.6M dataset is one of the largest motion capture datasets, which consists of 3.6 million human poses and corresponding …

📊 1 results

📏 Metrics: Hit@1, Hit@10

MPI-INF-3DHP

MPI-INF-3DHP is a 3D human body pose estimation dataset consisting of both constrained indoor and complex outdoor scenes. It records …

📊 1 results

📏 Metrics: Hit@1, Hit@10

Precipitation Forecasting

SEVIR

SEVIR is an annotated, curated and spatio-temporally aligned dataset containing over 10,000 weather events that each consist of 384 km …

📊 1 results

📏 Metrics: CSI-pool16, CSI-pool4

Prediction Of Occupancy Grid Maps

nuScenes

The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in …

📊 1 results

📏 Metrics: mIoU

Procedure Step Recognition

IndustReal

IndustReal is an ego-centric, multi-modal dataset where 27 participants are challenged to perform assembly and maintenance procedures on a construction-toy …

📏 Metrics: S-measure, F-measure, MAE, mBA

DUT-OMRON

The DUT-OMRON dataset is used for evaluation of Salient Object Detection task and it contains 5,168 high quality images. The …

📊 17 results

📏 Metrics: S-Measure, F-measure, mean E-Measure, MAE, mean F-Measure, Weighted F-Measure

ECSSD

The Extended Complex Scene Saliency Dataset (ECSSD) is comprised of complex scenes, presenting textures and structures common to real-world images. …

📊 13 results

📏 Metrics: S-Measure, F-measure, MAE, mean F-Measure, mean E-Measure, F-Score, Weighted F-Measure

HKU-IS

HKU-IS is a visual saliency prediction dataset which contains 4447 challenging images, most of which have either low contrast or …

📊 13 results

📏 Metrics: S-Measure, F-measure, MAE, mean F-Measure, mean E-Measure, Weighted F-Measure, F-Score

HRSOD

There exist several datasets for saliency detection, but none of them is specifically designed for high-resolution salient object detection. High-Resolution …

📊 14 results

📏 Metrics: S-Measure, max F-Measure, MAE, mBA

ISTD

The Image Shadow Triplets dataset (ISTD) is a dataset for shadow understanding that contains 1870 image triplets of shadow image, …

📊 4 results

📏 Metrics: Balanced Error Rate

PASCAL-S

PASCAL-S is a dataset for salient object detection consisting of a set of 850 images from PASCAL VOC 2010 validation …

📊 11 results

📏 Metrics: S-Measure, F-measure, MAE, mean F-Measure, mean E-Measure, F-Score, Weighted F-Measure

SBU / SBU-Refine

SBU-Kinect-Interaction dataset version 2.0 comprises of RGB-D video sequences of humans performing interaction activities that are recording using the Microsoft …

📊 4 results

📏 Metrics: Balanced Error Rate

SOC

SOC (Salient Objects in Clutter) is a dataset for Salient Object Detection (SOD). It includes images with salient and non-salient …

📊 3 results

📏 Metrics: Average MAE, S-Measure, mean E-Measure

SOD

Aiming Detect small obstacles, like lost and found. # frames 3000+ picture. 3000+ claimed labelled. 1600 actually labelled.

📊 1 results

📏 Metrics: MAE, F-measure

UHRSD

Recent salient object detection (SOD) methods based on deep neural network have achieved remarkable performance. However, most of existing SOD …

📊 12 results

📏 Metrics: S-Measure, max F-Measure, MAE, mBA

Rain Removal

Nightrain

Synthetically Generated Night-time Weather Degraded Database

📊 4 results

📏 Metrics: PSNR

📏 Metrics: PSNR, R-FID

Referring Expression

SQA3D

SQA3D is a dataset for embodied scene understanding, where an agent needs to comprehend the scene it situates from an …

📊 1 results

📏 Metrics: [email protected], [email protected], Acc@15°, Acc@30°

Referring Expression Segmentation

A2D Sentences

The Actor-Action Dataset (A2D) by Xu et al. [29] serves as the largest video dataset for the general actor and …

📊 20 results

📏 Metrics: AP, IoU overall, IoU mean, [email protected], [email protected], [email protected], [email protected], [email protected]

CLEVR-Ref+

CLEVR-Ref+ is a synthetic diagnostic dataset for referring expression comprehension. The precise locations and attributes of the objects are readily …

📊 1 results

📏 Metrics: IoU

PhraseCut

PhraseCut is a dataset consisting of 77,262 images and 345,486 phrase-region pairs. The dataset is collected on top of the …

📊 6 results

📏 Metrics: Mean IoU, [email protected], [email protected], [email protected]

RefCOCO

The RefCOCO dataset is a referring expression generation (REG) dataset used for tasks related to understanding natural language expressions that …

📊 4 results

📏 Metrics: IoU, IoU (%)

Refer-YouTube-VOS

There exist previous works [6, 10] that constructed referring segmentation datasets for videos. Gavrilyuk et al. [6] extended the A2D …

📊 2 results

📏 Metrics: Mean IoU, [email protected], [email protected]

Referring Expressions for DAVIS 2016 & 2017

Our task is to localize and provide a pixel-level mask of an object on all video frames given a language …

📊 1 results

📏 Metrics: F, J, J&F 1st frame

Referring expression generation

ColonINST-v1 (Seen)

ColonINST is a large-scale instruction tuning dataset designed for multimodal analysis in colonoscopy. This dataset comprises 62 categories, 303,001 colonoscopy …

📊 17 results

📏 Metrics: Accuray

ColonINST-v1 (Unseen)

ColonINST is a large-scale instruction tuning dataset designed for multimodal analysis in colonoscopy. This dataset comprises 62 categories, 303,001 colonoscopy …

📊 17 results

📏 Metrics: Accuray

Reinforcement Learning (RL)

ProcGen

Procgen Benchmark includes 16 simple-to-use procedurally-generated environments which provide a direct measure of how quickly a reinforcement learning agent learns …

📊 2 results

📏 Metrics: Mean Normalized Performance

Repetitive Action Counting

Countix

Countix is a real world dataset of repetition videos collected in the wild (i.e.YouTube) covering a wide range of semantic …

📊 3 results

📏 Metrics: OBO, MAE, OBZ, RMSE

RepCount

Counting repetitive actions are widely seen in human activities such as physical exercise. Existing methods focus on performing repetitive action …

📊 6 results

📏 Metrics: OBO, MAE, OBZ, RMSE

UCFRep

The UCFRep dataset contains 526 annotated repetitive action videos. This dataset is built from the action recognition dataset UCF101. Source: …

📊 2 results

📏 Metrics: MAE, OBO, OBZ, RMSE

Representation Learning

Animals-10

It contains about 28K medium quality animal images belonging to 10 categories: dog, cat, horse, spyder, butterfly, chicken, sheep, cow, …

📊 1 results

📏 Metrics: 1:1 Accuracy

SciDocs

SciDocs evaluation framework consists of a suite of evaluation tasks designed for document-level tasks. Source: Allen Institute for AI

📊 7 results

📏 Metrics: Avg.

Sports10

Games dataset containing 100,000 Gameplay Images of 175 Video Games across 10 Sports Genres - AMERICAN FOOTBALL, BASKETBALL, BIKE …

📊 1 results

📏 Metrics: Silhouette Score

Retinal Vessel Segmentation

CHASE_DB1

CHASE_DB1 is a dataset for retinal vessel segmentation which contains 28 color retina images with the size of 999×960 pixels …

📊 15 results

📏 Metrics: AUC, F1 score, mIOU, Sensitivity, MCC, 1:1 Accuracy, Acc, Average IOU, DSC

DRIVE

The Digital Retinal Images for Vessel Extraction (DRIVE) dataset is a dataset for retinal vessel segmentation. It consists of a …

📊 19 results

📏 Metrics: AUC, F1 score, Accuracy, mIoU, sensitivity, Specificity, MCC, 1:1 Accuracy, Average IOU, DSC

HRF

The HRF dataset is a dataset for retinal vessel segmentation which comprises 45 images and is organized as 15 subsets. …

📊 4 results

📏 Metrics: AUC, F1 score, MCC, mIoU, 1:1 Accuracy, Acc, Average IOU, DSC, Sensitivity

INSPIRE-AVR (LUNet subset)

This dataset contains 65 DFIs acquired from patients with POAG at the University of Iowa Hospitals and Clinics. DFIs were …

📊 1 results

📏 Metrics: Average Dice

STARE

The STARE (Structured Analysis of the Retina) dataset is a dataset for retinal vessel segmentation. It contains 20 equal-sized (700×605) …

📊 9 results

📏 Metrics: AUC, F1 score, mIOU, Sensitivity, Acc, MCC, 1:1 Accuracy, Average IOU, DSC

UZLF

The Leuven-Haifa dataset contains 240 disc-centered fundus images of 224 unique patients (75 patients with normal tension glaucoma, 63 patients …

The ToolLens dataset consists of 18,770 concise yet intentionally multifaceted queries, each associated with 1 to 3 verified tools out …

📊 1 results

📏 Metrics: COMP@

Road Segmentation

ChesapeakeRSC

A novel remote sensing dataset for evaluating a geospatial machine learning model's ability to learn long range dependencies and spatial …

📊 4 results

📏 Metrics: DWR

DeepGlobe

We observe that satellite imagery is a powerful source of information as it contains more structured and uniform data, compared …

📊 2 results

📏 Metrics: APLS, IoU, mIoU

Massachusetts Roads Dataset

The datasets introduced in Chapter 6 of my PhD thesis are below. See the thesis for more details. If you …

📊 2 results

📏 Metrics: IoU, F1, APLS

Room Layout Estimation

SUN RGB-D

The SUN RGBD dataset contains 10335 real RGB-D images of room scenes. Each RGB image has a corresponding depth and …

PASCAL-S is a dataset for salient object detection consisting of a set of 850 images from PASCAL VOC 2010 validation …

📊 1 results

📏 Metrics: MAE

Saliency Prediction

CAT2000

Includes 4000 images; 200 from each of 20 categories covering different types of scenes such as Cartoons, Art, Objects, Low …

📊 1 results

📏 Metrics: KL

SALICON

The SALIency in CONtext (SALICON) dataset contains 10,000 training images, 5,000 validation images and 5,000 test images for saliency prediction. …

📊 5 results

📏 Metrics: AUC, CC, KLD, NSS, SIM, sAUC, IG

Salient Object Detection

DUT-OMRON

The DUT-OMRON dataset is used for evaluation of Salient Object Detection task and it contains 5,168 high quality images. The …

📊 8 results

📏 Metrics: S-measure, E-measure, MAE, max_F1

ECSSD

The Extended Complex Scene Saliency Dataset (ECSSD) is comprised of complex scenes, presenting textures and structures common to real-world images. …

📊 10 results

📏 Metrics: S-measure, E-measure, MAE, max_F1

HKU-IS

HKU-IS is a visual saliency prediction dataset which contains 4447 challenging images, most of which have either low contrast or …

📊 9 results

📏 Metrics: S-measure, E-measure, MAE, max_F1

PASCAL-S

PASCAL-S is a dataset for salient object detection consisting of a set of 850 images from PASCAL VOC 2010 validation …

📊 10 results

📏 Metrics: S-measure, E-measure, MAE, max_F1

SOD

Aiming Detect small obstacles, like lost and found. # frames 3000+ picture. 3000+ claimed labelled. 1600 actually labelled.

📊 1 results

📏 Metrics: Fwβ, MAE, Sm, relaxFbβ, {max}Fβ

Scanpath prediction

CapMIT1003

The CapMIT1003 database contains captions and clicks collected for images from the MIT1003 database, for which reference eye scanpath are …

📊 2 results

📏 Metrics: SBTDE

Scene Change Detection

ChangeSim

ChangeSim is a dataset aimed at online scene change detection (SCD) and more. The data is collected in photo-realistic simulation …

📊 3 results

📏 Metrics: Category mIoU, macro F1

ChangeVPR

Scene change detection (SCD) dataset tailored for generalizable SCD algorithm. It consists of change-labeld images from SF-XL, St Lucia, Nordland …

📊 1 results

📏 Metrics: F1 score

PCD

The Arabic dataset is scraped mainly from الموسوعة الشعرية and الديوان. After merging both, the total number of verses is …

📊 2 results

📏 Metrics: F1-score

Unaligned-VL-CMU-CD (neighbor distance 2)

Street-View images captured at different timestamps often undergo geometric transformations. To make the VL-CMU-CD dataset more challenging and closer to …

📊 2 results

📏 Metrics: F1-score

Scene Classification

UC Merced Land Use Dataset

This is a 21 class land use image dataset meant for research purposes. There are 100 images for each of …

📊 4 results

📏 Metrics: Accuracy (%)

Scene Flow Estimation

Argoverse 2

Argoverse 2 (AV2) is a collection of three datasets for perception and forecasting research in the self-driving domain. The annotated …

📊 6 results

📏 Metrics: EPE 3-Way, EPE Foreground Dynamic, EPE Foreground Static, EPE Background Static

Spring

Spring is a large, high-resolution and high-detail, computer-generated benchmark for scene flow, optical flow, and stereo. Based on rendered scenes …

📊 6 results

📏 Metrics: 1px total

Scene Generation

AVD

AVD focuses on simulating robotic vision tasks in everyday indoor environments using real imagery. The dataset includes 20,000+ RGB-D images …

📊 3 results

📏 Metrics: FID, SwAV-FID

GoogleEarth

The GoogleEarth dataset is collected from Google Earth Studio, including 400 orbit trajectories in Manhattan and Brooklyn. Each trajectory consists …

📊 4 results

📏 Metrics: Depth Error, KID, Camera Error, FID

KITTI

KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile …

📊 1 results

📏 Metrics: FID, KID

OSM

The OSM dataset, sourced from OpenStreetMap, is composed of the rasterized semantic maps and height fields of 80 cities worldwide, …

📊 1 results

📏 Metrics: Average FID, KID

Replica

The Replica Dataset is a dataset of high quality reconstructions of a variety of indoor spaces. Each reconstruction has clean …

📊 3 results

📏 Metrics: FID, SwAV-FID

VizDoom

ViZDoom is an AI research platform based on the classical First Person Shooter game Doom. The most popular game mode …

📊 3 results

📏 Metrics: FID, SwAV-FID

Scene Graph Generation

4D-OR

4D-OR includes a total of 6734 scenes, recorded by six calibrated RGB-D Kinect sensors 1 mounted to the ceiling of …

📊 5 results

📏 Metrics: F1

MM-OR

Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment for enhancing …

📊 1 results

📏 Metrics: Macro F1

VRD

The Visual Relationship Dataset (VRD) contains 4000 images for training and 1000 for testing annotated with visual relationships. Bounding boxes …

📊 2 results

📏 Metrics: Recall@50

Visual Genome

Visual Genome contains Visual Question Answering data in a multi-choice setting. It consists of 101,174 images from MSCOCO with 1.7 …

📊 16 results

📏 Metrics: Recall@50, mean Recall @20, Recall@100, Recall@20, mean Recall @100, R@100, mR@100, mR@50, zR@100, zR@20, zR@50

Scene Parsing

PGDP5K

PGDP5K is a dataset consisting of 5000 diagram samples composed of 16 shapes, covering 5 positional relations, 22 symbol types …

📊 2 results

📏 Metrics: Total Accuracy

Scene Segmentation

MovieNet

MovieNet is a holistic dataset for movie understanding. MovieNet contains 1,100 movies with a large amount of multi-modal data, e.g. …

📊 1 results

📏 Metrics: AP

ScanNet

ScanNet is an instance-level indoor RGB-D dataset that includes both 2D and 3D data. It is a collection of labeled …

📊 3 results

📏 Metrics: Average Accuracy, 3DIoU

StreetHazards

StreetHazards is a synthetic dataset for anomaly detection, created by inserting a diverse array of foreign objects into driving scenes …

📊 3 results

📏 Metrics: Open-mIoU

UAVid

UAVid is a high-resolution UAV semantic segmentation dataset as a complement, which brings new challenges, including large scale variation, moving …

📊 1 results

📏 Metrics: Category mIoU

Scene Text Detection

COCO-Text

The COCO-Text dataset is a dataset for text detection and recognition. It is based on the MS COCO dataset, which …

📊 6 results

📏 Metrics: F-Measure, Precision, Recall

ICDAR 2013

The ICDAR 2013 dataset consists of 229 training images and 233 testing images, with word-level annotations provided. It is the …

📊 14 results

📏 Metrics: F-Measure, Precision, Recall, H-Mean

ICDAR 2015

ICDAR 2015 was a scene text detection used for the ICDAR 2015 conference.

📊 41 results

📏 Metrics: F-Measure, Precision, Recall, Accuracy, FPS

MSRA-TD500

The MSRA-TD500 dataset is a text detection dataset that contains 300 training images and 200 test images. Text regions are …

📊 18 results

📏 Metrics: F-Measure, Precision, Recall, FPS

SCUT-CTW1500

The SCUT-CTW1500 dataset contains 1,500 images: 1,000 for training and 500 for testing. In particular, it provides 10,751 cropped text …

📊 16 results

📏 Metrics: F-Measure, Precision, Recall, FPS

Total-Text

Total-Text is a text detection dataset that consists of 1,555 images with a variety of text types including horizontal, multi-oriented, …

The IIIT5K dataset contains 5,000 text instance images: 2,000 for training and 3,000 for testing. It contains words from street …

📊 16 results

📏 Metrics: Accuracy

MSDA

5 domains: synthetic domain, document domain, street view domain, handwritten domain, and car license domain * over five million …

📊 2 results

📏 Metrics: Average Accuracy

SVT

The Street View Text (SVT) dataset was harvested from Google Street View. Image text in this data exhibits high variability …

📊 34 results

📏 Metrics: Accuracy

SVTP

SVTP dataset stands for Scene Text Recognition Datasets. It is a collection of 4 popular Latin/English scene text recognition datasets, …

📊 16 results

📏 Metrics: Accuracy

WOST

The Weakly Occluded Scene Text (WOST) dataset is a public dataset for scene text segmentation. It is used to generate …

📊 5 results

📏 Metrics: 1:1 Accuracy

Scene-Aware Dialogue

AVSD

The Audio Visual Scene-Aware Dialog (AVSD) dataset, or DSTC7 Track 3, is a audio-visual dataset for dialogue understanding. The goal …

📊 1 results

📏 Metrics: CIDEr

Seeing Beyond the Visible

KITTI360-EX

KITTI360-EX is a dataset for outer- and inner FoV expansion. It contains 76k pinhole images as well as 76k spherical …

📊 6 results

📏 Metrics: Average PSNR

Segmentation

SA-1B

SA-1B consists of 11M diverse, high resolution, licensed, and privacy protecting images and 1.1B high-quality segmentation masks. Source: Segment Anything …

📊 2 results

📏 Metrics: Average Precision, AR-small, AR-medium, AR-large

SimGas

This dataset consists of computer-generated images for gas leakage segmentation. It features diverse backgrounds, interfering foreground objects, and precise ground …

📊 1 results

📏 Metrics: IoU, Precision, Recall

Segmentation Based Workflow Recognition

PETRAW

PETRAW data set was composed of 150 sequences of peg transfer training sessions. The objective of the peg transfer session …

ACDC Scribbles

We release expert-made scribble annotations for the medical ACDC dataset [1]. The released data must be considered as extending the …

📊 6 results

📏 Metrics: Dice (Average)

ADE20K

The ADE20K semantic segmentation dataset contains more than 20K scene-centric images exhaustively annotated with pixel-level objects and object parts labels. …

📊 229 results

📏 Metrics: Validation mIoU, Test Score, Params (M), GFLOPs (512 x 512), GFLOPs, Mean IoU (class)

AI-TOD

AI-TOD comes with 700,621 object instances for eight categories across 28,036 aerial images. Compared to existing object detection datasets in …

📊 2 results

📏 Metrics: Dice

AIRS

The AIRS (Aerial Imagery for Roof Segmentation) dataset provides a wide coverage of aerial imagery with 7.5 cm resolution and …

📊 1 results

📏 Metrics: IoU

ATLANTIS

ATLANTIS is a benchmark for semantic segmentation of waterbody images. This dataset covers a wide range of natural waterbodies such …

📊 1 results

📏 Metrics: A-acc, A-mIoU, Accuracy, mIoU

ApolloScape

ApolloScape is a large dataset consisting of over 140,000 video frames (73 street scene videos) from various locations in China …

📊 2 results

📏 Metrics: mIoU

BIG

A high-resolution semantic segmentation dataset with 50 validation and 100 test objects. Image resolution in BIG ranges from 2048×1600 to …

📊 4 results

📏 Metrics: mBA, IoU

CC3M-TagMask

The dataset offers tag and mask annotations for image-text pairs from the CC3M validation set. Tag annotations denote words that …

📊 4 results

📏 Metrics: mIoU

CEMS-W

The dataset includes annotations for burned area delineation and land cover segmentation, with a focus on European soil. The dataset …

📊 3 results

📏 Metrics: mIoU

COCO (Common Objects in Context)

The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset. It is designed to …

📊 9 results

📏 Metrics: mIoU

COCO-Stuff

The Common Objects in COntext-stuff (COCO-stuff) dataset is a dataset for scene understanding tasks like semantic segmentation, object detection and …

📊 1 results

📏 Metrics: F.W. IU, Per-Class Accuracy, Pixel Accuracy, mIoU

Cam2BEV

The dataset contains two subsets of synthetic, semantically segmented road-scene images, which have been created for developing and applying the …

📊 1 results

📏 Metrics: Mean IoU

CamVid

CamVid (Cambridge-driving Labeled Video Database) is a road/driving scene understanding database which was originally captured as five video sequences with …

📊 20 results

📏 Metrics: Mean IoU, Global Accuracy

Cityscapes

Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense …

📊 2 results

📏 Metrics: mIoU, Pixel Accuracy

Cityscapes 3D

Detecting vehicles and representing their position and orientation in the three dimensional space is a key technology for autonomous driving. …

📊 1 results

📏 Metrics: mIoU

Cityscapes VIPriors subset

The training and validation data are subsets of the training split of the Cityscapes dataset. The test set is taken …

📊 1 results

📏 Metrics: Accuracy, mIoU

DADA-seg

DADA-seg is a pixel-wise annotated accident dataset, which contains a variety of critical scenarios from traffic accidents. It is used …

📊 27 results

📏 Metrics: mIoU

DDD17

DDD17 has over 12 h of a 346x260 pixel DAVIS sensor recording highway and city driving in daytime, evening, night, …

📊 9 results

📏 Metrics: mIoU

DELIVER

DELIVER is an arbitrary-modal segmentation benchmark, covering Depth, LiDAR, multiple Views, Events, and RGB. Aside from this, the dataset is …

📊 9 results

📏 Metrics: mIoU, test mIoU

DIVA-HisDB

The database consists of 150 annotated pages of three different medieval manuscripts with challenging layouts. Furthermore, we provide a layout …

📊 2 results

📏 Metrics: Mean IoU (class)

DSEC

DSEC is a stereo camera dataset in driving scenarios that contains data from two monochrome event cameras and two global …

📊 9 results

📏 Metrics: mIoU

Dark Zurich

Dark Zurich is an image dataset containing a total of 8779 images captured at nighttime, twilight, and daytime, along with …

📊 14 results

📏 Metrics: mIoU

DensePASS

DensePASS - a novel densely annotated dataset for panoramic segmentation under cross-domain conditions, specifically built to study the Pinhole-to-Panoramic transfer …

📊 35 results

📏 Metrics: mIoU

DroneDeploy

From DroneDeploy: We’ve collected a dataset of aerial orthomosaics and elevation images. These have been annotated into 6 different classes: …

📊 1 results

📏 Metrics: Mean IoU (test), Mean IoU (val)

Endoscapes

Cholecystectomy is a very common abdominal surgical procedure almost ubiquitously performed with a laparoscopic approach, hence guided by an endoscopic …

📊 2 results

📏 Metrics: Mean F1

FLAIR (French Land cover from Aerospace ImageRy)

The French National Institute of Geographical and Forest Information (IGN) has the mission to document and measure land-cover on French …

📊 4 results

📏 Metrics: mIoU

FMB Dataset

FMB contains 1500 well-registered infrared and visible image pairs with 14 annotated pixel-level categories. Also, it covers a wide range …

📊 13 results

📏 Metrics: mIoU

Fine-Grained Cloud Segmentation Dataset

The dataset consists of 96 terrain-corrected (Level-1T) scenes from Landsat 8 OLI and TIRS, covering diverse biomes. This variety supports …

📊 3 results

📏 Metrics: mIoU

Fine-Grained Grass Segmentation Dataset

The dataset was created using high-resolution (8 m) satellite imagery from the Gaofen series (Gaofen-2 and Gaofen-6), captured in 2019 …

📊 9 results

📏 Metrics: mIoU

FoodSeg103

FoodSeg103 is a new food image dataset containing 7,118 images. Images are annotated with 104 ingredient classes and each image …

📊 7 results

📏 Metrics: mIoU

Forward-Looking Sonar Marine Debris Datasets

This dataset is made up of forward-looking sonar images containing ten classes of underwater debris. The dataset can be used …

📊 1 results

📏 Metrics: mIOU

Freiburg Forest

The Freiburg Forest dataset was collected using a Viona autonomous mobile robot platform equipped with cameras for capturing multi-spectral and …

📊 2 results

📏 Metrics: Mean IoU

HAM10000

HAM10000 is a dataset of 10000 training images for detecting pigmented skin lesions. The authors collected dermatoscopic images from different …

📊 1 results

📏 Metrics: Average Dice, Average IOU

HERA RFI Detection

This dataset contains simulated and expert-labelled spectrograms from two radio telescopes: the Hydrogen Epoch of Reionization Array (HERA) in South …

📊 2 results

📏 Metrics: AUPRC, AUROC, F1

Hypersim

For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real images. …

📊 5 results

📏 Metrics: mIoU, mIoU (test)

INRIA Aerial Image Labeling

The INRIA Aerial Image Labeling dataset is comprised of 360 RGB tiles of 5000×5000px with a spatial resolution of 30cm/px …

📊 6 results

📏 Metrics: IoU, mIOU

ISPRS Potsdam

The data set contains 38 patches (of the same size), each consisting of a true orthophoto (TOP) extracted from a …

📊 17 results

📏 Metrics: Overall Accuracy, Mean F1, Mean IoU

ISPRS Vaihingen

The data set contains 33 patches (of different sizes), each consisting of a true orthophoto (TOP) extracted from a larger …

📊 10 results

📏 Metrics: Overall Accuracy, Average F1, Category mIoU

ImageNet-S

Powered by the ImageNet dataset, unsupervised learning on large-scale data has made significant advances for classification tasks. There are two …

📊 20 results

📏 Metrics: mIoU (val), mIoU (test)

KITTI-360

KITTI-360 is a large-scale dataset that contains rich sensory information and full annotations. It is the successor of the popular …

📊 14 results

📏 Metrics: mIoU

Kvasir-Instrument

Consists of annotated frames containing GI procedure tools such as snares, balloons and biopsy forceps, etc. Beside of the images, …

📊 2 results

📏 Metrics: DSC, mIoU

LOFAR RFI Detection

This dataset contains simulated and expert-labelled spectrograms from two radio telescopes: the Hydrogen Epoch of Reionization Array (HERA) in South …

📊 2 results

📏 Metrics: AUPRC, AUROC, F1

LaRS

LaRS is the largest and most diverse panoptic maritime obstacle detection dataset. Highlights: * Diverse scenes from manual capture, public …

📊 20 results

📏 Metrics: Q, F1, μ, mIoU

LoveDA

5987 high spatial resolution (0.3 m) remote sensing images from Nanjing, Changzhou, and Wuhan 2. Focus on different geographical …

📊 16 results

📏 Metrics: Category mIoU

MCubeS

Multimodal material segmentation (MCubeS) dataset contains 500 sets of images from 42 street scenes. Each scene has images for four …

📊 21 results

📏 Metrics: mIoU

MCubeS (P)

Multimodal material segmentation (MCubeS) dataset contains 500 sets of images from 42 street scenes. Each scene has images for four …

📊 8 results

📏 Metrics: mIoU

MUSES: MUlti-SEnsor Semantic perception dataset

MUSES offers 2500 multi-modal scenes, evenly distributed across various combinations of weather conditions (clear, fog, rain, and snow) and types …

📊 2 results

📏 Metrics: mIoU

Matterport3D

The Matterport3D dataset is a large RGB-D dataset for scene understanding in indoor environments. It contains 10,800 panoramic views inside …

📊 4 results

📏 Metrics: Test mIoU, Validation mIoU

Mila Simulated Floods

Mila Simulated Floods Dataset is a 1.5 square km virtual world using the Unity3D game engine including urban, suburban and …

📊 1 results

📏 Metrics: mIoU

MixedWM38

MixedWM38 Dataset(WaferMap) has more than 38000 wafer maps, including 1 normal pattern, 8 single defect patterns, and 29 mixed defect …

📊 1 results

📏 Metrics: Dice, Mean IoU

Montgomery County X-ray Set

X-ray images in this data set have been acquired from the tuberculosis control program of the Department of Health andHuman …

📊 3 results

📏 Metrics: F1-score

Nighttime Driving

Nighttime Driving is a dataset of road scenes consisting of 35,000 images ranging from daytime to twilight time and to …

📊 12 results

📏 Metrics: mIoU

OpenEDS

OpenEDS (Open Eye Dataset) is a large scale data set of eye-images captured using a virtual-reality (VR) head mounted display …

📊 1 results

📏 Metrics: mIOU

PASCAL Context

The PASCAL Context dataset is an extension of the PASCAL VOC 2010 detection challenge, and it contains pixel-wise labels for …

📊 62 results

📏 Metrics: mIoU, Mean Accuracy, Pixel Accuracy

PASCAL VOC

The PASCAL Visual Object Classes (VOC) 2012 dataset contains 20 object categories including vehicles, household, animals, and other: aeroplane, bicycle, …

📊 1 results

📏 Metrics: mIoU

PASCAL VOC 2007

PASCAL VOC 2007 is a dataset for image recognition. The twenty object classes that have been selected are: Person: person …

📊 2 results

📏 Metrics: Mean IoU

PASCAL VOC 2011

PASCAL VOC 2011 is an image segmentation dataset. It contains around 2,223 images for training, consisting of 5,034 objects. Testing …

📊 1 results

📏 Metrics: Mean IoU

PASCAL VOC 2012 test

SCC Data Set

📊 51 results

📏 Metrics: Mean IoU, FLOPS, Params

PASTIS

PASTIS is a benchmark dataset for panoptic and semantic segmentation of agricultural parcels from satellite image time series. It is …

📊 3 results

📏 Metrics: Mean IoU (test), Number of Params, Overall Accuracy

PASTIS-R

Extension of the PASTIS benchmark with radar and optical image time series.

📊 1 results

📏 Metrics: IoU

PETRAW

PETRAW data set was composed of 150 sequences of peg transfer training sessions. The objective of the peg transfer session …

📊 4 results

📏 Metrics: Mean IoU (class)

PH2

The increasing incidence of melanoma has recently promoted the development of computer-aided diagnosis systems for the classification of dermoscopic images. …

📊 2 results

📏 Metrics: Average Dice, Average IOU

Pothole Mix

This dataset for the semantic segmentation of potholes and cracks on the road surface was assembled from 5 other datasets …

📊 7 results

📏 Metrics: Test Dice Multiclass, Test mIoU, Validation Dice Multiclass, Validation mIoU

Potsdam

https://paperswithcode.com/sota/semantic-segmentation-on-isprs-potsdam

📊 3 results

📏 Metrics: mIoU

RUGD

A Video Dataset for Visual Perception and Autonomous Navigation in Unstructured Environments. Website: http://rugd.vision/ The RUGD dataset focuses on semantic …

📊 1 results

📏 Metrics: AIOU, mIoU

Replica

The Replica Dataset is a dataset of high quality reconstructions of a variety of indoor spaces. Each reconstruction has clean …

📊 5 results

📏 Metrics: mIoU

S3DIS

The Stanford 3D Indoor Scene Dataset (S3DIS) dataset contains 6 large-scale indoor areas with 271 rooms. Each point in the …

📊 50 results

📏 Metrics: Mean IoU, mAcc, oAcc, FLOPs, Number of params, mIoU, Params (M)

SBCoseg

The SBCoseg dataset includes 889 groups of images and each group consists of 18 images with a common object, leading …

📊 1 results

📏 Metrics: Jaccard

STARE

The STARE (Structured Analysis of the Retina) dataset is a dataset for retinal vessel segmentation. It contains 20 equal-sized (700×605) …

📊 1 results

📏 Metrics: AUC

SWIMSEG

The SWIMSEG dataset contains 1013 images of sky/cloud patches, along with their corresponding binary segmentation maps. The ground truth annotation …

📊 1 results

📏 Metrics: Average Precision, Average Recall, F1-Score, MCC, Mean IoU

SWINSEG

The SWINSEG dataset contains 115 nighttime images of sky/cloud patches along with their corresponding binary ground truth maps. The ground …

📊 1 results

📏 Metrics: Average Precision, Average Recall, F1-Score, MCC, Mean IoU

SWINySEG

The SWINySEG dataset contains 6768 daytime- and nighttime-images of sky/cloud patches along with their corresponding binary ground truth maps. The …

📊 1 results

📏 Metrics: Average Precision, Average Recall, F1-Score, MCC, Mean IoU

SYNTHIA

The SYNTHIA dataset is a synthetic dataset that consists of 9400 multi-viewpoint photo-realistic frames rendered from a virtual city and …

📊 2 results

📏 Metrics: mIoU

ScanNet

ScanNet is an instance-level indoor RGB-D dataset that includes both 2D and 3D data. It is a collection of labeled …

📊 44 results

📏 Metrics: val mIoU, test mIoU

Semantic3D

Semantic3D is a point cloud dataset of scanned outdoor scenes with over 3 billion points. It contains 15 training and …

📊 13 results

📏 Metrics: mIoU, oAcc

SemanticPOSS

The SemanticPOSS dataset for 3D semantic segmentation contains 2988 various and complicated LiDAR scans with large quantity of dynamic instances. …

📊 1 results

📏 Metrics: Mean IoU

ShapeNet

ShapeNet is a large scale repository for 3D CAD models developed by researchers from Stanford University, Princeton University and the …

📊 4 results

📏 Metrics: Mean IoU

SpaceNet 1

SpaceNet 1: Building Detection v1 is a dataset for building footprint detection. The data is comprised of 382,534 building footprints, …

📊 10 results

📏 Metrics: Mean IoU

Structured3D

Structured3D is a large-scale photo-realistic dataset containing 3.5K house designs (a) created by professional designers with a variety of ground …

📊 4 results

📏 Metrics: Test mIoU, Validation mIoU

Trans10K

A large-scale dataset for transparent object segmentation, named Trans10K, consisting of 10,428 images of real scenarios with carefully manual annotations, …

📊 14 results

📏 Metrics: mIoU, GFLOPs

UAVid

UAVid is a high-resolution UAV semantic segmentation dataset as a complement, which brings new challenges, including large scale variation, moving …

📊 6 results

📏 Metrics: Mean IoU

UPLight

UPLight is an underwater RGB-Polarization multimodal semantic segmentation dataset with 12 typical underwater semantic classes.

📊 6 results

📏 Metrics: mIoU

VDD

Semantic segmentation of drone images is critical for various aerial vision tasks as it provides essential seman- tic details to …

📊 7 results

📏 Metrics: mIoU

WildDash

WildDash is a benchmark evaluation method is presented that uses the meta-information to calculate the robustness of a given algorithm …

📊 1 results

📏 Metrics: Mean IoU

ZJU-RGB-P

Research on semantic segmentation of traffic scenes using color and polarization information (including training and testing sets).

📊 13 results

📏 Metrics: mIoU, Frame (fps)

iSAID

iSAID contains 655,451 object instances for 15 categories across 2,806 high-resolution images. The images of iSAID is the same as …

📊 15 results

📏 Metrics: mIoU

Semantic correspondence

AP-10K

AP-10K is the first large-scale benchmark for general animal pose estimation, to facilitate the research in animal pose estimation. AP-10K …

📊 1 results

📏 Metrics: PCK

CUB-200-2011

The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of …

📊 1 results

📏 Metrics: Mean [email protected], Mean [email protected]

Caltech-101

The Caltech101 dataset contains images from 101 object categories (e.g., “helicopter”, “elephant” and “chair” etc.) and a background category that …

📊 2 results

📏 Metrics: IoU, LT-ACC, IoU (weak), LT-ACC (weak)

PF-PASCAL

📊 14 results

📏 Metrics: PCK, PCK (weak)

PF-WILLOW

📊 7 results

📏 Metrics: PCK, PCK (weak)

SPair-71k

SPair-71k contains 70,958 image pairs with diverse variations in viewpoint and scale. Compared to previous datasets, it is significantly larger …

📊 21 results

📏 Metrics: PCK

Semi-Supervised Image Classification

Caltech-101

The Caltech101 dataset contains images from 101 object categories (e.g., “helicopter”, “elephant” and “chair” etc.) and a background category that …

📊 1 results

📏 Metrics: Accuracy

Caltech-256

📊 9 results

📏 Metrics: J&F, J, F

Long Video Dataset (3X)

We randomly selected three videos from the Internet, that are longer than 1.5K frames and have their main objects continuously …

📊 2 results

📏 Metrics: J&F, J, F

MOSE

CoMplex video Object SEgmentation (MOSE) is a dataset to study the tracking and segmenting objects in complex environments. MOSE contains …

📊 17 results

📏 Metrics: J&F, J, F, FPS

VOT2020

VOT2020 is a Visual Object Tracking benchmark for short-term tracking in RGB.

📊 20 results

📏 Metrics: EAO, EAO (real-time)

YouTube-VOS 2018

Youtube-VOS is a Video Object Segmentation dataset that contains 4,453 videos - 3,471 for training, 474 for validation, and 508 …

📊 52 results

📏 Metrics: Overall, Jaccard (Seen), Jaccard (Unseen), F-Measure (Seen), F-Measure (Unseen), Speed (FPS), Params(M), Speed (FPS)

Semi-supervised Anomaly Detection

UBI-Fights

UBI-Fights - Concerning a specific anomaly detection and still providing a wide diversity in fighting scenarios, the UBI-Fights dataset is …

📊 4 results

📏 Metrics: AUC, Decidability, EER

Sentiment Analysis

BanglaBook

This repository contains the code, data, and models of the paper titled "BᴀɴɢʟᴀBᴏᴏᴋ: A Large-scale Bangla Dataset for Sentiment Analysis …

📊 13 results

📏 Metrics: Weighted Average F1-score

DBRD

The DBRD (pronounced dee-bird) dataset contains over 110k book reviews along with associated binary sentiment polarity labels. It is greatly …

📊 3 results

📏 Metrics: Accuracy, F1

DynaSent

DynaSent is an English-language benchmark task for ternary (positive/negative/neutral) sentiment analysis. DynaSent combines naturally occurring sentences with sentences created using …

📊 12 results

📏 Metrics: Macro F1, 10 fold Cross validation

HARD

The Hotel Arabic-Reviews Dataset (HARD) contains 93700 hotel reviews in Arabic language. The hotel reviews were collected from Booking.com website …

📊 1 results

📏 Metrics: Accuracy

IMDb Movie Reviews

The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database …

📊 2 results

📏 Metrics: Accuracy (2 classes), F1 Macro

MR

MR Movie Reviews is a dataset for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect …

📊 18 results

📏 Metrics: Accuracy, Training Time

SLUE

Spoken Language Understanding Evaluation (SLUE) is a suite of benchmark tasks for spoken language understanding evaluation. It consists of limited-size …

📊 8 results

📏 Metrics: Recall (%) , F1 (%), Text model

SST-3

SST-5 is the Stanford Sentiment Treebank 5-way classification dataset (positive, somewhat positive, neutral, somewhat negative, negative). To create SST-3 (positive, …

📊 11 results

📏 Metrics: Macro F1

Sentiment Merged

This is a dataset for 3-way sentiment classification of reviews (negative, neutral, positive). It is a merge of [Stanford Sentiment …

📊 10 results

📏 Metrics: Macro F1

TweetEval

TweetEval introduces an evaluation framework consisting of seven heterogeneous Twitter-specific classification tasks. Source: [TweetEval: Unified Benchmark and Comparative Evaluation for …

📊 7 results

📏 Metrics: Emoji, Emotion, Hate, Irony, Offensive, Sentiment, Stance, ALL

Shadow Detection

CUHK-Shadow

Collects shadow images for multiple scenarios and compiled a new dataset of 10,500 shadow images, each with labeled ground-truth mask, …

📊 6 results

📏 Metrics: BER

SBU / SBU-Refine

SBU-Kinect-Interaction dataset version 2.0 comprises of RGB-D video sequences of humans performing interaction activities that are recording using the Microsoft …

📊 6 results

📏 Metrics: BER

Shadow Removal

INS Dataset

A significant challenge in removing shadows from indoor scenes is obtaining shadow-free images. To overcome this challenge, we propose a …

📊 1 results

📏 Metrics: Average PSNR (dB)

ISTD

The Image Shadow Triplets dataset (ISTD) is a dataset for shadow understanding that contains 1870 image triplets of shadow image, …

📊 9 results

📏 Metrics: MAE

ISTD+

ISTD+ consists of shadow images, shadow-free images, and shadow masks, with 1,330 training images and 540 testing images from 135 …

📊 20 results

📏 Metrics: RMSE, PSNR, SSIM, LPIPS

SRD

SRD is a dataset for shadow removal that contains 3088 shadow and shadow-free image pairs.

📊 19 results

📏 Metrics: RMSE, PSNR, SSIM, LPIPS

WSRD+

A version of the WSRD Dataset will be used as a benchmark for the NTIRE24 Challenge on Image Shadow Removal.

📊 1 results

📏 Metrics: LPIPS, PSNR, SSIM

Short-term Object Interaction Anticipation

Ego4D

Ego4D is a massive-scale egocentric video dataset and benchmark suite. It offers 3,025 hours of daily life activity video spanning …

WLASL is a large video dataset for Word-Level American Sign Language (ASL) recognition, which features 2,000 common different words in …

📊 2 results

📏 Metrics: Top-1 Accuracy

Znaki

The first and the one open dataset for Russian finger- spelling, contained 1,593 annotated phrases and over 37 thousand HD+ …

PACS

PACS is an image dataset for domain generalization. It consists of four domains, namely Photo (1,670 images), Art Painting (2,048 …

📊 8 results

📏 Metrics: Accuracy

Single-View 3D Reconstruction

CUB-200-2011

The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of …

📊 1 results

📏 Metrics: FID

Common Objects in 3D

Common Objects in 3D is a large-scale dataset with real multi-view images of object categories annotated with camera poses and …

📊 3 results

📏 Metrics: Avg. F1

GSO

Scanned Objects by Google Research is a dataset of common household objects that have been 3D scanned for use in …

📊 3 results

📏 Metrics: Chamfer Distance, IoU, F-Score

ShapeNet

ShapeNet is a large scale repository for 3D CAD models developed by researchers from Stanford University, Princeton University and the …

📊 7 results

📏 Metrics: 3DIoU, F-Score

ShapeNetCore

ShapeNetCore is a subset of the full ShapeNet dataset with single clean 3D models and manually verified category and alignment …

📊 6 results

📏 Metrics: 3DIoU

SynthEVox3D-Tiny

Event cameras are sensors that are inspired by biological systems and specialize in capturing changes in brightness. These emerging cameras …

📊 2 results

📏 Metrics: A-mIoU

TransProteus

The dataset contains procedurally generated images of transparent vessels containing liquid and objects . The data for each image includes …

📊 1 results

📏 Metrics: R2

Single-object discovery

Object Discovery

The Object Discovery dataset was collected by downloading images from Internet for airplane, car and horse. It is significantly larger …

📊 2 results

📏 Metrics: CorLoc

Sketch-Based Image Retrieval

Chairs

The Chairs dataset contains rendered images of around 1000 different three-dimensional chair models. Source: Adversarial Disentanglement with Grouped Observations Image …

📊 2 results

📏 Metrics: R@1, R@10

Sketch-to-Image Translation

COCO-Stuff

The Common Objects in COntext-stuff (COCO-stuff) dataset is a dataset for scene understanding tasks like semantic segmentation, object detection and …

📊 3 results

📏 Metrics: FID, FID-C

Scribble

Scribble is a new outline dataset consisting of 200 images (150 train, 50 test) for each of 10 classes – …

📊 2 results

📏 Metrics: FID, Accuracy, Human (%)

SketchyCOCO

SketchyCOCO dataset consists of two parts: Object-level data Object-level data contains $20198(train18869+val1329)$ triplets of {foreground sketch, foreground image, foreground edge …

📊 2 results

📏 Metrics: FID, Accuracy, Human (%)

Skills Assessment

Multimodal PISA

Dataset for multimodal skills assessment focusing on assessing piano player’s skill level. Annotations include player's skills level, and song difficulty …

📊 1 results

📏 Metrics: Accuracy (%)

Skills Evaluation

eSports Sensors Dataset

The eSports Sensors dataset contains sensor data collected from 10 players in 22 matches in League of Legends. The sensor …

📊 5 results

📏 Metrics: Accuracy, LogLoss, ROC AUC

Small Object Detection

SODA-D

SODA-D is a large-scale dataset tailored for small object detection in driving scenario, which is built on top of MVD …

📊 1 results

📏 Metrics: [email protected]:0.95

Source-Free Domain Adaptation

PACS

PACS is an image dataset for domain generalization. It consists of four domains, namely Photo (1,670 images), Art Painting (2,048 …

📊 2 results

📏 Metrics: Average Accuracy

VisDA-2017

VisDA-2017 is a simulation-to-real dataset for domain adaptation with over 280,000 images across 12 categories in the training, validation and …

📊 10 results

📏 Metrics: Accuracy

Spatial Relation Recognition

Rel3D

Understanding spatial relations (e.g., “laptop on table”) in visual input is important for both humans and robots. Existing datasets are …

📊 9 results

📏 Metrics: Acc

Spatio-Temporal Video Grounding

HC-STVG1

The newly proposed HC-STVG task aims to localize the target person spatio-temporally in an untrimmed video. For this task, we …

📊 3 results

📏 Metrics: m_vIoU, [email protected], [email protected]

HC-STVG2

We have added data and cleaned the labels in HC-STVG to build the HC-STVG2.0. While the original database contained 5660 …

📊 4 results

📏 Metrics: Val m_vIoU, Val [email protected], Val [email protected]

VidSTG

The VidSTG dataset is a spatio-temporal video grounding dataset constructed based on the video relation dataset VidOR. VidOR contains 7,000, …

📊 3 results

📏 Metrics: Declarative m_vIoU, Declarative [email protected], Declarative [email protected], Interrogative m_vIoU, Interrogative [email protected], Interrogative [email protected]

State Change Object Detection

Ego4D

Ego4D is a massive-scale egocentric video dataset and benchmark suite. It offers 3,025 hours of daily life activity video spanning …

📊 1 results

📏 Metrics: AP, AP50, AP75

Stereo Depth Estimation

Spring

Spring is a large, high-resolution and high-detail, computer-generated benchmark for scene flow, optical flow, and stereo. Based on rendered scenes …

📊 4 results

📏 Metrics: 1px total

Stereo Disparity Estimation

Middlebury 2014

The Middlebury 2014 dataset contains a set of 23 high resolution stereo pairs for which known camera calibration parameters and …

📊 2 results

📏 Metrics: D1 Error (2px)

Story Continuation

VIST

The Visual Storytelling Dataset (VIST) consists of 210,819 unique photos and 50,000 stories. The images were collected from albums on …

📊 2 results

📏 Metrics: FID

Style Transfer

GYAFC

Grammarly’s Yahoo Answers Formality Corpus (GYAFC) is the largest dataset for any style containing a total of 110K informal / …

📊 1 results

📏 Metrics: Accuracy, BLEU-4, Harmonic mean

StyleBench

To comprehensively evaluate the effectiveness and generalization ability of style transfer methods, we build StyleBench that covers 73 distinct styles, …

📊 7 results

📏 Metrics: CLIP Score

WikiArt

WikiArt contains painting from 195 different artists. The dataset has 42129 images for training and 10628 images for testing. Source: …

📊 2 results

📏 Metrics: SSIM, ArtFID

Supervised Image Retrieval

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 3 results

📏 Metrics: Precision@100

Surface Normals Estimation

IBims-1

iBims-1 (independent Benchmark images and matched scans - version 1) is a new high-quality RGB-D dataset, especially designed for testing …

📊 2 results

📏 Metrics: % < 11.25, % < 22.5, % < 30, Mean

PASCAL Context

The PASCAL Context dataset is an extension of the PASCAL VOC 2010 detection challenge, and it contains pixel-wise labels for …

📊 1 results

📏 Metrics: Mean Angle Error

Stanford-ORB

We introduce Stanford-ORB, a new real-world 3D Object inverse Rendering Benchmark. Recent advances in inverse rendering have enabled a wide …

📊 7 results

📏 Metrics: Cosine Distance

Taskonomy

Taskonomy provides a large and high-quality dataset of varied indoor scenes. - Complete pixel-level geometric information via aligned meshes. - …

📊 1 results

📏 Metrics: L1 error

Surgical phase recognition

Cholec80

Cholec80 is an endoscopic video dataset containing 80 videos of cholecystectomy surgeries performed by 13 surgeons. The videos are captured …

📊 6 results

📏 Metrics: F1, Acc

GraSP

Holistic and Multi-Granular Surgical Scene Understanding of Prostatectomies (GraSP) dataset, a curated benchmark that models surgical scene understanding as a …

📊 2 results

📏 Metrics: mAP

HeiChole Benchmark

Analyzing the surgical workflow is a prerequisite for many applications in computer assisted surgery (CAS), such as context-aware visualization of …

📊 5 results

📏 Metrics: F1

MISAW

The MISAW data set is composed of 27 sequences of micro-surgical anastomosis on artificial blood vessels performed by 3 surgeons …

CrossTask dataset contains instructional videos, collected for 83 different tasks. For each task an ordered list of steps with manual …

📊 7 results

📏 Metrics: Recall

EPIC-KITCHENS-100

This paper introduces the pipeline to scale the largest dataset in egocentric vision EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a …

📊 6 results

📏 Metrics: Avg mAP (0.1-0.5), mAP [email protected], mAP [email protected], mAP [email protected], mAP [email protected], mAP [email protected]

FineAction

FineAction contains 103K temporal instances of 106 action categories, annotated in 17K untrimmed videos. FineAction introduces new opportunities and challenges …

📊 9 results

📏 Metrics: mAP, mAP [email protected], mAP [email protected], mAP [email protected]

HACS

HACS is a dataset for human action recognition. It uses a taxonomy of 200 action classes, which is identical to …

📊 11 results

📏 Metrics: Average-mAP, [email protected], [email protected], [email protected]

MUSES

MUSES is a large-scale dataset for temporal event (action) localization. It focuses on the temporal localization of multi-shot events, which …

📊 2 results

📏 Metrics: mAP, [email protected], [email protected], [email protected], [email protected], [email protected]

MultiTHUMOS

The MultiTHUMOS dataset contains dense, multilabel, frame-level action annotations for 30 hours across 400 videos in the THUMOS'14 action detection …

📊 7 results

📏 Metrics: Average mAP, mAP [email protected], mAP [email protected], mAP [email protected], mAP [email protected], mAP [email protected], mAP [email protected], mAP [email protected], mAP [email protected], mAP [email protected]

THUMOS14

The THUMOS14 (THUMOS 2014) dataset is a large-scale video dataset that includes 1,010 videos for validation and 1,574 videos for …

📊 1 results

📏 Metrics: Avg mAP (0.3:0.7)

Temporal Sentence Grounding

Charades-STA

Charades-STA is a new dataset built on top of Charades by adding sentence temporal annotations. Source: [TALL: Temporal Activity Localization …

📊 11 results

📏 Metrics: [email protected], [email protected], [email protected], [email protected]

Text Detection

UrduDoc

The UrduDoc Dataset is a benchmark dataset for Urdu text line detection in scanned documents. It is created as a …

📊 5 results

📏 Metrics: Precision, Recall

Text Spotting

ICDAR 2015

ICDAR 2015 was a scene text detection used for the ICDAR 2015 conference.

📊 17 results

📏 Metrics: F-measure (%) - Strong Lexicon, F-measure (%) - Weak Lexicon, F-measure (%) - Generic Lexicon

SCUT-CTW1500

The SCUT-CTW1500 dataset contains 1,500 images: 1,000 for training and 500 for testing. In particular, it provides 10,751 cropped text …

📊 10 results

📏 Metrics: F-measure (%) - No Lexicon, F-Measure (%) - Full Lexicon

Total-Text

Total-Text is a text detection dataset that consists of 1,555 images with a variety of text types including horizontal, multi-oriented, …

📊 12 results

📏 Metrics: F-measure (%) - No Lexicon, F-measure (%) - Full Lexicon

Text based Person Retrieval

CUHK-PEDES

The CUHK-PEDES dataset is a caption-annotated pedestrian dataset. It contains 40,206 images over 13,003 persons. Images are collected from five …

📊 16 results

📏 Metrics: R@1, R@5, R@10, mAP, Rank-1, Rank-10, Rank-5

ICFG-PEDES

One large-scale database for Text-to-Image Person Re-identification, i.e., Text-based Person Retrieval. Compared with existing databases, ICFG-PEDES has three key advantages. …

📊 11 results

📏 Metrics: R@1, Rank-1, R@5, R@10, mAP, mINP, Rank-10, Rank-5

RSTPReid

RSTPReid contains 20505 images of 4,101 persons from 15 cameras. Each person has 5 corresponding images taken by different cameras …

📊 9 results

📏 Metrics: R@1, R@5, R@10, mAP, Rank-1, Rank-10, Rank-5, mINP

Text to 3D

T$^3$Bench

T$^3$Bench is the first comprehensive text-to-3D benchmark containing diverse text prompts of three increasing complexity levels that are specially designed …

📊 6 results

📏 Metrics: Avg

Text to Video Retrieval

DrawBench is a comprehensive and challenging benchmark for text-to-image models, introduced by the Imagen research team. Let me provide you …

📊 8 results

📏 Metrics: Aesthetics (Laion Aesthtetics Predictor), Human Preference Alignement (HPSv2), Text Alignement (SentenceBERT)

Flickr-8k

Contains 8k flickr Images with captions. Visit this page to explore the data. Cite this paper if you find it …

📊 1 results

📏 Metrics: LPIPS

GenEval

Recent breakthroughs in diffusion models, multimodal pretraining, and efficient finetuning have led to an explosion of text-to-image generative models. Given …

📊 20 results

📏 Metrics: Overall, Single Obj., Two Obj., Color Attri., Colors, Counting, Position

LAION COCO

LAION-COCO is the world’s largest dataset of 600M generated high-quality captions for publicly available web-images. The images are extracted from …

📊 2 results

📏 Metrics: FID

T2I-CompBench

T2I-CompBench is a comprehensive benchmark for open-world compositional text-to-image generation, consisting of 6,000 compositional textual prompts from 3 categories (attribute …

A3D

A new dataset of diverse traffic accidents. Source: Unsupervised Traffic Accident Detection in First-Person Videos

📊 3 results

📏 Metrics: AUC

Traffic Sign Detection

CCTSDB-AUG

The CSUST Chinese Traffic Sign Detection Benchmark (CCTSDB) is an existing dataset for traffic sign detection. It consists of nearly …

📊 2 results

📏 Metrics: Averaged Precision, avg-mAP (0.1-0.5)

CCTSDB2021

Traffic signs are one of the most important information that guide cars to travel, and the detection of traffic signs …

📊 1 results

📏 Metrics: [email protected]

Training-free 3D Point Cloud Classification

ScanObjectNN

ScanObjectNN is a newly published real-world dataset comprising of 2902 3D objects in 15 categories. It is a challenging point …

The code to create the dataset is available here. The dataset used in the paper is available on github - …

Cats and Dogs

A large set of images of cats and dogs. Homepage: https://www.microsoft.com/en-us/download/details.aspx?id=54765 Source code: tfds.image_classification.CatsVsDogs Versions: 4.0.0 (default): New split API …

📊 1 results

📏 Metrics: AUC-ROC

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 1 results

📏 Metrics: AUC-ROC

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 1 results

📏 Metrics: AUC-ROC

STL-10

The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or …

UVO

UVO is a new benchmark for open-world class-agnostic object segmentation in videos. Besides shifting the problem focus to the open-world …

📏 Metrics: mean, Correctness of Information, Detail Orientation, Contextual Understanding, Temporal Understanding, Consistency, Dense Captioning, Spatial Understanding, Reasoning

Vehicle Key-Point and Orientation Estimation

ApolloCar3D

ApolloCar3DT is a dataset that contains 5,277 driving images and over 60K car instances, where each car is fitted with …

📊 1 results

📏 Metrics: A3DP

Vehicle Re-Identification

CityFlow

CityFlow is a city-scale traffic camera dataset consisting of more than 3 hours of synchronized HD videos from 40 cameras …

📊 1 results

📏 Metrics: mAP

VeRi-776

VeRi-776 is a vehicle re-identification dataset which contains 49,357 images of 776 vehicles from 20 cameras. The dataset is collected …

📊 17 results

📏 Metrics: mAP, Rank-1, Rank1, Rank5, Rank-10, Rank-5

VehicleID

The “VehicleID” dataset contains CARS captured during the daytime by multiple real-world surveillance cameras distributed in a small city in …

📊 1 results

📏 Metrics: Rank1

Vehicle Speed Estimation

BrnoCompSpeed

The dataset contains 21 full-HD videos, each around 1 hr long, captured at six different locations. Vehicles in the videos …

📊 2 results

📏 Metrics: Mean Speed Measurement Error (km/h), Median Speed Measurement Error (km/h), 95-th Percentile Speed Measurement Error (km/h), 99-th Percentile Speed Measurement Error (km/h)

Video & Kinematic Base Workflow Recognition

PETRAW

PETRAW data set was composed of 150 sequences of peg transfer training sessions. The objective of the peg transfer session …

📊 6 results

📏 Metrics: Average AD-Accuracy

Video Anomaly Detection

Video Based Workflow Recognition

PETRAW

PETRAW data set was composed of 150 sequences of peg transfer training sessions. The objective of the peg transfer session …

📊 5 results

📏 Metrics: Average AD-Accuracy

Video Captioning

ActivityNet Captions

The ActivityNet Captions dataset is built on ActivityNet v1.3 which includes 20k YouTube untrimmed videos with 100k caption annotations. The …

📊 5 results

📏 Metrics: BLEU4, BLEU-3, CIDEr, ROUGE-L, METEOR

MSR-VTT

MSR-VTT (Microsoft Research Video to Text) is a large-scale dataset for the open domain video captioning, which consists of 10,000 …

📊 22 results

📏 Metrics: CIDEr, METEOR, ROUGE-L, BLEU-4, GS

MSRVTT-CTN

MSRVTT-CTN Dataset This dataset contains CTN annotations for the MSRVTT-CTN benchmark dataset in JSON format. It has three files …

📊 3 results

📏 Metrics: CIDEr, SPICE, ROUGE-L

MSVD

The Microsoft Research Video Description Corpus (MSVD) dataset consists of about 120K sentences collected during the summer of 2010. Workers …

📊 14 results

📏 Metrics: CIDEr, BLEU-4, METEOR, ROUGE-L, GS

MSVD-CTN

MSVD-CTN Dataset This dataset contains CTN annotations for the MSVD-CTN benchmark dataset in JSON format. It has three files …

📊 3 results

📏 Metrics: CIDEr, ROUGE-L, SPICE

MSVD-Indonesian

MSVD-Indonesian is derived from the MSVD dataset, which is obtained with the help of a machine translation service. This dataset …

📊 1 results

📏 Metrics: BLEU-4, CIDEr, METEOR, ROUGE-L

Shot2Story20K

A short clip of video may contain progression of multiple events and an interesting story line. A human needs to …

📊 2 results

📏 Metrics: CIDEr, BLEU-4, METEOR, ROUGE

TVC

TV show Caption is a large-scale multimodal captioning dataset, containing 261,490 caption descriptions paired with 108,965 short video moments. TVC …

📊 2 results

📏 Metrics: BLEU-4, CIDEr

VATEX

VATEX is multilingual, large, linguistically complex, and diverse dataset in terms of both video and natural language descriptions. It has …

📊 8 results

📏 Metrics: BLEU-4, CIDEr, METEOR, ROUGE-L

YouCook2

YouCook2 is the largest task-oriented, instructional video dataset in the vision community. It contains 2000 long untrimmed videos from 89 …

📊 14 results

📏 Metrics: BLEU-4, BLEU-3, CIDEr, ROUGE-L, METEOR

Video Chaptering

VidChapters-7M

VidChapters-7M is a dataset of 817K user-chaptered videos including 7M chapters in total. VidChapters-7M is automatically created from videos online …

MFQE v2

A dataset for compressed video quality enhancement.

The Vimeo-90K is a large-scale high-quality video dataset for lower-level video processing. It proposes three different video processing tasks: frame …

📊 21 results

📏 Metrics: PSNR, SSIM, LPIPS, Speed (ms/f)

X4K1000FPS

Dataset of high-resolution (4096×2160), high-fps (1000fps) video frames with extreme motion. X-TEST consists of 15 video clips with 33-length of …

YouTube Driving Dataset contains a massive amount of real-world driving frames with various conditions, from different weather, different regions, to …

📊 1 results

📏 Metrics: FVD16

Video Grounding

MAD

MAD (Movie Audio Descriptions) is an automatically curated large-scale dataset for the task of natural language grounding in videos or …

📊 2 results

📏 Metrics: R@1,IoU=0.1, R@5,IoU=0.1, R@10,IoU=0.1, R@100,IoU=0.1, R@50,IoU=0.1, R@1,IoU=0.3, R@5,IoU=0.3

QVHighlights

The Query-based Video Highlights (QVHighlights) dataset is a dataset for detecting customized moments and highlights from videos given natural language …

📊 6 results

📏 Metrics: R@1,IoU=0.7, R@1,IoU=0.5

Video Inpainting

DAVIS

The Densely Annotation Video Segmentation dataset (DAVIS) is a high quality and high resolution densely annotated video segmentation dataset under …

📊 11 results

📏 Metrics: PSNR, SSIM, VFID, Ewarp, LPIPS (object), LPIPS (square), PNSR (object), SSIM (object), SSIM (square)

How2Sign

The How2Sign is a multimodal and multiview continuous American Sign Language (ASL) dataset consisting of a parallel corpus of more …

📊 1 results

📏 Metrics: L1 error

YouTube-VOS 2018

Youtube-VOS is a Video Object Segmentation dataset that contains 4,453 videos - 3,471 for training, 474 for validation, and 508 …

📊 10 results

📏 Metrics: PSNR, SSIM, VFID, Ewarp

Video Instance Segmentation

HQ-YTVIS

While Video Instance Segmentation (VIS) has seen rapid progress, current approaches struggle to predict high-quality masks with accurate boundary details. …

📊 4 results

📏 Metrics: Tube-Boundary AP

YouTube-VIS 2021

3,859 high-resolution YouTube videos, 2,985 training videos, 421 validation videos and 453 test videos. An improved 40-category label set by …

📊 26 results

📏 Metrics: mask AP, AP50, AP75, AR1, AR10

Youtube-VIS 2022 Validation

Video object segmentation has been studied extensively in the past decade due to its importance in understanding video spatial-temporal structures …

📊 7 results

📏 Metrics: mAP_L, AP50_L, AP75_L, AR1_L, AR10_L

Video Object Segmentation

📏 Metrics: Average PSNR (dB)

Video Retrieval

ActivityNet

The ActivityNet dataset contains 200 different types of activities and a total of 849 hours of videos collected from YouTube. …

📊 31 results

📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video R@50, text-to-video Mean Rank, text-to-video Median Rank, video-to-text R@1, video-to-text R@5, video-to-text Mean Rank, video-to-text Median Rank, video-to-text R@10, video-to-text R@50

Charades-STA

Charades-STA is a new dataset built on top of Charades by adding sentence temporal annotations. Source: [TALL: Temporal Activity Localization …

📊 1 results

📏 Metrics: text-to-video Mean Rank, text-to-video Median Rank, text-to-video R@1, text-to-video R@10, video-to-text Mean Rank, video-to-text Median Rank, video-to-text R@1, video-to-text R@10

Condensed Movies

A large-scale video dataset, featuring clips from movies with detailed captions.

📊 3 results

📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10

DiDeMo

The Distinct Describable Moments (DiDeMo) dataset is one of the largest and most diverse datasets for the temporal localization of …

📊 39 results

📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video R@50, text-to-video Median Rank, text-to-video Mean Rank, video-to-text R@1, video-to-text R@5, video-to-text R@10, video-to-text Median Rank, video-to-text Mean Rank, text-to-videoR@1

EgoExoLearn

EgoExoLearn is a fascinating dataset designed to bridge the gap between egocentric and exocentric views of procedural activities. 1. **What …

📊 2 results

📏 Metrics: Accuracy

FIVR-200K

The FIVR-200K dataset has been collected to simulate the problem of Fine-grained Incident Video Retrieval (FIVR). The dataset comprises 225,960 …

📊 15 results

📏 Metrics: mAP (ISVR), mAP (CSVR), mAP (DSVR)

LSMDC

This dataset contains 118,081 short video clips extracted from 202 movies. Each video has a caption, either extracted from the …

📊 38 results

📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video Median Rank, text-to-video Mean Rank, video-to-text R@1, video-to-text R@10, video-to-text R@5, video-to-text Median Rank, video-to-text Mean Rank

MSR-VTT

MSR-VTT (Microsoft Research Video to Text) is a large-scale dataset for the open domain video captioning, which consists of 10,000 …

📊 38 results

📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video Mean Rank, text-to-video Median Rank, video-to-text R@1, video-to-text R@5, video-to-text R@10, video-to-text Median Rank, video-to-text Mean Rank, text-to-video MedianR, text-to-videoMedian Rank

MSVD

The Microsoft Research Video Description Corpus (MSVD) dataset consists of about 120K sentences collected during the summer of 2010. Workers …

📊 24 results

📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video Median Rank, text-to-video Mean Rank, text-to-video R@50, video-to-text R@1, video-to-text R@5, video-to-text R@10, video-to-text Median Rank, video-to-text Mean Rank

MSVD-Indonesian

MSVD-Indonesian is derived from the MSVD dataset, which is obtained with the help of a machine translation service. This dataset …

📊 1 results

📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video Median Rank, text-to-video Mean Rank, video-to-text R@1, video-to-text R@5, video-to-text R@10, video-to-text Median Rank, video-to-text Mean Rank

QuerYD

A large-scale dataset for retrieval and event localisation in video. A unique feature of the dataset is the availability of …

📊 5 results

📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10

TGIF

📊 3 results

📏 Metrics: SSIM, PSNR, TIoU

Vimeo90K

The Vimeo-90K is a large-scale high-quality video dataset for lower-level video processing. It proposes three different video processing tasks: frame …

📊 3 results

📏 Metrics: PSNR, SSIM

Video deraining

VRDS

We generate a synthesized dataset, namely VRDS, with 102 rainy videos from diverse scenarios, and each video frame has the …

📊 8 results

📏 Metrics: SSIM, PSNR

Video Waterdrop Removal Dataset

Due to the lack of training data for video waterdrop removal, we propose a large-scale synthetic dataset with simulated waterdrops …

📊 4 results

📏 Metrics: PSNR, SSIM

Video, Kinematic & Segmentation Base Workflow Recognition

PETRAW

PETRAW data set was composed of 150 sequences of peg transfer training sessions. The objective of the peg transfer session …

EPIC-Hotspot

From Grounded Human-Object Interaction Hotspots from Video (ICCV'19): We collect annotations for interaction keypoints on EPIC Kitchens in order to …

📊 3 results

📏 Metrics: KLD, SIM, AUC-J

OPRA

The OPRA Dataset was introduced in Demo2Vec: Reasoning Object Affordances From Online Videos (CVPR'18) for reasoning object affordances from online …

ConvAI2

The ConvAI2 NeurIPS competition aimed at finding approaches to creating high-quality dialogue agents capable of meaningful open domain conversation. The …

📊 1 results

📏 Metrics: BLEU-4, F1, ROUGE-L

EmpatheticDialogues

The EmpatheticDialogues dataset is a large-scale multi-turn empathetic dialogue dataset collected on the Amazon Mechanical Turk, containing 24,850 one-to-one open-domain …

📊 1 results

📏 Metrics: BLEU-4, F1, ROUGE-L

Image-Chat

The IMAGE-CHAT dataset is a large collection of (image, style trait for speaker A, style trait for speaker B, dialogue …

📊 1 results

📏 Metrics: BLEU-4, F1, ROUGE-L

Wizard of Wikipedia

Wizard of Wikipedia is a large dataset with conversations directly grounded with knowledge retrieved from Wikipedia. It is used to …

📊 1 results

📏 Metrics: BLEU-4, F1, ROUGE-L

Visual Localization

Aachen Day-Night v1.1 Benchmark

Aachen Day-Night v1.1 dataset is an extended version of the original Aachen Day-Night dataset. Besides the original query images, the …

📊 7 results

📏 Metrics: [email protected], 2°, [email protected], 5°, Acc@5m, 10°

Visual Navigation

AI2-THOR

AI2-Thor is an interactive environment for embodied AI. It contains four types of scenes, including kitchen, living room, bedroom and …

📊 2 results

📏 Metrics: SPL (All), SPL (L≥5), Success Rate (All), Success Rate (L≥5)

R2R

R2R is a dataset for visually-grounded natural language navigation in real buildings. The dataset requires autonomous agents to follow human-generated …

📊 11 results

📏 Metrics: spl

Visual Object Tracking

AVisT

One of the key factors behind the recent success in visual tracking is the availability of dedicated benchmarks. While being …

📊 7 results

📏 Metrics: Success Rate

DiDi

DiDi is a distractor-distilled tracking dataset created to address the limitation of low distractor presence in current visual object tracking …

📊 10 results

📏 Metrics: Tracking quality

GOT-10k

The GOT-10k dataset contains more than 10,000 video segments of real-world moving objects and over 1.5 million manually labelled bounding …

📊 41 results

📏 Metrics: Average Overlap, Success Rate 0.5, Success Rate 0.75

ITB

Informative Tracking Benchmark (ITB) is a small and informative tracking benchmark with 7% out of 1.2 M frames of existing …

📊 1 results

📏 Metrics: AUC

LaSOT

LaSOT is a high-quality benchmark for Large-scale Single Object Tracking. LaSOT consists of 1,400 sequences with more than 3.5M frames …

📊 44 results

📏 Metrics: AUC, Normalized Precision, Precision

OTB-2013

OTB2013 is the previous version of the current OTB2015 Visual Tracker Benchmark. It contains only 50 tracking sequences, as opposed …

📊 5 results

📏 Metrics: AUC

OTB-2015

OTB-2015, also referred as Visual Tracker Benchmark, is a visual tracking dataset. It contains 100 commonly used video sequences for …

📊 17 results

📏 Metrics: AUC, Precision

TNL2K

Tracking by Natural Language (TNL2K) is constructed for the evaluation of tracking by natural language specification. TNL2K features: - Large-scale: …

📊 14 results

📏 Metrics: AUC, precision, Normalized Precision

TrackingNet

TrackingNet is a large-scale tracking dataset consisting of videos in the wild. It has a total of 30,643 videos split …

📊 38 results

📏 Metrics: Accuracy, Normalized Precision, Precision, Success Rate, AUC

UAV123

📊 15 results

📏 Metrics: AUC, Precision

VOT2014

The dataset comprises 25 short sequences showing various objects in challenging backgrounds. Eight sequences are from the VOT2013 challenge (bolt, …

📊 1 results

📏 Metrics: Expected Average Overlap (EAO)

VOT2016

VOT2016 is a video dataset for visual object tracking. It contains 60 video clips and 21,646 corresponding ground truth maps …

📊 6 results

📏 Metrics: Expected Average Overlap (EAO)

VOT2017

VOT2017 is a Visual Object Tracking dataset for different tasks that contains 60 short sequences annotated with 6 different attributes. …

📊 6 results

📏 Metrics: Expected Average Overlap (EAO)

VOT2018

VOT2018 is a dataset for visual object tracking. It consists of 60 challenging videos collected from real-life datasets. Source: [Remove …

📊 2 results

📏 Metrics: Expected Average Overlap (EAO), Accuracy

VOT2019

VOT2019 is a Visual Object Tracking benchmark for short-term tracking in RGB. Source: https://www.votchallenge.net/vot2019/dataset.html Image Source: https://www.votchallenge.net/vot2019/dataset.html

📊 3 results

📏 Metrics: Expected Average Overlap (EAO), Accuracy

VOT2022

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 4 results

📏 Metrics: EAO

VideoCube

VideoCube is a high-quality and large-scale benchmark to create a challenging real-world experimental environment for Global Instance Tracking (GIT). MGIT …

📊 1 results

📏 Metrics: Normalized Precision, Precision, Success Rate

YouTube-VOS 2018

Youtube-VOS is a Video Object Segmentation dataset that contains 4,453 videos - 3,471 for training, 474 for validation, and 508 …

📊 9 results

📏 Metrics: O (Average of Measures), Jaccard (Seen), Jaccard (Unseen), F-Measure (Seen), F-Measure (Unseen)

Visual Odometry

EuRoC MAV

EuRoC MAV is a visual-inertial datasets collected on-board a Micro Aerial Vehicle (MAV). The dataset contains stereo images, synchronized IMU …

📊 1 results

📏 Metrics: Relative Position Error Translation [cm]

Visual Place Recognition

AmsterTime

AmsterTime dataset offers a collection of 2,500 well-curated images matching the same scene from a street view matched to historical …

📊 8 results

📏 Metrics: Recall@1, Recall@10, Recall@5

CV-Cities

CV-Cities comprises $223,736$ ground panoramic images and an equal number of satellite images all accompanied by high-precision GPS coordinates. These …

📊 3 results

📏 Metrics: Recall@1, Recall@5

KITTI

KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile …

📊 1 results

📏 Metrics: Average F1

KITTI360pose

The KITTI360Pose dataset encompasses a total area of 15.51 square kilometers across nine urban regions, consisting of 43,381 point cloud- …

📊 5 results

📏 Metrics: Localization Recall@1

MSLS

The largest and most diverse dataset for lifelong place recognition from image sequences in urban and suburban settings.

📊 3 results

📏 Metrics: Recall@1, Recall@5

Nordland

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 13 results

📏 Metrics: Recall@1, Recall@5, Recall@10

Nordland* (2760 queries)

The nordland used in SALAD and BoQ (2760 queries, 27592 reference images, threshold: 1 frames).

📊 4 results

📏 Metrics: Recall@1, Recall@5, Recall@10

Oxford RobotCar Dataset

The Oxford RobotCar Dataset contains over 100 repetitions of a consistent route through Oxford, UK, captured over a period of …

📊 7 results

📏 Metrics: Recall@1

SF-XL Night

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 2 results

📏 Metrics: Recall@1, Recall@5, Recall@10

SF-XL Occlusion

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 2 results

📏 Metrics: Recall@1, Recall@5, Recall@10

SF-XL test v1

Test set version 1 for the San Francisco eXtra Large dataset

📊 5 results

📏 Metrics: Recall@1, Recall@10, Recall@5

SF-XL test v2

Test set version 2 for the San Francisco eXtra Large dataset

📊 5 results

📏 Metrics: Recall@1, Recall@5, Recall@10

SVOX

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 2 results

📏 Metrics: Recall@1, Recall@5, Recall@10

San Francisco Landmark Dataset

The San Francisco Landmark Dataset contains a database of 1.7 million images of buildings in San Francisco with ground truth …

📊 3 results

📏 Metrics: Recall@1, Recall@10, Recall@5

Visual Question Answering

📏 Metrics: Top-1 Accuracy

Visual Reasoning

Bongard-OpenWorld

Bongard-OpenWorld is a new benchmark for evaluating real-world few-shot reasoning for machine vision. We hope it can help us better …

📊 9 results

📏 Metrics: 2-Class Accuracy

IRFL: Image Recognition of Figurative Language

The IRFL dataset consists of idioms, similes, and metaphors with matching figurative and literal images, as well as two novel …

📊 1 results

📏 Metrics: 1-of-100 Accuracy

NLVR

NLVR contains 92,244 pairs of human-written English sentences grounded in synthetic images. Because the images are synthetically generated, this dataset …

📊 1 results

📏 Metrics: Accuracy (Dev), Accuracy (Test-P), Accuracy (Test-U)

VASR

Visual Analogies of Situation Recognition (VASR) is a dataset for visual analogical mapping, adapting the classical word-analogy task into the …

📊 4 results

📏 Metrics: 1:1 Accuracy

VSR

The Visual Spatial Reasoning (VSR) corpus is a collection of caption-image pairs with true/false labels. Each caption describes the spatial …

📊 5 results

📏 Metrics: accuracy

WinoGAViL

This dataset is collected via the WinoGAViL game to collect challenging vision-and-language associations. Inspired by the popular card game Codenames, …

📊 8 results

📏 Metrics: Jaccard Index

Winoground

Winoground is a dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning. Given two …

📊 105 results

📏 Metrics: Text Score, Image Score, Group Score

Visual Relationship Detection

VRD

The Visual Relationship Dataset (VRD) contains 4000 images for training and 1000 for testing annotated with visual relationships. Bounding boxes …

📊 1 results

📏 Metrics: R@50 k=1

Visual Genome

Visual Genome contains Visual Question Answering data in a multi-choice setting. It consists of 101,174 images from MSCOCO with 1.7 …

📊 1 results

📏 Metrics: R@100, R@50, mR@100, mR@50

Visual Social Relationship Recognition

PIPA

The PIPA database is collected from Flickr photo albums for the task of person recognition. Then the dataset is extended …

📊 6 results

📏 Metrics: Accuracy, Accuracy (domain)

PISC

The People in Social Context (PISC) dataset is a dataset that focuses on social relationships. It consists of 22,670 images …

📊 5 results

📏 Metrics: mAP, mAP (Coarse)

Visual Speech Recognition

LRS2

The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences …

📊 2 results

📏 Metrics: Word Error Rate (WER)

LRS3-TED

LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of …

TrackingNet is a large-scale tracking dataset consisting of videos in the wild. It has a total of 30,643 videos split …

📊 1 results

📏 Metrics: ACCURACY, Normalized Precision

Weakly Supervised Action Localization

BEOID

The BEOID dataset includes object interactions ranging from preparing a coffee to operating a weight lifting machine and opening a …

📊 5 results

📏 Metrics: [email protected]:0.7, [email protected]

FineAction

FineAction contains 103K temporal instances of 106 action categories, annotated in 17K untrimmed videos. FineAction introduces new opportunities and challenges …

📊 4 results

📏 Metrics: mAP, mAP [email protected], mAP [email protected], mAP [email protected]

GTEA

The Georgia Tech Egocentric Activities (GTEA) dataset contains seven types of daily activities such as making sandwich, tea, or coffee. …

📊 5 results

📏 Metrics: [email protected]:0.7, [email protected]

THUMOS14

The THUMOS14 (THUMOS 2014) dataset is a large-scale video dataset that includes 1,010 videos for validation and 1,574 videos for …

📊 12 results

📏 Metrics: avg-mAP (0.3-0.7), avg-mAP (0.1:0.7), avg-mAP (0.1-0.5)

Weakly-supervised Temporal Action Localization

UCF101-24

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 1 results

📏 Metrics: [email protected]

Zero Shot Segmentation

Segmentation in the Wild

Recent advances in language-image pre-training has witnessed the emerging field of building transferable systems that can effortlessly adapt to a …

📊 12 results

📏 Metrics: Mean AP

Zero-Shot Action Recognition

ActivityNet

The ActivityNet dataset contains 200 different types of activities and a total of 849 hours of videos collected from YouTube. …

📊 4 results

📏 Metrics: Top-1 Accuracy

Charades

The Charades dataset is composed of 9,848 videos of daily indoors activities with an average length of 30 seconds, involving …

📊 4 results

📏 Metrics: mAP

HMDB51

The HMDB51 dataset is a large collection of realistic videos from various sources, including movies and web videos. The dataset …

📊 24 results

📏 Metrics: Top-1 Accuracy, Top-5 Accuracy, Accuracy

Kinetics

The Kinetics dataset is a large-scale, high-quality dataset for human action recognition in videos. The dataset consists of around 500,000 …

📊 16 results

📏 Metrics: Top-1 Accuracy, Top-5 Accuracy

UCF101

UCF101 dataset is an extension of UCF50 and consists of 13,320 video clips, which are classified into 101 categories. These …

📊 27 results

📏 Metrics: Top-1 Accuracy, Top-5 accuracy

Zero-Shot Image Classification

Country211

Country211 is a dataset released by OpenAI, designed to assess the geolocation capability of visual representations. It filters the YFCC100m …

An open-ended VideoQA benchmark that aims to: i) provide a well-defined evaluation by including five correct answer annotations per question …

📊 1 results

📏 Metrics: Accuracy

Zero-Shot Semantic Segmentation

COCO-Stuff

The Common Objects in COntext-stuff (COCO-stuff) dataset is a dataset for scene understanding tasks like semantic segmentation, object detection and …

📊 13 results

📏 Metrics: Transductive Setting hIoU, Inductive Setting hIoU

PASCAL VOC

The PASCAL Visual Object Classes (VOC) 2012 dataset contains 20 object categories including vehicles, household, animals, and other: aeroplane, bicycle, …

📊 11 results

📏 Metrics: Transductive Setting hIoU, Inductive Setting hIoU

Zero-Shot Transfer Image Classification

Food-101

The Food-101 dataset consists of 101 food categories with 750 training and 250 test images per category, making a total …

📊 5 results

📏 Metrics: Top 1 Accuracy

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 20 results

📏 Metrics: Param, Accuracy (Private), Accuracy (Public)

ImageNet-A

The ImageNet-A dataset consists of real-world, unmodified, and naturally occurring examples that are misclassified by ResNet models. Source: [On Robustness …

📊 12 results

📏 Metrics: Accuracy (Private), Accuracy (Public)

ImageNet-R

ImageNet-R(endition) contains art, cartoons, deviantart, graffiti, embroidery, graphics, origami, paintings, patterns, plastic objects, plush objects, sculptures, sketches, tattoos, toys, and …

📊 11 results

📏 Metrics: Accuracy

ImageNet-S

Powered by the ImageNet dataset, unsupervised learning on large-scale data has made significant advances for classification tasks. There are two …

📊 1 results

📏 Metrics: Accuracy (Private), Top 5 Accuracy

ImageNet-Sketch

ImageNet-Sketch data set consists of 50,889 images, approximately 50 images for each of the 1000 ImageNet classes. The data set …

📊 6 results

📏 Metrics: Accuracy (Private)

ObjectNet

ObjectNet is a test set of images collected directly using crowd-sourcing. ObjectNet is unique as the objects are captured at …

📊 9 results

📏 Metrics: Accuracy (Private), Accuracy (Public), Top 5 Accuracy

SUN

When glancing at a magazine, or browsing the Internet, we are continuously being exposed to photographs. Despite of this overflow …

📊 3 results

📏 Metrics: Accuracy

Zero-Shot Video Retrieval

ActivityNet

The ActivityNet dataset contains 200 different types of activities and a total of 849 hours of videos collected from YouTube. …

📊 12 results

📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, video-to-text R@1, video-to-text R@5, video-to-text R@10

DiDeMo

The Distinct Describable Moments (DiDeMo) dataset is one of the largest and most diverse datasets for the temporal localization of …

📊 26 results

📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, video-to-text R@1, video-to-text R@5, video-to-text R@10, text-to-video Median Rank, video-to-text Median Rank

LSMDC

This dataset contains 118,081 short video clips extracted from 202 movies. Each video has a caption, either extracted from the …

📊 16 results

📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video Median Rank, text-to-video Mean Rank, video-to-text R@1, video-to-text R@5, video-to-text R@10

MSR-VTT

MSR-VTT (Microsoft Research Video to Text) is a large-scale dataset for the open domain video captioning, which consists of 10,000 …

📊 41 results

📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video Median Rank, text-to-video Mean Rank, video-to-text R@1, video-to-text R@5, video-to-text R@10, video-to-text Median Rank

MSVD

The Microsoft Research Video Description Corpus (MSVD) dataset consists of about 120K sentences collected during the summer of 2010. Workers …

📊 14 results

📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video Median Rank, text-to-video Mean Rank, video-to-text R@1, video-to-text R@5, video-to-text R@10, video-to-text Median Rank

VATEX

VATEX is multilingual, large, linguistically complex, and diverse dataset in terms of both video and natural language descriptions. It has …

📊 5 results

📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, video-to-text R@1, video-to-text R@5, video-to-text R@10

YouCook2

YouCook2 is the largest task-oriented, instructional video dataset in the vision community. It contains 2000 long untrimmed videos from 89 …

📊 8 results

📏 Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video Mean Rank, text-to-video Median Rank

inverse tone mapping

VDS dataset: Multi exposure stack-based inverse tone mapping

Have need seven multiple exposure ground truth images satisfying EV 0, ±1, ±2, ±3 for static scenes. * 96 …

📊 5 results

📏 Metrics: HDR-VDP-2, HDR-VDP-3, PU21-PSNR, PU21-SSIM, Reinhard'TMO-PSNR, Kim and Kautz TMO-PSNR

regression

California Housing Prices

Median house prices for California districts derived from the 1990 census. About Dataset Context This is the dataset used in …

📊 3 results

📏 Metrics: R2 Score, lambda

Car_Price_Prediction

In this dataset we added [Company Name, Car Model, Car Type, Fuel Type, Transmission, Engine (cc), Mileage, Kms_driven, Buyers, Horsepower …

📊 1 results

📏 Metrics: R Squared

Concrete Compressive Strength

Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age …

📊 3 results

📏 Metrics: R2 Score, lambda

Medical Cost Personal Dataset

This dataset contains demographic and personal health information for individuals, along with the corresponding medical insurance charges billed to them. …

📊 3 results

📏 Metrics: R2 Score, lambda

self-supervised scene text recognition

TextSeg

TextSeg is a large-scale fine-annotated and multi-purpose text detection and segmentation dataset, collecting scene and design text with six types …

📊 1 results

📏 Metrics: IoU (%)

TextZoom

TextZoom is a super-resolution dataset that consists of paired Low Resolution – High Resolution scene text images. The images are …

📊 1 results

📏 Metrics: Average PSNR (dB), SSIM

video narration captioning

Shot2Story20K

A short clip of video may contain progression of multiple events and an interesting story line. A human needs to …

📊 1 results

📏 Metrics: BLEU-4, CIDEr, METEOR, ROUGE

Machine Learning Benchmarks

1 Image, 2*2 Stitchi

10-shot image generation

16k

2D Human Pose Estimation

2D Object Detection

2D Panoptic Segmentation

2D Pose Estimation

2D Semantic Segmentation

2D Semantic Segmentation task 3 (25 classes)

3D Absolute Human Pose Estimation

3D Action Recognition

3D Anomaly Detection

3D Canonical Hand Pose Estimation

3D Classification

3D Depth Estimation

3D Face Animation

3D Face Modelling

3D Face Reconstruction

3D Hand Pose Estimation

3D Human Pose Estimation

3D Instance Segmentation

3D Multi-Object Tracking

3D Multi-Person Pose Estimation

3D Multi-Person Pose Estimation (absolute)

3D Multi-Person Pose Estimation (root-relative)

3D Object Captioning

3D Object Detection

3D Object Reconstruction