Machine Learning Datasets

2785 datasets with active ML research benchmarks

All Datasets (A-Z)

Quick navigation: 1 | 2 | 3 | 4 | 5 | 9 | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z

1

1 - 111
10,000 People - Human Pose Recognition Data - Description: 10,000 People - Human Pose Recognition Data. This dataset …
100STLYE-Labelled - Over 4 million frames of motion capture data for 100 …

2

20000 utterances - 20000 utterances
2000 HUB5 English - **2000 HUB5 English Evaluation Transcripts** was developed by the Linguistic …
2010 i2b2/VA - **2010 i2b2/VA** is a biomedical dataset for relation classification and …
2012 i2b2 Temporal Relations - 2012 i2b2 Temporal Relations Corpus
2018 Data Science Bowl - 2018 Data Science Bowl Find the nuclei in divergent images to advance medical discovery
2024 AI City Challenge - The AI City Challenge, hosted at CVPR 2024, focuses on …
20Newsgroup (10 tasks) - This dataset has 20 classes and each class has about …
20 Newsgroups - The 20 Newsgroups data set is a collection of approximately …
20NewsGroups - The 20 Newsgroups data set is a collection of approximately …

3

300W - 300 Faces-In-The-Wild
3D AffordanceNet - 3D AffordanceNet is a dataset of 23k shapes for visual …
3D-BSLS-6D - 3D scans of Bins by Structured-Light Scanner for 6D pose estimation
3DOH50K - 3DOH50K is the first real 3D human dataset for the …
3D Platelet EM - Platelet Electron Microscopy
3DPW - The **3D Poses in the Wild dataset** is the first …
3DSSG - 3DSSG provides 3D semantic scene graphs for 3RScan. A semantic …
3RScan - A novel dataset and benchmark, which features 1482 RGB-D scans …

4

4D-DRESS - A 4D Dataset of Real-world Human Clothing with Semantic Annotations
4D Light Field Dataset - 4D Light Field Dataset is a light field benchmark consisting …
4DMatch - A benchmark for matching and registration of partial point clouds …
4D-OR - 4D-OR includes a total of 6734 scenes, recorded by six …

5

50 Salads - Activity recognition research has shifted focus from distinguishing full-body motion …

9

97 synthetic datasets - 97 synthetic datasets consists of 97 datasets (as illustrated in …

A

A2D - Actor-Action Dataset
A2D Sentences - Sentences for the Actor-Action Dataset (A2D)
A3D - AnAn Accident Detection
Aachen Day-Night v1.1 Benchmark - Aachen Day-Night v1.1 dataset is an extended version of the …
A Ball Collision Dataset (ABCD) - A Ball-Collision Dataset (ABCD) serves as a comprehensive benchmark for …
Abalone - Predicting the age of abalone from physical measurements. The age …
ABCD - Action-Based Conversations Dataset
Abstractive Text Summarization from Fanpage - Fanpage dataset, containing news articles taken from Fanpage. There are …
Abstractive Text Summarization from Il Post - IlPost dataset, containing news articles taken from IlPost. There are …
AbstRCT - Neoplasm - The AbstRCT dataset consists of randomized controlled trials retrieved from …
Abt-Buy - The Abt-Buy dataset for entity resolution derives from the online …
ACDC - Automated Cardiac Diagnosis Challenge
ACDC (Adverse Conditions Dataset with Correspondences) - Adverse Conditions Dataset with Correspondences
ACDC Scribbles - We release expert-made scribble annotations for the medical ACDC dataset …
ACE 2004 - ACE 2004 Multilingual Training Corpus
ACE 2005 - ACE 2005 Multilingual Training Corpus
ACES - A Translation Accuracy Challenge Set
ACI-Bench - Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic …
ACID - Aerial Coastline Imagery Dataset
ACM - Association for Computing Machinery Active Contour Model algebraic collective model and-Compare Module Active Contour Models
ACNE04 - The ACNE04 dataset includes 3756 Chinese face images with Acne. …
ACOS - Aspect Category Opinion Sentiment
Acted Facial Expressions In The Wild (AFEW) - Acted Facial Expressions In The Wild (AFEW) is a dynamic …
Action-Camera Parking - The Action-Camera Parking Dataset contains 293 images captured at a …
ActivityNet - The **ActivityNet** dataset contains 200 different types of activities and …
ActivityNet Adverbs - ActivityNet Adverbs is a subset from the ActivityNet dataset with …
ActivityNet Captions - The **ActivityNet Captions** dataset is built on ActivityNet v1.3 which …
ActivityNet-QA - The ActivityNet-QA dataset contains 58,000 human-annotated QA pairs on 5,800 …
ADAM - Adam: automatic detection challenge on age-related macular degeneration
A Dataset of Multispectral Potato Plants Images - The dataset contains aerial agricultural images of a potato field …
ADE20K - The **ADE20K** semantic segmentation dataset contains more than 20K scene-centric …
ADE-OoD - ADE-OoD is a public benchmark for dense out-of-distribution detection in …
ADNI - Alzheimer's Disease NeuroImaging Initiative
AdobeVFR real - Adobe Visual Font Recognition real-world images dataset
AdobeVFR syn - Adobe Visual Font Recognition synthetic dataset
ADORE - A benchmark dataset for machine learning in ecotoxicology
Adult - Data Set Information: Extraction was done by Barry Becker from …
Adult Census Income - adult_census_income
AdversarialQA - We have created three new Reading Comprehension datasets constructed using …
Adverse Drug Events (ADE) Corpus - Development of a benchmark corpus to support the automatic extraction …
AdvGLUE - Adversarial GLUE
Advising Corpus - Advising Corpus is a dataset based on an entirely new …
AE-110k - AliExpress - 110k
AESLC - To study the task of email subject line generation: automatically …
AFAD - Asian Face Age Dataset
AffectNet - burak yılmaz
Aff-Wild2 - Aff-Wild2 is a large-scale in-the-wild database and an extension of …
AFHQ - Animal Faces-HQ
AFLW2000-3D - **AFLW2000-3D** is a dataset of 2000 images that have been …
AfriSenti - AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages
AgeDB - AgeDB contains 16, 488 images of various famous people, such …
AgeGroup Transactions MTPP - Marked Temporal Point Processes on financial transactions data
AGENDA - Abstract GENeration DAtaset
AG News - AG’s News Corpus
AGORA - AGORA is a synthetic human dataset with high realism and …
AG-ReID - Aerial-Ground Person Re-identification
AG-ReID.v2 - Aerial-Ground Person Re-identification
AI2D - AI2 Diagrams
AI2-THOR - AI2-Thor is an interactive environment for embodied AI. It contains …
AIC - AI Challenger
AIDA/testc - AIDA/testc is a new challenging test set for entity linking …
AIDER - Dataset aimed to do automated aerial scene classification of disaster …
AIDERV2 - Aerial Image Dataset for Emergency Response Applications (version 2)
AIDS - **AIDS** is a graph dataset. It consists of 2000 graphs …
AIM-500 - Automatic Image Matting-500
aiMotive Dataset - aiMotive Multimodal Dataset
AIOZ-GDANCE - AIOZ-GDANCE comprises 16.7 hours of whole-body motion and music audio …
AIR - Adverbs in Recipes
AIRS - The **AIRS** (Aerial Imagery for Roof Segmentation) dataset provides a …
AISHELL-1 - AISHELL-1 is a corpus for speech recognition research and building …
AISHELL-2 - AISHELL-2 contains 1000 hours of clean read-speech data from iOS …
AIST++ - **AIST++** is a 3D dance dataset which contains 3D motion …
AI-TOD - Tiny Object Detection in Aerial Images
Alexa Point of View - The **Alexa Point of View** dataset is point of view …
ALG514 - 514 algebra word problems and associated equation systems gathered from …
AlgoPuzzleVQA - We introduce the novel task of multimodal puzzle solving, framed …
Alibaba Cluster Trace - **Alibaba Cluster Trace** captures detailed statistics for the co-located workloads …
AliMeeting - Multi-Channel Multi-Party Meeting Transcription Challenge
AlpacaEval - The AlpacaEval set contains 805 instructions form self-instruct, open-assistant, vicuna, …
AM-2K - Animal Matting 2,000 Dataset
Amazon Beauty - Amazon Beauty 5-core
Amazon-Book - N/A
Amazon Fashion - This datasets is a subset of the Amazon reviews dataset …
Amazon-Fraud - Multi-relational Graph Dataset for Amazon Fraudulent Account Detection
Amazon-Google - The Amazon-Google dataset for entity resolution derives from the online …
Amazon Men - This datasets is a subset of the Amazon reviews dataset …
Amazon MTPP - Marked Temporal Point Processes on Amazon data
Amazon Photo - Amazon Photo
Amazon Polarity - The Amazon Polarity dataset is a set of reviews from …
amazon-ratings - amazon-ratings is a product co-purchasing network based on data from …
AMIGOS - AMIGOS: A Dataset for Affect, Personality and Mood Research on Individuals and Groups
AMI Meeting Corpus - The **AMI Meeting Corpus** is a **multi-modal data set** comprising …
AMOS - Despite the considerable progress in automatic abdominal multi-organ segmentation from …
AMR3.0 - Abstract Meaning Representation (AMR) Annotation Release 3.0
AmsterTime - AmsterTime: A Visual Place Recognition Benchmark Dataset for Severe Domain Shift
AMZ Computers - amazon_electronics_computers
An Amharic News Text classification Dataset - In NLP, text classification is one of the primary problems …
AND Dataset - The **AND Dataset** contains 13700 handwritten samples and 15 corresponding …
Animal3D - Accurately estimating the 3D pose and shape is an essential …
Animal Kingdom - Animal Kingdom is a large and diverse dataset that provides …
Animal-Pose Dataset - **Animal-Pose Dataset** is an animal pose dataset to facilitate training …
Animals-10 - It contains about 28K medium quality animal images belonging to …
ANLI - Adversarial NLI
AnoShift - AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly Detection
ANTILLES - ANTILLES: An Open French Linguistically Enriched Part-of-Speech Corpus
AODRaw - Adverse condition Object Detection with RAW images
A-OKVQA - **A-OKVQA** is crowdsourced visual question answering dataset composed of a …
AP - Adversarial Paraphrase
AP-10K - AP-10K is the first large-scale benchmark for general animal pose …
ApolloCar3D - **ApolloCar3DT** is a dataset that contains 5,277 driving images and …
ApolloScape - **ApolloScape** is a large dataset consisting of over 140,000 video …
Apolloscape Inpainting - The **Inpainting** dataset consists of synchronized Labeled image and LiDAR …
Apolloscape Trajectory - Our trajectory dataset consists of camera-based images, LiDAR scanned point …
APPS - Automated Programming Progress Standard
aPY - Attribute Pascal and Yahoo
AQA-7 - Consists of 1106 action samples from seven actions with quality …
AQUAINT - The **AQUAINT** Corpus consists of newswire text data in English, …
AquaTrash - This dataset contains 369 images of Trash used for deep …
ARC (AI2 Reasoning Challenge) - The AI2’s Reasoning Challenge (**ARC**) dataset is a multiple-choice question-answering …
ARCH2S - Dataset, Benchmark for Learning Exterior Architectural Structures from Point Clouds
Argoverse - **Argoverse** is a tracking benchmark with over 30K scenarios collected …
Argoverse 2 - Argoverse 2 (AV2) is a collection of three datasets for …
ArgSciChat - **ArgSciChat** is an argumentative dialogue dataset. It consists of 498 …
Aria Digital Twin Dataset - A real-world dataset, with hyper-accurate digital counterpart & comprehensive ground-truth …
Aria Everyday Objects - A small-scale, real-world Project Aria dataset with high quality static …
Aria Synthetic Environments - [1]: https://www.projectaria.com/datasets/ase/ "" [2]: https://facebookresearch.github.io/projectaria_tools/docs/open_datasets/aria_synthetic_environments_dataset "" [3]: https://www.projectaria.com/research/ "" **Aria …
Aristo-v4 - Aristo Tuple KB Version 4
ARKitScenes - **ARKitScenes** is an RGB-D dataset captured with the widely available …
ArmanEmo - **ArmanEmo** is a human-labeled emotion dataset of more than 7000 …
ARMBench - **ARMBench** is a large-scale, object-centric benchmark dataset for robotic manipulation …
ARQMath - The goal of ARQMath is to advance techniques for mathematical …
ArtBench-10 (32x32) - We introduce ArtBench-10, the first class-balanced, high-quality, cleanly annotated, and …
ArtDL - ArtDL is a novel painting data set for iconography classification …
ArtQuest - The task of Visual Question Answering (VQA) has been studied …
arXiv - For nearly 30 years, ArXiv has served the public and …
arXiv-10 - Benchmark dataset for abstracts and titles of 100,000 ArXiv scientific …
Arxiv HEP-TH citation graph - **Arxiv HEP-TH (high energy physics theory) citation graph** is from …
arXiv Summarization Dataset - This is a dataset for evaluating summarisation methods for research …
ASAP - Aligned Scores and Performances
ASAP-AES - Automated Student Assessment Prize
ASLG-PC12 - English-ASL Gloss Parallel Corpus 2012
ASOS Data - Automated Surface/Weather Observing Systems (ASOS/AWOS) Data
ASQP - Aspect Sentiment Quad Prediction
Assembly101 - Assembly101 is a new procedural activity dataset featuring 4321 videos …
ASSET - ASSET is a new dataset for assessing sentence simplification in …
ASTD - Arabic Sentiment Tweets Dataset
ASTE - Aspect Sentiment Triplet Extraction
Astock - (1) provide financial news for each specific stock. (2) provide …
ATC-GRAPH - ATC-GRAPH is the most extensive ATC benchmark dataset. All drugs …
ATC-SMILES - The benchmark ATC-SMILES is built for ATC classification. ATC-SMILES consists …
ATD-12K - **ATD-12K** is a large-scale animation triplet dataset, which comprises 12,000 …
ATIS - Airline Travel Information Systems
ATIS (vi) - Vietnamese Intent Detection and Slot Filling
ATLANTIS - **ATLANTIS** is a benchmark for semantic segmentation of waterbody images. …
AudioCaps - **AudioCaps** is a dataset of sounds with event descriptions that …
AudioSet - Audioset is an audio event dataset, which consists of over …
AUR & UMB dataset - Anticancer Efficacy of Auraptene & Umbelliprenin: In Vitro Viability Dataset
australian - Statlog (Australian Credit Approval) Data Set
AutoHallusion - Large vision-language models (LVLMs) are prone to hallucinations, where certain …
Autooral dataset - A multi-tasking oral ulcer dataset (Autooral dataset) is proposed. Autooral …
AUTSL - Ankara University Turkish Sign Language Dataset
AVA - Atomic Visual Actions
AVA-Speech - Contains densely labeled speech activity in YouTube videos, with the …
AVD - Active Vision Dataset
AVeriTeC - AVeriTeC: A Dataset for Real-world Claim Verification with Evidence from the Web
AviationQA - AviationQA is introduced in the paper titled- There is No …
AVisT - A Benchmark for Visual Object Tracking in Adverse Visibility
AVSD - Audio-Visual Scene-Aware Dialog
AwA2 - Animals with Attributes 2
AWARE - AWARE: Aspect-Based Sentiment Analysis Dataset of Apps Reviews for Requirements Elicitation

B

BA - Binary Alphabet
BACE (β-secretase enzyme) - The BACE dataset focuses on inhibitors of human beta-secretase 1 …
BAIR Robot Pushing - Dataset of 64x64 images of a robot pushing objects on …
bajer_danish_misogyny - Bajer Online Misogyny
Bala-Copa - Balanced-COPA
Ballroom - This data set includes beat and bar annotations of the …
Bamboogle - The Bamboogle dataset is a collection of questions that was …
BanglaBook - Large-scale Bangla Dataset for Sentiment Analysis from Book Reviews
BanglaLekhaImageCaptions - This dataset consists of images and annotations in Bengali. The …
Banglish - A Bilingual Dataset for Bangla and English Voice Commands Colloquial …
BANKING77 - Dataset composed of online banking queries annotated with their corresponding …
BAR - Biased Action Recognition
Basketball - NUST-NBA181
BBAI Dataset - Black-box Agent Integration
BBBP (Blood-Brain Barrier Penetration) - The BBBP dataset comes from a study focused on modeling …
BBH - BIG-Bench Hard
BC2GM - Created by Smith et al. at 2008, the BioCreative II …
BC4CHEMD - BioCreative IV Chemical compound and drug name recognition
BC5CDR - BioCreative V CDR corpus
BC7 NLM-Chem - BioCreative VII NLM-Chem
Bc8 - Bc8BioRED
BCI - Breast Cancer Immunohistochemical Image Generation
BCI Competition IV: ECoG to Finger Movements - #####Prediction of Finger Flexion IV Brain-Computer Interface Data Competition The …
BDD100K - Datasets drive vision progress, yet existing driving datasets are impoverished …
BEAT2 - BEAT-SMPLX-FLAME
Beatles - This dataset includes the beat and downbeat annotations for Beatles …
BEDLAM - **BEDLAM** is a large-scale synthetic video dataset designed to train …
Bee4Exp Honeybee Detection - A dataset for flying honeybee detection introduced in ["A Method …
BeerAdvocate - BeerAdvocate is a dataset that consists of beer reviews from …
BEHAVE - BEHAVE is a full body human-object interaction dataset with multi-view …
Beijing Traffic - The Beijing Traffic Dataset collects traffic speeds at 5-minute granularity …
Belfort - The Belfort dataset: Handwritten Text Recognition from Crowdsourced Annotations
BenchIE - BenchIE: a benchmark and evaluation framework for comprehensive evaluation of …
BenchLMM - BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
Benchmark for AMR Metrics based on Overt Objectives - Benchmark for AMR Metrics based on Overt Objectives (Bamboo), the …
Bengali Ekman's Six Basic Emotions Corpus - The dataset contains 36000 Bangla data based on Ekman's six …
Bentham - Bentham project
BEOID - Bristol Egocentric Object Interactions Dataset
BERSt - Basic Emotion Random phrase Shouts
BGP - Border Gateway Protocol (BGP) Network
BiasBios - Bias in Bios
Biase et al - Source: [Cell fate inclination within 2-cell and 4-cell mouse embryos …
BIG - A high-resolution semantic segmentation dataset with 50 validation and 100 …
BIG-bench - Beyond the Imitation Game Benchmark
BigEarthNet - BigEarthNet consists of 590,326 Sentinel-2 image patches, each of which …
BigPatent - Consists of 1.3 million records of U.S. patent documents along …
BillSum - BillSum is the first dataset for summarization of US Congressional …
Binarized MNIST - A binarized version of MNIST. Source: [Binarized MNIST](http://www.dmi.usherb.ca/~larocheh/mlpython/_modules/datasets/binarized_mnist.html)
BindingDB - The Binding Database
Bio - Bio AMR Corpus
BioASQ - Biomedical Semantic Indexing and Question Answering
BioNLI - Biomedical Natural Language Inference
BioRED - BioRED is a first-of-its-kind biomedical relation extraction dataset with multiple …
BIOSCAN_1M_Insect Dataset - In an effort to catalog insect biodiversity, we propose a …
BIOSSES - Biomedical Semantic Similarity Estimation System
BIPED - Barcelona Images for Perceptual Edge Detection
BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) - BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) represents …
Birdsnap - **Birdsnap** is a large bird dataset consisting of 49,829 images …
BKAI-IGH NeoPolyp-Small - This dataset contains 1200 images (1000 WLI images and 200 …
BLEFF - Blender Forward Facing Dataset
Blizzard Challenge 2013 - Blizzard Challenge 2013 - English language tasks
BLURB - Biomedical Language Understanding and Reasoning Benchmark
BN-AuthProf - Bangla Author Profiling Dataset
BOBSL - BBC-Oxford British Sign Language
Bongard-HOI - Bongard-HOI testifies to which extent your few-shot visual learner can …
Bongard-OpenWorld - Bongard-OpenWorld is a new benchmark for evaluating real-world few-shot reasoning …
Books3 - The **Books3 dataset** emerged as part of a broader effort …
BookSum - **BookSum** is a collection of datasets for long-form narrative summarization. …
BoolQ - Boolean Questions
BorealTC - Boreal Terrain Classification Dataset
BottleCap - The BottleCap dataset contains over 1100 color images and 7 …
Box-IS - RGB-D instance segmentation box dataset. The Box-IS dataset was created …
BP4D - The **BP4D**-Spontaneous dataset is a 3D video database of spontaneous …
BRACE - The Breakdancing Competition Dataset for Dance Motion Synthesis
Brain Tumor MRI Dataset - This dataset is a combination of the following three datasets …
Brain US - This brain anatomy segmentation dataset has 1300 2D US scans …
Brazil Air-Traffic - Brazil Air-Traffic
Breakfast - The Breakfast Actions Dataset
BreakHis - Breast Cancer Histopathological Database
BreastDICOM4 - [MIMBCD-UI] UTA4: Medical Imaging DICOM Files Dataset
BRIND - BSDS-RIND
BrnoCompSpeed - The dataset contains 21 full-HD videos, each around 1 hr …
Broad Twitter Corpus - This paper introduces the Broad Twitter Corpus (BTC), which is …
BSARD - Belgian Statutory Article Retrieval Dataset
BSDS500 - Berkeley Segmentation Dataset 500
BS-RSC - BS-RSC is a real-world rolling shutter (RS) correction dataset and …
BTAD - beanTech Anomaly Detection
BTS3.1 - Expanding Accurate Person Recognition to New Altitudes and Ranges: The BRIAR Dataset
Bukva - Bukva: Russian Sign Language Alphabet
Burned Area Delineation from Satellite Imagery - A Dataset for Burned Area Delineation and Severity Estimation from Satellite Imagery
Burr classification images - Original images and images with RUSTICO filters applied Also a …

C

C2A: Human Detection in Disaster Scenarios - Combination to Application
C4 - Colossal Clean Crawled Corpus
CACD - Cross-Age Celebrity Dataset
CaDIS - Cataract Dataset for Image Segmentation
CAER - Jiyoung Lee, Seungryong Kim, Sunok Kim, Jungin Park, Kwanghoon Sohn; …
CAER-Dynamic - 13,201 clips from 79 TV shows. Each video clip was …
CaFFe - CAlving Fronts and where to Find thEm
CAIS - Chinese Artificial Intelligence Speakers
CALFW - Cross-Age LFW
California Housing Prices - Median house prices for California districts derived from the 1990 …
Caltech-101 - The Caltech101 dataset contains images from 101 object categories (e.g., …
Caltech-256 - **Caltech-256** is an object recognition dataset containing 30,607 real-world images, …
Caltech Pedestrian Dataset - The Caltech Pedestrian Dataset consists of approximately 10 hours of …
CALVIN - Composing Actions from Language and Vision
Cam2BEV - The [dataset](https://gitlab.ika.rwth-aachen.de/cam2bev/cam2bev-data) contains two subsets of synthetic, semantically segmented road-scene …
CAMELYON16 - Cancer Metastases in Lymph Nodes Challenge 2016
CAMO - Camouflaged Object
CAMO-FS - CAMO-FS Dataset comes with the paper entitled The Art of …
Camouflaged Animal Dataset - The nine (moving camera) videos in this benchmark exhibit camouflaged …
CamVid - Cambridge-driving Labeled Video Database
CANARD - A Dataset for Question-in-Context Rewriting
Candombe - Candombe Recordings Dataset
Canon RAW Low Light - Canon Camera Low Light RAW Image Dataset
CapMIT1003 - The CapMIT1003 database contains captions and clicks collected for images …
CaRB - Crowdsourced automatic open Relation extraction Benchmark
CarFusion - We provide manual annotations of 14 semantic keypoints for 100,000 …
CARLA - Car Learning to Act
CARPK - car parking lot dataset
Car_Price_Prediction - Second_Hand-Car_Price_Prediction
CARS196 - CARS196 is composed of 16,185 car images of 196 classes.
CaseHOLD - Case Holdings On Legal Decisions
CASIA (OSN-transmitted - Facebook) - This dataset is an OSN-transmitted (OSN = Online Social Network) …
CASIA (OSN-transmitted - Wechat) - This dataset is an OSN-transmitted (OSN = Online Social Network) …
CASIA (OSN-transmitted - Weibo) - This dataset is an OSN-transmitted (OSN = Online Social Network) …
CASIA (OSN-transmitted - Whatsapp) - This dataset is an OSN-transmitted (OSN = Online Social Network) …
Casia V1+ - Casia V1 is a dataset for forgery classification. Casia V1+ …
CASIA-WebFace+masks - The COVID-19 pandemic raises the problem of adapting face recognition …
CAS-VSR-S101 - A new large-scale, in-thewild Mandarin dataset, CAS-VSR-S101 with 101.1 hours …
CAS-VSR-W1k (LRW-1000) - *LRW-1000 has been renamed as CAS-VSR-W1k.** It is a naturally-distributed …
CAT2000 - Includes 4000 images; 200 from each of 20 categories covering …
catbAbI LM-mode - concatenated-bAbI
catbAbI QA-mode - concatenated-bAbI
CATER - Rendered synthetically using a library of standard 3D objects, and …
CatFLW - The Cat Facial Landmarks in the Wild (CatFLW) dataset contains …
CATH 4.2 - The CATH (Class, Architecture, Topology, Homology) [65] database is a …
CATH 4.3 - The CATH (Class, Architecture, Topology, Homology) [65] database is a …
Cats and Dogs - A large set of images of cats and dogs. Homepage: …
CATT - CATT Arabic Diacritization Benchmark Dataset
Causal3DIdent - Update on 3DIdent, where we introduce six additional object classes …
CausalGym - SyntaxGym, adapted for interventional interpretability.
CBVS - A Large-Scale Chinese Image-Text Benchmark for Real-World Short Video Search …
CC3M-TagMask - The dataset offers tag and mask annotations for image-text pairs …
CCGbank - **CCGbank** is a translation of the Penn Treebank into a …
CCTSDB2021 - Traffic signs are one of the most important information that …
CCTSDB-AUG - From CCSPNet-Joint,IJCNN 2024
CCVID - Clothes-Changing Video person re-ID
CD18 - Cellphone Dataset with 18 Features
CDCP - Cornell eRulemaking Corpus
CDD-11 - Composite Degradation Dataset 11
CDD Dataset (season-varying) - Source: [CHANGE DETECTION IN REMOTE SENSING IMAGES USING CONDITIONAL ADVERSARIAL …
CDR - BioCreative V CDR Task Corpus
CEDAR Signature - CEDAR Signature is a database of off-line signatures for signature …
C# EditCompletion - We scraped the 53 most popular C# repositories from GitHub …
CelebA - CelebFaces Attributes Dataset
CelebA-HQ - The **CelebA-HQ** dataset is a high-quality version of CelebA that …
CelebAMask-HQ - **CelebAMask-HQ** is a large-scale face image dataset that has 30,000 …
CelebA+masks - The COVID-19 pandemic raises the problem of adapting face recognition …
Cell - The CELL benchmark is made of fluorescence microscopy images of …
CellTypeGraph Benchmark - Classifying all cells in an organ is a relevant and …
CEMS-W - CEMS Wildires
CeyMo - CeyMo is a novel benchmark dataset for road marking detection …
CFC-DAOD - Caltech Fish Counting – Domain Adaptive Object Detection
CFQ - Compositional Freebase Questions
CHAD - Charlotte Anomaly Dataset
ChAII - Hindi and Tamil Question Answering - The dataset covers Hindi and Tamil, collected without the use …
Chairs - The **Chairs** dataset contains rendered images of around 1000 different …
Chalearn-AutoML-1 - This meta-dataset is first used in the AutoML1 challenge organized …
Chameleon (48%/32%/20% fixed splits) - Node classification on Chameleon with the fixed 48%/32%/20% splits provided …
Chameleon(60%/20%/20% random splits) - Node classification on Chameleon with 60%/20%/20% random splits for training/validation/test.
ChangeSim - **ChangeSim** is a dataset aimed at online scene change detection …
ChangeVPR - Scene change detection (SCD) dataset tailored for generalizable SCD algorithm. …
Chaoyang - Chaoyang dataset contains 1111 normal, 842 serrated, 1404 adenocarcinoma, 664 …
Charades - The **Charades** dataset is composed of 9,848 videos of daily …
Charades-Ego - Contains 68,536 activity instances in 68.8 hours of first and …
Charades-STA - Charades-STA is a new dataset built on top of Charades …
ChartQA - Charts are very popular for analyzing data. When exploring charts, …
CHASE_DB1 - **CHASE_DB1** is a dataset for retinal vessel segmentation which contains …
CHB-MIT - CHB-MIT Scalp EEG
ChEBI-20 - Dataset contains 33,010 molecule-description pairs split into 80\%/10\%/10\% train/val/test splits. …
CheGeKa - CheGeKa is a Jeopardy!-like Russian QA dataset collected from the …
ChemProt - **ChemProt** consists of 1,820 PubMed abstracts with chemical-protein interactions annotated …
ChesapeakeRSC - Chesapeake Roads Spatial Context
Chest wall lung sound dataset - Annotated audio files (separate combined annotation file) of lung sounds …
ChestX-ray14 - **ChestX-ray14** is a medical imaging dataset which comprises 112,120 frontal-view …
Chest X-ray images - chest X-ray images for pneumonia detection
CheXpert - The **CheXpert** dataset contains 224,316 chest radiographs of 65,240 patients …
CheXphoto - CheXphoto is a competition for x-ray interpretation based on a …
ChicagoFSWild - This is the home of a collaborative data collection effort …
ChicagoFSWild+ - This is the home of a collaborative data collection effort …
Chikusei Dataset - Airborne hyperspectral data taken over Chikusei
Children's Book Test - Click to add a brief description of the dataset (Markdown …
CHILI-100K - The CHILI-100K dataset is a large-scale graph dataset (with overall …
CHILI-3K - The CHILI-3K dataset is a medium-scale graph dataset (with overall …
ChinaOpen-1k - ChinaOpen is a new video dataset targeted at open-world multimodal …
CHIP-CTC - CHIP Clinical Trial Classification, a dataset aimed at classifying clinical …
CHIP-STS - Semantic Textual Similarity Dataset
CHOCOLATE - Captions Have Often ChOsen Lies About The Evidence
Cholec80 - Surgical Workflow Dataset
Ciao - The **Ciao** dataset contains rating information of users given to …
CICIDS2017 - Intrusion Detection Evaluation Dataset (CIC-IDS2017)
CID - Campus Image Dataset
CIFAKE: Real and AI-Generated Synthetic Images - The quality of AI-generated images has rapidly increased, leading to …
CIFAR-10 - The CIFAR-10 database (Canadian Institute For Advanced Research database) is …
CIFAR-100 - The **CIFAR-100** dataset (Canadian Institute for Advanced Research, 100 classes) …
CIFAR-10C - Common corruptions dataset for CIFAR10
CIFAR10-DVS - **CIFAR10-DVS** is an event-stream dataset for object classification. 10,000 frame-based …
CIHP - Crowd Instance-level Human Parsing
CINIC-10 - **CINIC-10** is a dataset for image classification. It has a …
CirCor DigiScope - CirCor DigiScope is currently the largest pediatric heart sound dataset. …
CIRR - Compose Image Retrieval on Real-life images
Citeseer - The CiteSeer dataset consists of 3312 scientific publications classified into …
Citeseer (48%/32%/20% fixed splits) - Node classification on Citeseer with the fixed 48%/32%/20% splits provided …
CiteSum - CiteSum is a large-scale scientific extreme summarization benchmark.
CityFlow - CityFlow is a city-scale traffic camera dataset consisting of more …
CityPersons - The **CityPersons** dataset is a subset of Cityscapes which only …
Cityscapes - **Cityscapes** is a large-scale database which focuses on semantic understanding …
Cityscapes 3D - Detecting vehicles and representing their position and orientation in the …
Cityscapes Panoptic Parts - The Cityscapes Panoptic Parts dataset introduces part-aware panoptic segmentation annotations …
Cityscapes VIPriors subset - The training and validation data are subsets of the training …
CK+ - Extended Cohn-Kanade dataset
classification benchmark - This benchmark includes 11 image classification datasets that were used …
CLCD - Cropland-CD
CLCXray - Cutters and Liquid Containers X-ray Dataset
Clear Weather - DENSE
CLEVR - Compositional Language and Elementary Visual Reasoning
CLEVR-Humans - We collect a new dataset of human-posed free-form natural language …
CLEVR-Ref+ - CLEVR-Ref+ is a synthetic diagnostic dataset for referring expression comprehension. …
ClevrTex - **ClevrTex** is a new benchmark designed as the next challenge …
CLEVR-X - **CLEVR-X** is a dataset that extends the [CLEVR](/dataset/clevr) dataset with …
CliCR - CliCR is a new dataset for domain specific reading comprehension …
Climabench - The topic of Climate Change (CC) has received limited attention …
CLINC150 - This dataset is for evaluating the performance of intent classification …
Clinical Admission Notes from MIMIC-III - This dataset is created from **MIMIC-III** ([Medical Information Mart for …
clintox - The ClinTox dataset compares drugs approved by the FDA and …
Clipart1k - In Clipart1k, the target domain classes to be detected are …
ClonedPerson - The ClonedPerson dataset is a large-scale synthetic person re-identification dataset …
Clothing1M - **Clothing1M** contains 1M clothing images in 14 classes. It is …
Clothing Attributes Dataset - We introduce the Clothing Attribute Dataset for promoting research in …
Clotho - **Clotho** is an audio captioning dataset, consisting of 4981 audio …
CLUSTER - CLUSTER is a node classification tasks generated with [Stochastic Block …
Cluttered Omniglot - Dataset for one-shot segmentation. Source: [One-Shot Segmentation in Clutter](/paper/one-shot-segmentation-in-clutter)
CMeEE - Chinese Medical Named Entity Recognition Dataset
CMeIE - Chinese Medical Information Extraction Dataset
CMU-MOSEI - CMU Multimodal Opinion Sentiment and Emotion Intensity (**CMU-MOSEI**) is the …
CN-CELEB - CN-Celeb is a large-scale speaker recognition dataset collected `in the …
CNN/Daily Mail - **CNN/Daily Mail** is a dataset for text summarization. Human generated …
Coastal Inundation Maps with Floodwater Depth Values - Simulated Flood Inundation Maps of Abu Dhabi's Coast Under Different Shoreline Protection Scenarios
CoAuthor - **CoAuthor** is a dataset designed for revealing GPT-3's capabilities in …
CochlScene - **CochlScene** is a dataset for acoustic scene classification. The dataset …
COCO 10% labeled data - Semi-Supervised Object Detection on COCO 10% labeled data
COCO Captions - COCO Captions contains over one and a half million captions …
COCO-CN - COCO-CN is a bilingual image description dataset enriching MS-COCO with …
COCO (Common Objects in Context) - Common Objects in Context
COCO-MLT - The COCO-MLT is created from MS COCO-2017, containing 1,909 images …
COCO-N Medium - COCO-N Medium introduces a stochastic benchmark that simulates common real-world …
COCO-O - COCO-O(ut-of-distribution) contains 6 domains (sketch, cartoon, painting, weather, handmake, tattoo) …
COCO-OOC - COCO-OOC goes beyond standard object detection to ask the question: …
COCO-Stuff - Common Objects in COntext-stuff
COCO-Text - The **COCO-Text** dataset is a dataset for text detection and …
COCO-WholeBody - **COCO-WholeBody** is an extension of [COCO](/dataset/coco) dataset with whole-body annotations. …
CODAH - COmmonsense Dataset Adversarially-authored by Humans
CodeContests - CodeContests is a competitive programming dataset for machine-learning. This dataset …
CoDesc - **CoDesc** is a large dataset of 4.2m Java source code …
CodeSearchNet - The **CodeSearchNet** Corpus is a large dataset of functions with …
CoDEx Large - CoDEx comprises a set of knowledge graph completion datasets extracted …
CoDEx Medium - CoDEx comprises a set of knowledge graph completion datasets extracted …
CoDEx Small - CoDEx comprises a set of knowledge graph completion datasets extracted …
COESOT - In this work, we propose a general dataset for Color-Event …
COFAR - Commonsense and Factual Reasoning in Image Search
COFW - Caltech Occluded Faces in the Wild
COIN - The **COIN** dataset (a large-scale dataset for COmprehensive INstructional video …
CoIR - Code Information Retrieval Benchmark
CoLA - Corpus of Linguistic Acceptability
COLLAB - **COLLAB** is a scientific collaboration dataset. A graph corresponds to …
ColonINST-v1 (Seen) - ColonINST is a large-scale instruction tuning dataset designed for multimodal …
ColonINST-v1 (Unseen) - ColonINST is a large-scale instruction tuning dataset designed for multimodal …
Colored-MNIST(with spurious correlation) - This is a dataset with spurious correlations which can be …
Color FERET - The color FERET database is a dataset for face recognition. …
Colors - A large dataset of color names and their respective RGB …
Columbia (OSN-transmitted - Facebook) - This dataset is an OSN-transmitted (Online Social Network) version of …
Columbia (OSN-transmitted - Wechat) - This dataset is an OSN-transmitted (Online Social Network) version of …
Columbia (OSN-transmitted - Weibo) - This dataset is an OSN-transmitted (Online Social Network) version of …
Columbia (OSN-transmitted - Whatsapp) - This dataset is an OSN-transmitted (Online Social Network) version of …
COMA - CoMA contains 17,794 meshes of the human face in various …
Comic2k - **Comic2k** is a dataset used for cross-domain object detection which …
CommitmentBank - The CommitmentBank is a corpus of 1,200 naturally occurring discourses …
CommonGen - CommonGen is constructed through a combination of crowdsourced and existing …
Common Objects in 3D - Common Objects in 3D is a large-scale dataset with real …
CommonsenseQA - CSQA
Common Voice - **Common Voice** is an audio dataset that consists of a …
CompCars - Comprehensive Cars
Completion3D - The Completion3D benchmark is a dataset for evaluating state-of-the-art 3D …
Complex-CronQuestions - A filtered version of CronQuestions and which can better demonstrate …
ComplexWebQuestions - ComplexWebQuestions is a dataset for answering complex questions that require …
Composition-1K - Composition-1K is a large-scale image matting dataset including 49300 training …
CoNaLa - CMU CoNaLa, the Code/Natural Language Challenge
CoNaLa-Ext - CoNaLa Extended With Question Text
ConceptNet - ConceptNet is a knowledge graph that connects words and phrases …
Conceptual Captions - Automatic image captioning is the task of producing a natural-language …
CONCODE - A new large dataset with over 100,000 examples consisting of …
Concrete Compressive Strength - Concrete is the most important material in civil engineering. The …
Condensed Movies - A large-scale video dataset, featuring clips from movies with detailed …
ConditionalQA - ConditionalQA is a Question Answering (QA) dataset that contains complex …
CoNLL++ - CoNLL++ is a corrected version of the CoNLL03 NER dataset …
CoNLL04 - The CoNLL04 dataset is a benchmark dataset used for relation …
CoNLL 2003 - **CoNLL-2003** is a named entity recognition dataset released as a …
CoNLL-2009 - The task builds on the CoNLL-2008 task and extends it …
CoNLL-2020 - CoNLLpp
CoNSeP - Colorectal Nuclear Segmentation and Phenotypes
Consumer Spendings - Finance > US Economy > Consumer Spendings
Contract Discovery - A new shared task of semantic retrieval from legal texts, …
ConvAI2 - Conversational Intelligence Challenge 2
ConvFinQA - Conversational Finance Question Answering
COPA - Choice of Plausible Alternatives
Copel-AMR - This dataset contains 12,500 meter images acquired in the field …
CoQA - Conversational Question Answering Challenge
Cora - The **Cora** dataset consists of 2708 scientific publications classified into …
Cora (48%/32%/20% fixed splits) - Node classification on Cora with the fixed 48%/32%/20% splits provided …
CORBEL - Conveyor belt pressure signal dataset)
CORD - Consolidated Receipt Dataset for Post-OCR Parsing
CORD-19 - CORD-19 is a free resource of tens of thousands of …
CORD-r - We introduce FUNSD-r and CORD-r in [Token Path Prediction](https://arxiv.org/abs/2310.11016), the …
CORE-MM - CORE-MM is an Open-ended VQA benchmark dataset specifically designed for …
Cornell (48%/32%/20% fixed splits) - Node classification on Cornell with the fixed 48%/32%/20% splits provided …
Cornell (60%/20%/20% random splits) - Node classification on Cornell with 60%/20%/20% random splits for training/validation/test.
Countix - Countix is a real world dataset of repetition videos collected …
Country211 - Country211 is a dataset released by OpenAI, designed to assess …
Coveo Data Challenge Dataset - The 2021 SIGIR workshop on eCommerce is hosting the Coveo …
COVERAGE - Copy-Move Forgery Database with Similar but Genuine Objects
COVID-19 Fake News Dataset - COVID19 Fake News Detection in English
COVID-19 Image Data Collection - Contains hundreds of frontal view X-rays and is the largest …
COVIDGR - Under a close collaboration with an expert radiologist team of …
COVIDx - COVIDx CRX-2
COVIDx CXR-3 - COVIDx CXR-3 is an open access benchmark dataset that we …
CPED - Chinese Personalized and Emotional Dialogue
CPLFW - Cross-Pose LFW
CPP - Chinese Polyphones with Pinyin
CPPE-5 - Medical Personal Protective Equipment Dataset
CQADupStack - CQADupStack is a benchmark dataset for community question-answering research. It …
CREMA-D - **CREMA-D** is an emotional multimodal actor data set of 7,442 …
CREPE (Compositional REPresentation Evaluation) - A fundamental characteristic common to both human vision and natural …
Criteo - Display Advertising Challenge
CROHME 2014 - * Benchmark for HMER and OHMER Source: [CROHME 2014](https://ieeexplore.ieee.org/document/6981117)
CROHME 2016 - Source: [ICFHR2016 CROHME: Competition on Recognition of Online Handwritten Mathematical …
CROHME 2019 - Source: [ICDAR 2019 CROHME + TFD: Competition on Recognition of …
CronQuestions - CRONQUESTIONS, the Temporal KGQA dataset consists of two parts: a …
CrossTask - **CrossTask** dataset contains instructional videos, collected for 83 different tasks. …
CrowdHuman - **CrowdHuman** is a large and rich-annotated human detection dataset, which …
CrowdPose - The **CrowdPose** dataset contains about 20,000 images and a total …
CrowS-Pairs - CrowS-Pairs has 1508 examples that cover stereotypes dealing with nine …
CSIQ - Categorical Subjective Image Quality
CSL - CSL is a synthetic dataset introduced in [Murphy et al. …
CSL-Daily - CSL-Daily (Chinese Sign Language Corpus) is a large-scale continuous SLT …
CUB-200-2011 - Caltech-UCSD Birds-200-2011
CUHK03 - Chinese University of Hong Kong Re-identification
CUHK03-C - **CUHK03-C** is an evaluation set that consists of algorithmically generated …
CUHK Avenue - Avenue Dataset contains 16 training and 21 testing video clips. …
CUHK-PEDES - The **CUHK-PEDES** dataset is a caption-annotated pedestrian dataset. It contains …
CUHK-Shadow - Collects shadow images for multiple scenarios and compiled a new …
CUHK-SYSU - CUHK-SYSU Person Search Dataset
CULane - **CULane** is a large scale challenging dataset for academic research …
Curation Corpus - The Curation Corpus is a collection of 40,000 professionally-written summaries …
CurveLanes - CurveLanes is a new benchmark lane detection dataset with 150K …
Custom FINNgers - A dataset with 3200 images (200 for each number quantity …
CUTE80 - The CUTE80 dataset is a lightweight collection of images specifically …
CVC-ClinicDB - **CVC-ClinicDB** is an open-access dataset of 612 images with a …
CV-Cities - CV-Cities comprises $223,736$ ground panoramic images and an equal number …
CVR - Congressional Voting Records Data Set
CVSS - **CVSS** is a massively multilingual-to-English speech to speech translation (S2ST) …
CWL EEG/fMRI Dataset - EEG/fMRI Data from 8 subject doing a simple eyes open/eyes …
CWRU Bearing Dataset - Data was collected for normal bearings, single-point drive end and …
CxC - Crisscrossed Captions
Czech restaurant information - Czech restaurant information is a dataset for NLG in task-oriented …
Czech Subjectivity Dataset - Czech subjectivity dataset of 10k manually annotated subjective and objective …

D

D4LA - The D4LA dataset is a diverse benchmark for document layout …
D4RL - **D4RL** is a collection of environments for offline reinforcement learning. …
DABS - Domain-Agnostic Benchmark for Self-supervised learning
DADA-seg - DADA-seg is a pixel-wise annotated accident dataset, which contains a …
DAGM2007 - This is a synthetic dataset for defect detection on textured …
DailyDialog - **DailyDialog** is a high-quality multi-turn open-domain English dialog dataset. It …
DAIR-V2X - **DAIR-V2X** is a large-scale, multi-modality, multi-view dataset from real scenarios …
DaLAJ - DaLAJ 1.0, a dataset for Linguistic Acceptability Judgments for Swedish, …
DALES - DALES: A Large-scale Aerial LiDAR Data Set for Semantic Segmentation
DanceTrack - A large-scale multi-object tracking dataset for human tracking in occlusion, …
DaNE - Danish Dependency Treebank
DaNetQA - Yes/no Question Answering Dataset for the Russian
DanFEVER - We present a dataset, DANFEVER, intended for claim verification in …
DaReCzech - Dataset for text relevance ranking in Czech
Dark Zurich - **Dark Zurich** is an image dataset containing a total of …
DART - DART is a large dataset for open-domain structured data record …
Data Collected with Package Delivery Quadcopter Drone - This experiment was performed in order to empirically measure the …
Dataset: Relationship extraction for knowledge graph creation from biomedical literature (Gene-Disease relationships) - This is the dataset used for classifying Gene-Disease relationship types …
DAVIS - Densely Annotated VIdeo Segmentation
DAVIS 2016 - DAVIS16 is a dataset for video object segmentation which consists …
DAVIS 2017 - DAVIS17 is a dataset for video object segmentation. It contains …
DAVIS-585 - A dataset for interactive segmentation with simulated initial masks.
DAVIS-DTA - Dataset Description: The interaction of 72 kinase inhibitors with 442 …
DAVIS-S - To enrich the diversity, we also collect 92 images which …
DBLP - Citation Network Dataset
DBP1M FR-EN - A large-scale cross-lingual dataset for entity alignment
DBP2.0 zh-en - The DBP2.0 dataset can be downloaded from the figshare repository. …
DBP-5L (English) - DPB-5L is a Multilingual KG dataset containing 5 KGs in …
DBP-5L (Greek) - DPB-5L is a Multilingual KG dataset containing 5 KGs in …
DBpedia - **DBpedia** (from "DB" for "database") is a project aiming to …
DBRD - Dutch Book Reviews Dataset
DCASE 2019 Mobile - TAU Urban Acoustic Scenes 2019 Mobile
DCM - The DCM dataset is composed of 772 annotated images from …
DDD17 - DAVIS Driving Dataset 2017
DDD17-SEG - Based on the [DDD17](https://pkuml.org/resources/pku-ddd17-car.html) dataset, we select some image-event pairs …
DDI - The **DDI**Extraction 2013 task relies on the DDI corpus which …
DebateSum - **DebateSum** consists of 187328 debate documents, arguments (also can be …
Decagon - Bio-decagon
Deep Blending - The Deep Blending Dataset comprises 19 diverse scenes, offering comprehensive …
DeepCAD - **DeepCAD** is a CAD dataset consisting of 179,133 models and …
DeepCom-Java - The Java dataset introduced in DeepCom ([Deep Code Comment Generation](https://dl.acm.org/doi/10.1145/3196321.3196334)), …
DeepFashion - **DeepFashion** is a dataset containing around 800K diverse fashion images …
DeepFix - **DeepFix** consists of a program repair dataset (fix compiler errors …
DeepGlobe - We observe that satellite imagery is a powerful source of …
Deep Indices - multi-spectral leaf/vegetation segmentation
DeepPatent - The dataset consists of over 350,000 public domain patent drawings …
Deep PCB - Deep Printed Circuit Board
Defects4J - Defects4J is a collection of reproducible bugs and a supporting …
Delicious - **Delicious** : This data set contains tagged web pages retrieved …
DELIVER - **DELIVER** is an arbitrary-modal segmentation benchmark, covering Depth, LiDAR, multiple …
Deng et al - Source: [Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in …
Dense Fog - DENSE
DensePASS - DensePASS - a novel densely annotated dataset for panoramic segmentation …
DEplain-APA-doc - ### DEplain-APA-doc: A German Parallel Corpus for Document Simplification on …
DEplain-APA-sent - ### DEplain-APA-sent: A German Parallel Corpus for Sentence Simplification on …
DEplain-web-doc - ### DEplain-web-doc: A German Parallel Corpus for Document Simplification on …
DEplain-web-sent - ### DEplain-web-sent: A German Parallel Corpus for Sentence Simplification on …
DESED - Domestic environment sound event detection
Desert Locust - **Desert Locus** is a animal pose estimation dataset for desert …
DET - DET is a lane detection dataset that consists of the …
DexYCB - DexYCB is a dataset for capturing hand grasping of objects. …
DF20 - Danish Fungi 2020
DF20 - Mini - Danish Fungi 2020 - Mini
DFDC - Deepfake Detection Challenge
DHB Dataset - Dynamic Human Bodies Dataset
Dhoroni - Dhoroni: A Multi-Perspective Bengali Climate Change and Environmental News Dataset
DHP19 - Dynamic Vision Sensor 3D Human Pose Dataset
Diabetes - Diabetes 130-US Hospitals for Years 1999-2008
DiaBLa - A new English-French test set for the evaluation of Machine …
DialogSum - DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 …
Dialogue State Tracking Challenge - The Dialog State Tracking Challenges 2 & 3 (DSTC2&3) were …
DIB-10K - DongNiao International Birds 10000
DIC-C2DH-HeLa - HeLa cells on a flat glass Dr. G. van Cappellen. …
DiCOVA - The DiCOVA Challenge dataset is derived from the Coswara dataset, …
DiDeMo - Distinct Describable Moments
DiDi - Distractor Distilled Dataset
DiffusionDB - **DiffusionDB** is a large-scale text-to-image prompt dataset. It contains 2 …
Digital Peter - Digital Peter is a dataset of Peter the Great's manuscripts …
Digital twin-supported deep learning for fault diagnosis - This is a dataset used to test deep learning-supported deep …
Digits - Optical Recognition of Handwritten Digits
DIHARD II - The DIHARD II development and evaluation sets draw from a …
DIODE - Dense Indoor and Outdoor Depth
DIOR - Click to add a brief description of the dataset (Markdown …
DIR-LAB COPDgene - The Deformable Image Registration Laboratory
Discovery - The *Discovery* datasets consists of adjacent sentence pairs (s1,s2) with …
DISE 2021 Dataset - The 2021 Document Image Skew Estimation Dataset
DISFA - Denver Intensity of Spontaneous Facial Action
Distinctions-646 - Dinstinctions-646 are composed of 646 foreground images with manually annotated …
DIV2K - **DIV2K** is a popular single-image super-resolution dataset which contains 1,000 …
DIVA-HisDB - The database consists of 150 annotated pages of three different …
DiveFace - A new face annotation dataset with balanced distribution between genders …
Django - The **Django** dataset is a dataset for code generation comprising …
DKhate - A corpus of Offensive Language and Hate Speech Detection for …
DND - Darmstadt Noise Dataset
DNS Challenge - Deep Noise Suppression Challenge
DocRED - **DocRED** (Document-Level Relation Extraction Dataset) is a relation extraction dataset …
DocRED-IE - The *DocRED Information Extraction (DocRED-IE)* dataset extends the DocRED dataset …
DocUNet - Document Image Unwarping via a Stacked U-Net
DocVQA - DocVQA consists of 50,000 questions defined on 12,000+ document images. …
DomainNet - **DomainNet** is a dataset of common objects in six different …
DONeRF: Evaluation Dataset - This is the dataset for the CGF 2021 paper "DONeRF: …
DotPrompts - DotPrompts is a set of testcases derived from PragmaticCode, such …
Douban - Douban Conversation Corpus
Douban Conversation Corpus - We release Douban Conversation Corpus, comprising a training data set, …
DPB-5L (French) - DPB-5L is a Multilingual KG dataset containing 5 KGs in …
DPM - Don’t Patronize Me!
DramaQA - The DramaQA focuses on two perspectives: 1) Hierarchical QAs as …
DRAW-1K - Diverse Algebra Word Problem Set
DrawBench - **DrawBench** is a comprehensive and challenging benchmark for text-to-image models, …
DreamBooth - The **DreamBooth dataset** is a collection of images used for …
Dress Code - Dress Code is a new dataset for image-based virtual try-on …
DRI Corpus - Dr. Inventor Multi-layer Scientific Corpus
Drinking Waste Classification - ## About the Dataset: 4 classes of drinking waste: Aluminium …
DRIVE - Digital Retinal Images for Vessel Extraction
Drone-Action - Drone-Action: An Outdoor Recorded Drone Video Dataset for Action Recognition
DroneDeploy - From DroneDeploy: We’ve collected a dataset of aerial orthomosaics and …
DroneVehicle - VisDrone-DroneVehicle
Drone vs Bird - Drone vs Bird Detection Challenge
DROP - Discrete Reasoning Over Paragraphs
DSC (10 tasks) - Task Incremental Document Sentiment Classification
DSD100 - The dsd100 is a dataset of 100 full lengths of …
DSEC - A Stereo Event Camera Dataset for Driving Scenarios
DSEC-SEG - Based on the [DSEC](https://dsec.ifi.uzh.ch/) dataset, we select some image-event pairs …
DSEval-LeetCode - In this paper, we introduce a novel benchmarking framework designed …
DSIFN-CD - The dataset is manually collected from Google Earth. It consists …
DSO (OSN-transmitted - Facebook) - This dataset is an OSN-transmitted (Online Social Network) version of …
DSO (OSN-transmitted - Wechat) - This dataset is an OSN-transmitted (Online Social Network) version of …
DSO (OSN-transmitted - Weibo) - This dataset is an OSN-transmitted (Online Social Network) version of …
DSO (OSN-transmitted - Whatsapp) - This dataset is an OSN-transmitted (Online Social Network) version of …
DTD - Describable Textures Dataset
DTTD-Mobile - Are current 3D object tracking methods truely robust enough for …
DTU - DTU MVS dataset - 2014
DUC 2004 - The DUC2004 dataset is a dataset for document summarization. Is …
DukeMTMC-attribute - The images in **DukeMTMC-attribute** dataset comes from Duke University. There …
DukeMTMC-reID - The **DukeMTMC-reID** (Duke Multi-Tracking Multi-Camera ReIDentification) dataset is a subset …
DukeMTMC-VideoReID - The DukeMTMC-VideoReID (Duke Multi-Tracking Multi-Camera Video-based ReIDentification) dataset is a …
DUO - Detecting Underwater Objects
DuoRC - DuoRC contains 186,089 unique question-answer pairs created from a collection …
DuReader - **DuReader** is a large-scale open-domain Chinese machine reading comprehension dataset. …
DUT-OMRON - The **DUT-OMRON** dataset is used for evaluation of Salient Object …
DUTS - **DUTS** is a saliency detection dataset containing 10,553 training images …
DVS128 Gesture - Comprises 11 hand gesture categories from 29 subjects under 3 …
DWIE - Deutsche Welle corpus for Information Extraction
DyML-Animal - Dynamic Metric Learning Animal
DyML-Product - Dynamic Metric Learning Product
DyML-Vehicle - Dynamic Metric Learning Vehicle
DynaSent - DynaSent is an English-language benchmark task for ternary (positive/negative/neutral) sentiment …

E

E2E - End-to-End NLG Challenge
EarlyNSD - Early Nutrient Stress Detection of Plants
EARS-WHAM - The EARS-WHAM dataset mixes speech from the EARS dataset with …
EarthVQA - A multi-modal multi-task VQA dataset for remote sensing
EasyCom - The Easy Communications (EasyCom) dataset is a world-first dataset designed …
EBD - A large-scale benchmark with 1605 high-resolution, well-annotated images, featuring more …
eBDtheque - The eBDtheque database is a selection of one hundred comic …
EBM-NLP - EBM-NLP annotates PICO (Participants, Interventions, Comparisons and Outcomes) spans in …
EC-FUNSD - EC-FUNSD is introduced in [[arXiv:2402.02379]](https://arxiv.org/abs/2402.02379) as a benchmark of semantic …
ECG200 - ECG200
ECG5000 - The original dataset for "ECG5000" is a 20-hour long ECG …
ECG-Image-Database - Digitization and Classification of ECG Images: The George B. Moody PhysioNet Challenge 2024
ECLAIR - ECLAIR: A High-Fidelity Aerial LiDAR Dataset for Semantic Segmentation
E-commerce - We release E-commerce Dialogue Corpus, comprising a training data set, …
EconLogicQA - EconLogicQA is a benchmark designed to test the sequential reasoning …
ECSSD - Extended Complex Scene Saliency Dataset
EdNet - A large-scale hierarchical dataset of diverse student activities collected by …
EEG Motor Movement/Imagery Dataset - This data set consists of over 1500 one- and two-minute …
Ego4D - Ego4D is a massive-scale egocentric video dataset and benchmark suite. …
EgoBody - **EgoBody** dataset is a novel large-scale dataset for egocentric 3D …
EgoExoLearn - **EgoExoLearn** is a fascinating dataset designed to bridge the gap …
EgoGesture - The **EgoGesture** dataset contains 2,081 RGB-D videos, 24,161 gesture samples …
EgoSchema - **EgoSchema** is very long-form video question-answering dataset, and benchmark to …
EgoTaskQA - **EgoTask QA** benchmark contains 40K balanced question-answer pairs selected from …
EGTEA - EGTEA Gaze+
EGY-BCD - Bi-temporal images in the EGY-BCD dataset are taken from 4 …
EHE - Elderly Home Exercise
EIBench - For Emotion Interpretation task
EigenWorms - Caenorhabditis elegans is a roundworm commonly used as a model …
Electricity - Individual household electric power consumption Data Set
Electronics - This data was collected by performing a breadth-first search on …
Electron Microscopy Dataset - The dataset available for download on this webpage represents a …
Elephant - The Elephant MIL dataset is a benchmark used in multiple …
ELEVATER - Evaluation of Language-augmented Visual Task-level Transfer
ELI5 - ELI5 is a dataset for long-form question answering. It contains …
eLife - Scientific Lay Summarization
Elliptic Dataset - Click to add a brief description of the dataset (Markdown …
EMDB - EMDB contains in-the-wild videos of human activity recorded with a …
EMNIST - Extended MNIST
EmoCause - **EmoCause** is a dataset of annotated emotion cause words in …
EmoDB Dataset - Berlin Database of Emotional Speech
Emomusic - Emotion in Music Database
EMOTIC - EMOTIons in Context
EmpatheticDialogues - The **EmpatheticDialogues** dataset is a large-scale multi-turn empathetic dialogue dataset …
Endoscapes - Endoscapes - Semantic Segmentation
Endotect Polyp Segmentation Challenge Dataset - A challenge that consists of three tasks, each targeting a …
ENSeg - ## ENSeg Dataset Overview This dataset represents an enhanced subset …
ENT-DESC - ENT-DESC involves retrieving abundant knowledge of various types of main …
ENTIRe-ID - The growing importance of person re-identification in computer vision has …
ENZYMES - **ENZYMES** is a dataset of 600 protein tertiary structures obtained …
EPIC-Hotspot - From Grounded Human-Object Interaction Hotspots from Video (ICCV'19): We collect …
EPIC-KITCHENS-100 - This paper introduces the pipeline to scale the largest dataset …
EPIC-KITCHENS-55 - The EPIC-KITCHENS-55 dataset comprises a set of 432 egocentric videos …
EPIC-SOUNDS - **EPIC-SOUNDS** is a large scale dataset of audio annotations capturing …
Epilepsy seizure prediction - The original dataset from the reference consists of 5 different …
Epinions - The **Epinions** dataset is built form a who-trust-whom online social …
EQ-Bench - This dataset contains benchmark scores for EQ-Bench, a novel benchmark …
ESC-50 - The **ESC-50** dataset is a labeled collection of 2000 environmental …
e-SNLI - e-SNLI is used for various goals, such as obtaining full …
e-SNLI-VE - e-SNLI-VE is a large VL (vision-language) dataset with NLEs (natural …
ESOL (Estimated SOLubility) - ESOL is a water solubility prediction dataset consisting of 1128 …
eSports Sensors Dataset - The eSports Sensors dataset contains sensor data collected from 10 …
Essays - Stream-of-consciousness Essays
ETD500 - The paper used 500 scanned Electronic Theses and Dissertation cover …
ETDII Dataset - Electric Transmission and Distribution Infrastructure Imagery Dataset
ETH - ETH Pedestrian
ETH3D - ETHD is a multi-view stereo benchmark / 3D reconstruction benchmark …
Ethics (per ethics) - Ethics (per ethics) dataset is created to test the knowledge …
ETH-XGaze - Consists of over one million high-resolution images of varying gaze …
ETTh1 (96) - ETT (Electricity Transformer Temperature)
E.T. the Exceptional Trajectories - Click to add a brief description of the dataset (Markdown …
EurekaAlert - Eureka Alert
EuRoC MAV - **EuRoC MAV** is a visual-inertial datasets collected on-board a Micro …
Europarl - European Parliament Proceedings Parallel Corpus
EuroSAT - **Eurosat** is a dataset and deep learning benchmark for land …
EuroSAT-SAR - A SAR version of the EuroSAT dataset. The images were …
EvalCrafter Text-to-Video (ECTV) Dataset - This dataset contains around 10000 videos generated by various methods …
EVD4UAV - VD4UAV is an altitude-sensitive benchmark dataset designed to evade vehicle …
Event-Camera Dataset - The **Event-Camera Dataset** is a collection of datasets with an …
EventNarrative - **EventNarrative** is a knowledge graph-to-text dataset from publicly available open-world …
EVICAN - Deep learning use for quantitative image analysis is exponentially increasing. …
Exact Street2Shop - A dataset containing 404,683 shop photos collected from 25 different …
ExDark - Exclusively Dark Image Dataset
Explor_all - Explor_all font image dataset https://drive.google.com/file/d/1P2DbNbVw4Q__WcV1YdzE7zsDKilmd3pO/view
Exposure-Errors - A dataset of over 24,000 images exhibiting the broadest range …
ExpW - Expression in-the-Wild
EXPY-TKY - Expressway-Tokyo
Extended heartSeg - The dataset X of this work is an extension of …
Extended Task10_Colon Medical Decathlon - Extended Task10_Colon of Medical Segmentation Decathlon dataset
ExtMarker - 3D motion of chest external markers
Extreme Events > Natural Disasters > Hurricane - Tourism > Finance > Sales Revenue
EYEDIAP - The **EYEDIAP** dataset is a dataset for gaze estimation from …

F

FABSA - An aspect-based sentiment analysis dataset of Customer Feedback reviews
FaceForensics - FaceForensics is a video dataset consisting of more than 500,000 …
FaceForensics++ - FaceForensics++ is a forensics dataset consisting of 1000 original video …
FairytaleQA - **FairytaleQA** is a dataset focusing on narrative comprehension of kindergarten …
fake - Real / Fake Job Posting Prediction
FakeAVCeleb - FakeAVCeleb is a novel Audio-Video Deepfake dataset that not only …
FarsTail - Natural Language Inference (NLI), also called Textual Entailment, is an …
Fashion IQ - Fashion IQ support and advance research on interactive fashion image …
Fashion-MNIST - **Fashion-MNIST** is a dataset comprising of 28×28 grayscale images of …
FB122 - Freebase-122
FB15k - Freebase 15K
FB15k-237 - **FB15k-237** is a link prediction dataset created from FB15k. While …
FBMS - Freiburg-Berkeley Motion Segmentation
FBMS-59 - Freiburg-Berkeley Motion Segmentation
F-CelebA (10 tasks) - Federated-CelebA (10 tasks)
FCGEC - FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction
FDCompCN - A new fraud detection dataset FDCompCN for detecting financial statement …
FDDB - Face Detection Dataset and Benchmark
FDMSE-ISL - A large-scale isolated Indian sign language dataset. It contains 2002 …
FE108 - Large-scale single-object tracking dataset, containing 108 sequences with a total …
FEMNIST - Federated Extended MNIST
FER+ - Face Expression Recognition Plus dataset
FER2013 - Facial Expression Recognition 2013 Dataset
FERG - Facial Expression Research Group Database
FETA Car-Manuals - FETA Car-Manuals dataset, image-text retrieval for foundation models' expert data performance.
FEVER - Fact Extraction and VERification
FewRel - Few-Shot Relation Classification Dataset
FEWS - FEWS: Large-Scale, Low-Shot Word Sense Disambiguation with the Dictionary
FFHQ - Flickr-Faces-HQ
FGVC-Aircraft - FGVC-Aircraft contains 10,200 images of aircraft, with 100 images for …
FIGER - Fine-Grained Entity Recognition
Film(48%/32%/20% fixed splits) - Node classification on Film with the fixed 48%/32%/20% splits provided …
Film (60%/20%/20% random splits) - Node classification on Film with 60%/20%/20% random splits for training/validation/test.
Filosax - 48 multitrack jazz recordings with many annotations.
FindVehicle - The ***first*** NER dataset in the field of traffic, which …
FineAction - **FineAction** contains 103K temporal instances of 106 action categories, annotated …
FineDance - Click to add a brief description of the dataset (Markdown …
FineDiving - We construct a fine-grained video dataset organized by both semantic …
Fine-Grained Cloud Segmentation Dataset - The dataset consists of 96 terrain-corrected (Level-1T) scenes from Landsat …
Fine-Grained Grass Segmentation Dataset - The dataset was created using high-resolution (8 m) satellite imagery …
FinQA - FinQA is a new large-scale dataset with Question-Answering pairs over …
FinSen - ## Enhancing Financial Market Predictions: Causality-Driven Feature Selection This paper …
FIRE - Fundus Image Registration Dataset
Fire and Smoke Dataset - This dataset is collected by DataCluster Labs, India. To download …
FireRisk - FireRisk: A Remote Sensing Dataset for Fire Risk Assessment
First-Person Hand Action Benchmark - **First-Person Hand Action Benchmark** is a collection of RGB-D video …
Fish-100 - Schools of inland silversides (Menidia beryllina, n=14 individuals per school) …
FishEye8K - FishEye8K: A Benchmark and Dataset for Fisheye Camera Object Detection
Fishyscapes - **Fishyscapes** is a public benchmark for uncertainty estimation in a …
FIVR-200K - The FIVR-200K dataset has been collected to simulate the problem …
FixaTons - FixaTons is a large collection of datasets human scanpaths (temporally …
FKD - Football Keywords Dataset
FLAIR (French Land cover from Aerospace ImageRy) - The French National Institute of Geographical and Forest Information (IGN) …
FLEURS - Few-shot Learning Evaluation of Universal Representations of Speech
Flickr30k - The **Flickr30k** dataset contains 31,000 images collected from Flickr, together …
Flickr-8k - Contains 8k flickr Images with captions. Visit [this](http://hockenmaier.cs.illinois.edu/8k-pictures.html) page to …
FlickrLogos-32 - Object detection benchmark for logo detection. Images are natural scenes. …
FlickrStyle10K - FlickrStyle10K is collected and built on Flickr30K image caption dataset. …
FloCo - Flow chart Image to Code
Florence - Florence 3D Faces
FLoRes-200 - FLoRes-200 doubles the existing language coverage of FLoRes-101. Given the …
Fluent Speech Commands - Fluent Speech Commands is an open source audio dataset for …
Fluo-C3DL-MDA231 - MDA231 human breast carcinoma cells infected with a pMSCV vector …
Fluo-N2DH-GOWT1 - GFP-GOWT1 mouse stem cells Dr. E. Bártová. Institute of Biophysics, …
Fluo-N2DH-SIM+ - Simulated nuclei of HL60 cells stained with Hoescht Dr. V. …
Fluo-N2DL-HeLa - HeLa cells stably expressing H2b-GFP Mitocheck Consortium
FlyingThings3D - **FlyingThings3D** is a synthetic dataset for optical flow, disparity and …
FMB Dataset - Full-time Multi-modality Benchmark Dataset
FMC-MWO2KG - The MWO2KG Failure Mode Classification Dataset
FMD - Fluorescence Microscopy Denoising
FMD (materials) - Flickr Material Dataset
fMoW - Functional Map of the World
FNC-1 - Fake News Challenge Stage 1
Foggy Cityscapes - **Foggy Cityscapes** is a synthetic foggy dataset which simulates fog …
Fongbe audio - Fongbe dataset
Food-101 - The **Food-101** dataset consists of 101 food categories with 750 …
Food-101N - The Food-101N dataset is introduced in "CleanNet: Transfer Learning for …
FoodSeg103 - lewisnjue
FoodX-251 - FoodX-251 is a dataset of 251 fine-grained classes with 118k …
Forest CoverType - Predicting forest cover type from cartographic variables only (no remotely …
ForgeryNet - We construct the ForgeryNet dataset, an extremely large face forgery …
Forward-Looking Sonar Marine Debris Datasets - This dataset is made up of forward-looking sonar images containing …
FP4S - Floor plan image segmentation via scribble-based semi-weakly-supervised learning
FPv1 - FPv1 (prior name FAUST-partial) is a 3D registration benchmark dataset …
FQL-Driving - FQL-driving
FQuAD - French Question Answering Dataset
FreeLaw - **Free Law Project** is a leading nonprofit organization that aims …
FreeSolv (Free Solvation) - The FreeSolv database offers a curated collection of experimental and …
Freiburg Forest - The **Freiburg Forest** dataset was collected using a Viona autonomous …
FreiHAND - **FreiHAND** is a 3D hand pose dataset which records different …
FRGC - Face Recognition Grand Challenge
FSC147 - We introduce a dataset of 147 object categories containing over …
FSD50K - Freesound Database 50K
FSDSoundScapes - A synthetic sound mixture specification dataset for the Target Sound …
FSNS - Test - Arabic handwriting dataset.
FSS-1000 - **FSS-1000** is a 1000 class dataset for few-shot segmentation. The …
Full-body Parkinson’s disease dataset - A public data set of walking full-body kinematics and kinetics …
FUNSD - Form Understanding in Noisy Scanned Documents
FUNSD-r - We introduce FUNSD-r and CORD-r in [Token Path Prediction](https://arxiv.org/abs/2310.11016), the …
FusedChat - FusedChat is an inter-mode dialogue dataset. It contains dialogue sessions …

G

GAD - Gene Associations Database
Gait3D - Gait3D is a large-scale 3D representation-based gait recognition dataset. It …
Game of 24 - Game of 24 is a mathematical reasoning challenge, where the …
GAP - GAP Benchmark Suite
GasHisSDB - Four pathologists from Longhua Hospital Shanghai University of Traditional Chinese …
Gaze360 - Physically Unconstrained Gaze Estimation in the Wild
GazeCapture - Eye Tracking for Everyone
Gaze-CIFAR-10 - We construct **Gaze-CIFAR-10**, a gaze-augmented image dataset based on the …
GazeFollow - GazeFollow is a large-scale dataset annotated with the location of …
Gazeta - **Gazeta** is a dataset for automatic summarization of Russian news. …
GDA - Gene-Disease Associations Corpus
GDELT - The **GDELT Project** is a remarkable initiative that monitors our …
GEdit-Bench-EN - This dataset is a new benchmark, grounded in real-world usages …
GEN1 Detection - Prophesee GEN1 Automotive Detection Dataset
GenEval - Recent breakthroughs in diffusion models, multimodal pretraining, and efficient finetuning …
GENIA - The **GENIA** corpus is the primary collection of biomedical literature …
genius - node classification on genius
GenWiki - GenWiki is a large-scale dataset for knowledge graph-to-text (G2T) and …
GEOM-DRUGS - GEOM-DRUGS is a dataset of 430,000 large organic molecules of …
Geometry3K - A new large-scale geometry problem-solving dataset - 3,002 multi-choice geometry …
GeoQA - Geometric Question Answering
GeoQuestions1089 - GeoQuestions1089 is a crowdsourced geospatial question-answering dataset that targets the …
GeoS - **GeoS** is a dataset for automatic math problem solving. It …
GePaDe - This dataset encompasses 265 speeches (over 200,000 tokens) from the …
GermanQuAD - **GermanQuAD** is a Question Answering (QA) dataset of 13,722 extractive …
GerMS-AT - GERMS-AT: A Sexism/Misogyny Dataset of Forum Comments from an Austrian Online Newspaper
GF-PA66 3D XCT - Glass fiber-reinforced polyamide 66 (GF-PA66) 3D X-ray Computed Tomography (XCT))
GigaSpeech - GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 …
GitHub-Python - Repair AST parse (syntax) errors in Python code
GlaS - Gland Segmentation in Colon Histology Images Challenge
GlassTemp - Glass Transition Temperature
GLUE - General Language Understanding Evaluation benchmark
GMOT-40 - Generic Multiple Object Tracking (GMOT)
GO21 - GO21 is a biomedical knowledge graph that models genes, proteins, …
GoodsAD - The GoodsAD dataset contains 6124 images with 6 categories of …
GoogleEarth - The GoogleEarth dataset is collected from Google Earth Studio, including …
Google Speech Commands - Musan - This noisy speech test set is created from the Google …
Goolam et al - Source: [Heterogeneity in Oct4 and Sox2 Targets Biases Cell Fate …
GoPro - The **GoPro** dataset for deblurring consists of 3,214 blurred images …
GOT-10k - Generic Object Tracking Benchmark
GovReport - GovReport is a dataset for long document summarization, with significantly …
Gowalla - Gowalla is a location-based social networking website where users share …
GQA - The **GQA** dataset is a large-scale visual question answering dataset …
GQA-REX - A GQA-based dataset with 1,040,830 multi-modal explanations of visual reasoning …
GraphQuestions - GraphQuestions is a characteristic-rich dataset designed for factoid question answering. …
GraSP - Holistic and Multi-Granular Surgical Scene Understanding of Prostatectomies
GraspNet-1Billion - **GraspNet-1Billion** provides large-scale training data and a standard evaluation platform …
GRAZPEDWRI-DX - GRAZPEDWRI-DX is a public dataset of 20,327 pediatric wrist trauma …
gRefCOCO - gRefCOCO is the first large-scale Generalized Referring Expression Segmentation dataset …
GRIT - General Robust Image Task Benchmark
Groove - Groove MIDI Dataset
GSL - Greek Sign Language
GSM8K - GSM8K is a dataset of 8.5K high quality linguistically diverse …
GSM-Plus - By perturbing the widely used GSM8K dataset, an adversarial dataset …
GSO - Google Scanned Objects
GTA-IM Dataset - GTA Indoor Motion
GTEA - Georgia Tech Egocentric Activity
GTSRB - German Traffic Sign Recognition Benchmark
GTZAN - The **gtzan8** audio dataset contains 1000 tracks of 30 second …
GUE - Genome Understanding Evaluation
GuitarSet - **GuitarSet** is a dataset of high-quality guitar recordings and rich …
GUM - Georgetown University Multilayer corpus
GVLM - Global Very-High-Resolution Landslide Mapping
GYAFC - Grammarly’s Yahoo Answers Formality Corpus

H

H2O (2 Hands and Objects) - We present a comprehensive framework for egocentric interaction recognition using …
H3WB - Human 3.6M 3D WholeBody
HAA500 - Human-Centric Atomic Action Dataset
HACS - Human Action Clips and Segments
Hainsworth - S. W. Hainsworth and M. D. Macleod, “Particle filtering applied …
HallusionBench - Large language models (LLMs), after being aligned with vision models …
HAM10000 - **HAM10000** is a dataset of 10000 training images for detecting …
HAMMER - **HAMMER** dataset contains 13 Scenes. Each scene has two setups, …
HANS - Heuristic Analysis for NLI Systems
HAR - Human Activity Recognition Using Smartphones
HARD - Hotel Arabic-Reviews Dataset
HarmfulQA - [**Paper**](https://arxiv.org/abs/2308.09662) | [**Github**](https://github.com/declare-lab/red-instruct) | [**Dataset**](https://huggingface.co/datasets/declare-lab/HarmfulQA)| [**Model**](https://huggingface.co/declare-lab/starling-7B) As a part of …
Harmonix - The Harmonix Set
HARPER - Exploring 3D Human Pose Estimation and Forecasting from the Robot’s Perspective: The HARPER Dataset
Harry Potter Dialogue Dataset - Harry Potter Dialogue is the first dialogue dataset that integrates …
Hateful Memes - The Hateful Memes data set is a multimodal dataset for …
HateMM - Hate speech has become one of the most significant issues …
HatEval - SemEval 2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter
HateXplain - Covers multiple aspects of the issue. Each post in the …
Haze4k - **Haze4k** is a synthesized dataset with 4,000 hazy images, in …
HCP Aging - Lifespan Human Connectome Project Aging
HC-STVG1 - Human-centric Spatio-Temporal Video Grounding
HC-STVG2 - We have added data and cleaned the labels in HC-STVG …
HDR-GS - HDR-GS: Efficient High Dynamic Range Novel View Synthesis at 1000x Speed via Gaussian Splatting
HeadQA - HeadQA is a multi-choice question answering testbed to encourage research …
Healthcare Provider Fraud Detection Analysis - Inpatient claims, Outpatient claims and Beneficiary details of each provider. …
Heavy Snowfall - DENSE
HeiChole Benchmark - Surgical Workflow and Skill Analysis Challenge (HeiChole Benchmark)
HellaSwag - HellaSwag is a challenge dataset for evaluating commonsense NLI that …
HELOC - Home Equity Line of Credit
HePIC 🏛️ - Heritage Pointcloud Instance Collection dataset, acquired from two large buildings …
HERA RFI Detection - Hydrogen Epoch of Reionization Array (HERA)
Herbarium 2021 Half–Earth - The **Herbarium Half-Earth** dataset is a large and diverse dataset …
Herbarium 2022 - Identify plant species of the Americas from herbarium specimens
HErlev - HErlev Pap Smear Dataset
HEV-I - Honda Egocentric View-Intersection Dataset
HICO - Humans Interacting with Common Objects
HICO-DET - **HICO-DET** is a dataset for detecting human-object interactions (HOI) in …
HIDE - Consists of 8,422 blurry and sharp image pairs with 65,784 …
HierText - HierText is the first dataset featuring hierarchical annotations of text …
HiEve - Human-in-Events
HIGGS Data Set - The data has been produced using Monte Carlo simulations. The …
Hindi MSR-VTT - Hindi Microsoft reseacrh video to text
HiNER-collapsed - HiNER: A Large Hindi Named Entity Recognition Dataset
HiNER-original - HiNER: A Large Hindi Named Entity Recognition Dataset
HInt: Hand Interactions in the wild - The HInt dataset is frequently used as a <b>generalizability benchmark</b> …
HiRID - HiRID is a freely accessible critical care dataset containing data …
HistGen WSI-Report Dataset - This dataset is composed of 7,753 pairs of whole slide …
HJDB - J. Hockman, M. E. Davies, and I. Fujinaga, “One in …
HKR - Handwritten Kazakh and Russian (HKR) Database for Text Recognition
HKU-IS - **HKU-IS** is a visual saliency prediction dataset which contains 4447 …
HMDB51 - The **HMDB51** dataset is a large collection of realistic videos …
HME100K - Source: [HME100K](https://github.com/Phymond/HME100K)
HO-3D v2 - A hand-object interaction dataset with 3D pose annotations of hand …
HO-3D v3 - The HO-3D v3 is the version 3 of the HO-3D …
HOC - Hallmarks of Cancer
Hockey Fight Detection Dataset - Whereas the action recognition community has focused mostly on detecting …
Home Action Genome - Home Action Genome is a large-scale multi-view video database of …
HopeEDI - HopeEDI: A Multilingual Hope Speech Detection Dataset for Equality, Diversity, and Inclusion
Hopkins155 - The Hopkins 155 dataset consists of 156 video sequences of …
Horse-10 - **Horse-10** is an animal pose estimation dataset. It comprises 30 …
HOST - The heavily occluded scene text (HOST) dataset is a dataset …
HotpotQA - **HotpotQA** is a question answering dataset collected on the English …
How2 - The **How2** dataset contains 13,500 videos, or 300 hours of …
How2QA - To collect How2QA for video QA task, the same set …
How2Sign - A Large-scale Multimodal Dataset for Continuous American Sign Language
HowMany-QA - HowMany-Qa is a object counting dataset. It is taken from …
HOWS - HOWS-CL-25
HowTo100M Adverbs - HowTo100M Adverbs is a subset from HowTo100M with mined adverbs …
HPatches - Homography-patches dataset
HQ-YTVIS - While Video Instance Segmentation (VIS) has seen rapid progress, current …
HR-Avenue - The human-Related version of the CUHK Avenue dataset, first presented …
HRF - High-Resolution Fundus
HR-ShanghaiTech - The human-Related version of the ShanghaiTech Campus, was first presented …
HRSOD - High-Resolution Salient Object Detection
HR-UBnormal - The Human Related version of UBnormal ("UBnormal: New Benchmark for …
HSPACE - Human-SPACE
HUI speech corpus - Hof University iisys speech dataset
Human3.6M - The **Human3.6M** dataset is one of the largest motion capture …
HumanAct12 - **HumanAct12** is a new 3D human motion dataset adopted from …
Human-Art - Human-Art is a versatile human-centric dataset to bridge the gap …
HumanEval - This is an evaluation harness for the HumanEval problem solving …
HumanEval-ET - Extension test cases of HumanEval, as well as generated code.
HumanEvalPack - HumanEvalPack is an extension of OpenAI's HumanEval to cover 6 …
HumanML3D - HumanML3D is a 3D human motion-language dataset that originates from …
HUME-VB - The Hume Vocal Bursts Dataset
HUST-LEBW - An eyeblink detection in the wild dataset.
Hutter Prize - The Hutter Prize Wikipedia dataset, also known as enwiki8, is …
HuTu 80 - HuTu 80 cell populations
HWU64 - This project contains natural language data for human-robot interaction in …
HybridQA - A new large-scale question-answering dataset that requires reasoning on heterogeneous …
Hyper-Kvasir Dataset - HyperKvasir dataset contains 110,079 images and 374 videos where it …
Hyperpartisan News Detection - Hyperpartisan News Detection was a dataset created for PAN @ …
Hypersim - For many fundamental scene understanding tasks, it is difficult or …
HYPERVIEW - Seeing Beyond the Visible

I

i2b2 De-identification Dataset - Informatics for Integrating Biology and the Bedside (i2b2) Project — De-identification Dataset
I2L-140K - Introduced by Singh, Sumeet S.. “Teaching Machines to Code: Neural …
IAM - IAM Handwriting
IAM(line-level) - Line-level Handwritten Text Recognition on IAM
IBims-1 - Independent benchmark images and matched scans v1
IC13 - The IC13 dataset contains 561 images: 420 for training and …
iCartoonFace - The **iCartoonFace** dataset is a large-scale dataset that can be …
ICBHI Respiratory Sound Database - The Respiratory Sound database - ICBHI 2017 Challenge
ICDAR 2003 - The ICDAR2003 dataset is a dataset for scene text recognition. …
ICDAR 2013 - The **ICDAR 2013** dataset consists of 229 training images and …
ICDAR 2015 - **ICDAR 2015** was a scene text detection used for the …
ICDAR 2019 - cTDaR
ICFG-PEDES - Identity-Centric and Fine-Grained Person Description Dataset
ICIAR 2018 Grand Challenge on Breast Cancer Histology Images - The dataset is composed of Hematoxylin and eosin (H&E) stained …
IconQA - Icon Question Answering
ICSI Meeting Corpus - ICSI Meeting Corpus in JSON format.
iDesigner - Fashion trends are constantly evolving, but a trained eye can …
Id Pattern Dataset - After defining a taxonomy of the main stone deterioration patterns …
IDRiD - Indian Diabetic Retinopathy Image Dataset
IECSIL FIRE-2018 Shared Task - The dataset is taken from the First shared task on …
IEMOCAP - The Interactive Emotional Dyadic Motion Capture (IEMOCAP) Database
IFEval - Instruction Following Evaluation Datset
iFF - Intrinsic Forward Facing
IHDP - Infant Health and Development Program
IIIT5k - The IIIT5K dataset contains 5,000 text instance images: 2,000 for …
IITB Corridor - An abnormal activity data-set for research use that contains 4,83,566 …
IJB-A - IARPA Janus Benchmark A
IJB-B - IARPA Janus Benchmark-B
IJB-C - IARPA Janus Benchmark-C
IJB-S - IARPA Janus Benchmark-S
iKala - The **iKala** dataset is a singing voice separation dataset that …
iLIDS-VID - The **iLIDS-VID** dataset is a person re-identification dataset which involves …
IllusionVQA - IllusionVQA is a Visual Question Answering (VQA) dataset with two …
im2latex-100k - A prebuilt dataset for OpenAI's task for image-2-latex system. Includes …
Image-Chat - The IMAGE-CHAT dataset is a large collection of (image, style …
ImageCLEF-DA - The **ImageCLEF-DA** dataset is a benchmark dataset for ImageCLEF 2014 …
ImageCoDe - Image Retrieval from Contextual Descriptions
ImageNet - The **ImageNet** dataset contains 14,197,122 annotated images according to the …
ImageNet-100 (TEMI Split) - This split was introduced in TEMI (BMVC 2023) Adaloglou, Nikolas, …
ImageNet-1k vs iNaturalist - A benchmark dataset for out-of-distribution detection. ImageNet-1k is in-distribution, while …
ImageNet-1k vs NINCO - No ImageNet Class Objects
ImageNet-1k vs OpenImage-O - OpenImage-O is built for the ID dataset ImageNet-1k. It is …
ImageNet-1k vs Places - A benchmark dataset for out-of-distribution detection. ImageNet-1k is in-distribution, while …
ImageNet-1k vs SUN - A benchmark dataset for out-of-distribution detection. ImageNet-1k is in-distribution, while …
ImageNet-1k vs Textures - A benchmark dataset for out-of-distribution detection. ImageNet-1k is in-distribution, while …
ImageNet-32 - Imagenet32 is a huge dataset made up of small images …
ImageNet-50 (TEMI Split) - The ImageNet-50 dataset split as introduced in TEMI. Adaloglou, Nikolas, …
ImageNet-64 - Imagenet64 is a massive dataset of small images called the …
ImageNet-9 - ImageNet-9 consists of images with different amounts of background and …
ImageNet-A - The **ImageNet-A** dataset consists of real-world, unmodified, and naturally occurring …
ImageNet-C - **ImageNet-C** is an open source data set that consists of …
ImageNet_CN - Chinese ImageNet Classification
ImageNet C-OOD (class-out-of-distribution) - This dataset was presented as part of the ICLR 2023 …
ImageNet ctest10k - Colorization validation set for unconditional/conditional colorization tasks. Subset of the …
ImageNet-Hard - ImageNet-Hard is a new benchmark that comprises 10,980 images collected …
ImageNet-LT - ImageNet Long-Tailed
ImageNet-P - **ImageNet-P** consists of noise, blur, weather, and digital distortions. The …
ImageNet-R - ImageNet-Rendition
ImageNet-S - ImageNet Semantic Segmentation
ImageNet-Sketch - ImageNet-Sketch data set consists of 50,889 images, approximately 50 images …
Imagenette - **Imagenette** is a subset of 10 easily classified classes from …
ImageNet-VidVRD - ImageNet-VidVRD dataset contains 1,000 videos selected from ILVSRC2016-VID dataset based …
Image Paragraph Captioning - The Image Paragraph Captioning dataset allows researchers to benchmark their …
IMC PhotoTourism - Image Matching Challenge Phototourism
IMCPT-SparseGM-100 - IMCPT-SparseGM dataset is a new visual graph matching benchmark addressing …
IMCPT-SparseGM-50 - IMCPT-SparseGM dataset is a new visual graph matching benchmark addressing …
IMDB-BINARY - **IMDB-BINARY** is a movie collaboration dataset that consists of the …
IMDB-Clean - We have cleaned the noisy IMDB-WIKI dataset using a constrained …
IMDb Movie Reviews - The **IMDb Movie Reviews** dataset is a binary sentiment analysis …
IMDB-MULTI - **IMDB-MULTI** is a relational dataset that consists of a network …
ImgEdit-Data - ImgEdit is a large-scale, high-quality image-editing dataset comprising 1.2 million …
iMiGUE - **iMiGUE** is a dataset for emotional artificial intelligence research: identity-free …
iNaturalist - The iNaturalist 2017 dataset (iNat) contains 675,170 training and validation …
IndicGLUE - Indic General Language Understanding Evaluation Benchmark
InDL - In-Diagram Logic
IndustReal - IndustReal Dataset of Egocentric Videos for Procedure Understanding
InfiMM-Eval - Complex Open-ended Reasoning Evaluation for Multi-Modal Language Models
InfographicVQA - **InfographicVQA** is a dataset that comprises a diverse collection of …
InfoSeek - Visual Information Seeking
InLoc - InLoc is a dataset with reference 6DoF poses for large-scale …
INRIA Aerial Image Labeling - The **INRIA Aerial Image Labeling** dataset is comprised of 360 …
INRIA Holidays Dataset - The Holidays dataset is a set of images which mainly …
INS Dataset - A significant challenge in removing shadows from indoor scenes is …
In-Shop - In-shop Clothes Retrieval Benchmark
Inshorts News - Inshorts English News dataset
Insider Threat Test Dataset - The Insider Threat Test Dataset is a collection of synthetic …
Inspec - Paper: Improved automatic keyword extraction given more linguistic knowledge Doi: …
INSPIRE-AVR (LUNet subset) - This dataset contains 65 DFIs acquired from patients with POAG …
InsPLAD - Inspection Power Line Asset Dataset
INSTRE - **INSTRE** is a benchmark for INSTance-level visual object REtrieval and …
Instructional-DT (Instr-DT) - Instructional Discourse Treebank
Intel Image Classification - Context This is image data of Natural Scenes around the …
IntentQA - We contribute an IntentQA dataset with diverse intents in daily …
InterHand2.6M - The InterHand2.6M dataset is a large-scale real-captured dataset with accurate …
InterHuman - **InterHuman** is a multimodal dataset, named InterHuman. It consists of …
Inter-X - Inter-X is a large-scale dataset containing ~11K interaction sequences, more …
IntrA - **IntrA** is an open-access 3D intracranial aneurysm dataset that makes …
ionosphere - The original ionosphere dataset from UCI machine learning repository is …
I.PHI - **I.PHI** processes the Packard Humanities Institute (PHI) database of ancient …
iPhone (Monocular Dynamic View Synthesis) - iPhone dataset is a challenging benchmarks for dynamic reconstruction. This …
iPinYou - iPinYou Global RTB Bidding Algorithm Competition Dataset
IRFL: Image Recognition of Figurative Language - The IRFL dataset consists of idioms, similes, and metaphors with …
iris - The Iris flower data set or Fisher's Iris data set …
iRodent - iRodent Animal Pose Estimation
IRV2V - IRregular V2V Dataset
iSAID - iSAID contains 655,451 object instances for 15 categories across 2,806 …
iSarcasm - iSarcasm is a dataset of tweets, each labelled as either …
ISBNet - ISBNet is a dataset of images of recyclables. It is …
iShape - **iShape** is an irregular shape dataset for instance segmentation. iShape …
ISIC 2019 - The goal for ISIC 2019 is classify dermoscopic images among …
ISIC 2020 Challenge Dataset - Official dataset of the SIIM-ISIC Melanoma Classification Challenge 2020
ISPRS Potsdam - 2D Semantic Labeling Contest - Potsdam
ISPRS Vaihingen - 2D Semantic Labeling - Vaihingen data
ISRUC-Sleep - **ISRUC-Sleep** is a polysomnographic (PSG) dataset. The data were obtained …
ISTD - The Image Shadow Triplets dataset (**ISTD**) is a dataset for …
ISTD+ - ISTD+ consists of shadow images, shadow-free images, and shadow masks, …
ItaCoLA - **ItaCoLA** is a corpus for monolingual and cross-lingual acceptability judgments …
ITB - Informative Tracking Benchmark
ITDD - Industrial Textile Defect Detection
Itihasa - Itihasa is a large-scale corpus for Sanskrit to English translation …
IUST_PersonReID - The IUST_PersonReID dataset was developed to address limitations in existing …
IU X-Ray - IU X-ray (Demner-Fushman et al., 2016) is a set of …
iVQA - Instructional Video Question Answering
iWildCam2020-WILDS - The iWildCam2020-WILDS dataset is a variant of the iWildCam 2020 …
IWSLT 2017 - The IWSLT 2017 translation dataset.
IXI - IXI Brain Development Dataset

J

JAAD - Joint Attention in Autonomous Driving
JAAH - Jazz Audio-Aligned Harmony
JAFFE - Japanese Female Facial Expression
JamPatoisNLI - JamPatoisNLI provides the first dataset for natural language inference in …
JaQuAD - **JaQuAD** (Japanese Question Answering Dataset) is a question answering dataset …
JARVIS-DFT - JARVIS-DFT is a repository of density functional theory based calculation …
Java scripts - The Java dataset introduced in Hybrid-DeepCom ([Deep code comment generation …
JerichoWorld - **JerichoWorld** is a dataset that enables the creation of learning …
Jester (Gesture Recognition) - **Jester Gesture Recognition** dataset includes 148,092 labeled video clips of …
JFLEG - JHU FLuency-Extended GUG corpus
JFT-300M - **JFT-300M** is an internal Google dataset used for training image …
JHMDB - Joint-annotated Human Motion Data Base
JIGSAWS - JHU-ISI Gesture and Skill Assessment Working Set
JNLPBA - **JNLPBA** is a biomedical dataset that comes from the GENIA …
Jobs - The Jobs dataset by LaLonde [36] is a widely used …
JRDB - JackRabbot Dataset and Benchmark
JSB Chorales - The **JSB** chorales are a set of short, four-voice pieces …
JTA - Joint Track Auto

K

K2HPD - Includes 100K depth images under challenging scenarios. Source: [Human Pose …
KADID-10k - Konstanz artificially distorted image quality database (KADID-10k) contains 81 pristine …
Kaggle-Credit Card Fraud Dataset - The dataset contains transactions made by credit cards in September …
KaggleDBQA - KaggleDBQA: Realistic Text-to-SQL dataset
Kaggle EyePACS - Kaggle EyePACS. Diabetic Retinopathy Detection Identify signs of diabetic retinopathy in eye images
KAIST Multispectral Pedestrian Detection Benchmark - KAIST Multispectral Pedestrian Dataset The KAIST Multispectral Pedestrian Dataset is …
KAMEL - Knowledge Analysis with Multitoken Entities in Language Models
KANFace - KANFace Dataset
KanHope - Kannada Hope speech dataset
KDD12 - A clickthrough prediction dataset, for more information please see the …
KDD Cup 1999 - This is the data set used for The Third International …
Kepler Exoplanet Search Results - Context The Kepler Space Observatory is a NASA-build satellite that …
KG20C - A scholarly knowledge graph benchmark dataset
KIBA - Dataset Description: Toward making use of the complementary information captured …
kickstarter - Funding Successful Projects on Kickstarter
Kinetics - Kinetics Human Action Video Dataset
Kinetics-600 - The **Kinetics-600** is a large-scale action recognition dataset which consists …
Kinetics-700 - Kinetics-700 is a video dataset of 650,000 clips that covers …
Kinetics-GEB+ - **Kinetics-GEB+** (Generic Event Boundary Captioning, Grounding and Retrieval) is a …
KinFaceW-I - KinFaceW-I dataset contains 533 pairs of facial images of persons …
KinFaceW-II - KinFaceW-II Dataset consists of 1000 pairs of facial images of …
KINS - Augments the KITTI with more instance pixel-level annotation for 8 …
KIT Motion-Language - The KIT Motion-Language is a dataset linking human motion and …
KITTI - **KITTI** (Karlsruhe Institute of Technology and Toyota Technological Institute) is …
KITTI-360 - KITTI-360 is a large-scale dataset that contains rich sensory information …
KITTI360-EX - **KITTI360-EX** is a dataset for outer- and inner FoV expansion. …
KITTI360pose - The KITTI360Pose dataset encompasses a total area of 15.51 square …
KITTI MOTS - KITTI Multi-Object Tracking and Segmentation (MOTS) Evaluation
KITTI Odometry Benchmark - The odometry benchmark consists of 22 stereo sequences, saved in …
KKBox - The task is to predict the chances of a user …
K-Lane - KAIST-Lane
Kleister NDA - **Kleister NDA** is a dataset for Key Information Extraction (KIE). …
Klexikon - Klexikon: A German Dataset for Joint Summarization and Simplification
KLUE - Korean Language Understanding Evaluation
KOHTD - Kazakh Offline Handwritten Text Dataset
KolektorSDD - Kolektor Surface-Defect Dataset
KolektorSDD2 - Kolektor Surface-Defect Dataset 2
KonIQ-10k - Konstanz Image Quality 10k Database
KoNViD-1k - KoNViD-1k VQA Database
Korea Composite Stock Price Index - The data contains the following attributes for Korea Stock Price …
KP20k - **KP20k** is a large-scale scholarly articles dataset with 528K articles …
KPTimes - KPTimes is a large-scale dataset of news texts paired with …
KQA Pro - A large-scale dataset for Complex KBQA. Source: [KQA Pro: A …
Krapivin - A dataset for benchmarking keyphrase extraction and generation techniques from …
KT3DMoSeg - Please find more details of this dataset at https://alex-xun-xu.github.io/ProjectPage/CVPR_18/index.html 3D …
KTH - KTH Action dataset
KTH-TIPS2 - The KTH-TIPS (Textures under varying Illumination, Pose and Scale) image …
KUAKE-QIC - Query Intent Classification Dataset
KUAKE-QQR - Query-Query Relevance Dataset
KUAKE-QTR - Query-Title Relevance Dataset
Kubric - **Kubric** is a data generation pipeline for creating semi-realistic synthetic …
Kumar - The **Kumar** dataset contains 30 1,000×1,000 image tiles from seven …
Kuzushiji-MNIST - Kuzushiji-MNIST is a drop-in replacement for the MNIST dataset (28x28 …
Kvasir - The Kvasir Dataset
KvasirCapsule-SEG - The dataset contains a Video capsule endoscopy dataset for polyp …
Kvasir-Instrument - Consists of annotated frames containing GI procedure tools such as …
Kvasir-SEG - Kvasir-SEG is an open-access dataset of gastrointestinal polyp images and …

L

L3DAS21 - L3DAS21 is a dataset for 3D audio signal processing. It …
LabelMe - **LabelMe** database is a large collection of images with ground …
LaFAN1 - Ubisoft La Forge Animation Dataset
LAG - Large-scale Attention based Glaucoma
LAGENDA - Layer Age and Gender Dataset
LAION-400M - **LAION-400M** is a dataset with CLIP-filtered 400 million image-text pairs, …
LAION COCO - **LAION-COCO** is the world’s largest dataset of 600M generated high-quality …
LAMBADA - The **LAMBADA** (LAnguage Modeling Broadened to Account for Discourse Aspects) …
LAM(line-level) - The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition
LandCover.ai - Dataset for Automatic Mapping of Buildings, Woodlands, Water and Roads from Aerial Imagery
language-modeling-recommendation - This is the Big-Bench version of our language-based movie recommendation …
Laptop-ACOS - Laptop-ACOS is a brand new Laptop dataset collected from the …
Large COVID-19 CT scan slice dataset - "We built a large lung CT scan dataset for COVID-19 …
Large Labelled Logo Dataset (L3D) - It is composed of around 770k of color 256x256 RGB …
LargeST - LargeST: A Benchmark Dataset for Large-Scale Traffic Forecasting
LaRS - Lakes, Rivers and Seas Dataset
LaSCo - Large Scale Composed Image Retrieval (LaSCo) is a new dataset …
LaSOT - Large-scale Single Object Tracking
LAV-DF - Localized Audio Visual DeepFake Dataset
LAVIB - Large-scale Video Interpolation Benchmark
LCSTS - LCSTS is a large corpus of Chinese short text summarization …
LDC2017T10 - Abstract Meaning Representation (AMR) Annotation Release 2.0
LDC2020T02 - Abstract Meaning Representation (AMR) Annotation Release 3.0
LDD - LDD: A Grape Diseases Dataset Detection and Instance Segmentation
LeafNet - LeafNet: A large-scale dataset for training image-text models in leaf disease identification
LegalNERo - Romanian Named Entity Recognition in the Legal domain
LeNER-Br - LeNER-Br is a dataset for named entity recognition (NER) in …
LES-AV - This data set comprises 22 fundus images with their corresponding …
Letter - Letter Recognition Data Set
LeukemiaAttri - The LeukemiaAttri dataset is a large-scale, multi-domain collection of microscopy …
L-Eval - Although large language models (LLMs) demonstrate impressive performance for many …
LEVIR-CD - LEVIR-CD is a new large-scale remote sensing building Change Detection …
LexGLUE - Legal General Language Understanding Evaluation (LexGLUE) benchmark is a collection …
LFW - Labeled Faces in the Wild
LHQ - Landscapes High-Quality
LIAR - LIAR is a publicly available dataset for fake news detection. …
LIAR2 - The [LIAR](https://doi.org/10.18653/v1/P17-2067) dataset has been widely followed by fake news …
LIBRAS-UFOP - A multimodal LIBRAS-UFOP Brazilian sign language dataset of minimal pairs …
LibriCSS - Continuous speech separation (CSS) is an approach to handling overlapped …
LibriTTS - **LibriTTS** is a multi-speaker English corpus of approximately 585 hours …
LIDC-IDRI - The **LIDC-IDRI** dataset contains lesion annotations from four experienced thoracic …
LiDiRus - Linguistic Diagnostic for Russian
Light Snowfall - DENSE
LIMUC - Labeled Images for Ulcerative Colitis
LingOly - This dataset is a benchmark for complex reasoning abilities in …
LINNAEUS - LINNAEUS is a general-purpose dictionary matching software, capable of processing …
Lipogram-e - This is a dataset of 3 English books which do …
Lipophilicity (logd74) - The lipophilicity database refers to a collection of information related …
LitBank - LitBank is an annotated dataset of 100 works of English-language …
LIT-PCBA(ALDH1) - ALDH1 target of LIT-PCBA Dataset
LIT-PCBA(ESR1_ant) - ESR1_ant target of LIT-PCBA Dataset
LIT-PCBA(KAT2A) - KAT2A target of LIT-PCBA Dataset
LIT-PCBA(MAPK1) - MAPK1 target of LIT-PCBA Dataset
LIVE - Laboratory for Image & Video Engineering
LIVECell - Label-free In Vitro image Examples of Cells
LIVE-ETRI - ETRI-LIVE Space-Time Subsampled Video Quality (STSVQ) Database
LIVE-FB LSVQ - LIVE-FB Large-Scale Social Video Quality (LSVQ) Database
LIVE Livestream - **LIVE Livestream** is a database for Video Quality Assessment (VQA), …
Liver-US - Liver Ultrasound Dataset for Medical Image Classification
LIVE-VQC - LIVE Video Quality Challenge (VQC) Database
LIVE-YT-HFR - LIVE YouTube High Frame Rate
LJSpeech - The LJ Speech Dataset
LLAMAS - Labeled Lane Markers
LLFF - Local Light Field Fusion
LLVIP - A Visible-infrared Paired Dataset for Low-light Vision
L+M-24 - Language-molecule models have emerged as an exciting direction for molecular …
LM-KBC 2023 - A diverse set of 21 relations, each covering a different …
Localized Narratives - We propose Localized Narratives, a new form of multimodal image …
LoDoPaB-CT - LoDoPaB-CT is a dataset of computed tomography images and simulated …
LOFAR RFI Detection - Low-Frequency Array (LOFAR) Radio Frequency Interference Detection
LogiQA - LogiQA consists of 8,678 QA instances, covering multiple types of …
LOL - LOw-Light dataset
Lombardia Sentinel-2 Image Time Series for Crop Mapping - Usually, the information related to the crop types available in …
LongBench - Click to add a brief description of the dataset (Markdown …
Long Video Dataset - We randomly selected three videos from the Internet, that are …
Long Video Dataset (3X) - We randomly selected three videos from the Internet, that are …
Lost and Found - **Lost and Found** is a novel lost-cargo image sequence dataset …
LoTE-Animal - LoTE-Animal: A Long Time-span Dataset for Endangered Animal Behavior Understanding
Lot-insts - Long-Tailed instituition names
LoveDA - Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation
LRS2 - Lip Reading Sentences 2
LRS3-TED - LRS3-TED is a multi-modal dataset for visual and audio-visual speech …
LRW - Lip Reading in the Wild
LSA64 - LSA64: A Dataset for Argentinian Sign Language
LSA-T - Lengua de Señas Argentina - Traducción
LSMDC - Large Scale Movie Description Challenge
LSOIE - Large-Scale dataset for Supervised Open Information Extraction
LSSED - LSSED, a challenging large-scale english dataset for speech emotion recognition. …
LSUN - Large-scale Scene UNderstanding Challenge
LTCC - LTCC contains 17,119 person images of 152 identities, and each …
LUN - LUN is used for unreliable news source classification, this dataset …

M

M$^3$-VOS - M$^3$-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation
MacaquePose - **MacaquePose** is an animal pose estimation dataset containing pictures of …
MAD - MAD (Movie Audio Descriptions) is an automatically curated large-scale dataset …
MAESTRO - The **MAESTRO** dataset contains over 200 hours of paired audio …
MAFW - **MAFW** is a large-scale, multi-modal, compound affective database for dynamic …
MagnaTagATune - **MagnaTagATune** dataset contains 25,863 music clips. Each clip is a …
M-AILabs speech dataset - The M-AILABS Speech Dataset is the first large dataset that …
Malaria Dataset - The dataset contains a total of 27,558 cell images with …
Mall - Mall Dataset
MAMe - Museum Art Medium dataset
MAMS - Multi Aspect Multi-Sentiment
Manga109 - **Manga109** has been compiled by the Aizawa Yamasaki Matsui Laboratory, …
ManyTypes4TypeScript - [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6336113.svg)](https://doi.org/10.5281/zenodo.6336113) Type Inference dataset for TypeScript. Click on DOI tag …
map2seq - 7,672 human written natural language navigation instructions for routes in …
MapEval-API - MapEval-Textual contains 300 question-answer pairs. The task is to answer …
MapEval-Textual - MapEval-Textual contains 300 context-question-answer triplets. The necessary geo-spatial information is …
MapEval-Visual - MapEval-Visual contains 400 image-question-answer triplets. Each question is paired with …
MAPS - Midi Aligned Piano Dataset
MARIDA - Marine Debris Archive
Market-1501 - **Market-1501** is a large-scale public benchmark dataset for person re-identification. …
Market1501-Attributes - The **Market1501-Attributes** dataset is built from the Market1501 dataset. Market1501 …
Market-1501-C - **Market-1501-C** is an evaluation set that consists of algorithmically generated …
Marmoset-8K - DeepLabCut multi-animal Marmoset dataset
MARS - Motion Analysis and Re-identification Set
Mars DTM Estimation - This dataset is useful for doing research in the field …
MARS (Multimodal Analogical Reasoning dataSet) - Analogical reasoning is fundamental to human cognition and holds an …
MAS3K - MAS3K: An Open Dataset for Marine Animal Segmentation
MaSaC_ERC - The E-MASAC Dataset is a collection of code-mixed conversations sourced …
Massachusetts Roads Dataset - Road and Building Detection Datasets - Massachusetts Roads Dataset
MASSIVE - MASSIVE is a parallel dataset of > 1M utterances across …
MassSpecGym - MassSpecGym: A benchmark for the discovery and identification of molecules
Materials Project - The **Materials Project** is a collection of chemical compounds labelled …
MATH - MATH is a new dataset of 12,500 challenging competition mathematics …
Math23K - Math23K for Math Word Problem Solving
Mathematics Dataset - This dataset code generates mathematical question and answer pairs, from …
MathMC - Existing arithmetic benchmarks have a limited number of multiple-choice questions. …
MathQA - MathQA significantly enhances the AQuA dataset with fully-specified operational programs. …
MathToF - Existing arithmetic benchmarks have a limited number of True-or-False questions. …
MATH-V - Math-Vision (Math-V) dataset is a meticulously curated collection of 3,040 …
MATRES - Multi-Axis Temporal RElations for Start-points
Matterport3D - The **Matterport3D** dataset is a large RGB-D dataset for scene …
MAVE - MAVE: : A Product Dataset for Multi-source Attribute Value Extraction
MAWPS - MAth Word ProblemS
MBPP - Mostly Basic Python Programming
MBPP-ET - Extension test cases of MBPP, as well as generated code.
MCubeS - Multimodal Material Segmentation Dataset
MCubeS (P) - Multimodal Material Segmentation Dataset
MDBD - Multicue Dataset for Edge Detection
MEAD - A Large-scale Audio-visual Dataset for Emotional Talking-face Generation
mebeblurf - Matanga Darknet — 2025 Access Guide As internet censorship intensifies, …
MECCANO - The MECCANO dataset is the first dataset of egocentric videos …
MED - Monotonicity Entailment Dataset
MedConceptsQA - MedConceptsQA - Open Source Medical Concepts QA Benchmark The benchmark …
Mediapi-RGB - Mediapi-RGB is a bilingual corpus of French Sign Language (LSF) …
MediaSpeech - **MediaSpeech** is a media speech dataset (you might have guessed …
MediBeng - Synthetic Code-Switched Bengali-English Speech Conversations for Healthcare Applications
Medical Cost Personal Dataset - This dataset contains demographic and personal health information for individuals, …
Medical Segmentation Decathlon - The Medical Segmentation Decathlon is a collection of medical image …
Medico automatic polyp segmentation challenge (dataset) - The “Medico automatic polyp segmentation challenge” aims to develop computer-aided …
MedMentions - MedMentions is a new manually annotated resource for the recognition …
MedNLI - Medical Natural Language Inference
MedQA - Multiple choice question answering based on the United States Medical …
MedSecId - The process by which sections in a document are demarcated …
MedTurkQuAD: Medical Turkish Question-Answering Dataset - A comprehensive Turkish dataset for question-answering tasks in medical domain
MeerKAT: Meerkat Kalahari Audio Transcripts - A large-scale reference dataset for bioacoustics. MeerKAT is a 1068h …
MeetingBank - MeetingBank, a benchmark dataset created from the city councils of …
MegaFace - **MegaFace** was a publicly available dataset which is used for …
MELD - Multimodal EmotionLines Dataset
MemeTracker - The Memetracker corpus contains articles from mainstream media and blogs …
MemexQA - A large, realistic multimodal dataset consisting of real personal photos …
MentSum - Mental Health Summarization Dataset
MeQSum - **MeQSum** is a dataset for medical question summarization. It contains …
MERL-RAV - MERL Reannotation of AFLW with Visibility
Meta-Dataset - The **Meta-Dataset** benchmark is a large few-shot learning benchmark and …
MetaQA - MoviE Text Audio QA
MetFaces - MetFaces is an image dataset of human faces extracted from …
METR-LA - **METR-LA** is a dataset for traffic prediction.
Mewsli-9 - A large new multilingual dataset for multilingual entity linking. Source: …
MFA - Many Faces of Anger
MFQE v2 - Multi-Frame Quality Enhancement v2 Dataset
MFR - Ongoing version of ICCV-2021 Masked Face Recognition Challenge & Workshop(MFR)
MFSD - Masked Face Segmentation Dataset
MFW+ (M-M) - MFW+ is a benchmark dataset for masked face recognition and …
MFW+ (U-M) - MFW+ is a benchmark dataset for masked face recognition and …
MGif - MGif is a dataset of videos containing movements of different …
MGTAB - Multi-Relational Graph-Based Twitter Account Detection Benchmark
MHIST - Minimalist Histopathology image analysis dataset
MIB Dataset - You need to request access to download and use the …
MICCAI 2015 Head and Neck Challenge - This database is provided and maintained by Dr. Gregory C …
MICCAI 2015 Multi-Atlas Abdomen Labeling Challenge - Under Institutional Review Board (IRB) supervision, 50 abdomen CT scans …
Microsoft Malware Classification Challenge - The Microsoft Malware Classification Challenge was announced in 2015 along …
Middlebury - Middlebury Stereo
Middlebury 2014 - The **Middlebury 2014** dataset contains a set of 23 high …
Mila Simulated Floods - Mila Simulated Floods Dataset is a 1.5 square km virtual …
Mimetics - Click to add a brief description of the dataset (Markdown …
MIMIC-CXR - **MIMIC-CXR** from Massachusetts Institute of Technology presents 371,920 chest X-rays …
MIMIC-CXR-LT - long-tailed version of MIMIC-CXR
MIMIC-III - The Medical Information Mart for Intensive Care III
MIMIC-IV ICD-10 - MIMIC-IV ICD-10 contains 122,279 discharge summaries—free-text medical documents—annotated with ICD-10 …
MIMIC-IV-ICD-10-full - The MIMIC-IV-ICD10-full dataset, including occurring labels.
MIMIC-IV-ICD10-top50 - The MIMIC-IV-ICD10 dataset, featuring the top 50 most frequently occurring …
MIMIC-IV ICD-9 - MIMIC-IV ICD-9 contains 209,326 discharge summaries—free-text medical documents—annotated with ICD-9 …
MIMIC-IV-ICD9-full - The MIMIC-IV-ICD9 dataset, including all occurring labels.
MIMIC-IV-ICD9-top50 - The MIMIC-IV-ICD9 dataset, featuring the top 50 most frequently occurring …
MINDS-Libras - Brazilian Sign Language (Libras) data set with 20 signs for …
minesweeper - minesweeper is a synthetic graph emulating the eponymous game.
mini-ImageNet-LT - mini-ImageNet was proposed by Matching networks for one-shot learning for …
Mip-NeRF 360 - Unbounded Anti-Aliased Neural Radiance Fields
MISAW - MIcro-Surgical Anastomose Workflow recognition on training sessions
MIT-Adobe FiveK - The **MIT-Adobe FiveK** dataset consists of 5,000 photographs taken with …
MIT-BIH Arrhythmia Database - The MIT-BIH Arrhythmia Database contains 48 half-hour excerpts of two-channel …
MitoEM - Contains mitochondria instances. Source: [MitoEM](https://mitoem.grand-challenge.org/)
MIT-States - The **MIT-States** dataset has 245 object classes, 115 attribute classes …
MixATIS - Dataset is constructed from single intent dataset ATIS. This is …
MixedWM38 - MixedWM38 Dataset(WaferMap) has more than 38000 wafer maps, including 1 …
MixSNIPS - Dataset is constructed from single intent dataset SNIPS. This is …
MJU-Waste - **MJU-Waste** is an RGBD waste object segmentation dataset that is …
MLB Dataset - A new dataset on the baseball domain. Source: [Data-to-text Generation …
MLFW - Masked LFW
MLO-Cn2 - Mauna Loa Seeing Study
MLRSNet - **MLRSNet** is a a multi-label high spatial resolution remote sensing …
MLSum-it - The MLSum-it dataset is the translated version (Helsinki-NLP/opus-mt-es-it) of the …
MLT17 - Click to add a brief description of the dataset (Markdown …
MMBench - **MMBench** is a multi-modality benchmark. It methodically develops a comprehensive …
MMConv - The main goal of the data collection is to acquire …
MMFlood - MMFlood is remote sensing dataset derived from Sentinel-1 (VV-VH), MapZen …
MMI - MMI Facial Expression Database
MMKG - MMKG is a collection of three knowledge graphs for link …
MML - Massive Multitask Language Understanding
MMLU-Pro - The MMLU-Pro dataset is an enhanced version of the Massive …
MMNeedle - Multimodal Needle in a Haystack
MM-OR - Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding …
MMPD-Dataset - MMPD Dataset is proposed in ECCV'2024 "When Pedestrian Detection Meets …
MMPTRACK - Multi-camera Multiple People Tracking Dataset
MM-Vet - MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
MM-Vet v2 - MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models …
MNIST - The **MNIST** database (**Modified National Institute of Standards and Technology** …
MoB - Malicious or Benign Cartoon Videos
MoCA-Mask - Moving Camouflaged Animals (MoCA)-Mask
ModelNet40-C - ModelNet-C
MoleculeNet - **MoleculeNet** is a large scale benchmark for molecular machine learning. …
Molweni - A machine reading comprehension (MRC) dataset with discourse structure built …
Montgomery County X-ray Set - X-ray images in this data set have been acquired from …
Montreal Archive of Sleep Studies - The Montreal Archive of Sleep Studies (MASS) is an open-access …
MoNuSAC - MoNuSAC 2020
MoNuSeg - The dataset for this challenge was obtained by carefully annotating …
MORPH - **MORPH** is a facial age estimation dataset, which contains 55,134 …
Morphosyntactic-analysis-dataset - This dataset is for evaluation of morphosyntactic analyzers.
MOSE - Complex Video Object Segmentation
MosMedData - MosMedData contains anonymised human lung computed tomography (CT) scans with …
MOT15 - Multiple Object Tracking 15
MOT16 - Multiple Object Tracking 2016
MOT17 - Multiple Object Tracking 17
MOT20 - **MOT20** is a dataset for multiple object tracking. The dataset …
Motion-X - Motion-X is a large-scale 3D expressive whole-body motion dataset, which …
MovieLens - The **MovieLens** datasets, first released in 1998, describe people’s expressed …
MovieNet - **MovieNet** is a holistic dataset for movie understanding. MovieNet contains …
Moving MNIST - The **Moving MNIST** dataset contains 10,000 video sequences, each consisting …
MP-100 - Mulit-category Pose Dataset
MP20 - Metastable crystal structures from Materials Project
M-PCCD - MPEG Point Cloud Compression Dataset
MPDD - Metal Parts Defect Detection Dataset
MPEblink - The pioneering eyeblink detection dataset is characterized by three key …
MPII - MPII Human Pose
MPII Cooking 2 Dataset - A dataset which provides detailed annotations for activity recognition. Source: …
MPIIGaze - **MPIIGaze** is a dataset for appearance-based gaze estimation in the …
MPII Human Pose - **MPII Human Pose** Dataset is a dataset for human pose …
MPI-INF-3DHP - **MPI-INF-3DHP** is a 3D human body pose estimation dataset consisting …
MPI Sintel - MPI (Max Planck Institute) Sintel is a dataset for optical …
MPSGaze - Multi-Person Swap Gaze Dataset
MPV - Multi-Pose Virtual try on
MR - MR Movie Reviews
Mr. HiSum - Mr. HiSum is a large-scale video highlight detection and summarization …
MRNet - The MRNet dataset consists of 1,370 knee MRI exams performed …
MRPC - Microsoft Research Paraphrase Corpus
MRQA - The MRQA (Machine Reading for Question Answering) dataset is a …
MRR-Benchmark - Multi-Modal Reading Benchmark
MSASL-1000 - **MSASL** is a real-life large-scale sign language data set comprising …
MSAW - Multi-Sensor All Weather Mapping
MSCOCO - Click to add a brief description of the dataset (Markdown …
MSDA - Multi-source domain adaptation dataset for text recognition
MSD (Mirror Segmentation Dataset) - We construct the first large-scale mirror dataset, named MSD. It …
MSL - Mars Science Laboratory
MSLR-WEB30K - The **MSLR-WEB30K** dataset consists of 30,000 search queries over the …
MSLS - Mapillary Street-level Sequences Dataset
MS MARCO - Microsoft Machine Reading Comprehension Dataset
MSMT17 - Multi Scene Multi Time dataset for person re-id
MSMT17-C - **MSMT17-C** is an evaluation set that consists of algorithmically generated …
MSP-IMPROV - MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception
MSP-Podcast - A large naturalistic speech emotional dataset
MSRA-TD500 - MSRA Text Detection 500 Database
MSRC-12 - MSRC-12 Kinect Gesture Dataset
MSR-VTT - **MSR-VTT** (Microsoft Research Video to Text) is a large-scale dataset …
MSR-VTT Adverbs - MSR-VTT Adverbs is a subset from MSR-VTT with extracted verb-adverb …
MSRVTT-CTN - MSRVTT Causal-Temporal Narrative
MSRVTT-MC - The MSRVTT-MC (Multiple Choice) dataset is a video question-answering dataset …
MSRVTT-QA - The **MSR-VTT-QA** dataset is a benchmark for the task of …
MSU BASED - MSU BASED Video Deblurring Dataset and Benchmark
MSU FR VQA Database - MSU Full-Reference Video Quality Assessment Database
MSU HDR Video Reconstruction Benchmark - This is a dataset for a video inverse-tone-mapping task. The …
MSU NR VQA Database - MSU No-Reference Video Quality Assessment Database
MSU SR-QA Dataset - MSU Super-Resolution Quality Assessment Dataset
MSU Super-Resolution for Video Compression - This is a dataset for a super-resolution task. The dataset …
MSU Video Frame Interpolation - This is a dataset for video frame interpolation task. The …
MSU Video Super Resolution Benchmark: Detail Restoration - This is a dataset for a video super-resolution task. The …
MSU Video Upscalers: Quality Enhancement - The dataset aims to find the algorithms that produce the …
MSVD - Microsoft Research Video Description Corpus
MSVD-CTN - MSVD Causal-Temporal Narrative
MSVD-Indonesian - MSVD-Indonesian is derived from the MSVD dataset, which is obtained …
MSVD-QA - The MSVD-QA dataset is a Video Question Answering (VideoQA) dataset. …
MT-Bench - This dataset contains 3.3K expert-level pairwise human preferences for model …
MTEB - Massive Text Embedding Benchmark
MTL-AQA - A new multitask action quality assessment (AQA) dataset, the largest …
MUC-4 - Fourth Message Uunderstanding Conference
MuCGEC - Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction
MuJoCo - **MuJoCo** (multi-joint dynamics with contact) is a physics engine used …
Multi30K - Multi30K is a large-scale multilingual multimodal dataset for interdisciplinary machine …
Multi Lingual Bug Reports - ## Dataset Description The dataset used in this study comprises …
Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems - The dataset contains training and evaluation data for 12 languages: …
MultiMNIST - The **MultiMNIST** dataset is generated from MNIST. The training and …
Multimodal PISA - Multimodal Piano Skills Assessment
MultiNLI - Multi-Genre Natural Language Inference
MultiOFF - Introudced from Multimodal Meme Dataset (MultiOFF) for Identifying Offensive Content …
Multi-omics mRNA, miRNA, and DNA Methylation Dataset - The dataset contains multi-omics data, incuding mRNA, miRNA, and DNA …
Multi-PIE - The **Multi-PIE** (Multi Pose, Illumination, Expressions) dataset consists of face …
MultiQ - MultiQ is a multi-hop QA dataset for Russian, suitable for …
MultiRC - Multi-Sentence Reading Comprehension
MultiScan - We introduce MultiScan, a scalable RGBD dataset construction pipeline leveraging …
MultiSports - Spatio-temporal action detection is an important and challenging problem in …
MultiSubs - MultiSubs: A Large-scale Multimodal and Multilingual Dataset
MultiTHUMOS - The **MultiTHUMOS** dataset contains dense, multilabel, frame-level action annotations for …
MultiTQ - MULTITQ is a large-scale dataset featuring ample relevant facts and …
MultiviewX - **MultiviewX** is a synthetic Multiview pedestrian detection dataset. It is …
MuMiN-large - This is the large version of the [MuMiN dataset](https://paperswithcode.com/dataset/mumin).
MuMiN-medium - This is the medium version of the [MuMiN dataset](https://paperswithcode.com/dataset/mumin).
MuMiN-small - This is the small version of the [MuMiN dataset](https://paperswithcode.com/dataset/mumin).
MuPoTS-3D - Multiperson Pose Test Set in 3DMulti-person Pose estimation Test Set in 3D
MuReD Dataset - Multi-Label Retinal Diseases Dataset
MUSDB18 - The **MUSDB18** is a dataset of 150 full lengths music …
MUSDB18-HQ - **MUSDB18-HQ** is a high-quality version of the MUSDB18 music tracks …
MuSeRC - Russian Multi-Sentence Reading Comprehension
MUSES - MUlti-Shot EventS
MUSES: MUlti-SEnsor Semantic perception dataset - The Multi-Sensor Semantic Perception Dataset for Driving under Uncertainty
Music21 - **Music21** is an untrimmed video dataset crawled by keyword query …
MUSIC-AVQA - The large-scale MUSIC-AVQA dataset of musical performance contains 45,867 question-answer …
MusicBench - The MusicBench dataset is a music audio-text pair dataset that …
MusicBrainz20K - The MusicBrainz20K dataset for entity resolution and entity clustering is …
MusicCaps - **MusicCaps** is a dataset composed of 5.5k music-text pairs, with …
MusicNet - MusicNet is a collection of 330 freely-licensed classical music recordings, …
MusicQA - We propose the MusicQA dataset to train Music-enabled question-answering models …
MuSiQue-Ans - MuSiQue-Ans is a new multihop QA dataset with ~25K 2-4 …
Musk v1 - The Musk dataset describes a set of molecules, and the …
Musk v2 - The Musk2 dataset is a set of 102 molecules of …
MUStARD++ - **MUStARD++** is a multimodal sarcasm detection dataset (MUStARD) pre-annotated with …
MuST-C - **MuST-C** currently represents the largest publicly available multilingual corpus (one-to-many) …
MUTAG - In particular, **MUTAG** is a collection of nitroaromatic compounds and …
Mutagenicity - **Mutagenicity** is a chemical compound dataset of drugs, which can …
MUV - The Maximum Unbiased Validation (MUV) dataset is a benchmark dataset …
M-VAD Names - M-VAD Names Dataset
MVBench - MVBench is a comprehensive Multi-modal Video understanding Benchmark. It was …
MVK - Marine Video Kit
MVSEC - Multi Vehicle Stereo Event Camera
MVSEC-SEG - Based on the [MVSEC](https://daniilidis-group.github.io/mvsec/) dataset, we select some image-event pairs …
MVTEC 3D-AD - THE MVTEC 3D ANOMALY DETECTION DATASET
MVTec-AC - MVTec-AC is a curated refinement of the widely-used MVTec-AD dataset, …
MVTecAD - MVTEC ANOMALY DETECTION DATASET
MVTec LOCO AD - MVTec Logical Constraints Anomaly Detection

N

NABirds - North America Birds
Nam - A holistic approach to cross-channel image noise modeling and its application to image denoising
NAO - Natural Adversarial Object
NarrativeQA - The NarrativeQA dataset includes a list of documents with Wikipedia …
NASA C-MAPSS - Turbofan Engine Degradation Simulation Data Set
NASA C-MAPSS-2 - Turbofan Engine Degradation Simulation Data Set-2
NASA Li-ion Dataset - Experiments on Li-Ion batteries. Charging and discharging at different temperatures. …
NASA Perseverance - Samples from NASA Perseverance and set of GAN generated synthetic …
NAS-Bench-101 - **NAS-Bench-101** is the first public architecture dataset for NAS research. …
NAS-Bench-201 - **NAS-Bench-201** is a benchmark (and search space) for neural architecture …
Natural Questions - The **Natural Questions** corpus is a question answering dataset containing …
NBA - NBA: This is extended from a Kaggle dataset * containing …
NBA SportVU - The NBA SportVU dataset contains player and ball trajectories for …
NBMOD - Noisy Background Multi-Object Dataset for grasp detection
NC4K - As far as we know, there only exists one large …
N-Caltech 101 - Neuromorphic-Caltech101
N-CARS - A large real-world event-based dataset for object classification. Source: [HATS: …
NCBI Disease - The **NCBI Disease** corpus consists of 793 PubMed abstracts, which …
NCI1 - The **NCI1** dataset comes from the cheminformatics domain, where each …
NCI109 - Tudataset: A collection of benchmark datasets for learning with graphs
NCT-CRC-HE-100K - The NCT-CRC-HE-100K dataset is a set of 100,000 non-overlapping image …
NELL - Never Ending Language Learning
NELL-995 - NELL-995 KG Completion Dataset
NEMO-Corpus - NEMO Hebrew NER and Morphology Corpus
NeRF - Neural Radiance Fields
New3 - New3, a set of 527 instances from AMR 3.0, whose …
New Plant Diseases Dataset - Image dataset containing different healthy and unhealthy crop leaves.
Newsela - The **Newsela dataset** was introduced by Xu et al. in …
NewsQA - The **NewsQA** dataset is a crowd-sourced machine reading comprehension dataset …
NExT-QA - **NExT-QA** is a VideoQA benchmark targeting the explanation of video …
NExT-QA (Open-ended VideoQA) - NExT-QA is a VideoQA benchmark targeting the explanation of video …
NH-HAZE - NN-HAZE is an image dehazing dataset. Since in many real …
Nightrain - Synthetically Generated Night-time Weather Degraded Database
Nighttime Driving - **Nighttime Driving** is a dataset of road scenes consisting of …
NIH-CXR-LT - Long-tailed (LT) NIH ChestXRay14
NII-CU MAPD - NII-CU Multispectral Aerial Person Detection Dataset
Nikon RAW Low Light - Nikon Camera Low Light RAW Image Dataset
N-ImageNet - Large-Scale Dataset for Event-Based Object Recognition
NIR2RGB VCIP Challange Dataset - VCIP2020 Grand Challenge on the NIR image colorization dataset
NIST (OSN-transmitted - Facebook) - This dataset is an OSN-transmitted (Online Social Network) version of …
NIST (OSN-transmitted - Wechat) - This dataset is an OSN-transmitted (Online Social Network) version of …
NIST (OSN-transmitted - Weibo) - This dataset is an OSN-transmitted (Online Social Network) version of …
NIST (OSN-transmitted - Whatsapp) - This dataset is an OSN-transmitted (Online Social Network) version of …
NKL - NKL (short for NanKai Lines) is a dataset for semantic …
NL-Drive - Nonlinear Autonomous Driving Dataset
NLVR - Natural Language Visual Reasoningnatural language for visual reasoning
N-MNIST - Neuromorphic-MNIST
NOAA Atmospheric Temperature Dataset - This dataset contains meteorological observations (temperature) at the land-based weather …
No Background RGB Arabic Alphabets Sign Language Dataset - The AASL-Clear dataset is a collection of RGB images featuring …
Nordic Language Identification - Automatic language identification is a challenging problem. Discriminating between closely …
Nordland - Click to add a brief description of the dataset (Markdown …
Nordland* (2760 queries) - The nordland used in SALAD and BoQ (2760 queries, 27592 …
Nottingham - The **Nottingham** Dataset is a collection of 1200 American and …
Novel COVID-19 Chestxray Repository - ##_Authors of the Dataset_: - Pratik Bhowal (B.E., Dept of …
NoW Benchmark - The goal of this benchmark is to introduce a standard …
NPO - Negative and Positive Obstacles
NSynth - **NSynth** is a dataset of one shot instrumental notes, containing …
NTU RGB+D - **NTU RGB+D** is a large-scale dataset for RGB-D human action …
NTU RGB+D 120 - NTU RGB+D 120 is a large-scale dataset for RGB+D human …
NTU RGB+D 2D - **NTU RGB+D 2D** is a curated version of [NTU RGB+D](https://paperswithcode.com/dataset/ntu-rgb-d) …
N-UCLA - Northwestern-UCLA Multiview Action 3D Dataset
NUS - The dataset was constructed by first finding suitable publications and …
nuScenes - The **nuScenes** dataset is a large-scale autonomous driving dataset. The …
nuScenes LiDAR only - Robust detection and tracking of objects is crucial for the …
NUS-WIDE - The **NUS-WIDE** dataset contains 269,648 images with a total of …
NYCBike1 - Bike flow data of New York City with grid 16x8.
NYCBike2 - Bike flow data of New York City.
NYCTaxi - Taxi flow data of New York City with grid 20x10.
NYT10-HRL - a dataset from A Hierarchical Framework for Relation Extraction with …
NYT11-HRL - Preprocessed version of NYT11. Each relational triple is formatted as …
NYUDv2-IS - A RGB-D dataset converted from NYUDv2 into COCO-style instance segmentation …
NYUv2 - NYU-Depth V2

O

OAB Exams - The **OAB Exams dataset** is a valuable resource used in …
OAD dataset - The Online Action Detection Dataset
OA-Mine - annotations - The dataset contains Amazon products from 10 product categories with …
OASIS - Open Annotations of Single Image Surfaces
Objaverse - **Objaverse** is a large dataset of objects with 800K+ (and …
Object Discovery - The **Object Discovery** dataset was collected by downloading images from …
Object HalBench - Object HalBench is a benchmark used to evaluate the performance …
ObjectNet - **ObjectNet** is a test set of images collected directly using …
Objects365 - Objects365 is a large-scale object detection dataset, Objects365, which has …
ObjectsRoom - The **ObjectsRoom** dataset is based on the MuJoCo environment used …
OBJ-MDA - The dataset contains images of 16 artworks included in the …
OC20 - Open Catalyst 2020
Occluded COCO - **Occluded COCO** is automatically generated subset of COCO val dataset, …
Occluded-DukeMTMC - Occluded-DukeMTMC contains 15,618 training images, 17,661 gallery images, and 2,210 …
Occluded-PoseTrack-ReID - Occluded-PoseTrack Re-Identification
Occluded REID - **Occluded REID** is an occluded person dataset captured by mobile …
OCHuman - This dataset focuses on heavily occluded human with comprehensive annotations …
ODAQ: Open Dataset of Audio Quality - A dataset containing the results of a MUSHRA listening test …
ODDS - Outlier Detection DataSets (ODDS)
Office-31 - Office Dataset
Office-Caltech-10 - **Office-Caltech-10** a standard benchmark for domain adaptation, which consists of …
Office-Home - **Office-Home** is a benchmark dataset for domain adaptation which contains …
Ohsumed - **Ohsumed** includes medical abstracts from the MeSH categories of the …
OIE2016 - OIE2016 is the first large-scale OpenIE benchmark. It is created …
Oktoberfest Food Dataset - A realistic, diverse, and challenging dataset for object detection on …
Okutama-Action - A new video dataset for aerial view concurrent human action …
Okutama Drone and Swiss Drone Dataset - The Swiss Drone data set was recorded around Cheseaux-sur-Lausanne in …
OK-VQA - Outside Knowledge Visual Question Answering
OLID - Offensive Language Identification Dataset
Olivetti face - This dataset contains a set of face images taken between …
OmniArt - Presents half a million samples and structured meta-data to encourage …
OmniBenchmark - Omni-Realm Benchmark (OmniBenchmark) is a diverse (21 semantic realm-wise datasets) …
Omnicount-191 - To effectively evaluate OmniCount across open-vocabulary, supervised, and few-shot counting …
ONCE - One Million Scenes
OntoGUM - **OntoGUM** is an OntoNotes-like coreference dataset converted from GUM, an …
OntoNotes 5.0 - **OntoNotes 5.0** is a large corpus comprising various genres of …
OOD-CV - Out Of Distribution Generalization in Computer Vision
OoDIS - Anomaly Instance Segmentation Benchmark
Open6DOR V2 - Benchmarking Open-instruction 6-DoF Object Rearrangement and A VLM-based Approach
OpenAPI completion refined - A human-refined dataset of OpenAPI definitions based on the APIs.guru …
OpenBookQA - OBQA
OpenEDS - OpenEDS (Open Eye Dataset) is a large scale data set …
Open Entity - The **Open Entity** dataset is a collection of about 6,000 …
OpenImages-v6 - OpenImages V6 is a large-scale dataset , consists of 9 …
OpenLane - **OpenLane** is the first real-world and the largest scaled 3D …
OpenLane-V2 val - **OpenLane-V2** is the world's first perception and reasoning benchmark for …
OpenMIC-2018 - **OpenMIC-2018** is an instrument recognition dataset containing 20,000 examples of …
OpenSLR - Open Speech and Language Resources
OpenSubtitles - OpenSubtitles is collection of multilingual parallel corpora. The dataset is …
OpenTrench3D - OpenTrench3D, the first publicly available point cloud dataset of underground …
OpenWebText - **OpenWebText** is an open-source recreation of the [WebText](/dataset/webtext) corpus. The …
OPRA - Online Product Reviews for Affordances
OPT - Object Pose Tracking
OPV2V - **OPV2V** is a large-scale open simulated dataset for Vehicle-to-Vehicle perception. …
OQM9HK - This is a large-scale dataset of quantum-mechanically calculated properties (DFT …
OQMD v1.2 - The Open Quantum Materials Database
Oracle-MNIST - Oracle-MNIST: a Realistic Image Dataset for Benchmarking Machine Learning Algorithms
OrangeSum - Source: [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](/paper/barthez-a-skilled-pretrained-french-sequence) **OrangeSum** is …
ORCAS-I - Queries Annotated with Intent using Weak Supervision
OrdinalDataset - Ordinal Encoding Data set
OSM - The OSM dataset, sourced from OpenStreetMap, is composed of the …
OTB-2013 - OTB2013 is the previous version of the current OTB2015 Visual …
OTB-2015 - **OTB-2015**, also referred as Visual Tracker Benchmark, is a visual …
OTT-QA - The Open Table-and-Text Question Answering (**OTT-QA**) dataset contains open questions …
Oulu-CASIA - Oulu-CASIA NIR&VIS facial expression database
OUMVLP - The OU-ISIR Gait Database, Multi-View Large Population Dataset (OU-MVLP) is …
OVAD benchmark - Open-Vocabulary Attribute Detection
OVBench - OVBench is a benchmark tailored for **real-time video understanding**: - …
Overruling - The **Overruling** dataset is a law dataset corresponding to the …
Oxford 102 Flower - 102 Category Flower Dataset
Oxford5k - Oxford Buildings
Oxford-IIIT Pet Dataset - The Oxford-IIIT Pet Dataset has 37 categories with roughly 200 …
Oxford-IIIT Pets - The Oxford-IIIT Pet Dataset is a 37-category pet dataset with …
Oxford Radar RobotCar Dataset - The Oxford Radar RobotCar Dataset is a radar extension to …
Oxford RobotCar Dataset - The Oxford RobotCar Dataset contains over 100 repetitions of a …

P

P3M-10k - Privacy-Preserving Portrait Matting Dataset
PA-100K - PA-100K Dataset
PackIt - The ability to jointly understand the geometry of objects and …
PACS - Photo-Art-Cartoon-Sketch
PAD Dataset - Pose-agnostic/Multi-pose Anomaly Detection Dataset
PAMAP2 - The PAMAP2 Physical Activity Monitoring dataset contains data of 18 …
PanNuke - PanNuke is a semi automatically generated nuclei instance segmentation and …
Panoptic - CMU Panoptic Studio
Paper Field - **Paper Field** is built from the Microsoft Academic Graph and …
Paralex - Paralex learns from a collection of 18 million question-paraphrase pairs …
ParallelCorpus-Python - The Python dataset introduced in the Parallel Corpus paper ([A …
ParaMAWPS - Paraphrased Math Word Problem Solving Repository
Paris6k - Click to add a brief description of the dataset (Markdown …
Paris-Lille-3D - The **Paris-Lille-3D** is a Benchmark on Point Cloud Classification. The …
Partial-REID - Partial REID is a specially designed partial person reidentification dataset …
PartNet - PartNet is a consistent, large-scale dataset of 3D objects annotated …
PARus - Choice of Plausible Alternatives for Russian language
PASCAL Context - The **PASCAL Context** dataset is an extension of the PASCAL …
PASCAL Face - The PASCAL FACE dataset is a dataset for face detection …
Pascal Panoptic Parts - The Pascal Panoptic Parts dataset consists of annotations for the …
PASCAL-Part - **PASCAL-Part** is a set of additional annotations for PASCAL VOC …
PASCAL-S - **PASCAL-S** is a dataset for salient object detection consisting of …
PASCAL VOC - PASCAL Visual Object Classes Challenge
PASCAL VOC 2007 - **PASCAL VOC 2007** is a dataset for image recognition. The …
PASCAL VOC 2011 - **PASCAL VOC 2011** is an image segmentation dataset. It contains …
PASCAL VOC 2012 test - SCC Data Set
PASTIS - Panoptic Segmentation of satellite image TImes Series
PASTIS-R - Panoptic Segmentation of Radar and Optical Satellite image TIme Series
pathbased - **pathbased** is a 3-cluster data set. The data set consists …
PathQuestion - Adopts two subsets of Freebase (Bollacker et al., 2008) as …
PATTERN - PATTERN is a node classification tasks generated with [Stochastic Block …
PCam - PatchCamelyon
PCBA - PCBA dataset 11 is a collection of high-quality dose-response data, …
PCD - Poem Comprehensive Dataset
PCQM4Mv2-LSC - PCQM4Mv2 is a quantum chemistry dataset originally curated under the …
P-DukeMTMC-reID - P-DukeMTMC-reID is a modified version based on DukeMTMC-reID dataset. There …
PECC - PECC: Problem Extraction and Coding Challenges
PeerQA - We present PeerQA, a real-world, scientific, document-level Question Answering (QA) …
Peir Gross - Peir Gross (Jing et al., 2018) was collected with descriptions …
PeMS04 - PeMS04 is a traffic forecasting benchmark.
PeMS07 - PeMS07 is a traffic forecasting benchmark.
PeMS08 - PeMS08 is a traffic forecasting dataset.
PEMS-BAY - PEMS-BAY is a dataset for traffic prediction.
PeMSD4 - The dataset refers to the traffic speed data in San …
PeMSD7 - PeMSD7 is traffic data in District 7 of California consisting …
PeMSD8 - This dataset contains the traffic data in San Bernardino from …
PEN - Problems with Explanations for Numbers
Penn94 - Node classification on Penn94
Penn Action - The **Penn Action** Dataset contains 2326 video sequences of 15 …
Penn Treebank - The English **Penn Treebank** (**PTB**) corpus, and in particular the …
PeopleArt - People-Art is an object detection dataset which consists of people …
Perception Test - Perception Test is a benchmark designed to evaluate the perception …
Permuted MNIST - **Permuted MNIST** is an MNIST variant that consists of 70,000 …
PerSeg - PerSeg is a dataset for personalized segmentation. The raw images …
Persian-ATIS - The PATIS is a Persian language dataset for intent detection …
Persian Font Recognition (PFR) - Persian Font Recognition (PFR) A dataset in order to solve …
Persian Text Image Segmentation (PTI SEG) - Persian Text Image Segmentation (PTI SEG) This dataset is part …
PersonPath22 - PersonPath22 is a large-scale multi-person tracking dataset containing 236 videos …
Perspectrum - Perspectrum is a dataset of claims, perspectives and evidence, making …
PETA - Pedestrian Attribute
PETRAW - PEg TRAnsfer Workflow recognition by different modalities
PG-19 - A new open-vocabulary language modelling benchmark derived from books. Source: …
PGDP5K - Plane Geometry Diagram Parsing Dataset
PGPS9K - A new large scale plane geometry problem solving dataset called …
PGR - Phenotype-Gene Relations
PH2 - The increasing incidence of melanoma has recently promoted the development …
PhC-C2DH-U373 - Glioblastoma-astrocytoma U373 cells on a polyacrylamide substrate Dr. S. Kumar. …
PhilPapers - **PhilPapers** is a remarkable resource for the philosophical community. Let …
PhoMT - **PhoMT** is a high-quality and large-scale Vietnamese-English parallel dataset of …
PhotoChat - PhotoChat, the first dataset that casts light on the photo …
PhotoShape - The PhotoShape dataset consists of photorealistic, relightable, 3D shapes produced …
PhraseCut - **PhraseCut** is a dataset consisting of 77,262 images and 345,486 …
PhysioNet Challenge 2012 - The **PhysioNet Challenge 2012** dataset is publicly available and contains …
PhysioNet Challenge 2018 - You Snooze You Win - The PhysioNet Computing in Cardiology Challenge 2018
PhysioNet Challenge 2020 - # Data ## The data for this Challenge are from …
PhysioNet Challenge 2021 - The PhysioNet/Computing in Cardiology Challenge 2021
PIE - Pedestrian Intention Estimation
Pinterest - The **Pinterest** dataset contains more than 1 million images associated …
PIPA - People in Photo Album
PIQA - Physical Interaction: Question Answering
PISC - People in Social Context
PIT - Paraphrase and Semantic Similarity in Twitter
Pix3D - The **Pix3D** dataset is a large-scale benchmark of diverse image-shape …
PixelRec - an image cover dataset in short video recommendation
pixraw10P - face image datasets
PKLot - A Robust Dataset for Parking Lot Classification
PKU-MMD - The **PKU-MMD** dataset is a large skeleton-based action detection dataset. …
PKU-Reid - This dataset contains 114 individuals including 1824 images captured from …
PKU SketchRe-ID Dataset - The PKU Sketch Re-ID dataset is constructed by National Engineering …
Placenta - **Placenta** is a benchmark dataset for node classification in an …
Place Pulse 2.0 - Place Pulse is a crowdsourcing effort that aims to map …
Places - The **Places** dataset is proposed for scene recognition and contains …
Places205 - The **Places205** dataset is a large-scale scene-centric dataset with 205 …
Places365 - The **Places365** dataset is a scene recognition dataset. It is …
Places-LT - **Places-LT** has an imbalanced training set with 62,500 images for …
PLAD - Point Line and Depth dataset
PlantDoc - PlantDoc is a dataset for visual plant disease detection. The …
PlantVillage - The PlantVillage dataset consists of 54303 healthy and unhealthy leaf …
PLOS - Scientific Lay Summarization
PlotQA - PlotQA is a VQA dataset with 28.9 million question-answer pairs …
PMC-VQA - **PMC-VQA** is a large-scale medical visual question-answering dataset that contains …
PMD - We propose a large-scale benchmark here, which contains a total …
PodcastFillers - The PodcastFillers dataset consists of 199 full-length podcast episodes in …
PointCloud-C - PointCloud-C is the very first test-suite for point cloud robustness …
PointOdyssey - **PointOdyssey** is a large-scale synthetic dataset, and data generation framework, …
PolitiFact - Fact-checking (FC) articles which contains pairs (multimodal tweet and a …
Pollen et al - TPM values together with cell type annotations that were obtained …
PolyU - PolyU Dataset is a large dataset of real-world noisy images …
Polyvore - Polyvore Outfits
PopQA - **PopQA** is an open-domain QA dataset with 14k QA pairs …
Pothole Mix - Pothole Mix Semantic Segmentation Dataset for Road Damage Detection and Segmentation
Potsdam - https://paperswithcode.com/sota/semantic-segmentation-on-isprs-potsdam
PPI - Protein-Protein Interactions (PPI)
PPM-100 - PPM is a portrait matting benchmark with the following characteristics: …
PPMI - Parkinson’s Progression Markers Initiative
PRCC - This dataset consists of 33698 images from 221 identities. Each …
PreCo - A large-scale English dataset for coreference resolution. The dataset is …
PRID2011 - Person RE-ID 2011
PRImA - The Prima head pose dataset consists of 2790 images of …
Probability words NLI - Natural language inference with words estimative of probability (WEP)
ProcGen - Procgen Benchmark includes 16 simple-to-use procedurally-generated environments which provide a …
PROMISE12 - The **PROMISE12** dataset was made available for the MICCAI 2012 …
PRONTO - PRONTO heterogeneous benchmark dataset
ProSLU - Profile-based Spoken Language Understanding
PROTEINS - **PROTEINS** is a dataset of proteins that are classified as …
PRO-teXt - PRO-teXt is an extension of PROXD with the inclusion of …
PROX - A dataset composed of 12 different 3D scenes and RGB …
PRW - Person Re-identification in the Wild
PS4 - A dataset of 18,731 proteins with their PDB code, index …
P-Stance - P-Stance: A Large Dataset for Stance Detection in Political Domain …
PTB Diagnostic ECG Database - The ECGs in this collection were obtained using a non-commercial, …
PTB-XL - Electrocardiography (ECG) is a key diagnostic tool to assess the …
PTC - Predictive Toxicology Challenge
PubChemQA - PubChemQA consists of molecules and their corresponding textual descriptions from …
Pubmed - The **PubMed** dataset consists of 19717 scientific publications from PubMed …
PubMed (48%/32%/20% fixed splits) - Node classification on PubMed with the fixed 48%/32%/20% splits provided …
PubMed (60%/20%/20% random splits) - Node classification on PubMed with 60%/20%/20% random splits for training/validation/test.
PubMedQA - The task of PubMedQA is to answer research questions with …
PubMedQA corpus with metadata - PubMedQA-MetaGen: Metadata-Enriched PubMedQA Corpus Dataset Summary PubMedQA-MetaGen is a metadata-enriched …
PubTabNet - PubTabNet is a large dataset for image-based table recognition, containing …

Q

QASPER - **QASPER** is a dataset for question answering on scientific research …
QC-Science - QC-Science contains 47832 question-answer pairs belonging to the science domain …
QED - **QED** is a linguistically principled framework for explanations in question …
QLEVR - Synthetic datasets have successfully been used to probe visual question-answering …
QM7 - QM7 dataset is a subset of the GDB-13 database. GDB-13 …
QM8 - QM8 dataset is a collection of molecular data used for …
QM9 - **QM9** provides quantum chemical properties (at DFT level) for a …
QMNIST - The exact pre-processing steps used to construct the MNIST dataset …
QMSum - **QMSum** is a new human-annotated benchmark for query-based multi-domain meeting …
QMUL-SurvFace - **QMUL-SurvFace** is a surveillance face recognition benchmark that contains 463,507 …
QNLI - Question-answering NLI
Q-Traffic - **Q-Traffic** is a large-scale traffic prediction dataset, which consists of …
QuAC - Question Answering in Context
QuadTrack - Most existing MOT datasets are captured using pinhole cameras, which …
QuALITY - Question Answering with Long Input Texts, Yes!
Quechua-SER - Quechua Collao corpus for automatic emotion recognition in speech. Audios …
QuerYD - A large-scale dataset for retrieval and event localisation in video. …
Query-Focused Video Summarization Dataset - Collects dense per-video-shot concept annotations. Source: [Query-Focused Video Summarization: Dataset, …
questions - Questions is an interaction graph of users of a question-answering …
Quizbowl - Consists of multiple sentences whose clues are arranged by difficulty …
Quora Question Pairs - **Quora Question Pairs** (QQP) dataset consists of over 400,000 question …
QVHighlights - Query-based Video Highlights

R

R2R - Room-to-Room
RACE - ReAding Comprehension dataset from Examinations
Radar Dataset (DIAT-μRadHAR: Radar micro-Doppler Signature dataset for Human Suspicious Activity Recognition) - Abstract In the view of national security, radar micro-Doppler (m-D) …
RADIATE - RAdar Dataset In Adverse weaThEr
RadioGalaxyNET Dataset - Automating the creation of catalogues for radio galaxies in next-generation …
RadQA - A Question Answering Dataset to Improve Comprehension of Radiology Reports
RaFD - Radboud Faces Database
RAF-DB - Real-world Affective Faces
RAFT - Realworld Annotated Few-shot Tasks
RAP - Richly Annotated Pedestrian
RARE - Randomized AMRs with Rewired Edges
RareAct - **RareAct** is a video dataset of unusual actions, including actions …
Rare Diseases Mentions in MIMIC-III - Rare disease mention annotations from a sample of MIMIC-III clinical notes
RAVDESS - Ryerson Audio-Visual Database of Emotional Speech and Song
RAWFC - For RAWFC, we constructed it from scratch by collecting the …
RB-Dust - RB-Dust: Real-world Industrial Dust Dehazing Dataset
RC-49 - RC-49 is a benchmark dataset for generating images conditional on …
RCB - Russian Commitment Bank
RCV1 - Reuters Corpus Volume 1
READ 2016 - HTR Dataset ICFHR 2016
READ2016(line-level) - Line-level Handwritten Text Recognition on READ 2016
ReadingBank - ReadingBank is a benchmark dataset for reading order detection built …
Real 3D-AD - Real 3D-AD is the first point cloud anomaly detection dataset …
RealCQA - RealCQA Scientific Chart Question Answering as a Test-bed for First-Order …
RealEstate10K - **RealEstate10K** is a large dataset of camera poses corresponding to …
Real Life Violence Situations Dataset - This dataset has the following citation: M. Soliman, M. Kamal, …
RealMAN - A Real-Recorded and Annotated Microphone Array Dataset for Dynamic Speech Enhancement and Localization
REALY - Region-aware benchmark based on the LYHM
REBEL - Wikipedia abstracts automatically annotated with WikiData entities and relations that …
REBUS - A Robust Evaluation Benchmark of Understanding Symbols
ReCAM - SemEval-2021 Task 4: Reading Comprehension of Abstract Meaning
RECCON - RECCON is a dataset for the task of recognizing emotion …
Recipe1M+ - **Recipe1M+** is a dataset which contains one million structured cooking …
RecipeNLG - Jsjsjwjwjwjwj
RecipeQA - RecipeQA is a dataset for multimodal comprehension of cooking recipes. …
ReClor - Logical reasoning is an important ability to examine, analyze, and …
ReCoRD - **Reading Comprehension with Commonsense Reasoning Dataset** (ReCoRD) is a large-scale …
Reddit - The **Reddit** dataset is a graph dataset from Reddit posts …
REDDIT-12K - Reddit12k contains 11929 graphs each corresponding to an online discussion …
REDDIT-BINARY - **REDDIT-BINARY** consists of graphs corresponding to online discussions on Reddit. …
Reddit Ideology Database - Dataset with articles posted in the r/Liberal and r/Conservative subreddits. …
Reddit TIFU - **Reddit TIFU** dataset is a newly collected Reddit dataset, where …
ReDial - ReDial (Recommendation Dialogues) is an annotated dataset of dialogues, where …
Red MiniImageNet 20% label noise - Part of the Controlled Noisy Web Labels Dataset.
Red MiniImageNet 40% label noise - Part of the Controlled Noisy Web Labels Dataset.
Red MiniImageNet 80% label noise - Part of the Controlled Noisy Web Labels Dataset.
Re-DocRED - Revisiting Document Level Relation Extraction
REDS - REalistic and Diverse Scenes dataset realistic and dynamic scenes
RefCOCO - The **RefCOCO dataset** is a **referring expression generation (REG)** dataset …
Referring Expressions for DAVIS 2016 & 2017 - Our task is to localize and provide a pixel-level mask …
Refer-YouTube-VOS - There exist previous works [6, 10] that constructed referring segmentation …
RefRef - RefRef: A Synthetic Dataset and Benchmark for Reconstructing Refractive and Reflective Objects
REFUGE Challenge - Retinal Fundus Glaucoma Challenge
RegDB - Dongguk Body-based Person Recognition Database (DBPerson-Recog-DB1)
Rel3D - Understanding spatial relations (e.g., “laptop on table”) in visual input …
Relative Human - Relative Human (RH) contains multi-person in-the-wild RGB images with rich …
RELLIS-3D - **RELLIS-3D** is a multi-modal dataset for off-road robotics. It was …
Remote Flash LiDAR Vehicles Dataset - This dataset includes 3D point-cloud and 2D imagery from a …
Rendered SST2 - The **Rendered SST2** dataset is a dataset released by OpenAI, …
RepCount - Repetitive Action Counting Dataset
Replica - The Replica Dataset is a dataset of high quality reconstructions …
RESD - Russian Emotional Speech Dialogs with annotated text
RESIDE - A new large-scale benchmark consisting of both synthetic and real-world …
RESISC45 - RESISC45 dataset is a dataset for Remote Sensing Image Scene …
RespiratoryDatabase@TR - Description 12-channel lung sounds for each patient Multi-channel Analysis opportunity …
RES-Q - RES-Q: Evaluating Code-Editing Large Language Model Systems at the Repository Scale
Restaurant-ACOS - The Restaurant-ACOS dataset is constructed based on the SemEval 2016 …
results-A - The results-A dataset is a dataset consisting of 22 infrared …
results-C - The results-C dataset is a dataset consisting of 22 infrared …
Re-TACRED - Revised-TACRED
Retinal Fundus MultiDisease Image Dataset (RFMiD) - According to the WHO, World report on vision 2019, the …
RetVQA - Retrieval-Based Visual Question Answering
RETWEET - **RETWEET** is a dataset of tweets and overall predominant sentiment …
Retweet MTPP - Marked Temporal Point Processes on Retweet data
Reuters-21578 - The **Reuters-21578** dataset is a collection of documents with news …
RF100 - Roboflow 100
RGB Arabic Alphabet Sign Language (AASL) dataset - RGB Arabic Alphabet Sign Language (AASL) dataset
RGBE-SEG - To perform universal event stream segmentation, we collected a large-scale …
RGB-Stacking - RGB-Stacking is a benchmark for vision-based robotic manipulation. The robot …
RHM - Rhm: Robot house multi-view human activity recognition dataset
Rhythmic Gymnastic - The Rhythmic Gymnastics dataset contains videos of four different types …
RICH - Real scenes, Interaction, Contact and Humans
RITE - Retinal Images vessel Tree Extraction
RLBench - **RLBench** is an ambitious large-scale benchmark and learning environment designed …
RMAS - Real-World Marine Animal Segmentation
Road Anomaly - This dataset contains images of unusual dangers which can be …
RoadTextVQA - Text and signs around roads provide crucial information for drivers, …
robo-vln - Robotics Vision-and-Language Navigation
ROBUST-MIS - Robust Medical Instrument Segmentation Challenge 2019
Rock Corpus - This dataset contains 200 famous songs in different genres (mostly …
RoCoG-v2 - Robot Control Gestures
ROCStories - **ROCStories** is a collection of commonsense short stories. The corpus …
RoFT - Real or Fake Text
RoFT-chatgpt - RoFT-chatgpt is a variation of RoFT dataset, where the same …
roman-empire - Roman-empire is a word dependency graph based on the Roman …
ROOR - ROOR is a reading order prediction (ROP) benchmark which annotates …
Rope3D - **Roadside Perception 3D** (**Rope3D**) is a dataset for autonomous driving …
RotoWire - This dataset consists of (human-written) NBA basketball game summaries aligned …
RRS - Restoration-200k for Response Selection
RRS Ranking Test - Restoration-200k for Response Selection with Ranking Test Set
RSBlur - The RSBlur dataset provides pairs of real and synthetic blurred …
RS-Haze - A large-scale non-homogeneous remote sensing image dehazing dataset
RSICD - Remote Sensing Image Captioning Dataset
RSITMD - Click to add a brief description of the dataset (Markdown …
RSSCN7 - he RSSCN7 dataset contains satellite images acquired from Google Earth, …
RST-DT - RST Discourse Treebank
RSTPReid - Real Scenario Text-based Person Re-identification
RTB - Robot Tracking Benchmark
RTE - Recognizing Textual Entailment
RT-GENE - Presents a diverse eye-gaze dataset. Source: [RT-GENE: Real-Time Eye Gaze …
rt-inod-bias - Red Teaming Innodata Bias
RTMV - **RTMV** is a large-scale synthetic dataset for novel view synthesis …
RuCoLA - The **Russian Corpus of Linguistic Acceptability (RuCoLA)** is built from …
RuCoS - Russian Reading Comprehension with Commonsense Reasoning
RuDaS - Synthetic Datasets for Rule Learning
RUGD - RUGD: Robot Unstructured Ground Driving
RuOpenBookQA - RuOpenBookQA is a QA dataset with multiple-choice elementary-level science questions …
RUSSE - Russian Words in Context (based on RUSSE)
Russian Event2Mind - The work provides a comprehensive overview of the corpus for …
RuStance - Includes Russian tweets and news comments from multiple sources, covering …
RuWorldTree - RuWorldTree is a QA dataset with multiple-choice elementary-level science questions, …
RVL-CDIP - The **RVL-CDIP** dataset consists of scanned document images belonging to …
RWCP Sound Scene Database - The **RWCP Sound Scene Database** includes non-speech sounds recorded in …
RWF-2000 - A database with 2,000 videos captured by surveillance cameras in …
RWSD - The Winograd Schema Challenge (Russian)
RWTH-PHOENIX-Weather 2014 - The signing is recorded by a stationary color camera placed …
RWTH-PHOENIX-Weather 2014 T - Over a period of three years (2009 - 2011) the …
RxR - Room-across-Room
RxRx1 - **RxRx1** is a biological dataset designed specifically for the systematic …

S

S2Looking - **S2Looking** is a building change detection dataset that contains large-scale …
S2ORC - A large corpus of 81.1M English-language academic papers spanning many …
S3DIS - Stanford 3D Indoor Scene Dataset (S3DIS)
SA-1B - **SA-1B** consists of 11M diverse, high resolution, licensed, and privacy …
SA-Det-100k - SA-Det-100k is a large-scale class-agnostic object detection dataset for Research …
SAFIM - Syntax-Aware Fill-In-the-Middle
Sagalee - ## Speech Recognition Dataset for Oromo Language. 📊 Key features …
SAIL 2017 - Sentiment Analysis for Indian Languages
Saint Gall - Saint Gall dataset contains handwritten historical manuscripts written in Latin …
SALICON - Salicency in Context
Salinas - Salinas Scene
SALMon - The SALMon dataset and benchmark was introduced in the paper …
SALSA - A novel dataset facilitating multimodal and Synergetic sociAL Scene Analysis. …
SAMSum - A new dataset with abstractive dialogue summaries. Source: [SAMSum Corpus: …
San Francisco Landmark Dataset - The San Francisco Landmark Dataset contains a database of 1.7 …
SAR-AIRcraft-1.0 - Click to add a brief description of the dataset (Markdown …
SARDet-100K - The SARDet-100K dataset encompasses a total of 116,598 images, and …
SAT-MTB-VSR - SAT-MTB-VSR is a large-scale dataset for satellite video super-resolution made …
SAVEE - Surrey Audio-Visual Expressed Emotion
SBCoseg - SBCoseg Dataset
SBD - Semantic Boundaries Dataset
SberQuAD - Sberbank Question Answering Dataset
SBU / SBU-Refine - SBU-Kinect-Interaction dataset v2.0
Scan2CAD - **Scan2CAD** is an alignment dataset based on 1506 ScanNet scans …
ScanNet - **ScanNet** is an instance-level indoor RGB-D dataset that includes both …
ScanNet++ - ScanNet++: A High-Fidelity Dataset of 3D Indoor Scenes
ScanNet200 - The ScanNet200 benchmark studies 200-class 3D semantic segmentation - an …
ScanObjectNN - **ScanObjectNN** is a newly published real-world dataset comprising of 2902 …
SCDE - **SCDE** is a human-created sentence cloze dataset, collected from public …
SceneNN - SceneNN is an RGB-D scene dataset consisting of more than …
SchizzoSQUAD - The “Mental Health” forum was used, a forum dedicated to …
SCICAP - SCICAP is a large-scale image captioning dataset that contains real-world …
SciCite - **SciCite** is a dataset of citation intents that addresses multiple …
SciDocs - SciDocs evaluation framework consists of a suite of evaluation tasks …
SciERC - **SciERC** dataset is a collection of 500 scientific abstract annotated …
SciQ - The SciQ dataset contains 13,679 crowdsourced science exam questions about …
SciTail - The **SciTail** dataset is an entailment dataset created from multiple-choice …
SCoralDet Dataset - Soft-Coral Detection Dataset
ScreenSpot - # ScreenSpot Evaluation Benchmark ScreenSpot is an evaluation benchmark for …
Scribble - **Scribble** is a new outline dataset consisting of 200 images …
ScribbleKITTI - ScribbleKITTI is a scribble-annotated dataset for LiDAR semantic segmentation.
SCUT-CTW1500 - The **SCUT-CTW1500** dataset contains 1,500 images: 1,000 for training and …
SDD - **SDD** dataset contains a variety of indoor and outdoor scenes, …
SDSS Galaxies - SDSS galaxies as imaged by DESI
SeaDronesSee - SeaDronesSee: A Maritime Benchmark for Detecting Humans in Open Water
Seaquest - OpenAI Gym - Dataset: The experiments are conducted using the Seaquest environment from …
SearchQA - SearchQA was built using an in-production, commercial search engine. It …
SECOND - SEmantic Change detectiON Dataset
SEDE - Stack Exchange Data Explorer
SEED - SJTU Emotion EEG Dataset
seeds - The examined group comprised kernels belonging to three different varieties …
Segmentation in the Wild - Recent advances in language-image pre-training has witnessed the emerging field …
SegTrack-v2 - SegTrack v2 is a video segmentation dataset with full pixel-level …
SEL - The semantic line (SEL) dataset contains 1,750 outdoor images in …
selfie2anime - The selfie dataset contains 46,836 selfie images annotated with 36 …
Semantic3D - **Semantic3D** is a point cloud dataset of scanned outdoor scenes …
SemanticKITTI - **SemanticKITTI** is a large-scale outdoor-scene dataset for point cloud semantic …
SemanticPOSS - The SemanticPOSS dataset for 3D semantic segmentation contains 2988 various …
SemanticSTF - **SemanticSTF** is an adverse-weather point cloud dataset that provides dense …
SemClinBr - A multi‑institutional and multi‑specialty semantically annotated corpus for Portuguese clinical NLP tasks
SemEval-2010 Task-8 - The dataset for the **SemEval-2010 Task 8** is a dataset …
SemEval-2014 Task-4 - Sentiment analysis is increasingly viewed as a vital task both …
SemEval-2017 Task-10 - We describe the SemEval task of extracting keyphrases and relations …
semi-indoor - collected by one VLP-16 in a small vehicle (1m x …
SemTabNet - # Dataset Card for SemTabNet This dataset accompanies the following …
SensatUrban - The SensatUrbat dataset is an urban-scale photogrammetric point cloud dataset …
SenseReID - SenseReID is a person re-identification dataset for evaluating ReID models. …
SentEval - SentEval is a toolkit for evaluating the quality of universal …
Sentiment140 - Sentiment140 is a dataset that allows you to discover the …
Sentiment Merged - SST-3, DynaSent R1/R2
Separated COCO - **Separated COCO** is automatically generated subsets of COCO val dataset, …
SEPE 8K - SEPE 8K dataset is made of 40 different 8K (8192 …
Sepehr_RumTel01 - The expansion of social networks has accelerated the transmission of …
Set14 - The **Set14** dataset is a dataset consisting of 14 images …
Set5 - The **Set5** dataset is a dataset consisting of 5 images …
SEVIR - Storm EVent ImagRy
SFCHD - This work contributes a large, complex, and realistic high-quality safety …
SFEW - Static Facial Expression in the Wild
SF-XL Night - Click to add a brief description of the dataset (Markdown …
SF-XL Occlusion - Click to add a brief description of the dataset (Markdown …
SF-XL test v1 - San Francisco eXtra Large test v1
SF-XL test v2 - San Francisco eXtra Large test v2
SGD - Schema-Guided Dialogue
SHAJ - Spoken Hate in the Albanian Jargon
ShanghaiTech - The Shanghaitech dataset is a large-scale crowd counting dataset. It …
ShanghaiTech Campus - The ShanghaiTech Campus dataset has 13 scenes with complex light …
shape bias - The 'shape bias' dataset was introduced in Geirhos et al. …
ShapeNet - **ShapeNet** is a large scale repository for 3D CAD models …
ShapeNetCore - ShapeNetCore is a subset of the full ShapeNet dataset with …
ShapeNet-ViPC - A large-scale dataset for the point cloud completion task on …
SHAPES - Swarm Heuristics based Adaptive and Penalized Estimation of Splines
ShapeStacks - A simulation-based dataset featuring 20,000 stack configurations composed of a …
SHD - Spiking Heidelberg Digits
SHD - Adding - Spiking Heidelberg Digits - Adding
SheetCopilot - The SheetCopilot dataset contains 28 evaluation workbooks and 221 spreadsheet …
Shelf&Tote Training Dataset - MIT-Princeton Amazon Picking Challenge 2016 Shelf&Tote Training Dataset
Shellcode_IA32 - Shellcode_IA32 is a dataset containing 20 years of shellcodes from …
ShEMO - Sharif Emotional Speech Database
SHHS - Sleep Heart Health Study
Shifts - The **Shifts Dataset** is a dataset for evaluation of uncertainty …
ShipSpotting - To construct such a dataset, a straightforward approach was scraping …
ShortPersianEmo - ShortPersianEmo is a new data set for emotion recognition in …
Shot2Story20K - A short clip of video may contain progression of multiple …
SICE-Grad - A test dataset SICE_Grad image datasets to represent complex mixed …
SICE-Mix - A test dataset SICE_Mix image datasets to represent complex mixed …
SICK - Sentences Involving Compositional Knowledge
SICKLE - Satellite Imagery for Cropping annotated with Keyparameter LabEls
SIDD - Smartphone Image Denoising Dataset
SIDER - **SIDER** contains information on marketed medicines and their recorded adverse …
SILICONE Benchmark - SILICONE
Sim10k - SIM10k is a synthetic dataset containing 10,000 images, which is …
SIMAC - F. Gouyon, “A computational approach to rhythm description — Audio …
SIMARA - SIMARA: a database for key-value information extraction from full-page handwritten documents
SimBEV - The **SimBEV** dataset is a collection of 320 scenes spread …
SimGas - Computer Simulated Gas Leakage Segmentation
SIMMC2.0 - Next generation task-oriented dialog systems need to understand conversational contexts …
SimpleQuestions - **SimpleQuestions** is a large-scale factoid question answering dataset. It consists …
SimplerEnv-Google Robot - Evaluating Real-World Robot Manipulation Policies in Simulation
SimplerEnv-Widow X - SimplerEnv: Simulated Manipulation Policy Evaluation Environments for Real Robot Setups
SINS - **SINS** is a database of continuous real-life audio recordings in …
SIPaKMeD - SIPaKMeD Pap Smear dataset
SIQA - Social Interaction QA
SIXray - The **SIXray** dataset is constructed by the Pattern Recognition and …
SKAB - Skoltech Anomaly Benchmark
Skeleton-Mimetics - A dataset derived from the recently introduced Mimetics dataset. Source: …
SketchyCOCO - SketchyCOCO dataset consists of two parts: **Object-level data** Object-level data …
Slakh2100 - Synthesized Lakh Dataset
Slashdot - The **Slashdot** dataset is a relational dataset obtained from Slashdot. …
Sleep-EDF - Sleep-EDF Expanded
SLOPER4D - **SLOPER4D** is a novel scene-aware dataset collected in large urban …
Slovo: Russian Sign Language Dataset - We introduce a large-scale video dataset **Slovo** for Russian Sign …
SLUE - Spoken Language Understanding Evaluation
SLURP - Spoken Language Understanding Resource Package
SMAC-Exp - StarCraft Multi-Agent Exploration Challenge
smallNORB - The **smallNORB** dataset is a datset for 3D object recognition …
SMAP - Soil Moisture Active Passive
SMC - A. Holzapfel, M. E. Davies, J. R. Zapata, J. L. …
SMD - Server Machine Dataset
S.MID - SeMantic InDustry
SMS Spam Collection Data Set - This corpus has been collected from free or free for …
SNIPS - SNIPS Natural Language Understanding benchmark
Snips-SmartLights - The SmartLights benchmark from Snipstests the capability of controlling lights …
Snips-SmartSpeaker - The SmartSpeaker benchmark tests the performance of reacting to music …
SNLI - Stanford Natural Language Inference
Snopes - Fact-checking (FC) articles which contains pairs (multimodal tweet and a …
So2Sat LCZ42 - So2Sat LCZ42 consists of local climate zone (LCZ) labels of …
SOBA - Shadow-OBject Association
SOC - Salient Objects in Clutter
Soccer - ISSIA-CNR Soccer
SoccerNet-v2 - A novel large-scale corpus of manual annotations for the SoccerNet …
Social media attributions of YouTube comments - Social media attributions dataset of YouTube comments in the context of water crisis
SOD - small obstacle detection
SODA-D - SODA-D is a large-scale dataset tailored for small object detection …
Something-Something V1 - The 20BN-SOMETHING-SOMETHING dataset is a large collection of labeled video …
Something-Something V2 - The 20BN-SOMETHING-SOMETHING V2 dataset is a large collection of labeled …
Song Describer Dataset - The Song Describer Dataset (SDD) contains ~1.1k captions for 706 …
Songdo Vision - Songdo Vision: Vehicle Annotations from High-Altitude BeV Drone Imagery in a Smart City
Sound-based drone fault classification using multitask learning - arxiv : https://arxiv.org/abs/2304.11708 Accepted at 29th International Congress on Sound …
SoundDescs - We introduce a new audio dataset called SoundDescs that can …
SoundingEarth - SoundingEarth consists of co-located aerial imagery and audio samples all …
Sound of Water 50 - We collect a dataset of 805 clean videos that show …
SpaceNet 1 - SpaceNet 1: Building Detection v1
SpaceNet 2 - SpaceNet 2: Building Detection v2
SPair-71k - SPair-71k contains 70,958 image pairs with diverse variations in viewpoint …
SPAQ - Smartphone Photography Attribute and Quality
SParC - Semantic Parsing in Context
Species-800 - **Species-800** is a corpus for species entities, which is based …
Speech Commands - **Speech Commands** is an audio dataset of spoken words designed …
SPGISpeech - SPGISpeech (pronounced “speegie-speech”) is a large-scale transcription dataset, freely available …
Spider 2.0 - Spider 2.0 is a comprehensive code generation agent task that …
Spiideo SoccerNet SynLoc - Synthetic soccer players rendered on top of real world stadium …
Spike-X4K - Spike-X4K Dataset
SPKL - Seasonal Parking Lot Dataset
Spoken-SQuAD - In SpokenSQuAD, the document is in spoken form, the input …
Sports10 - - Games dataset containing 100,000 Gameplay Images of 175 Video …
Sports-1M - The **Sports-1M** dataset consists of over a million videos from …
SportsMOT - SportsMOT: A Large Multi-Object Tracking Dataset in Multiple Sports Scenes
SPOT-10 - Animal Pattern Benchmark Dataset for Machine Learning Algorithms
Spring - Spring: A High-Resolution High-Detail Dataset and Benchmark for Scene Flow, Optical Flow and Stereo
Sprites - 2D Video Game Character Sprites
SQA - SequentialQA
SQA3D - Situated Question Answering in 3D Scenes
SQL-Eval - SQL-Eval is an open-source PostgreSQL evaluation dataset released by Defog, …
SQuAD - Stanford Question Answering Dataset
Squirrel (48%/32%/20% fixed splits) - Node classification on Squirrel with the fixed 48%/32%/20% splits provided …
Squirrel (60%/20%/20% random splits) - Node classification on Squirrel with 60%/20%/20% random splits for training/validation/test.
SRD - Shadow Removal Dataset
SRI-APPROVE Fine-Grained Video Classification - APPROVE consists of curated YouTube videos annotated with educational content. …
SROIE - Consists of a dataset with 1000 whole scanned receipt images …
SR-Reg - SynthRAD Registration
SSC - Spiking Speech Commands v0.2
SST - Stanford Sentiment Treebank
SST-2 - The Stanford Sentiment Treebank is a corpus with fully labeled …
SST-3 - Stanford Sentiment Treebank: 3-way
SST-5 - The SST-5, also known as the Stanford Sentiment Treebank with …
ssTEM - We provide two image stacks where each contains 20 sections …
Stacked MNIST - The **Stacked MNIST** dataset is derived from the standard MNIST …
StackOverflow MTPP - Marked Temporal Point Processes on StackOverflow data
Stanford40 - Stanford 40 Actions
Stanford Cars - The **Stanford Cars** dataset consists of 196 classes of cars …
Stanford Dogs - The **Stanford Dogs** dataset contains 20,580 images of 120 classes …
Stanford ECoG library: ECoG to Finger Movements - Electrophysiological data from implanted electrodes in the human brain are …
StanfordExtra - An 'in the wild' dataset of 20,580 dog images for …
Stanford Online Products - **Stanford Online Products** (SOP) dataset has 22,634 classes with 120,053 …
Stanford-ORB - We introduce Stanford-ORB, a new real-world 3D Object inverse Rendering …
STAR Benchmark - Situated Reasoning
STARE - Structured Analysis of the Retina
STARSS22 - Sony-TAu Realistic Spatial Soundscapes 2022
STB - Stereo Hand Pose Benchmark
STDW - **STDW** is a diverse large-scale dataset for table detection with …
STEDUCOV: A DATASET ON STANCE DETECTION IN TWEETS TOWARDS ONLINE EDUCATION DURING COVID-19 PANDEMIC - StEduCov, a dataset annotated for stances toward online education during …
StepGame - A Benchmark for Robust Multi-Hop Spatial Reasoning in Texts
StereoSet - A large-scale natural dataset in English to measure stereotypical biases …
STL-10 - Self-Taught Learning 10
STN PLAD - STN Power Line Assets Dataset
stocknet - stocknet-dataset
StoryCloze - Representation and learning of commonsense knowledge is one of the …
STPLS3D - Our project (STPLS3D) aims to provide a large-scale aerial photogrammetry …
StrategyQA - **StrategyQA** is a question answering benchmark where the required reasoning …
StreetHazards - StreetHazards is a synthetic dataset for anomaly detection, created by …
Street Scene - **Street Scene** is a dataset for video anomaly detection. Street …
StreetTryOn - StreetTryOn, the new in-the-wild Virtual Try-On dataset, consists of 12,364 …
STREUSLE - STREUSLE stands for Supersense-Tagged Repository of English with a Unified …
Structured3D - **Structured3D** is a large-scale photo-realistic dataset containing 3.5K house designs …
STS Benchmark - Semantic Textual Similarity
STVD-PVCD - Partial Video Copy Detection Dataset
StyleBench - To comprehensively evaluate the effectiveness and generalization ability of style …
Stylized ImageNet - The Stylized-ImageNet dataset is created by removing local texture cues …
SUBJ - Subjectivity dataset
SUDO Dataset - SUDO is a benchmark of 50 real-world malicious tasks designed …
SUIM - Segmentation of Underwater IMagery
SumMe - The **SumMe** dataset is a video summarization dataset consisting of …
SUN - SUN Database
SUN397 - The Scene UNderstanding (SUN) database contains 899 categories and 130,519 …
SUN Attribute - The **SUN Attribute** dataset consists of 14,340 images from 717 …
SUN RGB-D - The SUN RGBD dataset contains 10335 real RGB-D images of …
SUN-RGBD-IS - A RGB-D dataset converted from SUN-RGBD into COCO-style instance segmentation …
SUT - SUT: a new multi-purpose synthetic dataset for Farsi document image analysis
SUTD-TrafficQA - SUTD-TrafficQA (Singapore University of Technology and Design - Traffic Question …
SVAMP - Simple Variations on Arithmetic Math word Problems
SVHN - Street View House Numbers
SVOX - Click to add a brief description of the dataset (Markdown …
SVT - Street View Text Dataset
SVTP - SVTP dataset stands for Scene Text Recognition Datasets. It is …
SWAG - Situations With Adversarial Generations
SWDE - Structured Web Data Extraction
SWIMSEG - Singapore Whole sky IMaging SEGmentation Database
SWINSEG - Singapore Whole sky Nighttime Image SEGmentation Database
SWINySEG - Singapore Whole sky Nychthemeron Image SEGmentation Database
SWORD - 'Scenes with occluded regions' dataset
Sydney Urban Objects - This dataset contains a variety of common urban road objects …
Synthehicle - Synthehicle is a massive CARLA-based synthehic multi-vehicle multi-camera tracking dataset …
Synthetic Dynamic Networks - from Aging, Fitness Preferential Attachment mechanisms
SynthEVox3D-Tiny - Synthetic Event Camera Voxel 3D Reconstruction Dataset
SYNTHIA - SYNTHetic Collection of Imagery and Annotations
SynthPAI - SynthPAI: A Synthetic Dataset for Personal Attribute Inference
SYSU-30k - **SYSU-30k** contains 30k categories of persons, which is about 20 …
SYSU-MM01 - The **SYSU-MM01** is a dataset collected for the Visible-Infrared Re-identification …
SYSU-MM01-C - **SYSU-MM01-C** is an evaluation set that consists of algorithmically generated …
SZ-Taxi - Shenzhen Taxi Speed

T

T$^3$Bench - T$^3$Bench is the first comprehensive text-to-3D benchmark containing diverse text …
T2I-CompBench - T2I-CompBench is a comprehensive benchmark for open-world compositional text-to-image generation, …
TabFact - **TabFact** is a large-scale dataset which consists of 117,854 manually …
TACM12K - Table-ACM12K
TACO-BAAI - Topics in Algorithmic Code generation dataset
TACRED - The TAC Relation Extraction Dataset
TACRED-Revisited - The TACRED-Revisited dataset improves the crowd-sourced TACRED dataset for relation …
Tai-Chi-HD - **Thai-Chi-HD** is a high resolution dataset which can be used …
Tamil Memes - Social media are interactive platforms that facilitate the creation or …
Tanks and Temples - We present a benchmark for image-based 3D reconstruction. The benchmark …
TAO - Tracking Any Object Dataset
TapCorrect - J. Driedger, H. Schreiber, W. B. de Haas, and M. …
TAP-Vid - **TAP-Vid** is a benchmark which contains both real-world videos with …
TASD - Target Aspect Sentiment Detection
Taskonomy - Taskonomy provides a large and high-quality dataset of varied indoor …
TAT - Taiwanese Across Taiwan
TAT-QA - TAT-QA (Tabular And Textual dataset for Question Answering) is a …
TAU-NIGENS Spatial Sound Events 2021 - The TAU-NIGENS Spatial Sound Events 2021 dataset contains multiple spatial …
TAU Urban Acoustic Scenes 2019 - **TAU Urban Acoustic Scenes 2019** development dataset consists of 10-seconds …
TBBR - Thermal Bridges on Building Rooftops
TbD
TCGA - The Cancer Genome Atlas
TCMP-300 - Traditional Chinese Medicinal Plant Dataset
TDIUC - Task Directed Image Understanding Challenge
TED-LIUM - The TED-LIUM corpus consists of English-language TED talks. It includes …
TED-talks - In order to create the TED-talks dataset, 3,035 YouTube videos …
TempEval-3 - TempEval-3: events, times, and temporal relations
TempQA-WD - **TempQA-WD** is a benchmark dataset for temporal reasoning designed to …
TempQuestions - Here, we take a key step in this direction and …
Tennis - This dataset was introduced by [1], but was not used …
TEP - Tennessee Eastman Process
Terms of Service - The **Terms of Service** dataset is a law dataset corresponding …
TERRa - Textual Entailment Recognition for Russian
Texas (48%/32%/20% fixed splits) - Node classification on Texas with the fixed 48%/32%/20% splits provided …
Text8 - Desc: [About of Text8](http://mattmahoney.net/dc/textdata.html)
TextAtlasEval - *A Dense-text Image Benchmark to evaluate large generation model's ability …
TextComplexityDE - TextComplexityDE is a dataset consisting of 1000 sentences in German …
TextSeg - **TextSeg** is a large-scale fine-annotated and multi-purpose text detection and …
TextVQA - TextVQA is a dataset to benchmark visual reasoning based on …
TextZoom - **TextZoom** is a super-resolution dataset that consists of paired Low …
TFix's Code Patches Data - The dataset contains more than 100k code patch pairs extracted …
TGIF - Tumblr GIF
TGIF-QA - The TGIF-QA dataset contains 165K QA pairs for the animated …
The Game of 2048 - The 2048 game task involves training an agent to achieve …
The Little Prince - The Little Prince Corpus
The Pile - The Pile is a 825 GiB diverse, open source language …
The Spoken Wikipedia Corpora - The SWC is a corpus of aligned Spoken Wikipedia articles …
This is not a Dataset - This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models
Thorsten voice 21.02 neutral - Thorsten-Voice (Thorsten-21.02-neutral) is a neutrally spoken voice dataset recorded by …
ThreatGram 101 - Extreme Telegram Data - ThreatGram 101 - Extreme Telegram Replies Data with Threat Levels
THuman2.0 Dataset - THuman2.0 Dataset contains 500 high-quality human scans captured by a …
THUMOS14 - The **THUMOS14** (THUMOS 2014) dataset is a large-scale video dataset …
Thyroid - Thyroid Disease
TID2013 - **TID2013** is a dataset for image quality assessment that contains …
TII-SSRC-23 - The TII-SSRC-23 dataset offers a comprehensive collection of network traffic …
TimeBank - Enriches the TimeML annotations of TimeBank by adding information about …
TimeQuestions - Question answering over knowledge graphs (KG-QA) is a vital topic …
Timers and Such - Timers and Such is an open source dataset of spoken …
TimeTravel - TimeTravel contains 29,849 counterfactual rewritings, each with the original story, …
TIMIT - TIMIT Acoustic-Phonetic Continuous Speech Corpus
TIMo - TIMo (Time-of-Flight Indoor Monitoring)
Tiny ImageNet - **Tiny ImageNet** contains 100000 images of 200 classes (500 for …
TIP 2018 - The first large demoire dataset. The dataset contains 135,000 image …
TIQ - Existing benchmarks for temporal QA focus on a single information …
TLDR9+ - TLDR9+ is a large-scale summarization dataset containing over 9 million …
TLF2K - Table-LastFm2K
TMD - Text-Music-Dance
TML1M - Table-MovieLens1M
TNBC - Inolves an annotated a large number of cells, including normal …
TNL2K - Tracking by natural language
Tobacco-3482 - The Tobacco-3482 dataset consists of document images belonging to 10 …
ToLD-Br - Toxic Language Detection for Brazilian Portuguese
tolokers - Tolokers is a crowdsourcing platform workers network based on data …
TOMG-Bench - Text-based Open Molecule Generation Benchmark
ToolBench - **ToolBench** is an instruction-tuning dataset for tool use, which is …
ToolLens - The ToolLens dataset consists of 18,770 concise yet intentionally multifaceted …
Toronto-3D - **Toronto-3D** is a large-scale urban outdoor point cloud dataset acquired …
Torque - Torque is an English reading comprehension benchmark built on 3.2k …
Total-Text - **Total-Text** is a text detection dataset that consists of 1,555 …
ToTTo - ToTTo is an open-domain English table-to-text dataset with over 120,000 …
Touchdown Dataset - Touchdown is a corpus for executing navigation instructions and resolving …
Tox21 - The **Tox21** data set comprises 12,060 training samples and 647 …
ToxCast (Toxicity Forecaster) - ToxCast is an initiative by the U.S. Environmental Protection Agency …
TQA - Textbook Question Answering
TrackingNet - **TrackingNet** is a large-scale tracking dataset consisting of videos in …
Traditional and Context-specific Spam Twitter - This data set is being released to support the spam …
Traffic - Traffic Flow Forecasting Data Set
TrajAir: A General Aviation Trajectory Dataset - This dataset contains aircraft trajectories in an untowered terminal airspace …
TRANCOS - TRaffic ANd COngestionS
Trans10K - A large-scale dataset for transparent object segmentation, named Trans10K, consisting …
Translated SNLI Dataset in Marathi - ### Translated SNLI Dataset in Marathi A translated version of …
TransProteus - The dataset contains procedurally generated images of transparent vessels containing …
Travel - Tour & Travels Customer Churn Prediction
TREC-10 - TREC-10 Question Classification
TrecQA - Text Retrieval Conference Question Answering
TRECVID-AVS16 (IACC.3) - Internet Archive videos (IACC.3) under Creative Commons licenses. The test …
TRECVID-AVS17 (IACC.3) - Internet Archive videos (IACC.3) under Creative Commons licenses. The test …
TRECVID-AVS18 (IACC.3) - Internet Archive videos (IACC.3) under Creative Commons licenses. The test …
TRECVID-AVS19 (V3C1) - The dataset has been designed to represent true web videos …
TRECVID-AVS20 (V3C1) - The dataset has been designed to represent true web videos …
Treutlein et al - Source: [Reconstructing lineage hierarchies of the distal lung epithelium using …
TriMouse-161 - Three wild-type (C57BL/6J) male mice ran on a paper spool …
Trinity Speech-Gesture Dataset - **Trinity Gesture Dataset** includes 23 takes, totalling 244 minutes of …
TriviaQA - **TriviaQA** is a realistic text-based question answering dataset which includes …
TRR360D - **TRR360D** is based on the ICDAR2019MTD modern table detection dataset, …
TruckScenes - MAN TruckScenes
TruthfulQA - TruthfulQA is a benchmark to measure whether a language model …
TSP/HCP Benchmark set - This is a benchmark set for Traveling salesman problem (TSP) …
TSS - dataset of 400 image pairs
TSSB - Time Series Segmentation Benchmark
TSU - Toyota Smarthome Untrimmed
TT100K - Tsinghua-Tencent 100K(official training and testing set)
TTStroke-21 ME21 - TTStroke-21 for MediaEval 2021
TTStroke-21 ME22 - TTStroke-21 for MediaEval 2022
TUDA - Overall duration per microphone: about 36 hours (31 hrs train …
TUH EEG Seizure Corpus - Temple University Hospital (TUH) EEG Corpus
Turbulence - $\textbf{Turbulence}$ is a new benchmark for systematically evaluating the correctness …
TurkCorpus - TurkCorpus, a dataset with 2,359 original sentences from English Wikipedia, …
TURL - Twitter News URL Corpus
TuSimple - The **TuSimple** dataset consists of 6,408 road images on US …
TUT Acoustic Scenes 2017 - The **TUT Acoustic Scenes 2017** dataset is a collection of …
TUT Urban Acoustic Scenes 2018 - The dataset for this task is the TUT Urban Acoustic …
TVBench - TVBench is a new benchmark specifically created to evaluate temporal …
TVC - TV show Captions
TVQA - The **TVQA** dataset is a large-scale video dataset for video …
TVR - TV show Retrieval
TvSum - TVSum: Summarizing Web Videos Using Titles
Tweebank - Briefly describe the dataset. Provide: * a high-level explanation of …
TweepFake - The TweepFake dataset consists of 25,572 social media messages posted …
TweetEval - TweetEval introduces an evaluation framework consisting of seven heterogeneous Twitter-specific …
TweetQA - With social media becoming increasingly popular on which lots of …
tweetSentBR - The **TweetSentBR Dataset** is a valuable resource for sentiment analysis …
twitch-gamers - node classification on twitch-gamers
Twitter-HyDrug - Twitter-HyDrug is a real-world hypergraph data that describes the drug …
Twitter POS - K. Gimpel, N. Schneider, B. O’Connor, D. Das, D. Mills, …
Twitter Sentiment Analysis - Entity-Level Twitter Sentiment Analysis Dataset
Twitter-SMNER - This task aims to extract named entities and entity types …

U

U-10: United-10 COVID19 CT Dataset - This dataset supports the research detailed in the pre-print "Virtual …
UA-DETRAC - Consists of 100 challenging video sequences captured from real-world traffic …
UA-GEC - UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
UAVBillboards - UAV Billboards
UAVDB - Trajectory-Guided Adaptable Bounding Boxes for UAV Detection
UAVDT - Unmanned Aerial Vehicle Benchmark Object Detection and Tracking
UAV-Human - UAV-Human is a large dataset for human behavior understanding with …
UAVid - UAVid is a high-resolution UAV semantic segmentation dataset as a …
UAV-PDD2023 - UAV-PDD2023: A benchmark dataset for pavement distress detection based on …
UAVVaste - The UAVVaste dataset consists to date of 772 images and …
UBI-Fights - Abnormal Event Detection Dataset
UBnormal - University of Bucharest Abnormal Videos
UBody - UBody is a large-scale Upper-Body dataset with the following annotations: …
Ubuntu IRC - The **Ubuntu IRC dataset** is a valuable resource for research …
UCF101 - UCF101 Human Actions dataset
UCF101-24 - Click to add a brief description of the dataset (Markdown …
UCF-Crime - The UCF-Crime dataset is a large-scale dataset of 128 hours …
UCFRep - The **UCFRep** dataset contains 526 annotated repetitive action videos. This …
UCF Sports - The UCF Sports dataset consists of a set of actions …
UC Merced Land Use Dataset - This is a 21 class land use image dataset meant …
UCR Anomaly Archive - The UCR Anomaly Archive is a collection of 250 uni-variate …
UCR Time Series Classification Archive - The UCR Time Series Archive - introduced in 2002, has …
UCSD Ped2 - UCSD Anomaly Detection Dataset
UCY - The **UCY** dataset consist of real pedestrian trajectories with rich …
UDA-CH - Unsupervised Domain Adaptation on Cultural Heritage
UDED - Unified Dataset for Edge Detection
U-DIADS-Bib - U-DIADS-Bib is a proprietary dataset developed through the collaboration of …
UEA time-series datasets - UEA time-series datasets for series-level anomaly detection
UESTC RGB-D - UESTC RGB-D Varying-view action database
UFBA-425 - We introduce a set of 425 panoramic X-rays with Human …
UFPR-ADMR-v1 - This dataset contains 2,000 dial meter images obtained on-site by …
UFPR-AMR - This dataset contains 2,000 images taken from inside a warehouse …
UHD-IQA - We introduce a novel Image Quality Assessment (IQA) dataset comprising …
UHDM - The first ultra-high-definition image demoireing dataset, consisting of 4,500 4K …
UHRSD - Ultra High-Resolution Saliency Detection Dataset
UIIS - General Underwater Image Instance Segmentation dataset
UI-PRMD - University of Idaho – Physical Rehabilitation Movement Dataset
UK Biobank Brain MRI - UK Biobank Data - Brain MRI
UK Key Stage Readability - UK Key Stage Readability for English Texts
ULS labeled data - UVA laser scanning labelled las data over tropical moist forest classified as leaf or wood points
Ultra-High Resolution Image Reconstruction Benchmark - Ultra-high definition benchmark (UHDBench) includes 2293 images at 2k resolution …
UML Classes With Specs - UML Class Diagrams Paired With Their English Specifications
UMLS - Unified Medical Language System
Unaligned-VL-CMU-CD (neighbor distance 2) - Street-View images captured at different timestamps often undergo geometric transformations. …
UnAV-100 - Existing audio-visual event localization (AVE) handles manually trimmed videos with …
UniMorph 4.0 - Universal Morphology
UniProtQA - UniProtQA consists of proteins and textual queries about their functions …
Universal Dependencies - The **Universal Dependencies** (UD) project seeks to develop cross-linguistically consistent …
UNSW-NB15 - UNSQ-NB15
UPAR - Unified Pedestrian Attribute Recognition
UPFD-GOS - User Preference-aware Fake News Detection
UPFD-POL - User Preference-aware Fake News Detection
UPLight - UPLight is an underwater RGB-Polarization multimodal semantic segmentation dataset with …
UrbanSound8K - Urban Sound 8K is an audio dataset that contains 8732 …
UrduDoc - The **UrduDoc Dataset** is a benchmark dataset for Urdu text …
Urdu News Headlines Dataset - Urdu News Headlines Dataset with VOA and BBC An Urdu …
Urdu Online Reviews - This corpus was constructed by collecting 10,008 reviews from various …
URMP - University of Rochester Multi-Modal Musical Performance
UruDendro - UruDendro, a public dataset of cross-section images of pinus taeda
USA Air-Traffic - Leonardo Filipe Rodrigues Ribeiro, Pedro H. P. Saverese, and Daniel …
USNA-Cn2 (long-term) - Unites States Naval Academy Long-term Scintillation Study
USNA-Cn2 (short-duration) - Unites States Naval Academy Short-duration Optical Turbulence Dataset
USPS - **USPS** is a digit dataset automatically scanned from envelopes by …
USPTO-190 - A chemical synthesis route dataset constructed from the USPTO reaction …
USPTO-50k - Subset and preprocessed version of Chemical reactions from US patents …
USR-PersonaChat - This dataset was collected with the goal of assessing dialog …
USR-TopicalChat - This dataset was collected with the goal of assessing dialog …
UTD-MHAD - The **UTD-MHAD** dataset consists of 27 different actions performed by …
UTFPR-SBD3 - The semantic segmentation of clothes is a challenging task due …
UT-Interaction - The **UT-Interaction** dataset contains videos of continuous executions of 6 …
UTKFace - The **UTKFace** dataset is a large-scale face dataset with long …
UT Zappos50K - **UT Zappos50K** is a large shoe dataset consisting of 50,025 …
UV6K - Urban Vehicle Segmentation Dataset
UVO - Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation
UZLF - Leuven-Haifa High-Resolution Fundus Image Dataset for Retinal Blood Vessel Segmentation and Glaucoma Diagnosis

V

V2XSet - A large-scale V2X perception dataset using CARLA and OpenCDA
V2X-SIM - **V2X-Sim**, short for vehicle-to-everything simulation, is the a synthetic collaborative …
ValNov Subtask A - Binary labels for Validity and Novelty respectively are given for …
ValNov Subtask B - Validity and Novelty are determined in a comparative setting between …
VASR - Visual Analogies of Situation Recognition
VAST - VAried Stance Topics
VATEX - Video And TEXt
VATEX Adverbs - VATEX Adverbs is a subset from VATEX with extracted verb-adverb …
VB-DemandEx - Uses same clean speech as VoiceBank+Demand but more noise types. …
VC-Clothes - Person re-identification (Reid) is now an active research topic for …
V-COCO - Verbs in COCO
VCR - Visual Commonsense Reasoning
VCTK - CSTR VCTK Corpus
VDD - Varied Drone Dataset for Semantic Segmentation
VDS dataset: Multi exposure stack-based inverse tone mapping - * Have need seven multiple exposure ground truth images satisfying …
VEDAI - Vehicle Detection in Aerial Imagery
Vehicle Claims - The code to create the dataset is available [here](https://github.com/ajaychawda58/UADAD/blob/main/Code/Notebooks/create_dataset.ipynb). The …
VehicleID - PKU VehicleID
VeRi-776 - **VeRi-776** is a vehicle re-identification dataset which contains 49,357 images …
Verified Smart Contract Code Comments - **Verified Smart Contracts Code Comments** is a dataset of real …
VerilogEval - # VerilogEval Dataset The **VerilogEval Dataset** is a benchmark specifically …
VFITex - To test interpolation performance on various texture types, we developed …
VFR-2420 - A synthetic dataset containing word images of 447 typefaces with …
VFR-447 - A synthetic dataset containing 447 typefaces with only one font …
VFR-Wild - 325 word images intended for font recognition, whose fonts are …
VGGFace2 - Vggface2: A dataset for recognising faces across pose and age
VGG-Sound - Consists of more than 210k videos for 310 audio classes. …
VibraVox (forehead accelerometer) - This is the **forehead accelerometer variant** of the VibraVox dataset. …
VibraVox (headset microphone) - This is the **reference headset microphone variant** of the VibraVox …
VibraVox (rigid in-ear microphone) - This is the in-ear rigid earpiece-embedded microphone variant of the …
VibraVox (soft in-ear microphone) - This is the **in-ear comply foam-embedded microphone variant** of the …
VibraVox (temple vibration pickup) - This is the **temple vibration pickup variant** of the VibraVox …
VibraVox (throat microphone) - This is the **throat microphone (laryngophone) variant** of the VibraVox …
VidChapters-7M - VidChapters-7M is a dataset of 817K user-chaptered videos including 7M …
VideoAttentionTarget - A dataset with fully annotated attention targets in video for …
VideoCube - VideoCube is a high-quality and large-scale benchmark to create a …
VideoDB's OCR Benchmark Public Collection - ## Dataset Introduction This dataset leverages VideoDB's Public Collection to …
VideoInstruct - Video Instruction Dataset
Video Waterdrop Removal Dataset - Due to the lack of training data for video waterdrop …
VidHOI - VidHOI is a video-based human-object interaction detection benchmark. VidHOI is …
VidSTG - The **VidSTG** dataset is a spatio-temporal video grounding dataset constructed …
VietMed - VietMed: A Dataset and Benchmark for Automatic Speech Recognition of Vietnamese in the Medical Domain
ViGGO - The ViGGO corpus is a set of 6,900 meaning representation …
Vimeo90K - The Vimeo-90K is a large-scale high-quality video dataset for lower-level …
Vinegar Fly - **Vinegar Fly** is a pose estimation dataset for fruit flies.
ViNLI - Vietnamese Natural Language Inference Dataset
Vinoground - A temporal counterfactual dataset composing of 1000 short and natural …
ViP-Bench - Making Large Multimodal Models Understand Arbitrary Visual Prompts
VisA - Visual Anomaly Dataset
VisA-AC - VisA-AC is a refined benchmark based on the VisA dataset, …
VisDA-2017 - **VisDA-2017** is a simulation-to-real dataset for domain adaptation with over …
VisEvent - **VisEvent** (Visible-Event benchmark) is a dataset constructed for the evaluation …
VIST - Visual Storytelling
Visual7W - **Visual7W** is a large-scale visual question answering (QA) dataset, with …
Visual Genome - **Visual Genome** contains Visual Question Answering data in a multi-choice …
VisualMRC - VisualMRC: Machine Reading Comprehension on Document Images
Visual Wake Words - Visual Wake Words represents a common microcontroller vision use-case of …
VISUELLE2.0 - Visuelle 2.0 is a dataset containing real data for 5355 …
VitalDB - This is a comprehensive dataset of 6,388 surgical patients composed …
VITON - VITON-Zalando Dataset
VITON-HD - High-Resolution VITON-Zalando Dataset
ViTT - Video Timeline Tags
VIVOS - VIVOS Corpus
VizDoom - ViZDoom is an AI research platform based on the classical …
VizWiz - VizWiz-VQA
VizWiz-Classification - Our goal is to improve upon the status quo for …
VLCS - VLCS is a dataset to test for domain generalization.
VLEP - Video-and-Language Event Prediction
VLM2-Bench - VLM²-Bench
VLN-CE - Vision-and-Language Navigation in Continuous Environments
VocalSound - VocalSound is a free dataset consisting of 21,024 crowdsourced recordings …
VOCASET - **VOCASET** is a 4D face dataset with about 29 minutes …
VOC-MLT - We construct the long-tailed version of VOC from its 2012 …
VoiceBank + DEMAND - Noisy speech database for training speech enhancement algorithms and TTS models
VoiceBank+DEMAND - VoiceBank+DEMAND is a noisy speech database for training speech enhancement …
VOID - Visual Odometry with Inertial and Depth
Volleyball - **Volleyball** is a video action recognition dataset. It has 4830 …
voraus-AD - voraus-AD contains machine data of a collaborative robot, which moves …
VOT2014 - Visual Object Tracking Challenge 2014
VOT2016 - **VOT2016** is a video dataset for visual object tracking. It …
VOT2017 - Visual Object Tracking Challenge
VOT2018 - **VOT2018** is a dataset for visual object tracking. It consists …
VOT2019 - **VOT2019** is a Visual Object Tracking benchmark for short-term tracking …
VOT2020 - VOT2020 is a Visual Object Tracking benchmark for short-term tracking …
VOT2022 - Click to add a brief description of the dataset (Markdown …
VoxCeleb1 - **VoxCeleb1** is an audio dataset containing over 100,000 utterances for …
VoxCeleb2 - **VoxCeleb2** is a large scale speaker recognition dataset obtained automatically …
Voxceleb-3D - A dataset for voice and 3D face structure study. It …
VoxForge - **VoxForge** is an open speech dataset that was set up …
Voxforge German - VoxForge is an open speech dataset that was set up …
VOXLINGUA107 - Language Identification Dataset
VoxPopuli - VoxPopuli is a large-scale multilingual corpus providing 100K hours of …
VQA-CE - VQA Counterexamples
VQA-CP - The **VQA-CP** dataset was constructed by reorganizing VQA v2 such …
VRD - Visual Relationship Detection dataset
VRDS - Video Raindrop and Rain Streak Removal
VSPW - Video Scene Parsing in the Wild
VSR - Visual Spatial Reasoning
Vulnerability Java Dataset - The dataset consists of two versions: $X_1$ with $P_3$ and …
VulScribeR - VulScriber: 22K+ unfiltered vul samples generated with ChatGPT via Injection

W

warpPIE10P - face dataset
Watercolor2k - Watercolor2k is a dataset used for cross-domain object detection which …
WaterScenes - A Multi-Task 4D Radar-Camera Fusion Dataset for Autonomous Driving on …
Waymo Open Dataset - The Waymo Open Dataset is comprised of high resolution sensor …
WDC-PAVE - Web Data Commones - Product Attribute Value Extraction
WDC Products - **WDC Products** is an entity matching benchmark which provides for …
Weather - Max-Planck-Institut Weather Dataset for Long-term Time Series Forecasting
WebApp1k-Duo-React - Test-driven benchmark to challenge LLMs to write long JavaScript React …
WebApp1K-React - Test-driven benchmark to challenge LLMs to write JavaScript React application …
WebLINX - Real-World Website Navigation with Multi-Turn
WebNLG - The **WebNLG** corpus comprises of sets of triplets describing facts …
WebQuestions - The **WebQuestions** dataset is a question answering dataset using Freebase …
WebQuestionsSP - WebQuestions Semantic Parses Dataset
WebSRC - WebSRC: A Dataset for Web-Based Structural Reading Comprehension
WebVid - WebVid contains 10 million video clips with captions, sourced from …
WebVision - The WebVision dataset is designed to facilitate the research on …
WeChat - The **WeChat** dataset for fake news detection contains more than …
Weibo NER - The **Weibo NER** dataset is a Chinese Named Entity Recognition …
WeiboPolls - ### Dataset Description The dataset described in the provided text …
Well-being Dataset - Cambridge Well-being Dataset for Psychological Distress Analysis
WenetSpeech - WenetSpeech is a multi-domain Mandarin corpus consisting of 10,000+ hours …
WFDD - Woven Fabric Defect Detection
WFLW - Wider Facial Landmarks in the Wild
WHAM! - WSJ0 Hipster Ambient Mixtures
WHAMR! - WHAM! with synthetic reverberated sources
WHOOPS! - WHOOPS! Is a dataset and benchmark for visual commonsense. The …
WHU Building Dataset - We manually edited an aerial and a satellite imagery dataset …
WiC - Words in Context
WiC-TSV - Words-in-Context: Target Sense Verification
WIDER - Web Image Dataset for Event Recognition
WiderPerson - WiderPerson contains a total of 13,382 images with 399,786 annotations, …
WiGesture - Wireless Sensing Dataset for Gesture Recognition and People ID Identification with ESP32
Wiki - Web Traffic Time Series Forecasting
Wiki-40B - A new multilingual language model benchmark that is composed of …
WikiArt - **WikiArt** contains painting from 195 different artists. The dataset has …
WikiBio - Wikipedia Biography Dataset
WikiCoref - WikiCoref is an English corpus annotated for anaphoric relations, where …
Wiki-CS - Wiki-CS is a Wikipedia-based dataset for benchmarking Graph Neural Networks. …
Wikidata5M - Wikidata5m is a million-scale knowledge graph dataset with aligned corpus. …
WikiGraphs - **WikiGraphs** is a dataset of Wikipedia articles each paired with …
WikiHop - **WikiHop** is a multi-hop question-answering dataset. The query of WikiHop …
WikiHow - **WikiHow** is a dataset of more than 230,000 article and …
WikiOFGraph - Wikipedia Ontology-Free Graph-Text
Wikipedia Person and Animal Dataset - This dataset gathers 428,748 person and 12,236 animal infobox with …
WikiQA - Wikipedia open-domain Question Answering
WikiSQL - **WikiSQL** consists of a corpus of 87,726 hand-annotated SQL query …
WikiTableQuestions - **WikiTableQuestions** is a question answering dataset over semi-structured tables. It …
WikiText-103 - The WikiText language modeling dataset is a collection of over …
WikiText-2 - The WikiText language modeling dataset is a collection of over …
WildDash - WildDash is a benchmark evaluation method is presented that uses …
WildDESED - Wild Domestic Environment Sound Event Detection
WildQA - **WildQA** is a video understanding dataset of videos recorded in …
WildScenes - WildScenes is a bi-modal benchmark dataset consisting of multiple large-scale, …
Wildtrack - Wildtrack is a large-scale and high-resolution dataset. It has been …
WI-LOCNESS - Cambridge English Write & Improve & LOCNESS
Wind Tunnel and Flight Test Experiments - Our dataset comprises $23.468$ non-labelled and $356$ labelled samples where …
Wine - Wine Data Set
Win-Fail Action Understanding - First of its kind paired win-fail action understanding dataset with …
WinoGAViL - This dataset is collected via the WinoGAViL game to collect …
Winograd Automatic - Winograd
WinoGrande - WinoGrande is a large-scale dataset of 44k problems, inspired by …
Winoground - Winoground is a dataset for evaluating the ability of vision …
WiRe57 - We manually performed the task of Open Information Extraction on …
Wisconsin (48%/32%/20% fixed splits) - Node classification on Wisconsin with the fixed 48%/32%/20% splits provided …
WISE - WISE, the first benchmark specifically designed for World Knowledge-Informed Semantic …
WIT - Wikipedia-based Image Text
WITS - Why Is This Sarcastic?
Wizard of Wikipedia - **Wizard of Wikipedia** is a large dataset with conversations directly …
WLASL - Word-Level American Sign Language
WN18 - WordNet18
WN18RR - **WN18RR** is a link prediction dataset created from WN18, which …
WNLI - Winograd NLI
WNUT 2017 - WNUT 2017 Emerging and Rare entity recognition
WNUT 2020 - WNUT-2020 Task 1 Overview: Extracting Entities and Relations from Wet Lab Protocols
WNUT-2020 Task 2 - WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets
WorldFloods - WorldFloods: a newly compiled dataset of 119 globally verified flooding …
WOST - The Weakly Occluded Scene Text (WOST) dataset is a public …
WPC - Waterloo Point Cloud
WritingPrompts - WritingPrompts is a large dataset of 300K human-written stories paired …
WS353 - WordSim-353
WSC - Winograd Schema Challenge
WSJ0-2mix - **WSJ0-2mix** is a speech recognition corpus of speech mixtures using …
WSJ POS - M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, “Building …
WSRD+ - A version of the WSRD Dataset will be used as …
WTW - Wired Table in the Wild

X

X3D - X3D is a dataset containing 15 scenes and covering 4 …
X4K1000FPS - Dataset of high-resolution (4096×2160), high-fps (1000fps) video frames with extreme …
XAlign - It consists of an extensive collection of a high quality …
xBD - The xBD dataset contains over 45,000KM2 of polygon labeled pre …
XCOPA - The Cross-lingual Choice of Plausible Alternatives (**XCOPA**) dataset is a …
XD-Violence - XD-Violence is a large-scale audio-visual dataset for violence detection in …
XGLUE - **XGLUE** is an evaluation benchmark XGLUE,which is composed of 11 …
XImageNet-12 - XIMAGENET-12: An Explainable AI Benchmark Dataset for Model Robustness Evaluation
XQLFW - Cross-Quality Labeled Faces in the Wild
XSum - The Extreme Summarization (**XSum**) dataset is a dataset for evaluation …
XTREME - Cross-Lingual Transfer Evaluation of Multilingual Encoders
xView3-SAR - Unsustainable fishing practices worldwide pose a major threat to marine …
XWINO - XWINO is a multilingual collection of Winograd Schemas in six …

Y

YAGO3-10 - Yet Another Great Ontology 3-10
Yahoo! Answers - The Yahoo! Answers topic classification dataset is constructed using 10 …
Yan et al - Source: [Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic …
YCB-Video - The **YCB-Video** dataset is a large-scale video dataset for 6D …
Yelp - The Yelp Dataset is a valuable resource for academic research, …
Yelp2018 - The Yelp2018 dataset is adopted from the 2018 edition of …
Yelp-Fraud - Multi-relational Graph Dataset for Yelp Spam Review Detection
YouCook2 - **YouCook2** is the largest task-oriented, instructional video dataset in the …
YouTube-8M - The **YouTube-8M** dataset is a large scale video dataset, which …
YouTube Driving - YouTube Driving Dataset contains a massive amount of real-world driving …
YouTube-Hands - **YouTube-Hands** includes 240 videos which are annotated with hand trajectories.
Youtube INRIA Instructional - Unsupervised learning from narrated instruction videos
YouTube-UGC - YouTube UGC dataset
YouTube-VIS 2021 - Video Instance Segmentation on YouTube-VIS 2021 validation
Youtube-VIS 2022 Validation - Video object segmentation has been studied extensively in the past …
YouTube-VOS 2018 - Youtube Video Object Segmentation

Z

ZEB - Zero-shot Evaluation Benchmark
ZESHEL - ZESHEL is a zero-shot entity linking dataset, which places more …
ZINC - **ZINC** is a free database of commercially-available compounds for virtual …
ZJU-RGB-P - Research on semantic segmentation of traffic scenes using color and …
Znaki - The first and the one open dataset for Russian finger- …
ZS-F-VQA - The ZS-F-VQA dataset is a new split of the F-VQA …

About ML Datasets

This directory contains datasets used in machine learning research benchmarks. Each dataset page includes:

  • Dataset description and metadata
  • Associated benchmarks and tasks
  • Recent research results
  • Links to official sources and papers