ConceptNet is a knowledge graph that connects words and phrases of natural language with labeled edges. Its knowledge is collected …
CeyMo is a novel benchmark dataset for road marking detection which covers a wide variety of challenging urban, sub-urban and …
We introduce an object detection dataset in challenging adverse weather conditions covering 12000 samples in real-world driving scenes and 1500 …
DUO is a dataset for Underwater object detection for robot picking. The dataset contains a collection of diverse underwater images …
We introduce an object detection dataset in challenging adverse weather conditions covering 12000 samples in real-world driving scenes and 1500 …
The DroneVehicle dataset consists of a total of 56,878 images collected by the drone, half of which are RGB images, …
Paper: GridTracer: Automatic Mapping of Power Grids using Deep Learning and Overhead Imagery Authors: Bohao Huang, Jichen Yang, Artem Streltsov, …
The Exclusively Dark (ExDARK) dataset is a collection of 7,363 low-light images from very low-light environments to twilight (i.e 10 …
With the advance of AI, road object detection has been a prominent topic in computer vision, mostly using perspective cameras. …
RADIATE (RAdar Dataset In Adverse weaThEr) is new automotive dataset created by Heriot-Watt University which includes Radar, Lidar, Stereo Camera …
The evaluation of object detection models is usually performed by optimizing a single metric, e.g. mAP, on a fixed set …
Automating the creation of catalogues for radio galaxies in next-generation deep surveys necessitates the identification of components within extended sources …
The SARDet-100K dataset encompasses a total of 116,598 images, and 245,653 instances distributed across six categories: Aircraft, Ship, Car, Bridge, …
TRR360D is based on the ICDAR2019MTD modern table detection dataset, it refers to the annotation format of the DOTA dataset. …
UAV-PDD2023: A benchmark dataset for pavement distress detection based on UAV images
UAVDB is a high-resolution RGB video dataset meticulously designed for UAV detection tasks across diverse scales and complex backgrounds. Comprising …
The 300-W is a face dataset that consists of 300 Indoor and 300 Outdoor in-the-wild images. It covers a large …
ApolloCar3DT is a dataset that contains 5,277 driving images and over 60K car instances, where each car is fitted with …
A real-world dataset, with hyper-accurate digital counterpart & comprehensive ground-truth annotation. Dataset Content - 200 sequences (~400 mins) - 398 …
[1]: https://www.projectaria.com/datasets/ase/ "" [2]: https://facebookresearch.github.io/projectaria_tools/docs/open_datasets/aria_synthetic_environments_dataset "" [3]: https://www.projectaria.com/research/ "" Aria Synthetic Environments is a large-scale, fully simulated dataset created by …
DTU MVS 2014 is a multi-view stereo dataset, which is an order of magnitude larger in number of scenes and …
Scan2CAD is an alignment dataset based on 1506 ScanNet scans with 97607 annotated keypoints pairs between 14225 (3049 unique) CAD …
ScanNet is an instance-level indoor RGB-D dataset that includes both 2D and 3D data. It is a collection of labeled …
ShapeNet is a large scale repository for 3D CAD models developed by researchers from Stanford University, Princeton University and the …
Alzheimer's Disease Neuroimaging Initiative (ADNI) is a multisite study that aims to improve clinical trials for the prevention and treatment …
AG News (AG’s News Corpus) is a subdataset of AG's corpus of news articles constructed by assembling titles and description …
The BTAD ( beanTech Anomaly Detection) dataset is a real-world industrial anomaly dataset. The dataset contains a total of 2830 …
The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …
COCO-OOC goes beyond standard object detection to ask the question: Which objects are out-of-context (OOC)? Given an image with a …
Avenue Dataset contains 16 training and 21 testing video clips. The videos are captured in CUHK campus avenue with 30652 …
Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …
Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …
Fishyscapes is a public benchmark for uncertainty estimation in a real-world task of semantic segmentation for urban driving. It evaluates …
Predicting forest cover type from cartographic variables only (no remotely sensed data). The actual forest cover type for a given …
HyperKvasir dataset contains 110,079 images and 374 videos where it captures anatomical landmarks and pathological and normal findings. A total …
An abnormal activity data-set for research use that contains 4,83,566 annotated frames. Source: [Multi-timescale Trajectory Prediction for Abnormal Human Activity …
The Industrial Textile Defect Detection (ITDD) dataset includes 1885 industrial textile images categorized into 4 categories: cotton fabric, dyed fabric, …
InsPLAD is a Dataset for Power Line Asset Inspection containing 10,607 high-resolution Unmanned Aerial Vehicles colour images. It contains 17 …
This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held …
The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred …
Includes 5,824 fundus images labeled with either positive glaucoma (2,392) or negative glaucoma (3,432). Source: [Attention Based Glaucoma Detection: A …
Lost and Found is a novel lost-cargo image sequence dataset comprising more than two thousand frames with pixelwise annotations of …
The MIT-BIH Arrhythmia Database contains 48 half-hour excerpts of two-channel ambulatory ECG recordings, obtained from 47 subjects studied by the …
The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …
MPDD is a dataset aimed at benchmarking visual defect detection methods in industrial metal parts manufacturing. It consists of more …
MVTec 3D Anomaly Detection Dataset (MVTec 3D-AD) is a comprehensive 3D dataset for the task of unsupervised anomaly detection and …
MVTec Logical Constraints Anomaly Detection (MVTec LOCO AD) dataset is intended for the evaluation of unsupervised anomaly localization algorithms. The …
The Musk dataset describes a set of molecules, and the objective is to detect musks from non-musks. This dataset describes …
Outliers or anomalies are instances that do not conform to the norm of a dataset. Outlier detection is an important …
Multi-pose Anomaly Detection (MAD) dataset, which represents the first attempt to evaluate the performance of pose-agnostic anomaly detection. The MAD …
This dataset contains images of unusual dangers which can be encountered by a vehicle on the road – animals, rocks, …
a dataset of time-series anomaly detection
Street View House Numbers (SVHN) is a digit classification benchmark dataset that contains 600,000 32×32 RGB images of printed digits …
The Shanghaitech dataset is a large-scale crowd counting dataset. It consists of 1198 annotated crowd images. The dataset is divided …
The ShanghaiTech Campus dataset has 13 scenes with complex light conditions and camera angles. It contains 130 abnormal events and …
Street Scene is a dataset for video anomaly detection. Street Scene consists of 46 training and 35 testing high resolution …
The TII-SSRC-23 dataset offers a comprehensive collection of network traffic patterns, meticulously compiled to support the development and research of …
Thyroid is a dataset for detection of thyroid diseases, in which patients diagnosed with hypothyroid or subnormal are anomalies against …
UBnormal is a new supervised open-set benchmark composed of multiple virtual scenes for video anomaly detection. Unlike existing data sets, …
The UCF-Crime dataset is a large-scale dataset of 128 hours of videos. It consists of 1900 long and untrimmed real-world …
The UCR Anomaly Archive is a collection of 250 uni-variate time series collected in human medicine, biology, meteorology and industry. …
The UCSD Anomaly Detection Dataset was acquired with a stationary camera mounted at an elevation, overlooking pedestrian walkways. The crowd …
Five datasets used in NeurTraL-AD paper: \textit{RacketSports (RS).} Accelerometer and gyroscope recording of players playing four different racket sports. Each …
The code to create the dataset is available here. The dataset used in the paper is available on github - …
The VisA dataset contains 12 subsets corresponding to 12 different objects as shown in the above figure. There are 10,821 …
WFDD is a dataset for benchmarking anomaly detection methods with a focus on textile inspection. It includes 4101 woven fabric …
voraus-AD contains machine data of a collaborative robot, which moves a can by performing an industrial pick-and-place task. The samples …
These data are the results of a chemical analysis of wines grown in the same region in Italy but derived …
The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …
The AlpacaEval set contains 805 instructions form self-instruct, open-assistant, vicuna, koala, hh-rlhf. Those were selected so that the AlpacaEval ranking …
Data Set Information: Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records …
In an effort to catalog insect biodiversity, we propose a new large dataset of hand-labelled insect images, the BIOSCAN-1M Insect …
The purpose of this dataset was to study gender bias in occupations. Online biographies, written in English, were collected to …
BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring – they are …
This dataset is a combination of the following three datasets : figshare, SARTAJ dataset and Br35H This dataset contains 7022 …
The quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness. CIFAKE is a dataset that …
The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …
Common corruptions dataset for CIFAR10
Contains hundreds of frontal view X-rays and is the largest public resource for COVID-19 image and prognostic data, making it …
Data was collected for normal bearings, single-point drive end and fan end defects. Data was collected at 12,000 samples/second and …
The normal chest X-ray (left panel) depicts clear lungs without any areas of abnormal opacification in the image. Bacterial pneumonia …
We construct the ForgeryNet dataset, an extremely large face forgery dataset with unified annotations in image- and video-level data across …
A public data set of walking full-body kinematics and kinetics in individuals with Parkinson’s disease
HOWS-CL-25 (Household Objects Within Simulation dataset for Continual Learning) is a synthetic dataset especially designed for object classification on mobile …
The HRF dataset is a dataset for retinal vessel segmentation which comprises 45 images and is organized as 15 subsets. …
The IRFL dataset consists of idioms, similes, and metaphors with matching figurative and literal images, as well as two novel …
The goal for ISIC 2019 is classify dermoscopic images among nine different diagnostic categories.25,331 images are available for training across …
This dataset was presented as part of the ICLR 2023 paper 𝘈 𝘧𝘳𝘢𝘮𝘦𝘸𝘰𝘳𝘬 𝘧𝘰𝘳 𝘣𝘦𝘯𝘤𝘩𝘮𝘢𝘳𝘬𝘪𝘯𝘨 𝘊𝘭𝘢𝘴𝘴-𝘰𝘶𝘵-𝘰𝘧-𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯 𝘥𝘦𝘵𝘦𝘤𝘵𝘪𝘰𝘯 𝘢𝘯𝘥 𝘪𝘵𝘴 𝘢𝘱𝘱𝘭𝘪𝘤𝘢𝘵𝘪𝘰𝘯 …
Dataset Introduction In this work, we introduce the In-Diagram Logic (InDL) dataset, an innovative resource crafted to rigorously evaluate the …
This data set comprises 22 fundus images with their corresponding manual annotations for the blood vessels, separated as arteries and …
The Liver-US dataset is a comprehensive collection of high-quality ultrasound images of the liver, including both normal and abnormal cases. …
The minimalist histopathology image analysis dataset (MHIST) is a binary classification dataset of 3,152 fixed-size images of colorectal polyps, each …
The process by which sections in a document are demarcated and labeled is known as section identification. Such sections are …
MixedWM38 Dataset(WaferMap) has more than 38000 wafer maps, including 1 normal pattern, 8 single defect patterns, and 29 mixed defect …
Early detection of retinal diseases is one of the most important means of preventing partial or permanent blindness in patients. …
A large real-world event-based dataset for object classification. Source: HATS: Histograms of Averaged Time Surfaces for Robust Event-based Object Classification
The N-ImageNet dataset is an event-camera counterpart for the ImageNet dataset. The dataset is obtained by moving an event camera …
The RITE (Retinal Images vessel Tree Extraction) is a database that enables comparative studies on segmentation or classification of arteries …
he RSSCN7 dataset contains satellite images acquired from Google Earth, which is originally collected for remote sensing scene classification. We …
The Recognizing Textual Entailment (RTE) datasets come from a series of textual entailment challenges. Data from RTE1, RTE2, RTE3 and …
The Schema-Guided Dialogue (SGD) dataset consists of over 20k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. …
This dataset is based on the Spiking Heidelberg Digits (SHD) dataset. Sample inputs consist of two spike encoded digits sampled …
The SPOTS-10 dataset is an extensive collection of grayscale images showcasing diverse patterns found in ten animal species. Specifically, SPOTS-10 …
The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the …
Sentiment140 is a dataset that allows you to discover the sentiment of a brand, product, or topic on Twitter. Source: …
This dataset consists of computer-generated images for gas leakage segmentation. It features diverse backgrounds, interfering foreground objects, and precise ground …
arxiv : https://arxiv.org/abs/2304.11708 Accepted at 29th International Congress on Sound and Vibration (ICSV29). The drone has been used for various …
Table-ACM12K (TACM12K) is a relational table dataset derived from the ACM heterogeneous graph dataset. It includes four tables: papers, authors, …
Table-LastFm2K (TLF2K) is a relational table dataset derived from the classical LastFM2K dataset. It contains three tables: artists, user_artists, and …
Table-MovieLens1M (TML1M) is a relational table dataset derived from the classical MovieLens1M dataset. It consists of three tables: users, movies, …
The Winograd Schema Challenge was introduced both as an alternative to the Turing Test and as a test of a …
WiC is a benchmark for the evaluation of context-sensitive word embeddings. WiC is framed as a binary classification task. Each …
Enlarge the dataset to understand how image background effect the Computer Vision ML model. With the following topics: Blur Background …
97 synthetic datasets consists of 97 datasets (as illustrated in the figure) and can be used to test graph-based clustering …
Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …
The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …
This dataset contains a set of face images taken between April 1992 and April 1994 at AT&T Laboratories Cambridge.
This dataset has 20 classes and each class has about 1000 documents. The data split for train/validation/test is 1600/200/200. We …
AIDS is a graph dataset. It consists of 2000 graphs representing molecular compounds which are constructed from the AIDS Antiviral …
A set of 10 DSC datasets (reviews of 10 products) to produce sequences of tasks. The products are Sports, Toys, …
F-CelebA - This dataset is adapted from federated learning. Federated learning is an emerging machine learning paradigm with an emphasis …
Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …
Permuted MNIST is an MNIST variant that consists of 70,000 images of handwritten digits from 0 to 9, where 60,000 …
AG News (AG’s News Corpus) is a subdataset of AG's corpus of news articles constructed by assembling titles and description …
SciERC dataset is a collection of 500 scientific abstract annotated with scientific entities, their relations, and coreference clusters. The abstracts …
Description: 10,000 People - Human Pose Recognition Data. This dataset includes indoor and outdoor scenes.This dataset covers males and females. …
Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the …
Abstract: Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 …
Letter Recognition Data Set is a handwritten digit dataset. The task is to identify each of a large number of …
The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …
The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …
The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …
Engine degradation simulation was carried out using C-MAPSS. Four different were sets simulated under different combinations of operational conditions and …
The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …
USPS is a digit dataset automatically scanned from envelopes by the U.S. Postal Service containing a total of 9,298 16×16 …
The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …
The Caltech101 dataset contains images from 101 object categories (e.g., “helicopter”, “elephant” and “chair” etc.) and a background category that …
The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …
The goal of this project is to present two new datasets that seek to expand the capability of the Learning …
Comic2k is a dataset used for cross-domain object detection which contains 2k comic images with image and instance-level annotations. Image …
DomainNet is a dataset of common objects in six different domain. All domains include 345 categories (classes) of objects such …
Foggy Cityscapes is a synthetic foggy dataset which simulates fog on real scenes. Each foggy image is rendered with a …
The ImageCLEF-DA dataset is a benchmark dataset for ImageCLEF 2014 domain adaptation challenge, which contains three domains: Caltech-256 (C), ImageNet …
The LeukemiaAttri dataset is a large-scale, multi-domain collection of microscopy images derived from leukemia patient samples, enriched with detailed morphological …
Dataset release for the BMVC 2021 Paper "Few-Shot Domain Adaptation for Low Light RAW Image Enhancement" Abstract: Enhancing practical low …
The Office dataset contains 31 object categories in three domains: Amazon, DSLR and Webcam. The 31 categories in the dataset …
Office-Caltech-10 a standard benchmark for domain adaptation, which consists of Office 10 and Caltech 10 datasets. It contains the 10 …
Office-Home is a benchmark dataset for domain adaptation which contains 4 domains where each domain consists of 65 categories. The …
PACS is an image dataset for domain generalization. It consists of four domains, namely Photo (1,670 images), Art Painting (2,048 …
SIM10k is a synthetic dataset containing 10,000 images, which is rendered from the video game Grand Theft Auto V (GTA5). …
This corpus has been collected from free or free for research sources at the Internet: - A collection of 425 …
Large, multimodal biometric dataset: It contains still images and videos of over 1,000 people captured at various ranges (up to …
A renovation of Labeled Faces in the Wild (LFW), the de facto standard testbed for unconstraint face verification. Source: CALFW
The COVID-19 pandemic raises the problem of adapting face recognition systems to the new reality, where people may wear surgical …
A renovation of Labeled Faces in the Wild (LFW), the de facto standard testbed for unconstraint face verification. There are …
The COVID-19 pandemic raises the problem of adapting face recognition systems to the new reality, where people may wear surgical …
The color FERET database is a dataset for face recognition. It contains 11,338 color images of size 512×768 pixels captured …
The IJB-B dataset is a template-based face dataset that contains 1845 subjects with 11,754 images, 55,025 frames and 7,011 videos …
The LFW dataset contains 13,233 images of faces collected from the web. This dataset consists of the 5749 identities with …
During the COVID-19 coronavirus epidemic, almost everyone wears a facial mask, which poses a huge challenge to face recognition. Traditional …
The Masked LFW (MLFW), based on Cross-Age LFW (CALFW) database, is built using a simple but effective tool that generates …
MORPH is a facial age estimation dataset, which contains 55,134 facial images of 13,617 subjects ranging from 16 to 77 …
An evaluation protocol for face verification focusing on a large intra-pair image quality difference. Real-world face recognition applications often deal …
Matanga Darknet — 2025 Access Guide As internet censorship intensifies, Shadow Marketplaces remain crucial tools for anonymous transactions. Matanga Darknet …
CaseHOLD (Case Holdings On Legal Decisions) is a law dataset comprised of over 53,000+ multiple choice questions to identify the …
The Describable Textures Dataset (DTD) contains 5640 texture images in the wild. They are annotated with human-centric attributes inspired by …
Eurosat is a dataset and deep learning benchmark for land use and land cover classification. The dataset is based on …
"We built a large lung CT scan dataset for COVID-19 by curating data from 7 public datasets listed in the …
MR Movie Reviews is a dataset for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect …
Microsoft Research Paraphrase Corpus (MRPC) is a corpus consists of 5,801 sentence pairs collected from newswire articles. Each pair is …
MedConceptsQA - Open Source Medical Concepts QA Benchmark The benchmark can be found here: https://huggingface.co/datasets/ofir408/MedConceptsQA
The MedNLI dataset consists of the sentence pairs developed by Physicians from the Past Medical History section of MIMIC-III clinical …
The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary …
The Scene UNderstanding (SUN) database contains 899 categories and 130,519 images. There are 397 well-sampled categories to evaluate numerous state-of-the-art …
The Stanford Cars dataset consists of 196 classes of cars with a total of 16,185 images, taken from the rear. …
UCF101 dataset is an extension of UCF50 and consists of 13,320 video clips, which are classified into 101 categories. These …
This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identified …
Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …
The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …
These data are the results of a chemical analysis of wines grown in the same region in Italy but derived …
The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician, …
Animals with Attributes 2 (AwA2) is a dataset for benchmarking transfer-learning algorithms, such as attribute base classification and zero-shot learning. …
When glancing at a magazine, or browsing the Internet, we are continuously being exposed to photographs. Despite of this overflow …
Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …
Logical rules are a popular knowledge representation language in many domains. Recently, neural networks have been proposed to support the …
The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of …
This dataset is a benchmark for complex reasoning abilities in large language models, drawing on United Kingdom Linguistics Olympiad problems …
RuWorldTree is a QA dataset with multiple-choice elementary-level science questions, which evaluate the understanding of core science facts. Motivation The …
The Winograd schema challenge composes tasks with syntactic ambiguity, which can be resolved with logic and reasoning. Motivation The dataset …
The COCO-MLT is created from MS COCO-2017, containing 1,909 images from 80 classes. The maximum of training number per class …
Extended GTEA Gaze+ EGTEA Gaze+ is a large-scale dataset for FPV actions and gaze. It subsumes GTEA Gaze+ and comes …
ImageNet Long-Tailed is a subset of /dataset/imagenet dataset consisting of 115.8K images from 1000 categories, with maximally 1280 images per …
LoT-insts contains over 25k classes whose frequencies are naturally long-tail distributed. Its test set from four different subsets: many-, medium-, …
MIMIC-CXR-LT. We construct a single-label, long-tailed version of MIMIC-CXR in a similar manner. MIMIC-CXR is a multi-label classification dataset with …
NIH-CXR-LT. NIH ChestXRay14 contains over 100,000 chest X-rays labeled with 14 pathologies, plus a “No Findings” class. We construct a …
Places-LT has an imbalanced training set with 62,500 images for 365 classes from Places-2. The class frequencies follow a natural …
We construct the long-tailed version of VOC from its 2012 train-val set. It contains 1,142 images from 20 classes, with …
mini-ImageNet was proposed by Matching networks for one-shot learning for few-shot learning evaluation, in an attempt to have a dataset …
This dataset is composed of 7,753 pairs of whole slide images and their corresponding diagnostic reports, extracted from the TCGA …
IU X-ray (Demner-Fushman et al., 2016) is a set of chest X-ray images paired with their corresponding diagnostic reports. The …
MIMIC-CXR from Massachusetts Institute of Technology presents 371,920 chest X-rays associated with 227,943 imaging studies from 65,079 patients. The studies …
The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of …
DyML-Animal is based on animal images selected from ImageNet-5K [1]. It has 5 semantic scales (i.e., classes, order, family, genus, …
DyML-Product is derived from iMaterialist-2019, a hierarchical online product dataset. The original iMaterialist-2019 offers up to 4 levels of hierarchical …
DyML-Vehicle merges two vehicle re-ID datasets PKU VehicleID [1], VERI-Wild [1]. Since these two datasets have only annotations on the …
In-shop Clothes Retrieval Benchmark evaluates the performance of in-shop Clothes Retrieval. This is a large subset of DeepFashion, containing large …
Stanford Online Products (SOP) dataset has 22,634 classes with 120,053 product images. The first 11,318 classes (59,551 images) are split …
The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …
The QNLI (Question-answering NLI) dataset is a Natural Language Inference dataset automatically derived from the Stanford Question Answering Dataset v1.1 …
The Maximum Unbiased Validation (MUV) dataset is a benchmark dataset selected from PubChem BioAssay. It was created by applying a …
MoleculeNet is a large scale benchmark for molecular machine learning. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and …
PCBA dataset 11 is a collection of high-quality dose-response data, formulated as a multitask learning benchmark from 128 high-throughput screening …
QM7 dataset is a subset of the GDB-13 database. GDB-13 contains nearly 1 billion stable and synthetically accessible organic molecules. …
QM8 dataset is a collection of molecular data used for studying quantum mechanical calculations of electronic spectra and excited state …
QM9 provides quantum chemical properties (at DFT level) for a relevant, consistent, and comprehensive chemical space of small organic molecules. …
SIDER contains information on marketed medicines and their recorded adverse drug reactions. The information is extracted from public documents and …
The Tox21 data set comprises 12,060 training samples and 647 test samples that represent chemical compounds. There are 801 "dense …
The ClinTox dataset compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons. The …
The CheXpert dataset contains 224,316 chest radiographs of 65,240 patients with both frontal and lateral views available. The task is …
ChestX-ray14 is a medical imaging dataset which comprises 112,120 frontal-view X-ray images of 30,805 (collected from the year of 1992 …
MIMIC-CXR from Massachusetts Institute of Technology presents 371,920 chest X-rays associated with 227,943 imaging studies from 65,079 patients. The studies …
MLRSNet is a a multi-label high spatial resolution remote sensing dataset for semantic scene understanding. It provides different perspectives of …
The MRNet dataset consists of 1,370 knee MRI exams performed at Stanford University Medical Center. The dataset contains 1,104 (80.6%) …
The NUS-WIDE dataset contains 269,648 images with a total of 5,018 tags collected from Flickr. These images are manually annotated …
OpenImages V6 is a large-scale dataset , consists of 9 million training images, 41,620 validation samples, and 125,456 test samples. …
PASCAL VOC 2007 is a dataset for image recognition. The twenty object classes that have been selected are: Person: person …
The dataset offers tag and mask annotations for image-text pairs from the CC3M validation set. Tag annotations denote words that …
This data is for the Mis2-KDD 2021 under review paper: Dataset of Propaganda Techniques of the State-Sponsored Information Operation of …
The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. …
The RCV1 dataset is a benchmark dataset on text categorization. It is a collection of newswire articles producd by Reuters …
The Reuters-21578 dataset is a collection of documents with news articles. The original corpus has 10,369 documents and a vocabulary …
CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels …
ChestX-ray14 is a medical imaging dataset which comprises 112,120 frontal-view X-ray images of 30,805 (collected from the year of 1992 …
The NYU-Depth V2 data set is comprised of video sequences from a variety of indoor scenes as recorded by both …
QM9 provides quantum chemical properties (at DFT level) for a relevant, consistent, and comprehensive chemical space of small organic molecules. …
The UTKFace dataset is a large-scale face dataset with long age span (range from 0 to 116 years old). The …
The StarCraft Multi-Agent Challenges+ requires agents to learn completion of multi-stage tasks and usage of environmental factors without precise reward …
The dataset consists of 400 whole-slide images (WSIs) of lymph node sections stained with hematoxylin and eosin (H&E), collected from …
The Elephant MIL dataset is a benchmark used in multiple instance learning (MIL), which falls under the broader categories of …
The Musk dataset describes a set of molecules, and the objective is to detect musks from non-musks. This dataset describes …
The Musk2 dataset is a set of 102 molecules of which 39 are judged by human experts to be musks …
The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …
The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …
The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …
The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …
The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …
The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …
CINIC-10 is a dataset for image classification. It has a total of 270,000 images, 4.5 times that of CIFAR-10. It …
The Describable Textures Dataset (DTD) contains 5640 texture images in the wild. They are annotated with human-centric attributes inspired by …
The Food-101 dataset consists of 101 food categories with 750 training and 250 test images per category, making a total …
The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …
The LIDC-IDRI dataset contains lesion annotations from four experienced thoracic radiologists. LIDC-IDRI contains 1,018 low-dose lung CTs from 1010 lung …
The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …
NAS-Bench-101 is the first public architecture dataset for NAS research. To build NASBench-101, the authors carefully constructed a compact, yet …
The Oxford-IIIT Pet Dataset has 37 categories with roughly 200 images for each class. The images have a large variations …
The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or …
The Stanford Cars dataset consists of 196 classes of cars with a total of 16,185 images, taken from the rear. …
Street View House Numbers (SVHN) is a digit classification benchmark dataset that contains 600,000 32×32 RGB images of printed digits …
Introduced by Singh, Sumeet S.. “Teaching Machines to Code: Neural Markup Generation with Visual Attention.” ArXiv abs/1802.05415 (2018): n. pag. …
A prebuilt dataset for OpenAI's task for image-2-latex system. Includes total of ~100k formulas and images splitted into train, validation …
The original dataset for "ECG5000" is a 20-hour long ECG downloaded from Physionet. The name is BIDMC Congestive Heart Failure …
Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …
SKAB is designed for evaluating algorithms for anomaly detection. The benchmark currently includes 30+ datasets plus Python modules for algorithms’ …
The goal for ISIC 2019 is classify dermoscopic images among nine different diagnostic categories.25,331 images are available for training across …
The dataset contains the annotations of characters' visual appearances, in the form of tracks of face bounding boxes, and the …
The dataset contains historical financial transactions, including time, category and cost fields. There are 50000 clients, 205 categories and 43.7M …
The dataset includes time-stamped user product reviews behavior from January, 2008 to October, 2018. Each user has a sequence of …
The Memetracker corpus contains articles from mainstream media and blogs from August 1 to October 31, 2008 with about 1 …
RETWEET is a dataset of tweets and overall predominant sentiment of their replies. SUMMARY ------ WHAT: Message-level Polarity Classification. GOAL: …
This dataset contains time-stamped user retweet event sequences. The events are categorized into 3 types: retweets by “small,” “medium” and …
The dataset has two years of user awards on a question-answering website: each user received a sequence of badges and …
The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …
The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset. It is designed to …
The IJB-B dataset is a template-based face dataset that contains 1845 subjects with 11,754 images, 55,025 frames and 7,011 videos …
The IJB-C dataset is a video-based face recognition dataset. It is an extension of the IJB-A dataset with about 138,000 …
The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …
The LFW dataset contains 13,233 images of faces collected from the web. This dataset consists of the 5749 identities with …
A new multilingual language model benchmark that is composed of 40+ languages spanning several scripts and linguistic families containing round …
The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician, …
Procgen Benchmark includes 16 simple-to-use procedurally-generated environments which provide a direct measure of how quickly a reinforcement learning agent learns …
It contains about 28K medium quality animal images belonging to 10 categories: dog, cat, horse, spyder, butterfly, chicken, sheep, cow, …
SciDocs evaluation framework consists of a suite of evaluation tasks designed for document-level tasks. Source: Allen Institute for AI
HotpotQA is a question answering dataset collected on the English Wikipedia, containing about 113K crowd-sourced questions that are constructed to …
In this project, we introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions that cannot be answered with …
The dataset contains single-shot videos taken from moving cameras in underwater environments. The first shard of a new Marine Video …
The Natural Questions corpus is a question answering dataset containing 307,373 training examples, 7,830 development examples, and 7,842 test examples. …
Outside Knowledge Visual Question Answering (OK-VQA) includes more than 14,000 questions that require external knowledge to answer. Source: [OK-VQA: A …
This dataset contains 21,889 outfits from polyvore.com, in which 17,316 are for training, 1,497 for validation and 3,076 for testing. …
The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary …
PubMedQA-MetaGen: Metadata-Enriched PubMedQA Corpus Dataset Summary PubMedQA-MetaGen is a metadata-enriched version of the PubMedQA biomedical question-answering dataset, created using the …
Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary …
The ToolLens dataset consists of 18,770 concise yet intentionally multifaceted queries, each associated with 1 to 3 verified tools out …
The BIOSSES data set comprises total 100 sentence pairs all of which were selected from the "[TAC2 Biomedical Summarization Track …
CHIP Semantic Textual Similarity, a dataset for sentence similarity in the non-i.i.d. (non-independent and identically distributed) setting, is used for …
The Sentences Involving Compositional Knowledge (SICK) dataset is a dataset for compositional distributional semantics. It includes a large number of …
The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …
AG News (AG’s News Corpus) is a subdataset of AG's corpus of news articles constructed by assembling titles and description …
The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …
The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …
The Corpus of Linguistic Acceptability (CoLA) consists of 10657 sentences from 23 linguistics publications, expertly annotated for acceptability (grammaticality) by …
The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …
Office-Home is a benchmark dataset for domain adaptation which contains 4 domains where each domain consists of 65 categories. The …
According to the WHO, World report on vision 2019, the number of visually impaired people worldwide is estimated to be …
The data has been produced using Monte Carlo simulations. The first 21 features (columns 2-22) are kinematic properties measured by …
Animals with Attributes 2 (AwA2) is a dataset for benchmarking transfer-learning algorithms, such as attribute base classification and zero-shot learning. …
The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …
The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …
The COCO-MLT is created from MS COCO-2017, containing 1,909 images from 80 classes. The maximum of training number per class …
The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of …
The Caltech101 dataset contains images from 101 object categories (e.g., “helicopter”, “elephant” and “chair” etc.) and a background category that …
The Describable Textures Dataset (DTD) contains 5640 texture images in the wild. They are annotated with human-centric attributes inspired by …
Eurosat is a dataset and deep learning benchmark for land use and land cover classification. The dataset is based on …
FGVC-Aircraft contains 10,200 images of aircraft, with 100 images for each of 102 different aircraft model variants, most of which …
The Food-101 dataset consists of 101 food categories with 750 training and 250 test images per category, making a total …
The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …
transform the ImageNet-1K classification datatset for Chinese models by translating labels and prompts into Chinese.
This dataset contains 118,081 short video clips extracted from 202 movies. Each video has a caption, either extracted from the …
The MIT-States dataset has 245 object classes, 115 attribute classes and ∼53K images. There is a wide range of objects …
The MSR-VTT-QA dataset is a benchmark for the task of Visual Question Answering (VQA) on the MSR-VTT (Microsoft Research Video …
The MSVD-QA dataset is a Video Question Answering (VideoQA) dataset. It is based on the existing Microsoft Research Video Description …
MedConceptsQA - Open Source Medical Concepts QA Benchmark The benchmark can be found here: https://huggingface.co/datasets/ofir408/MedConceptsQA
Oxford 102 Flower is an image classification dataset consisting of 102 flower categories. The flowers chosen to be flower commonly …
The Oxford-IIIT Pet Dataset is a 37-category pet dataset with roughly 200 images for each class. The images have large …
The PASCAL Context dataset is an extension of the PASCAL VOC 2010 detection challenge, and it contains pixel-wise labels for …
The SNIPS Natural Language Understanding benchmark is a dataset of over 16,000 crowdsourced queries distributed among 7 user intents of …
The SUN Attribute dataset consists of 14,340 images from 717 scene categories, and each category is annotated with a taxonomy …
The Scene UNderstanding (SUN) database contains 899 categories and 130,519 images. There are 397 well-sampled categories to evaluate numerous state-of-the-art …
The Stanford Cars dataset consists of 196 classes of cars with a total of 16,185 images, taken from the rear. …
The TVQA dataset is a large-scale video dataset for video question answering. It is based on 6 popular TV shows …
UCF101 dataset is an extension of UCF50 and consists of 13,320 video clips, which are classified into 101 categories. These …
We construct the long-tailed version of VOC from its 2012 train-val set. It contains 1,142 images from 20 classes, with …
An open-ended VideoQA benchmark that aims to: i) provide a well-defined evaluation by including five correct answer annotations per question …
CALVIN (Composing Actions from Language and Vision), is an open-source simulated benchmark to learn long-horizon language-conditioned robot manipulation tasks.
BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring – they are …
HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are …
WinoGrande is a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the …
Median house prices for California districts derived from the 1990 census. About Dataset Context This is the dataset used in …
In this dataset we added [Company Name, Car Model, Car Type, Fuel Type, Transmission, Engine (cc), Mileage, Kms_driven, Buyers, Horsepower …
Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age …
This dataset contains demographic and personal health information for individuals, along with the corresponding medical insurance charges billed to them. …