FlyingThings3D is a synthetic dataset for optical flow, disparity and scene flow estimation. It consists of everyday objects flying along …
Music21 is an untrimmed video dataset crawled by keyword query from Youtube. It contains music performances belonging to 21 categories. …
Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …
The InterHand2.6M dataset is a large-scale real-captured dataset with accurate GT 3D interacting hand poses, used for 3D hand pose …
WN18RR is a link prediction dataset created from WN18, which is a subset of WordNet. WN18 consists of 18 relations …
Alzheimer's Disease Neuroimaging Initiative (ADNI) is a multisite study that aims to improve clinical trials for the prevention and treatment …
AG News (AG’s News Corpus) is a subdataset of AG's corpus of news articles constructed by assembling titles and description …
The BTAD ( beanTech Anomaly Detection) dataset is a real-world industrial anomaly dataset. The dataset contains a total of 2830 …
The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …
COCO-OOC goes beyond standard object detection to ask the question: Which objects are out-of-context (OOC)? Given an image with a …
Avenue Dataset contains 16 training and 21 testing video clips. The videos are captured in CUHK campus avenue with 30652 …
Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …
Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …
Fishyscapes is a public benchmark for uncertainty estimation in a real-world task of semantic segmentation for urban driving. It evaluates …
Predicting forest cover type from cartographic variables only (no remotely sensed data). The actual forest cover type for a given …
HyperKvasir dataset contains 110,079 images and 374 videos where it captures anatomical landmarks and pathological and normal findings. A total …
An abnormal activity data-set for research use that contains 4,83,566 annotated frames. Source: [Multi-timescale Trajectory Prediction for Abnormal Human Activity …
The Industrial Textile Defect Detection (ITDD) dataset includes 1885 industrial textile images categorized into 4 categories: cotton fabric, dyed fabric, …
InsPLAD is a Dataset for Power Line Asset Inspection containing 10,607 high-resolution Unmanned Aerial Vehicles colour images. It contains 17 …
This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held …
The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred …
Includes 5,824 fundus images labeled with either positive glaucoma (2,392) or negative glaucoma (3,432). Source: [Attention Based Glaucoma Detection: A …
Lost and Found is a novel lost-cargo image sequence dataset comprising more than two thousand frames with pixelwise annotations of …
The MIT-BIH Arrhythmia Database contains 48 half-hour excerpts of two-channel ambulatory ECG recordings, obtained from 47 subjects studied by the …
The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …
MPDD is a dataset aimed at benchmarking visual defect detection methods in industrial metal parts manufacturing. It consists of more …
MVTec 3D Anomaly Detection Dataset (MVTec 3D-AD) is a comprehensive 3D dataset for the task of unsupervised anomaly detection and …
MVTec Logical Constraints Anomaly Detection (MVTec LOCO AD) dataset is intended for the evaluation of unsupervised anomaly localization algorithms. The …
The Musk dataset describes a set of molecules, and the objective is to detect musks from non-musks. This dataset describes …
Outliers or anomalies are instances that do not conform to the norm of a dataset. Outlier detection is an important …
Multi-pose Anomaly Detection (MAD) dataset, which represents the first attempt to evaluate the performance of pose-agnostic anomaly detection. The MAD …
This dataset contains images of unusual dangers which can be encountered by a vehicle on the road – animals, rocks, …
a dataset of time-series anomaly detection
Street View House Numbers (SVHN) is a digit classification benchmark dataset that contains 600,000 32×32 RGB images of printed digits …
The Shanghaitech dataset is a large-scale crowd counting dataset. It consists of 1198 annotated crowd images. The dataset is divided …
The ShanghaiTech Campus dataset has 13 scenes with complex light conditions and camera angles. It contains 130 abnormal events and …
Street Scene is a dataset for video anomaly detection. Street Scene consists of 46 training and 35 testing high resolution …
The TII-SSRC-23 dataset offers a comprehensive collection of network traffic patterns, meticulously compiled to support the development and research of …
Thyroid is a dataset for detection of thyroid diseases, in which patients diagnosed with hypothyroid or subnormal are anomalies against …
UBnormal is a new supervised open-set benchmark composed of multiple virtual scenes for video anomaly detection. Unlike existing data sets, …
The UCF-Crime dataset is a large-scale dataset of 128 hours of videos. It consists of 1900 long and untrimmed real-world …
The UCR Anomaly Archive is a collection of 250 uni-variate time series collected in human medicine, biology, meteorology and industry. …
The UCSD Anomaly Detection Dataset was acquired with a stationary camera mounted at an elevation, overlooking pedestrian walkways. The crowd …
Five datasets used in NeurTraL-AD paper: \textit{RacketSports (RS).} Accelerometer and gyroscope recording of players playing four different racket sports. Each …
The code to create the dataset is available here. The dataset used in the paper is available on github - …
The VisA dataset contains 12 subsets corresponding to 12 different objects as shown in the above figure. There are 10,821 …
WFDD is a dataset for benchmarking anomaly detection methods with a focus on textile inspection. It includes 4101 woven fabric …
voraus-AD contains machine data of a collaborative robot, which moves a can by performing an industrial pick-and-place task. The samples …
The CHILI-100K dataset is a large-scale graph dataset (with overall >183M nodes, >1.2B edges) of nanomaterials generated from experimentally determined …
The CHILI-3K dataset is a medium-scale graph dataset (with overall >6M nodes, >49M edges) of mono-metallic oxide nanomaterials generated from …
Data Set Information: Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records …
In an effort to catalog insect biodiversity, we propose a new large dataset of hand-labelled insect images, the BIOSCAN-1M Insect …
The purpose of this dataset was to study gender bias in occupations. Online biographies, written in English, were collected to …
BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring – they are …
This dataset is a combination of the following three datasets : figshare, SARTAJ dataset and Br35H This dataset contains 7022 …
The quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness. CIFAKE is a dataset that …
The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …
Common corruptions dataset for CIFAR10
Contains hundreds of frontal view X-rays and is the largest public resource for COVID-19 image and prognostic data, making it …
Data was collected for normal bearings, single-point drive end and fan end defects. Data was collected at 12,000 samples/second and …
The normal chest X-ray (left panel) depicts clear lungs without any areas of abnormal opacification in the image. Bacterial pneumonia …
We construct the ForgeryNet dataset, an extremely large face forgery dataset with unified annotations in image- and video-level data across …
A public data set of walking full-body kinematics and kinetics in individuals with Parkinson’s disease
HOWS-CL-25 (Household Objects Within Simulation dataset for Continual Learning) is a synthetic dataset especially designed for object classification on mobile …
The HRF dataset is a dataset for retinal vessel segmentation which comprises 45 images and is organized as 15 subsets. …
The IRFL dataset consists of idioms, similes, and metaphors with matching figurative and literal images, as well as two novel …
The goal for ISIC 2019 is classify dermoscopic images among nine different diagnostic categories.25,331 images are available for training across …
This dataset was presented as part of the ICLR 2023 paper 𝘈 𝘧𝘳𝘢𝘮𝘦𝘸𝘰𝘳𝘬 𝘧𝘰𝘳 𝘣𝘦𝘯𝘤𝘩𝘮𝘢𝘳𝘬𝘪𝘯𝘨 𝘊𝘭𝘢𝘴𝘴-𝘰𝘶𝘵-𝘰𝘧-𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯 𝘥𝘦𝘵𝘦𝘤𝘵𝘪𝘰𝘯 𝘢𝘯𝘥 𝘪𝘵𝘴 𝘢𝘱𝘱𝘭𝘪𝘤𝘢𝘵𝘪𝘰𝘯 …
Dataset Introduction In this work, we introduce the In-Diagram Logic (InDL) dataset, an innovative resource crafted to rigorously evaluate the …
This data set comprises 22 fundus images with their corresponding manual annotations for the blood vessels, separated as arteries and …
The Liver-US dataset is a comprehensive collection of high-quality ultrasound images of the liver, including both normal and abnormal cases. …
The minimalist histopathology image analysis dataset (MHIST) is a binary classification dataset of 3,152 fixed-size images of colorectal polyps, each …
The process by which sections in a document are demarcated and labeled is known as section identification. Such sections are …
MixedWM38 Dataset(WaferMap) has more than 38000 wafer maps, including 1 normal pattern, 8 single defect patterns, and 29 mixed defect …
Early detection of retinal diseases is one of the most important means of preventing partial or permanent blindness in patients. …
A large real-world event-based dataset for object classification. Source: HATS: Histograms of Averaged Time Surfaces for Robust Event-based Object Classification
The N-ImageNet dataset is an event-camera counterpart for the ImageNet dataset. The dataset is obtained by moving an event camera …
The RITE (Retinal Images vessel Tree Extraction) is a database that enables comparative studies on segmentation or classification of arteries …
he RSSCN7 dataset contains satellite images acquired from Google Earth, which is originally collected for remote sensing scene classification. We …
The Recognizing Textual Entailment (RTE) datasets come from a series of textual entailment challenges. Data from RTE1, RTE2, RTE3 and …
The Schema-Guided Dialogue (SGD) dataset consists of over 20k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. …
This dataset is based on the Spiking Heidelberg Digits (SHD) dataset. Sample inputs consist of two spike encoded digits sampled …
The SPOTS-10 dataset is an extensive collection of grayscale images showcasing diverse patterns found in ten animal species. Specifically, SPOTS-10 …
The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the …
Sentiment140 is a dataset that allows you to discover the sentiment of a brand, product, or topic on Twitter. Source: …
This dataset consists of computer-generated images for gas leakage segmentation. It features diverse backgrounds, interfering foreground objects, and precise ground …
arxiv : https://arxiv.org/abs/2304.11708 Accepted at 29th International Congress on Sound and Vibration (ICSV29). The drone has been used for various …
Table-ACM12K (TACM12K) is a relational table dataset derived from the ACM heterogeneous graph dataset. It includes four tables: papers, authors, …
Table-LastFm2K (TLF2K) is a relational table dataset derived from the classical LastFM2K dataset. It contains three tables: artists, user_artists, and …
Table-MovieLens1M (TML1M) is a relational table dataset derived from the classical MovieLens1M dataset. It consists of three tables: users, movies, …
The Winograd Schema Challenge was introduced both as an alternative to the Turing Test and as a test of a …
WiC is a benchmark for the evaluation of context-sensitive word embeddings. WiC is framed as a binary classification task. Each …
Enlarge the dataset to understand how image background effect the Computer Vision ML model. With the following topics: Blur Background …
The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 …
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 …
The DBLP is a citation network dataset. The citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and …
The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. …
The CHILI-100K dataset is a large-scale graph dataset (with overall >183M nodes, >1.2B edges) of nanomaterials generated from experimentally determined …
The CHILI-3K dataset is a medium-scale graph dataset (with overall >6M nodes, >49M edges) of mono-metallic oxide nanomaterials generated from …
Alzheimer's Disease Neuroimaging Initiative (ADNI) is a multisite study that aims to improve clinical trials for the prevention and treatment …
AIDS is a graph dataset. It consists of 2000 graphs representing molecular compounds which are constructed from the AIDS Antiviral …
The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …
COLLAB is a scientific collaboration dataset. A graph corresponds to a researcher’s ego network, i.e., the researcher and its collaborators …
CSL is a synthetic dataset introduced in Murphy et al. (2019) to test the expressivity of GNNs. In particular, graphs …
The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 …
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 …
The DIGITS dataset consists of 1797 8×8 grayscale images (1439 for training and 360 for testing) of handwritten digits. Source: …
ENZYMES is a dataset of 600 protein tertiary structures obtained from the BRENDA enzyme database. The ENZYMES dataset contains 6 …
Lifespan HCP Release 2.0 includes cross-sectional visit 1 (V1) preprocessed structural and functional imaging data, unprocessed V1 imaging data for …
IMDB-BINARY is a movie collaboration dataset that consists of the ego-networks of 1,000 actors/actresses who played roles in movies in …
IMDB-MULTI is a relational dataset that consists of a network of 1000 actors or actresses who played roles in movies …
The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …
In particular, MUTAG is a collection of nitroaromatic compounds and the goal is to predict their mutagenicity on Salmonella typhimurium. …
The Maximum Unbiased Validation (MUV) dataset is a benchmark dataset selected from PubChem BioAssay. It was created by applying a …
Mutagenicity is a chemical compound dataset of drugs, which can be categorized into two classes: mutagen and non-mutagen. Source: [Hierarchical …
The NCI1 dataset comes from the cheminformatics domain, where each input graph is used as representation of a chemical compound: …
Tudataset: A collection of benchmark datasets for learning with graphs
A dataset for single-image 3D in the wild consisting of annotations of detailed 3D geometry for 140,000 images. Source: [OASIS: …
PROTEINS is a dataset of proteins that are classified as enzymes or non-enzymes. Nodes represent the amino acids and two …
PTC is a collection of 344 chemical compounds represented as graphs which report the carcinogenicity for rats. There are 19 …
The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. …
Reddit12k contains 11929 graphs each corresponding to an online discussion thread where nodes represent users, and an edge represents the …
REDDIT-BINARY consists of graphs corresponding to online discussions on Reddit. In each graph, nodes represent users, and there is an …
SIDER contains information on marketed medicines and their recorded adverse drug reactions. The information is extracted from public documents and …
This dataset accompanies the paper `Learning the mechanisms of network growth' by the same authors. The dataset contains 6733 networks …
The Tox21 data set comprises 12,060 training samples and 647 test samples that represent chemical compounds. There are 801 "dense …
UK Biobank participants have generously provided a very wide range of information about their health and well-being since recruitment began …
The Gossipcop variant of the UPFD dataset for benchmarking. Please refer to the UPFD dataset for more details of the …
The PolitiFact variant of the UPFD dataset for benchmarking. Please refer to the UPFD dataset for more details of the …
These data are the results of a chemical analysis of wines grown in the same region in Italy but derived …
The ClinTox dataset compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons. The …
The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 …
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 …
The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. …
The PASCAL Visual Object Classes (VOC) 2012 dataset contains 20 object categories including vehicles, household, animals, and other: aeroplane, bicycle, …
RARE consists of English AMR pairs with similarity scores that reflect the structural differences between them. Given that AMRs are …
SPair-71k contains 70,958 image pairs with diverse variations in viewpoint and scale. Compared to previous datasets, it is significantly larger …
QM9 provides quantum chemical properties (at DFT level) for a relevant, consistent, and comprehensive chemical space of small organic molecules. …
The GQA dataset is a large-scale visual question answering dataset with real images from the Visual Genome dataset and balanced …
ZINC is a free database of commercially-available compounds for virtual screening. ZINC contains over 230 million purchasable compounds in ready-to-dock, …
The GlassTemp dataset is collected from Polyinfo. It uses monomers as polymer graphs to predict the property of glass transition …
PCQM4Mv2 is a quantum chemistry dataset originally curated under the PubChemQC project. Based on the PubChemQC, we define a meaningful …
QM9 provides quantum chemical properties (at DFT level) for a relevant, consistent, and comprehensive chemical space of small organic molecules. …
ZINC is a free database of commercially-available compounds for virtual screening. ZINC contains over 230 million purchasable compounds in ready-to-dock, …
CoMA contains 17,794 meshes of the human face in various expressions Source: DEMEA: Deep Mesh Autoencoders for Non-Rigidly Deforming Objects …
The 3D Poses in the Wild dataset is the first dataset in the wild with accurate 3D poses for evaluation. …
COCO-WholeBody is an extension of COCO dataset with whole-body annotations. There are 4 types of bounding boxes (person box, face …
A dataset with 3200 images (200 for each number quantity on each hand).
Includes 100K depth images under challenging scenarios. Source: Human Pose Estimation from Depth Images via Inference Embedded Multi-task Learning
The Image Paragraph Captioning dataset allows researchers to benchmark their progress in generating paragraphs that tell a story about an …
Open Catalyst 2020 is a dataset for catalysis in chemical engineering. Focusing on molecules that are important in renewable energy …
DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used …
DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used …
DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used …
FB15k-237 is a link prediction dataset created from FB15k. While FB15k consists of 1,345 relations, 14,951 entities, and 592,213 triples, …
WN18RR is a link prediction dataset created from WN18, which is a subset of WordNet. WN18 consists of 18 relations …
JerichoWorld is a dataset that enables the creation of learning agents that can build knowledge graph-based world models of interactive …
Analogical reasoning is fundamental to human cognition and holds an important place in various fields. However, previous studies mainly focus …
The ACM dataset contains papers published in KDD, SIGMOD, SIGCOMM, MobiCOMM, and VLDB and are divided into three classes (Database, …
The AbstRCT dataset consists of randomized controlled trials retrieved from the MEDLINE database via PubMed search. The trials are annotated …
The Aristo Tuple KB contains a collection of high-precision, domain-targeted (subject,relation,object) tuples extracted from text using a high-precision extraction pipeline, …
The Cornell eRulemaking Corpus – CDCP is an argument mining corpus annotated with argumentative structure information capturing the evaluability of …
COLLAB is a scientific collaboration dataset. A graph corresponds to a researcher’s ego network, i.e., the researcher and its collaborators …
The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 …
CoDEx comprises a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph …
CoDEx comprises a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph …
CoDEx comprises a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph …
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 …
The DBLP is a citation network dataset. The citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and …
The Dr. Inventor Multi-Layer Scientific Corpus (DRI Corpus) includes 40 Computer Graphics papers, selected by domain experts. Each paper of …
Bio-decagon is a dataset for polypharmacy side effect identification problem framed as a multirelational link prediction problem in a two-layer …
We release Douban Conversation Corpus, comprising a training data set, a development set and a test set for retrieval based …
The FB15k dataset contains knowledge base relation triples and textual mentions of Freebase entity pairs. It has a total of …
FB15k-237 is a link prediction dataset created from FB15k. While FB15k consists of 1,345 relations, 14,951 entities, and 592,213 triples, …
The GDELT Project is a remarkable initiative that monitors our world by analyzing global news from various sources. Here are …
GO21 is a biomedical knowledge graph that models genes, proteins, drugs, and the hierarchy of the biological processes they participate …
KG20C is a Knowledge Graph about high quality papers from 20 top computer science Conferences. It can serve as a …
NELL-995 KG Completion Dataset
protein roles—in terms of their cellular functions from gene ontology—in various protein-protein interaction (PPI) graphs, with each graph corresponding to …
The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. …
SINS is a database of continuous real-life audio recordings in a home environment. The home is a vacation home and …
This is a benchmark set for Traveling salesman problem (TSP) with characteristics that are different from the existing benchmark sets. …
The Unified Medical Language System (UMLS) is a comprehensive resource that integrates and disseminates essential terminology, classification standards, and coding …
The WN18 dataset has 18 relations scraped from WordNet for roughly 41,000 synsets, resulting in 141,442 triplets. It was found …
WN18RR is a link prediction dataset created from WN18, which is a subset of WordNet. WN18 consists of 18 relations …
Wikidata5m is a million-scale knowledge graph dataset with aligned corpus. This dataset integrates the Wikidata knowledge graph and Wikipedia pages. …
YAGO3-10 is benchmark dataset for knowledge base completion. It is a subset of YAGO3 (which itself is an extension of …
The Yelp Dataset is a valuable resource for academic research, teaching, and learning. It provides a rich collection of real-world …
The Epinions dataset is built form a who-trust-whom online social network of a general consumer review site Epinions.com. Members of …
The Slashdot dataset is a relational dataset obtained from Slashdot. Slashdot is a technology-related news website know for its specific …
The Maximum Unbiased Validation (MUV) dataset is a benchmark dataset selected from PubChem BioAssay. It was created by applying a …
MoleculeNet is a large scale benchmark for molecular machine learning. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and …
PCBA dataset 11 is a collection of high-quality dose-response data, formulated as a multitask learning benchmark from 128 high-throughput screening …
QM7 dataset is a subset of the GDB-13 database. GDB-13 contains nearly 1 billion stable and synthetically accessible organic molecules. …
QM8 dataset is a collection of molecular data used for studying quantum mechanical calculations of electronic spectra and excited state …
QM9 provides quantum chemical properties (at DFT level) for a relevant, consistent, and comprehensive chemical space of small organic molecules. …
SIDER contains information on marketed medicines and their recorded adverse drug reactions. The information is extracted from public documents and …
The Tox21 data set comprises 12,060 training samples and 647 test samples that represent chemical compounds. There are 801 "dense …
The ClinTox dataset compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons. The …
AMZ Computers is a co-purchase graph extracted from Amazon, where nodes represent products, edges represent the co-purchased relations of products, …
AVA is a project that provides audiovisual annotations of video for improving our understanding of human activity. Each of the …
Amazon-Fraud is a multi-relational graph dataset built upon the Amazon review dataset, which can be used in evaluating graph-based node …
CLUSTER is a node classification tasks generated with Stochastic Block Models, which is widely used to model communities in social …
Classifying all cells in an organ is a relevant and difficult problem from plant developmental biology. We here abstract the …
Node classification on Chameleon with the fixed 48%/32%/20% splits provided by Geom-GCN.
Node classification on Chameleon with 60%/20%/20% random splits for training/validation/test.
The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 …
Node classification on Citeseer with the fixed 48%/32%/20% splits provided by Geom-GCN.
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 …
Node classification on Cora with the fixed 48%/32%/20% splits provided by Geom-GCN.
Node classification on Cornell with the fixed 48%/32%/20% splits provided by Geom-GCN.
Node classification on Cornell with 60%/20%/20% random splits for training/validation/test.
The DBLP is a citation network dataset. The citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and …
Node classification on Film with 60%/20%/20% random splits for training/validation/test.
Node classification on Film with the fixed 48%/32%/20% splits provided by Geom-GCN.
In particular, MUTAG is a collection of nitroaromatic compounds and the goal is to predict their mutagenicity on Salmonella typhimurium. …
This is the large version of the MuMiN dataset.
This is the medium version of the MuMiN dataset.
This is the small version of the MuMiN dataset.
NELL is a dataset built from the Web via an intelligent agent called Never-Ending Language Learner. This agent attempts to …
PATTERN is a node classification tasks generated with Stochastic Block Models, which is widely used to model communities in social …
protein roles—in terms of their cellular functions from gene ontology—in various protein-protein interaction (PPI) graphs, with each graph corresponding to …
Placenta is a benchmark dataset for node classification in an underexplored domain: predicting microanatomical tissue structures from cell graphs in …
Node classification on PubMed with the fixed 48%/32%/20% splits provided by Geom-GCN.
Node classification on PubMed with 60%/20%/20% random splits for training/validation/test.
The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. …
The Reddit dataset is a graph dataset from Reddit posts made in the month of September, 2014. The node label …
Node classification on Squirrel with the fixed 48%/32%/20% splits provided by Geom-GCN.
Node classification on Squirrel with 60%/20%/20% random splits for training/validation/test.
Node classification on Texas with the fixed 48%/32%/20% splits provided by Geom-GCN.
Leonardo Filipe Rodrigues Ribeiro, Pedro H. P. Saverese, and Daniel R. Figueiredo. struc2vec: Learning node representations from structural identity.
Wiki-CS is a Wikipedia-based dataset for benchmarking Graph Neural Networks. The dataset is constructed from Wikipedia categories, specifically 10 classes …
Node classification on Wisconsin with the fixed 48%/32%/20% splits provided by Geom-GCN.
Yelp-Fraud is a multi-relational graph dataset built upon the Yelp spam review dataset, which can be used in evaluating graph-based …
amazon-ratings is a product co-purchasing network based on data from SNAP datasets
minesweeper is a synthetic graph emulating the eponymous game.
Questions is an interaction graph of users of a question-answering website based on data provided by Yandex Q.
Roman-empire is a word dependency graph based on the Roman Empire article from the English Wikipedia.
Tolokers is a crowdsourcing platform workers network based on data provided by Toloka.
The original dataset for "ECG5000" is a 20-hour long ECG downloaded from Physionet. The name is BIDMC Congestive Heart Failure …
Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …
SKAB is designed for evaluating algorithms for anomaly detection. The benchmark currently includes 30+ datasets plus Python modules for algorithms’ …
PointCloud-C is the very first test-suite for point cloud robustness analysis under corruptions. - Two sets: ModelNet-C for point cloud …
This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links …
This datasets is a subset of the Amazon reviews dataset which contain Fashion related products
This datasets is a subset of the Amazon reviews dataset which contain Men related products
This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014. This …
The Ciao dataset contains rating information of users given to items, and also contain item category information. The data comes …
Delicious : This data set contains tagged web pages retrieved from the website delicious.com. Source: [Text segmentation on multilabel documents: …
We release Douban Conversation Corpus, comprising a training data set, a development set and a test set for retrieval based …
The Epinions dataset is built form a who-trust-whom online social network of a general consumer review site Epinions.com. Members of …
Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and …
The Pinterest dataset contains more than 1 million images associated to Pinterest users’ who have “pinned” them. Source: https://openaccess.thecvf.com/content_iccv_2015/papers/Geng_Learning_Image_and_ICCV_2015_paper.pdf
This dataset contains 21,889 outfits from polyvore.com, in which 17,316 are for training, 1,497 for validation and 3,076 for testing. …
ReDial (Recommendation Dialogues) is an annotated dataset of dialogues, where users recommend movies to each other. The dataset consists of …
The WeChat dataset for fake news detection contains more than 20k news labelled as fake news or not.
The Yelp Dataset is a valuable resource for academic research, teaching, and learning. It provides a rich collection of real-world …
The Yelp2018 dataset is adopted from the 2018 edition of the yelp challenge. Wherein local businesses like restaurants and bars …
AnoShift is a large-scale anomaly detection benchmark, which focuses on splitting the test data based on its temporal distance to …
The Caltech101 dataset contains images from 101 object categories (e.g., “helicopter”, “elephant” and “chair” etc.) and a background category that …
This is a synthetic dataset for defect detection on textured surfaces. It was originally created for a competition at the …
Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …
The dataset is constructed from images of defective production items that were provided and annotated by Kolektor Group d.o.o.. The …
KolektorSDD2 is a surface-defect detection dataset with over 3000 images containing several types of defects, obtained while addressing a real-world …
The PRONTO heterogeneous benchmark dataset is based on an industrial-scale multiphase flow facility. It includes data from heterogeneous sources, including …
The Reuters-21578 dataset is a collection of documents with news articles. The original corpus has 10,369 documents and a vocabulary …
Soil Moisture Active Passive (SMAP) dataset is a dataset of soil samples and telemetry information using the Mars rover by …
TIMo (Time-of-Flight Indoor Monitoring) is a dataset of infrared and depth videos intended for the use in Anomaly Detection and …
The code to create the dataset is available here. The dataset used in the paper is available on github - …