FlyingThings3D is a synthetic dataset for optical flow, disparity and scene flow estimation. It consists of everyday objects flying along β¦
Music21 is an untrimmed video dataset crawled by keyword query from Youtube. It contains music performances belonging to 21 categories. β¦
COCO-WholeBody is an extension of COCO dataset with whole-body annotations. There are 4 types of bounding boxes (person box, face β¦
Human-Art is a versatile human-centric dataset to bridge the gap between natural and artificial scenes. It includes twenty high-quality human β¦
This dataset focuses on heavily occluded human with comprehensive annotations including bounding-box, humans pose and instance mask. This dataset contains β¦
The Human3.6M dataset is one of the largest motion capture datasets, which consists of 3.6 million human poses and corresponding β¦
The Breast Cancer Histopathological Image Classification (BreakHis) is composed of 9,109 microscopic images of breast tumor tissue collected from 82 β¦
The Infant Health and Development Program (IHDP) is a randomized controlled study designed to evaluate the effect of home visit β¦
The Jobs dataset by LaLonde [36] is a widely used benchmark in the causal inference community, where the treatment is β¦
A large-scale cross-lingual dataset for entity alignment
The DBP2.0 dataset can be downloaded from the figshare repository. It has three entity alignment settings, i.e., ZH-EN, JA-EN and β¦
Alzheimer's Disease Neuroimaging Initiative (ADNI) is a multisite study that aims to improve clinical trials for the prevention and treatment β¦
A diverse set of 21 relations, each covering a different set of subject-entities and a complete list of ground truth β¦
DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used β¦
DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used β¦
DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used β¦
FB15k-237 is a link prediction dataset created from FB15k. While FB15k consists of 1,345 relations, 14,951 entities, and 592,213 triples, β¦
WN18RR is a link prediction dataset created from WN18, which is a subset of WordNet. WN18 consists of 18 relations β¦
The FB15k dataset contains knowledge base relation triples and textual mentions of Freebase entity pairs. It has a total of β¦
JerichoWorld is a dataset that enables the creation of learning agents that can build knowledge graph-based world models of interactive β¦
Analogical reasoning is fundamental to human cognition and holds an important place in various fields. However, previous studies mainly focus β¦
The MMLU-Pro dataset is an enhanced version of the Massive Multitask Language Understanding (MMLU) benchmark. It's designed to be more β¦
GeoS is a dataset for automatic math problem solving. It is a dataset of SAT plane geometry questions where every β¦
A new large-scale geometry problem-solving dataset - 3,002 multi-choice geometry problems - dense annotations in formal language for the diagrams β¦
GeoQA is a dataset for automatic geometric problem solving containing 5,010 geometric problems with corresponding annotated programs, which illustrate the β¦
A new large scale plane geometry problem solving dataset called PGPS9K, labeled both fine-grained diagram annotation and interpretable solution program.
MuSiQue-Ans is a new multihop QA dataset with ~25K 2-4 hop questions using seed questions from 5 existing single-hop datasets.
MMKG is a collection of three knowledge graphs for link prediction and entity matching research. Contrary to other knowledge graph β¦
This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links β¦
This datasets is a subset of the Amazon reviews dataset which contain Fashion related products
This datasets is a subset of the Amazon reviews dataset which contain Men related products
This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014. This β¦
The Ciao dataset contains rating information of users given to items, and also contain item category information. The data comes β¦
Delicious : This data set contains tagged web pages retrieved from the website delicious.com. Source: [Text segmentation on multilabel documents: β¦
We release Douban Conversation Corpus, comprising a training data set, a development set and a test set for retrieval based β¦
The Epinions dataset is built form a who-trust-whom online social network of a general consumer review site Epinions.com. Members of β¦
Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and β¦
The Pinterest dataset contains more than 1 million images associated to Pinterest usersβ who have βpinnedβ them. Source: https://openaccess.thecvf.com/content_iccv_2015/papers/Geng_Learning_Image_and_ICCV_2015_paper.pdf
This dataset contains 21,889 outfits from polyvore.com, in which 17,316 are for training, 1,497 for validation and 3,076 for testing. β¦
ReDial (Recommendation Dialogues) is an annotated dataset of dialogues, where users recommend movies to each other. The dataset consists of β¦
The WeChat dataset for fake news detection contains more than 20k news labelled as fake news or not.
The Yelp Dataset is a valuable resource for academic research, teaching, and learning. It provides a rich collection of real-world β¦
The Yelp2018 dataset is adopted from the 2018 edition of the yelp challenge. Wherein local businesses like restaurants and bars β¦
Procgen Benchmark includes 16 simple-to-use procedurally-generated environments which provide a direct measure of how quickly a reinforcement learning agent learns β¦
Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation
Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a β¦
Consists of 1.3 million records of U.S. patent documents along with human written abstractive summaries. Source: [BIGPATENT: A Large-Scale Dataset β¦
BillSum is the first dataset for summarization of US Congressional and California state bills. The BillSum dataset consists of three β¦
BookSum is a collection of datasets for long-form narrative summarization. This dataset covers source documents from the literature domain, such β¦
DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 dialogues with corresponding manually labeled summaries and topics. This work β¦
Gazeta is a dataset for automatic summarization of Russian news. The dataset consists of 63,435 text-summary pairs. To form training, β¦
GovReport is a dataset for long document summarization, with significantly longer documents and summaries. It consists of reports written by β¦
The How2 dataset contains 13,500 videos, or 300 hours of speech, and is split into 185,187 training, 2022 development (dev), β¦
The dataset introduces document alignments between German Wikipedia and the children's lexicon Klexikon. The source texts in Wikipedia are both β¦
LCSTS is a large corpus of Chinese short text summarization dataset constructed from the Chinese microblogging website Sina Weibo, which β¦
MTEB is a benchmark that spans 8 embedding tasks covering a total of 56 datasets and 112 languages. The 8 β¦
MeQSum is a dataset for medical question summarization. It contains 1,000 summarized consumer health questions. Source: https://www.aclweb.org/anthology/P19-1215.pdf Image Source: https://www.aclweb.org/anthology/P19-1215.pdf
MeetingBank, a benchmark dataset created from the city councils of 6 major U.S. cities to supplement existing datasets. It contains β¦
Mental health remains a significant challenge of public health worldwide. With increasing popularity of online platforms, many use the platforms β¦
Source: BARThez: a Skilled Pretrained French Sequence-to-Sequence Model OrangeSum is a single-document extreme summarization dataset with two tasks: title and β¦
The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. β¦
QMSum is a new human-annotated benchmark for query-based multi-domain meeting summarisation task, which consists of 1,808 query-summary pairs over 232 β¦
Reddit TIFU dataset is a newly collected Reddit dataset, where TIFU denotes the name of /r/tifu subbreddit. There are 122,933 β¦
A new dataset with abstractive dialogue summaries. Source: SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization
WikiHow is a dataset of more than 230,000 article and summary pairs extracted and constructed from an online knowledge base β¦
The Extreme Summarization (XSum) dataset is a dataset for evaluation of abstractive single-document summarization systems. The goal is to create β¦
For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from β¦
This is a dataset for evaluating summarisation methods for research papers. Source: [A Discourse-Aware Attention Model for Abstractive Summarization of β¦