COCO-WholeBody is an extension of COCO dataset with whole-body annotations. There are 4 types of bounding boxes (person box, face …
Human-Art is a versatile human-centric dataset to bridge the gap between natural and artificial scenes. It includes twenty high-quality human …
This dataset focuses on heavily occluded human with comprehensive annotations including bounding-box, humans pose and instance mask. This dataset contains …
The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …
GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. …
Game of 24 is a mathematical reasoning challenge, where the goal is to use 4 numbers and basic arithmetic operations …
Existing arithmetic benchmarks have a limited number of multiple-choice questions. To address this gap, MathMC is created including 1,000 Chinese …
Existing arithmetic benchmarks have a limited number of True-or-False questions. To address this gap, MathToF is created including 1,000 Chinese …
Data Set Information: Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records …
In an effort to catalog insect biodiversity, we propose a new large dataset of hand-labelled insect images, the BIOSCAN-1M Insect …
The purpose of this dataset was to study gender bias in occupations. Online biographies, written in English, were collected to …
BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring – they are …
This dataset is a combination of the following three datasets : figshare, SARTAJ dataset and Br35H This dataset contains 7022 …
The quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness. CIFAKE is a dataset that …
The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …
Common corruptions dataset for CIFAR10
Contains hundreds of frontal view X-rays and is the largest public resource for COVID-19 image and prognostic data, making it …
Data was collected for normal bearings, single-point drive end and fan end defects. Data was collected at 12,000 samples/second and …
The normal chest X-ray (left panel) depicts clear lungs without any areas of abnormal opacification in the image. Bacterial pneumonia …
We construct the ForgeryNet dataset, an extremely large face forgery dataset with unified annotations in image- and video-level data across …
A public data set of walking full-body kinematics and kinetics in individuals with Parkinson’s disease
HOWS-CL-25 (Household Objects Within Simulation dataset for Continual Learning) is a synthetic dataset especially designed for object classification on mobile …
The HRF dataset is a dataset for retinal vessel segmentation which comprises 45 images and is organized as 15 subsets. …
The IRFL dataset consists of idioms, similes, and metaphors with matching figurative and literal images, as well as two novel …
The goal for ISIC 2019 is classify dermoscopic images among nine different diagnostic categories.25,331 images are available for training across …
This dataset was presented as part of the ICLR 2023 paper 𝘈 𝘧𝘳𝘢𝘮𝘦𝘸𝘰𝘳𝘬 𝘧𝘰𝘳 𝘣𝘦𝘯𝘤𝘩𝘮𝘢𝘳𝘬𝘪𝘯𝘨 𝘊𝘭𝘢𝘴𝘴-𝘰𝘶𝘵-𝘰𝘧-𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯 𝘥𝘦𝘵𝘦𝘤𝘵𝘪𝘰𝘯 𝘢𝘯𝘥 𝘪𝘵𝘴 𝘢𝘱𝘱𝘭𝘪𝘤𝘢𝘵𝘪𝘰𝘯 …
Dataset Introduction In this work, we introduce the In-Diagram Logic (InDL) dataset, an innovative resource crafted to rigorously evaluate the …
This data set comprises 22 fundus images with their corresponding manual annotations for the blood vessels, separated as arteries and …
The Liver-US dataset is a comprehensive collection of high-quality ultrasound images of the liver, including both normal and abnormal cases. …
The minimalist histopathology image analysis dataset (MHIST) is a binary classification dataset of 3,152 fixed-size images of colorectal polyps, each …
The process by which sections in a document are demarcated and labeled is known as section identification. Such sections are …
MixedWM38 Dataset(WaferMap) has more than 38000 wafer maps, including 1 normal pattern, 8 single defect patterns, and 29 mixed defect …
Early detection of retinal diseases is one of the most important means of preventing partial or permanent blindness in patients. …
A large real-world event-based dataset for object classification. Source: HATS: Histograms of Averaged Time Surfaces for Robust Event-based Object Classification
The N-ImageNet dataset is an event-camera counterpart for the ImageNet dataset. The dataset is obtained by moving an event camera …
The RITE (Retinal Images vessel Tree Extraction) is a database that enables comparative studies on segmentation or classification of arteries …
he RSSCN7 dataset contains satellite images acquired from Google Earth, which is originally collected for remote sensing scene classification. We …
The Recognizing Textual Entailment (RTE) datasets come from a series of textual entailment challenges. Data from RTE1, RTE2, RTE3 and …
The Schema-Guided Dialogue (SGD) dataset consists of over 20k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. …
This dataset is based on the Spiking Heidelberg Digits (SHD) dataset. Sample inputs consist of two spike encoded digits sampled …
The SPOTS-10 dataset is an extensive collection of grayscale images showcasing diverse patterns found in ten animal species. Specifically, SPOTS-10 …
The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the …
Sentiment140 is a dataset that allows you to discover the sentiment of a brand, product, or topic on Twitter. Source: …
This dataset consists of computer-generated images for gas leakage segmentation. It features diverse backgrounds, interfering foreground objects, and precise ground …
arxiv : https://arxiv.org/abs/2304.11708 Accepted at 29th International Congress on Sound and Vibration (ICSV29). The drone has been used for various …
Table-ACM12K (TACM12K) is a relational table dataset derived from the ACM heterogeneous graph dataset. It includes four tables: papers, authors, …
Table-LastFm2K (TLF2K) is a relational table dataset derived from the classical LastFM2K dataset. It contains three tables: artists, user_artists, and …
Table-MovieLens1M (TML1M) is a relational table dataset derived from the classical MovieLens1M dataset. It consists of three tables: users, movies, …
The Winograd Schema Challenge was introduced both as an alternative to the Turing Test and as a test of a …
WiC is a benchmark for the evaluation of context-sensitive word embeddings. WiC is framed as a binary classification task. Each …
Enlarge the dataset to understand how image background effect the Computer Vision ML model. With the following topics: Blur Background …
The APPS dataset consists of problems collected from different open-access coding websites such as Codeforces, Kattis, and more. The APPS …
A new large dataset with over 100,000 examples consisting of Java classes from online code repositories, and develop a new …
The CMU CoNaLa, the Code/Natural Language Challenge dataset is a joint project from the Carnegie Mellon University NeuLab and Strudel …
The CoNaLa Extended With Question Text is an extension to the original CoNaLa Dataset (Papers With Code Link) proposed in …
CodeContests is a competitive programming dataset for machine-learning. This dataset was used when training AlphaCode. It consists of programming problems, …
In this paper, we introduce a novel benchmarking framework designed specifically for evaluations of data science agents. Our contributions are …
The Django dataset is a dataset for code generation comprising of 16000 training, 1000 development and 1805 test annotations. Each …
the FloCo dataset that contains 11,884 flowchart images and their corresponding Python codes.
This is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained …
Extension test cases of HumanEval, as well as generated code.
The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry-level programmers, covering programming fundamentals, …
Recent advancements in large language models (LLMs) have showcased their exceptional abilities across various tasks, such as code generation, problem-solving …
RES-Q is a natural language instruction-based benchmark for evaluating $\textbf{R}$epository $\textbf{E}$diting $\textbf{S}$ystems, which consists of 100 handcrafted repository editing tasks …
Shellcode_IA32 is a dataset containing 20 years of shellcodes from a variety of sources is the largest collection of shellcodes …
TACO (Topics in Algorithmic Code generation dataset) is a dataset focused on algorithmic code generation, designed to provide a more …
$\textbf{Turbulence}$ is a new benchmark for systematically evaluating the correctness and robustness of instruction-tuned large language models (LLMs) for code …
Verified Smart Contracts Code Comments is a dataset of real Ethereum smart contract functions, containing "code, comment" pairs of both …
Test-driven benchmark to challenge LLMs to write JavaScript React application GitHub Script
Test-driven benchmark to challenge LLMs to write long JavaScript React application GitHub Script
WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further …
The COmmonsense Dataset Adversarially-authored by Humans (CODAH) is an evaluation set for commonsense question-answering in the sentence completion style of …
The CommonsenseQA is a dataset for commonsense question answering task. The dataset consists of 12,247 questions with 5 choices each. …
Choice of Plausible Alternatives for Russian language (PARus) evaluation provides researchers with a tool for assessing progress in open-domain commonsense …
A Winograd schema is a pair of sentences that differ in only one or two words and that contain an …
Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) is a large-scale reading comprehension dataset which requires commonsense reasoning. ReCoRD consists of …
Russian reading comprehension with Commonsense reasoning (RuCoS) is a large-scale reading comprehension dataset that requires commonsense reasoning. RuCoS consists of …
Given a partial description like "she opened the hood of the car," humans can reason about the situation and anticipate …
This dataset is collected via the WinoGAViL game to collect challenging vision-and-language associations. Inspired by the popular card game Codenames, …
WinoGrande is a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the …
Engine degradation simulation was carried out using C-MAPSS. Four different were sets simulated under different combinations of operational conditions and …
The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of …
PMC-VQA is a large-scale medical visual question-answering dataset that contains 227k VQA pairs of 149k images that cover various modalities …
The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …
The Image Paragraph Captioning dataset allows researchers to benchmark their progress in generating paragraphs that tell a story about an …
This dataset is a benchmark for complex reasoning abilities in large language models, drawing on United Kingdom Linguistics Olympiad problems …
RuWorldTree is a QA dataset with multiple-choice elementary-level science questions, which evaluate the understanding of core science facts. Motivation The …
The Winograd schema challenge composes tasks with syntactic ambiguity, which can be resolved with logic and reasoning. Motivation The dataset …
514 algebra word problems and associated equation systems gathered from Algebra.com.
By perturbing the widely used GSM8K dataset, an adversarial dataset for grade-school math called GSM-Plus is created. Motivated by the …
MATH is a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution …
MAWPS is an online repository of Math Word Problems, to provide a unified testbed to evaluate different algorithms. MAWPS allows …
Math23K is a dataset created for math word problem solving, contains 23, 162 Chinese problems crawled from the Internet. Refer …
MathQA significantly enhances the AQuA dataset with fully-specified operational programs. Source: [MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based …
This repository contains the code, data, and models of the paper titled **"Math Word Problem Solving by Generating Linguistic Variants …
A challenge set for elementary-level Math Word Problems (MWP). An MWP consists of a short Natural Language narrative that describes …
GeoS is a dataset for automatic math problem solving. It is a dataset of SAT plane geometry questions where every …
A new large-scale geometry problem-solving dataset - 3,002 multi-choice geometry problems - dense annotations in formal language for the diagrams …
GeoQA is a dataset for automatic geometric problem solving containing 5,010 geometric problems with corresponding annotated programs, which illustrate the …
A new large scale plane geometry problem solving dataset called PGPS9K, labeled both fine-grained diagram annotation and interpretable solution program.
The CheXpert dataset contains 224,316 chest radiographs of 65,240 patients with both frontal and lateral views available. The task is …
ChestX-ray14 is a medical imaging dataset which comprises 112,120 frontal-view X-ray images of 30,805 (collected from the year of 1992 …
MIMIC-CXR from Massachusetts Institute of Technology presents 371,920 chest X-rays associated with 227,943 imaging studies from 65,079 patients. The studies …
MLRSNet is a a multi-label high spatial resolution remote sensing dataset for semantic scene understanding. It provides different perspectives of …
The MRNet dataset consists of 1,370 knee MRI exams performed at Stanford University Medical Center. The dataset contains 1,104 (80.6%) …
The NUS-WIDE dataset contains 269,648 images with a total of 5,018 tags collected from Flickr. These images are manually annotated …
OpenImages V6 is a large-scale dataset , consists of 9 million training images, 41,620 validation samples, and 125,456 test samples. …
PASCAL VOC 2007 is a dataset for image recognition. The twenty object classes that have been selected are: Person: person …
We introduce the novel task of multimodal puzzle solving, framed within the context of visual question-answering. We present a new …
Math-Vision (Math-V) dataset is a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math …
Recent advances in large language models have led to the development of multimodal LLMs (MLLMs), which take both image data …
BioNLI is a dataset in biomedical natural language inference. This dataset contains abstracts from biomedical literature and mechanistic premises generated …
The CommitmentBank is a corpus of 1,200 naturally occurring discourses whose final sentence contains a clause-embedding predicate under an entailment …
Natural Language Inference (NLI), also called Textual Entailment, is an important task in NLP with the goal of determining the …
The HANS (Heuristic Analysis for NLI Systems) dataset which contains many examples where the heuristics fail. Source: [Right for the …
JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois. Many of the most-spoken low-resource …
KUAKE Query-Query Relevance, a dataset used to evaluate the relevance of the content expressed in two queries, is used for …
KUAKE Query Title Relevance, a dataset used to estimate the relevance of the title of a query document, is used …
LiDiRus is a diagnostic dataset that covers a large volume of linguistic phenomena, while allowing you to evaluate information systems …
MED is a new evaluation dataset that covers a wide range of monotonicity reasoning that was created by crowdsourcing and …
Microsoft Research Paraphrase Corpus (MRPC) is a corpus consists of 5,801 sentence pairs collected from newswire articles. Each pair is …
The MedNLI dataset consists of the sentence pairs developed by Physicians from the Past Medical History section of MIMIC-III clinical …
The Multi-Genre Natural Language Inference (MultiNLI) dataset has 433K sentence pairs. Its size and mode of collection are modeled closely …
This dataset tests the capabilities of language models to correctly capture the meaning of words denoting probabilities (WEP), e.g. words …
The QNLI (Question-answering NLI) dataset is a Natural Language Inference dataset automatically derived from the Stanford Question Answering Dataset v1.1 …
Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary …
The Russian Commitment Bank is a corpus of naturally occurring discourses whose final sentence contains a clause-embedding predicate under an …
The Recognizing Textual Entailment (RTE) datasets come from a series of textual entailment challenges. Data from RTE1, RTE2, RTE3 and …
The Sentences Involving Compositional Knowledge (SICK) dataset is a dataset for compositional distributional semantics. It includes a large number of …
The SNLI dataset (Stanford Natural Language Inference) consists of 570k sentence-pairs manually labeled as entailment, contradiction, and neutral. Premises are …
The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct …
Textual Entailment Recognition has been proposed recently as a generic task that captures major semantic inference needs across many NLP …
TabFact is a large-scale dataset which consists of 117,854 manually annotated statements with regard to 16,573 Wikipedia tables, their relations …
The WNLI dataset is a part of the GLUE benchmark used for Natural Language Inference (NLI). It contains pairs of …
XWINO is a multilingual collection of Winograd Schemas in six languages that can be used for evaluation of cross-lingual commonsense …
e-SNLI is used for various goals, such as obtaining full sentence justifications of a model's decisions, improving universal sentence representations …
The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …
DeepFix consists of a program repair dataset (fix compiler errors in C programs). It enables research around automatically fixing programming …
HumanEvalPack is an extension of OpenAI's HumanEval to cover 6 total languages across 3 tasks. The evaluation suite is fully …
AviationQA is introduced in the paper titled- There is No Big Brother or Small Brother: Knowledge Infusion in Language Models …
BIG-Bench Hard (BBH) is a subset of the BIG-Bench, a diverse evaluation suite for language models. BBH focuses on a …
BLURB is a collection of resources for biomedical natural language processing. In general domains such as newswire and the Web, …
The Bamboogle dataset is a collection of questions that was constructed to investigate the ability of language models to perform …
BioASQ is a question answering dataset. Instances in the BioASQ dataset are composed of a question (Q), human-annotated answers (A), …
BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring – they are …
The COmmonsense Dataset Adversarially-authored by Humans (CODAH) is an evaluation set for commonsense question-answering in the sentence completion style of …
The Choice Of Plausible Alternatives (COPA) evaluation provides researchers with a tool for assessing progress in open-domain commonsense causal reasoning. …
CaseHOLD (Case Holdings On Legal Decisions) is a law dataset comprised of over 53,000+ multiple choice questions to identify the …
The dataset covers Hindi and Tamil, collected without the use of translation. It provides a realistic information-seeking task with questions …
CheGeKa is a Jeopardy!-like Russian QA dataset collected from the official Russian quiz database ChGK. Motivation The task can be …
Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …
CliCR is a new dataset for domain specific reading comprehension used to construct around 100,000 cloze queries from clinical case …
CoQA is a large-scale dataset for building Conversational Question Answering systems. The goal of the CoQA challenge is to measure …
A filtered version of CronQuestions and which can better demonstrate the model’s inference ability for complex temporal questions.
ComplexWebQuestions is a dataset for answering complex questions that require reasoning over multiple web snippets. It contains a large set …
ConditionalQA is a Question Answering (QA) dataset that contains complex questions with conditional answers, i.e. the answers are only applicable …
ConvFinQA is a dataset designed to study the chain of numerical reasoning in conversational question answering. The dataset contains 3892 …
CRONQUESTIONS, the Temporal KGQA dataset consists of two parts: a KG with temporal annotations, and a set of natural language …
Discrete Reasoning Over Paragraphs DROP is a crowdsourced, adversarially-created, 96k-question benchmark, in which a system must resolve references in a …
DaNetQA is a question answering dataset for yes/no questions. These questions are naturally occurring ---they are generated in unprompted and …
DuoRC contains 186,089 unique question-answer pairs created from a collection of 7680 pairs of movie plots where each pair in …
EgoTask QA benchmark contains 40K balanced question-answer pairs selected from 368K programmatically generated questions generated over 2K egocentric videos. It …
FEVER is a publicly available dataset for fact extraction and verification against textual sources. It consists of 185,445 claims manually …
A French Native Reading Comprehension dataset of questions and answers on a set of Wikipedia articles that consists of 25,000+ …
FairytaleQA is a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Annotated by educational experts based on an …
FinQA is a new large-scale dataset with Question-Answering pairs over Financial reports, written by financial experts. The dataset contains 8,281 …
GraphQuestions is a characteristic-rich dataset designed for factoid question answering. The dataset aims to provide a systematic way of constructing …
HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are …
HotpotQA is a question answering dataset collected on the English Wikipedia, containing about 113K crowd-sourced questions that are constructed to …
A new large-scale question-answering dataset that requires reasoning on heterogeneous information. Each question is aligned with a Wikipedia table and …
JaQuAD (Japanese Question Answering Dataset) is a question answering dataset in Japanese that consists of 39,696 extractive question-answer pairs on …
A large-scale dataset for Complex KBQA. Source: [KQA Pro: A Large-Scale Dataset with Interpretable Programs and Accurate SPARQLs for Complex …
MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively …
The MRQA (Machine Reading for Question Answering) dataset is a dataset for evaluating the generalization capabilities of reading comprehension systems. …
The MS MARCO (Microsoft MAchine Reading Comprehension) is a collection of datasets focused on deep learning in search. The first …
MapEval-Textual contains 300 question-answer pairs. The task is to answer question by fetching necessary informations using external Map APIs.
MapEval-Textual contains 300 context-question-answer triplets. The necessary geo-spatial information is provided in the context. The task is to answer question …
This dataset code generates mathematical question and answer pairs, from a range of question types at roughly school-level difficulty. This …
Multiple choice question answering based on the United States Medical License Exams (USMLE). The dataset is collected from the professional …
The MetaQA dataset consists of a movie ontology derived from the WikiMovies Dataset and three sets of question-answer pairs written …
A machine reading comprehension (MRC) dataset with discourse structure built over multiparty dialog. Molweni's source samples from the Ubuntu Chat …
MultiQ is a multi-hop QA dataset for Russian, suitable for general open-domain question answering, information retrieval, and reading comprehension tasks. …
MultiRC (Multi-Sentence Reading Comprehension) is a dataset of short paragraphs and multi-sentence questions, i.e., questions that can be answered by …
MULTITQ is a large-scale dataset featuring ample relevant facts and multiple temporal granularities.
NExT-QA is a VideoQA benchmark targeting the explanation of video contents. It challenges QA models to reason about the causal …
The NarrativeQA dataset includes a list of documents with Wikipedia summaries, links to full stories, and questions and answers. Source: …
The Natural Questions corpus is a question answering dataset containing 307,373 training examples, 7,830 development examples, and 7,842 test examples. …
The NewsQA dataset is a crowd-sourced machine reading comprehension dataset of 120,000 question-answer pairs. * Documents are CNN news articles. …
The Open Table-and-Text Question Answering (OTT-QA) dataset contains open questions which require retrieving tables and text from the web to …
OpenBookQA is a new kind of question-answering dataset modeled after open book exams for assessing human understanding of a subject. …
PIQA is a dataset for commonsense reasoning, and was created to investigate the physical knowledge of existing models in NLP. …
We present PeerQA, a real-world, scientific, document-level Question Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, which …
PopQA is an open-domain QA dataset with 14k QA pairs with fine-grained Wikidata entity ID, Wikipedia page views, and relationship …
PubChemQA consists of molecules and their corresponding textual descriptions from PubChem. It contains a single type of question, i.e., please …
The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary …
QASPER is a dataset for question answering on scientific research papers. It consists of 5,049 questions over 1,585 Natural Language …
Question Answering in Context is a large-scale dataset that consists of around 14K crowdsourced Question Answering dialogs with 98K question-answer …
QuALITY (Question Answering with Long Input Texts, Yes!) is a multiple-choice question answering dataset for long document comprehension. The dataset …
Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary …
The ReAding Comprehension dataset from Examinations (RACE) dataset is a machine reading comprehension dataset consisting of 27,933 passages and 97,867 …
Logical reasoning is an important ability to examine, analyze, and critically evaluate arguments as they occur in ordinary language as …
RecipeQA is a dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from …
RuOpenBookQA is a QA dataset with multiple-choice elementary-level science questions which probe the understanding of core science facts. Motivation RuOpenBookQA …
SCDE is a human-created sentence cloze dataset, collected from public school English examinations in China. The task requires a model …
Social Interaction QA (SIQA) is a question-answering benchmark for testing social commonsense intelligence. Contrary to many prior benchmarks that focus …
SQA3D is a dataset for embodied scene understanding, where an agent needs to comprehend the scene it situates from an …
The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct …
Given a partial description like "she opened the hood of the car," humans can reason about the situation and anticipate …
A large scale analogue of Stanford SQuAD in the Russian language - is a valuable resource that has not been …
The “Mental Health” forum was used, a forum dedicated to people suffering from schizophrenia and different mental disorders. Relevant posts …
SimpleQuestions is a large-scale factoid question answering dataset. It consists of 108,442 natural language questions, each paired with a corresponding …
A Benchmark for Robust Multi-Hop Spatial Reasoning in Texts
Representation and learning of commonsense knowledge is one of the foundational problems in the quest to enable deep language understanding. …
StrategyQA is a question answering benchmark where the required reasoning steps are implicit in the question, and should be inferred …
TAT-QA (Tabular And Textual dataset for Question Answering) is a large-scale QA dataset, aiming to stimulate progress of QA research …
Existing benchmarks for temporal QA focus on a single information source (either a KB or a text corpus), and include …
TempQA-WD is a benchmark dataset for temporal reasoning designed to encourage research in extending the present approaches to target a …
Here, we take a key step in this direction and release a new benchmark, TempQuestions, containing 1,271 questions, that are …
Question answering over knowledge graphs (KG-QA) is a vital topic in IR. Questions with temporal intent are a special class …
Torque is an English reading comprehension benchmark built on 3.2k news snippets with 21k human-generated questions querying temporal relationships. Source: …
Text Retrieval Conference Question Answering (TrecQA) is a dataset created from the TREC-8 (1999) to TREC-13 (2004) Question Answering tracks. …
TriviaQA is a realistic text-based question answering dataset which includes 950K question-answer pairs from 662K documents collected from Wikipedia and …
TruthfulQA is a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises …
With social media becoming increasingly popular on which lots of news and real-time events are reported, developing automated question answering …
UniProtQA consists of proteins and textual queries about their functions and properties. The dataset is constructed from UniProt, and consists …
The WebQuestions dataset is a question answering dataset using Freebase as the knowledge base and contains 6,642 question-answer pairs. It …
The WebQuestionsSP dataset is released as part of our ACL-2016 paper “The Value of Semantic Parse Labeling for Knowledge Base …
WebSRC is a novel Web-based Structural Reading Comprehension dataset. It consists of 0.44M question-answer pairs, which are collected from 6.5K …
WikiHop is a multi-hop question-answering dataset. The query of WikiHop is constructed with entities and relations from WikiData, while supporting …
The WikiQA corpus is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain …
WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further …
WikiTableQuestions is a question answering dataset over semi-structured tables. It is comprised of question-answer pairs on HTML tables, and was …
We aim to improve the bAbI benchmark as a means of developing intelligent dialogue agents. To this end, we propose …
We aim to improve the bAbI benchmark as a means of developing intelligent dialogue agents. To this end, we propose …
The ADE20K semantic segmentation dataset contains more than 20K scene-centric images exhaustively annotated with pixel-level objects and object parts labels. …
CelebAMask-HQ is a large-scale face image dataset that has 30,000 high-resolution face images selected from the CelebA dataset by following …
The Parkinson’s Progression Markers Initiative (PPMI) dataset originates from an observational clinical and longitudinal study comprising evaluations of people with …
Fashion trends are constantly evolving, but a trained eye can estimate with some accuracy the signature elements of a particular …
The ability to jointly understand the geometry of objects and plan actions for manipulating them is crucial for intelligent agents. …
The SheetCopilot dataset contains 28 evaluation workbooks and 221 spreadsheet manipulation tasks that are applied to these workbooks. These tasks …
The ActivityNet-QA dataset contains 58,000 human-annotated QA pairs on 5,800 videos derived from the popular ActivityNet dataset. The dataset provides …
The DramaQA focuses on two perspectives: 1) Hierarchical QAs as an evaluation metric based on the cognitive developmental stages of …
To collect How2QA for video QA task, the same set of selected video clips are presented to another group of …
We contribute an IntentQA dataset with diverse intents in daily social activities. We utilize NExT-QA as the source dataset to …
MSR-VTT (Microsoft Research Video to Text) is a large-scale dataset for the open domain video captioning, which consists of 10,000 …
The MSRVTT-MC (Multiple Choice) dataset is a video question-answering dataset created based on the MSR-VTT dataset. It consists of 2,990 …
The MSR-VTT-QA dataset is a benchmark for the task of Visual Question Answering (VQA) on the MSR-VTT (Microsoft Research Video …
The MSVD-QA dataset is a Video Question Answering (VideoQA) dataset. It is based on the existing Microsoft Research Video Description …
MVBench is a comprehensive Multi-modal Video understanding Benchmark. It was introduced to evaluate the comprehension capabilities of Multi-modal Large Language …
NExT-QA is a VideoQA benchmark targeting the explanation of video contents. It challenges QA models to reason about the causal …
OVBench is a benchmark tailored for real-time video understanding: - Memory, Perception, and Prediction of Temporal Contexts: Questions are framed …
Perception Test is a benchmark designed to evaluate the perception and reasoning skills of multimodal models. It introduces real-world videos …
Text and signs around roads provide crucial information for drivers, vital for safe navigation and situational awareness. Scene text recognition …
How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence. …
SUTD-TrafficQA (Singapore University of Technology and Design - Traffic Question Answering) is a dataset which takes the form of video …
The TGIF-QA dataset contains 165K QA pairs for the animated GIFs from the TGIF dataset [Li et al. CVPR 2016]. …
TVBench is a new benchmark specifically created to evaluate temporal understanding in video QA. We identified three main issues in …
The TVQA dataset is a large-scale video dataset for video question answering. It is based on 6 popular TV shows …
VLEP contains 28,726 future event prediction examples (along with their rationales) from 10,234 diverse TV Show and YouTube Lifestyle Vlog …
WildQA is a video understanding dataset of videos recorded in outside settings. The dataset can be used to evaluate models …
An open-ended VideoQA benchmark that aims to: i) provide a well-defined evaluation by including five correct answer annotations per question …
Video Instruction Dataset is used to train Video-ChatGPT. It consists of 100,000 high-quality video instruction pairs. employs a combination of …
Bongard-OpenWorld is a new benchmark for evaluating real-world few-shot reasoning for machine vision. We hope it can help us better …
The IRFL dataset consists of idioms, similes, and metaphors with matching figurative and literal images, as well as two novel …
NLVR contains 92,244 pairs of human-written English sentences grounded in synthetic images. Because the images are synthetically generated, this dataset …
Visual Analogies of Situation Recognition (VASR) is a dataset for visual analogical mapping, adapting the classical word-analogy task into the …
The Visual Spatial Reasoning (VSR) corpus is a collection of caption-image pairs with true/false labels. Each caption describes the spatial …
This dataset is collected via the WinoGAViL game to collect challenging vision-and-language associations. Inspired by the popular card game Codenames, …
Winoground is a dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning. Given two …