Machine Learning Benchmarks

Browse 276 benchmarks across 29 tasks
← ML Research Wiki / Benchmarks / Reasoning
Clear
Browse by Category

1 Image, 2*2 Stitchi

FQL-Driving

FQL-driving

📊 1 results
📏 Metrics: 0..5sec

2D Human Pose Estimation

COCO-WholeBody

COCO-WholeBody is an extension of COCO dataset with whole-body annotations. There are 4 types of bounding boxes (person box, face …

📊 14 results
📏 Metrics: WB, body, foot, face, hand

Human-Art

Human-Art is a versatile human-centric dataset to bridge the gap between natural and artificial scenes. It includes twenty high-quality human …

📊 10 results
📏 Metrics: AP, AP (gt bbox), Validation AP

OCHuman

This dataset focuses on heavily occluded human with comprehensive annotations including bounding-box, humans pose and instance mask. This dataset contains …

📊 10 results
📏 Metrics: Test AP, Validation AP

Analogical Similarity

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 2 results
📏 Metrics: Accuracy

Arithmetic Reasoning

GSM8K

GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. …

📊 144 results
📏 Metrics: Accuracy, Parameters (Billion)

Game of 24

Game of 24 is a mathematical reasoning challenge, where the goal is to use 4 numbers and basic arithmetic operations …

📊 1 results
📏 Metrics: Success

MathMC

Existing arithmetic benchmarks have a limited number of multiple-choice questions. To address this gap, MathMC is created including 1,000 Chinese …

📊 1 results
📏 Metrics: Accuracy

MathToF

Existing arithmetic benchmarks have a limited number of True-or-False questions. To address this gap, MathToF is created including 1,000 Chinese …

📊 1 results
📏 Metrics: Accuracy

Classification

Adult

Data Set Information: Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records …

📊 1 results
📏 Metrics: AUROC

BIOSCAN_1M_Insect Dataset

In an effort to catalog insect biodiversity, we propose a new large dataset of hand-labelled insect images, the BIOSCAN-1M Insect …

📊 2 results
📏 Metrics: Macro F1

BiasBios

The purpose of this dataset was to study gender bias in occupations. Online biographies, written in English, were collected to …

📊 1 results
📏 Metrics: 1:1 Accuracy

BoolQ

BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring – they are …

📊 2 results
📏 Metrics: Test Accuracy

Brain Tumor MRI Dataset

This dataset is a combination of the following three datasets : figshare, SARTAJ dataset and Br35H This dataset contains 7022 …

📊 1 results
📏 Metrics: F1 score

CIFAKE: Real and AI-Generated Synthetic Images

The quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness. CIFAKE is a dataset that …

📊 1 results
📏 Metrics: Validation Accuracy

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 1 results
📏 Metrics: Accuracy

CIFAR-10C

Common corruptions dataset for CIFAR10

📊 1 results
📏 Metrics: Accuracy on Brightness Corrupted Images

COVID-19 Image Data Collection

Contains hundreds of frontal view X-rays and is the largest public resource for COVID-19 image and prognostic data, making it …

📊 1 results
📏 Metrics: Accuracy

CWRU Bearing Dataset

Data was collected for normal bearings, single-point drive end and fan end defects. Data was collected at 12,000 samples/second and …

📊 1 results
📏 Metrics: 10 fold Cross validation

Chest X-Ray Images (Pneumonia)

The normal chest X-ray (left panel) depicts clear lungs without any areas of abnormal opacification in the image. Bacterial pneumonia …

📊 1 results
📏 Metrics: Accuracy

ForgeryNet

We construct the ForgeryNet dataset, an extremely large face forgery dataset with unified annotations in image- and video-level data across …

📊 3 results
📏 Metrics: AUC, Accuracy

Full-body Parkinson’s disease dataset

A public data set of walking full-body kinematics and kinetics in individuals with Parkinson’s disease

📊 7 results
📏 Metrics: F1-score (weighted)

HOWS

HOWS-CL-25 (Household Objects Within Simulation dataset for Continual Learning) is a synthetic dataset especially designed for object classification on mobile …

📊 1 results
📏 Metrics: Overall accuracy after last sequence

HRF

The HRF dataset is a dataset for retinal vessel segmentation which comprises 45 images and is organized as 15 subsets. …

📊 1 results
📏 Metrics: Accuracy

IRFL: Image Recognition of Figurative Language

The IRFL dataset consists of idioms, similes, and metaphors with matching figurative and literal images, as well as two novel …

📊 1 results
📏 Metrics: 1-of-100 Accuracy

ISIC 2019

The goal for ISIC 2019 is classify dermoscopic images among nine different diagnostic categories.25,331 images are available for training across …

📊 1 results
📏 Metrics: Balanced Multi-Class Accuracy

ImageNet C-OOD (class-out-of-distribution)

This dataset was presented as part of the ICLR 2023 paper 𝘈 𝘧𝘳𝘢𝘮𝘦𝘸𝘰𝘳𝘬 𝘧𝘰𝘳 𝘣𝘦𝘯𝘤𝘩𝘮𝘢𝘳𝘬𝘪𝘯𝘨 𝘊𝘭𝘢𝘴𝘴-𝘰𝘶𝘵-𝘰𝘧-𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯 𝘥𝘦𝘵𝘦𝘤𝘵𝘪𝘰𝘯 𝘢𝘯𝘥 𝘪𝘵𝘴 𝘢𝘱𝘱𝘭𝘪𝘤𝘢𝘵𝘪𝘰𝘯 …

📊 5 results
📏 Metrics: Detection AUROC (severity 0), Detection AUROC (severity 5), Detection AUROC (severity 10)

InDL

Dataset Introduction In this work, we introduce the In-Diagram Logic (InDL) dataset, an innovative resource crafted to rigorously evaluate the …

📊 9 results
📏 Metrics: Average Recall

LES-AV

This data set comprises 22 fundus images with their corresponding manual annotations for the blood vessels, separated as arteries and …

📊 1 results
📏 Metrics: Accuracy

Liver-US

The Liver-US dataset is a comprehensive collection of high-quality ultrasound images of the liver, including both normal and abnormal cases. …

📊 1 results
📏 Metrics: AUC

MHIST

The minimalist histopathology image analysis dataset (MHIST) is a binary classification dataset of 3,152 fixed-size images of colorectal polyps, each …

📊 6 results
📏 Metrics: Accuracy

MedSecId

The process by which sections in a document are demarcated and labeled is known as section identification. Such sections are …

📊 1 results
📏 Metrics: 1 shot Micro-F1

MixedWM38

MixedWM38 Dataset(WaferMap) has more than 38000 wafer maps, including 1 normal pattern, 8 single defect patterns, and 29 mixed defect …

📊 1 results
📏 Metrics: Accuracy, MCC

MuReD Dataset

Early detection of retinal diseases is one of the most important means of preventing partial or permanent blindness in patients. …

📊 1 results
📏 Metrics: ML F1, ML mAP, ML AUC

N-CARS

A large real-world event-based dataset for object classification. Source: HATS: Histograms of Averaged Time Surfaces for Robust Event-based Object Classification

📊 6 results
📏 Metrics: Accuracy (%), Architecture, Representation, Representation Time( ms / 100ms events), Inference Time, Params (M)

N-ImageNet

The N-ImageNet dataset is an event-camera counterpart for the ImageNet dataset. The dataset is obtained by moving an event camera …

📊 9 results
📏 Metrics: Accuracy (%)

RITE

The RITE (Retinal Images vessel Tree Extraction) is a database that enables comparative studies on segmentation or classification of arteries …

📊 1 results
📏 Metrics: Accuracy

RSSCN7

he RSSCN7 dataset contains satellite images acquired from Google Earth, which is originally collected for remote sensing scene classification. We …

📊 1 results
📏 Metrics: 1:1 Accuracy

RTE

The Recognizing Textual Entailment (RTE) datasets come from a series of textual entailment challenges. Data from RTE1, RTE2, RTE3 and …

📊 2 results
📏 Metrics: Test Accuracy

SGD

The Schema-Guided Dialogue (SGD) dataset consists of over 20k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. …

📊 1 results
📏 Metrics: F1 (Seqeval)

SHD - Adding

This dataset is based on the Spiking Heidelberg Digits (SHD) dataset. Sample inputs consist of two spike encoded digits sampled …

📊 3 results
📏 Metrics: Accuracy (%)

SPOT-10

The SPOTS-10 dataset is an extensive collection of grayscale images showcasing diverse patterns found in ten animal species. Specifically, SPOTS-10 …

📊 9 results
📏 Metrics: Accuracy

SST-2

The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the …

📊 2 results
📏 Metrics: Test Accuracy

Sentiment140

Sentiment140 is a dataset that allows you to discover the sentiment of a brand, product, or topic on Twitter. Source: …

📊 1 results
📏 Metrics: Accuracy

SimGas

This dataset consists of computer-generated images for gas leakage segmentation. It features diverse backgrounds, interfering foreground objects, and precise ground …

📊 1 results
📏 Metrics: Frame Level Accuracy

Sound-based drone fault classification using multitask learning

arxiv : https://arxiv.org/abs/2304.11708 Accepted at 29th International Congress on Sound and Vibration (ICSV29). The drone has been used for various …

📊 1 results
📏 Metrics: macro f1 score (A(100), B(100), C(100) Avg.)

TACM12K

Table-ACM12K (TACM12K) is a relational table dataset derived from the ACM heterogeneous graph dataset. It includes four tables: papers, authors, …

📊 1 results
📏 Metrics: Accuracy

TCGA

📊 1 results
📏 Metrics: AUPRC, AUROC

TLF2K

Table-LastFm2K (TLF2K) is a relational table dataset derived from the classical LastFM2K dataset. It contains three tables: artists, user_artists, and …

📊 1 results
📏 Metrics: Accuracy

TML1M

Table-MovieLens1M (TML1M) is a relational table dataset derived from the classical MovieLens1M dataset. It consists of three tables: users, movies, …

📊 1 results
📏 Metrics: Accuracy

WSC

The Winograd Schema Challenge was introduced both as an alternative to the Turing Test and as a test of a …

📊 2 results
📏 Metrics: Test Accuracy

WiC

WiC is a benchmark for the evaluation of context-sensitive word embeddings. WiC is framed as a binary classification task. Each …

📊 2 results
📏 Metrics: Test Accuracy

XImageNet-12

Enlarge the dataset to understand how image background effect the Computer Vision ML model. With the following topics: Blur Background …

📊 3 results
📏 Metrics: Robustness Score

Code Generation

APPS

The APPS dataset consists of problems collected from different open-access coding websites such as Codeforces, Kattis, and more. The APPS …

📊 18 results
📏 Metrics: Introductory Pass@1, Interview Pass@1, Competition Pass@1, Competition Pass@any, Interview Pass@any, Introductory Pass@any, Competition Pass@5, Interview Pass@5, Introductory Pass@5, Competition Pass@1000, Interview Pass@1000, Introductory Pass@1000, Pass@1

CONCODE

A new large dataset with over 100,000 examples consisting of Java classes from online code repositories, and develop a new …

📊 2 results
📏 Metrics: Exact Match, BLEU, CodeBLEU

CoNaLa

The CMU CoNaLa, the Code/Natural Language Challenge dataset is a joint project from the Carnegie Mellon University NeuLab and Strudel

📊 7 results
📏 Metrics: BLEU, Exact Match Accuracy

CoNaLa-Ext

The CoNaLa Extended With Question Text is an extension to the original CoNaLa Dataset (Papers With Code Link) proposed in …

📊 5 results
📏 Metrics: BLEU

CodeContests

CodeContests is a competitive programming dataset for machine-learning. This dataset was used when training AlphaCode. It consists of programming problems, …

📊 8 results
📏 Metrics: Test Set pass@1, Test Set pass@5, Val Set pass@1, Val Set pass@5

DSEval-LeetCode

In this paper, we introduce a novel benchmarking framework designed specifically for evaluations of data science agents. Our contributions are …

📊 1 results
📏 Metrics: Pass Rate, w/o Intact, w/o PE

Django

The Django dataset is a dataset for code generation comprising of 16000 training, 1000 development and 1805 test annotations. Each …

📊 5 results
📏 Metrics: Accuracy, BLEU Score

FloCo

the FloCo dataset that contains 11,884 flowchart images and their corresponding Python codes.

📊 1 results
📏 Metrics: BLEU, CodeBLEU

HumanEval

This is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained …

📊 7 results
📏 Metrics: Pass@1

HumanEval-ET

Extension test cases of HumanEval, as well as generated code.

📊 2 results
📏 Metrics: Pass@1

MBPP

The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry-level programmers, covering programming fundamentals, …

📊 95 results
📏 Metrics: Accuracy

MBPP-ET

Extension test cases of MBPP, as well as generated code.

📊 2 results
📏 Metrics: Pass@1

PECC

Recent advancements in large language models (LLMs) have showcased their exceptional abilities across various tasks, such as code generation, problem-solving …

📊 8 results
📏 Metrics: Pass@3

RES-Q

RES-Q is a natural language instruction-based benchmark for evaluating $\textbf{R}$epository $\textbf{E}$diting $\textbf{S}$ystems, which consists of 100 handcrafted repository editing tasks …

📊 9 results
📏 Metrics: pass@1

Shellcode_IA32

Shellcode_IA32 is a dataset containing 20 years of shellcodes from a variety of sources is the largest collection of shellcodes …

📊 3 results
📏 Metrics: BLEU-4, Exact Match Accuracy

TACO-BAAI

TACO (Topics in Algorithmic Code generation dataset) is a dataset focused on algorithmic code generation, designed to provide a more …

📊 3 results
📏 Metrics: easy pass@1

Turbulence

$\textbf{Turbulence}$ is a new benchmark for systematically evaluating the correctness and robustness of instruction-tuned large language models (LLMs) for code …

📊 5 results
📏 Metrics: CorrSc

Verified Smart Contract Code Comments

Verified Smart Contracts Code Comments is a dataset of real Ethereum smart contract functions, containing "code, comment" pairs of both …

📊 2 results
📏 Metrics: BLEU score

VerilogEval

VerilogEval Dataset The VerilogEval Dataset is a benchmark specifically designed to assess the ability of large language models (LLMs) …

📊 1 results
📏 Metrics: Pass Rate

WebApp1K-React

Test-driven benchmark to challenge LLMs to write JavaScript React application GitHub Script

📊 8 results
📏 Metrics: pass@1

WebApp1k-Duo-React

Test-driven benchmark to challenge LLMs to write long JavaScript React application GitHub Script

📊 6 results
📏 Metrics: pass@1

WikiSQL

WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further …

📊 10 results
📏 Metrics: Execution Accuracy, Exact Match Accuracy

Common Sense Reasoning

CODAH

The COmmonsense Dataset Adversarially-authored by Humans (CODAH) is an evaluation set for commonsense question-answering in the sentence completion style of …

📊 1 results
📏 Metrics: Accuracy

CommonsenseQA

The CommonsenseQA is a dataset for commonsense question answering task. The dataset consists of 12,247 questions with 5 choices each. …

📊 38 results
📏 Metrics: Accuracy

PARus

Choice of Plausible Alternatives for Russian language (PARus) evaluation provides researchers with a tool for assessing progress in open-domain commonsense …

📊 6 results
📏 Metrics: Accuracy

RWSD

A Winograd schema is a pair of sentences that differ in only one or two words and that contain an …

📊 6 results
📏 Metrics: Accuracy

ReCoRD

Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) is a large-scale reading comprehension dataset which requires commonsense reasoning. ReCoRD consists of …

📊 33 results
📏 Metrics: EM, F1

RuCoS

Russian reading comprehension with Commonsense reasoning (RuCoS) is a large-scale reading comprehension dataset that requires commonsense reasoning. RuCoS consists of …

📊 6 results
📏 Metrics: Average F1, EM

SWAG

Given a partial description like "she opened the hood of the car," humans can reason about the situation and anticipate …

📊 5 results
📏 Metrics: Test, Dev

WinoGAViL

This dataset is collected via the WinoGAViL game to collect challenging vision-and-language associations. Inspired by the popular card game Codenames, …

📊 1 results
📏 Metrics: Jaccard Index

WinoGrande

WinoGrande is a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the …

📊 73 results
📏 Metrics: Accuracy

Decision Making

NASA C-MAPSS

Engine degradation simulation was carried out using C-MAPSS. Four different were sets simulated under different combinations of operational conditions and …

📊 1 results
📏 Metrics: Average Remaining Cycles

Emotion Interpretation

EIBench

For Emotion Interpretation task

📊 13 results
📏 Metrics: Recall

Error Understanding

CUB-200-2011

The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of …

📊 4 results
📏 Metrics: Average highest confidence (ResNet-101), Insertion AUC score (ResNet-101), Average highest confidence (MobileNetV2), Insertion AUC score (MobileNetV2), Average highest confidence (EfficientNetV2-M), Insertion AUC score (EfficientNetV2-M)

Generative Visual Question Answering

PMC-VQA

PMC-VQA is a large-scale medical visual question-answering dataset that contains 227k VQA pairs of 149k images that cover various modalities …

📊 3 results
📏 Metrics: BLEU-1

Identify Odd Metapor

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 2 results
📏 Metrics: Accuracy

Image Paragraph Captioning

Image Paragraph Captioning

The Image Paragraph Captioning dataset allows researchers to benchmark their progress in generating paragraphs that tell a story about an …

📊 4 results
📏 Metrics: BLEU-4, METEOR, CIDEr

Logical Reasoning

LingOly

This dataset is a benchmark for complex reasoning abilities in large language models, drawing on United Kingdom Linguistics Olympiad problems …

📊 11 results
📏 Metrics: Delta_NoContext, Exact Match Accuracy

RuWorldTree

RuWorldTree is a QA dataset with multiple-choice elementary-level science questions, which evaluate the understanding of core science facts. Motivation The …

📊 4 results
📏 Metrics: Accuracy

Winograd Automatic

The Winograd schema challenge composes tasks with syntactic ambiguity, which can be resolved with logic and reasoning. Motivation The dataset …

📊 4 results
📏 Metrics: Accuracy

Math Word Problem Solving

ALG514

514 algebra word problems and associated equation systems gathered from Algebra.com.

📊 1 results
📏 Metrics: Accuracy (%)

GSM-Plus

By perturbing the widely used GSM8K dataset, an adversarial dataset for grade-school math called GSM-Plus is created. Motivated by the …

📊 1 results
📏 Metrics: 1:1 Accuracy

MATH

MATH is a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution …

📊 132 results
📏 Metrics: Accuracy, Parameters (Billions)

MAWPS

MAWPS is an online repository of Math Word Problems, to provide a unified testbed to evaluate different algorithms. MAWPS allows …

📊 15 results
📏 Metrics: Accuracy (%)

Math23K

Math23K is a dataset created for math word problem solving, contains 23, 162 Chinese problems crawled from the Internet. Refer …

📊 12 results
📏 Metrics: Accuracy (5-fold), Accuracy (training-test), weakly-supervised

MathQA

MathQA significantly enhances the AQuA dataset with fully-specified operational programs. Source: [MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based …

📊 5 results
📏 Metrics: Answer Accuracy

ParaMAWPS

This repository contains the code, data, and models of the paper titled **"Math Word Problem Solving by Generating Linguistic Variants …

📊 6 results
📏 Metrics: Accuracy (%)

SVAMP

A challenge set for elementary-level Math Word Problems (MWP). An MWP consists of a short Natural Language narrative that describes …

📊 23 results
📏 Metrics: Execution Accuracy, Accuracy

Mathematical Question Answering

GeoS

GeoS is a dataset for automatic math problem solving. It is a dataset of SAT plane geometry questions where every …

📊 1 results
📏 Metrics: Accuracy (%)

Geometry3K

A new large-scale geometry problem-solving dataset - 3,002 multi-choice geometry problems - dense annotations in formal language for the diagrams …

📊 8 results
📏 Metrics: Accuracy (%)

Mathematical Reasoning

GeoQA

GeoQA is a dataset for automatic geometric problem solving containing 5,010 geometric problems with corresponding annotated programs, which illustrate the …

📊 2 results
📏 Metrics: Accuracy (%)

PGPS9K

A new large scale plane geometry problem solving dataset called PGPS9K, labeled both fine-grained diagram annotation and interpretable solution program.

📊 6 results
📏 Metrics: Completion accuracy

Multi-Label Classification

CheXpert

The CheXpert dataset contains 224,316 chest radiographs of 65,240 patients with both frontal and lateral views available. The task is …

📊 11 results
📏 Metrics: AVERAGE AUC ON 14 LABEL, NUM RADS BELOW CURVE

ChestX-ray14

ChestX-ray14 is a medical imaging dataset which comprises 112,120 frontal-view X-ray images of 30,805 (collected from the year of 1992 …

📊 4 results
📏 Metrics: Average AUC on 14 label, Macro F1

MIMIC-CXR

MIMIC-CXR from Massachusetts Institute of Technology presents 371,920 chest X-rays associated with 227,943 imaging studies from 65,079 patients. The studies …

📊 1 results
📏 Metrics: Average AUC on 14 label

MLRSNet

MLRSNet is a a multi-label high spatial resolution remote sensing dataset for semantic scene understanding. It provides different perspectives of …

📊 2 results
📏 Metrics: F1-score

MRNet

The MRNet dataset consists of 1,370 knee MRI exams performed at Stanford University Medical Center. The dataset contains 1,104 (80.6%) …

📊 1 results
📏 Metrics: Average AUC, AUC on Abnormality (ABN), AUC on ACL Tear (ACL), AUC on Meniscus Tear (MEN), Average Accuracy, Accuracy on Abnormality (ABN), Accuracy on ACL Tear (ACL), Accuracy on Meniscus Tear (MEN)

NUS-WIDE

The NUS-WIDE dataset contains 269,648 images with a total of 5,018 tags collected from Flickr. These images are manually annotated …

📊 9 results
📏 Metrics: MAP

OpenImages-v6

OpenImages V6 is a large-scale dataset , consists of 9 million training images, 41,620 validation samples, and 125,456 test samples. …

📊 4 results
📏 Metrics: mAP

PASCAL VOC 2007

PASCAL VOC 2007 is a dataset for image recognition. The twenty object classes that have been selected are: Person: person …

📊 16 results
📏 Metrics: mAP

Multimodal Reasoning

AlgoPuzzleVQA

We introduce the novel task of multimodal puzzle solving, framed within the context of visual question-answering. We present a new …

📊 1 results
📏 Metrics: Acc

MATH-V

Math-Vision (Math-V) dataset is a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math …

📊 4 results
📏 Metrics: Accuracy

REBUS

Recent advances in large language models have led to the development of multimodal LLMs (MLLMs), which take both image data …

📊 8 results
📏 Metrics: Accuracy

Natural Language Inference

BioNLI

BioNLI is a dataset in biomedical natural language inference. This dataset contains abstracts from biomedical literature and mechanistic premises generated …

📊 1 results
📏 Metrics: Macro F1

CommitmentBank

The CommitmentBank is a corpus of 1,200 naturally occurring discourses whose final sentence contains a clause-embedding predicate under an entailment …

📊 20 results
📏 Metrics: Accuracy, F1

FarsTail

Natural Language Inference (NLI), also called Textual Entailment, is an important task in NLP with the goal of determining the …

📊 10 results
📏 Metrics: % Test Accuracy

HANS

The HANS (Heuristic Analysis for NLI Systems) dataset which contains many examples where the heuristics fail. Source: [Right for the …

📊 1 results
📏 Metrics: 1:1 Accuracy

JamPatoisNLI

JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois. Many of the most-spoken low-resource …

📊 2 results
📏 Metrics: Accuracy

KUAKE-QQR

KUAKE Query-Query Relevance, a dataset used to evaluate the relevance of the content expressed in two queries, is used for …

📊 1 results
📏 Metrics: Accuracy

KUAKE-QTR

KUAKE Query Title Relevance, a dataset used to estimate the relevance of the title of a query document, is used …

📊 1 results
📏 Metrics: Accuracy

LiDiRus

LiDiRus is a diagnostic dataset that covers a large volume of linguistic phenomena, while allowing you to evaluate information systems …

📊 6 results
📏 Metrics: MCC

MED

MED is a new evaluation dataset that covers a wide range of monotonicity reasoning that was created by crowdsourcing and …

📊 1 results
📏 Metrics: 1:1 Accuracy

MRPC

Microsoft Research Paraphrase Corpus (MRPC) is a corpus consists of 5,801 sentence pairs collected from newswire articles. Each pair is …

📊 1 results
📏 Metrics: Acc

MedNLI

The MedNLI dataset consists of the sentence pairs developed by Physicians from the Past Medical History section of MIMIC-III clinical …

📊 5 results
📏 Metrics: Accuracy, Params (M)

MultiNLI

The Multi-Genre Natural Language Inference (MultiNLI) dataset has 433K sentence pairs. Its size and mode of collection are modeled closely …

📊 63 results
📏 Metrics: Matched, Mismatched, Accuracy, Dev Matched, Dev Mismatched

Probability words NLI

This dataset tests the capabilities of language models to correctly capture the meaning of words denoting probabilities (WEP), e.g. words …

📊 1 results
📏 Metrics: 1:1 Accuracy

QNLI

The QNLI (Question-answering NLI) dataset is a Natural Language Inference dataset automatically derived from the Stanford Question Answering Dataset v1.1 …

📊 42 results
📏 Metrics: Accuracy

Quora Question Pairs

Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary …

📊 1 results
📏 Metrics: Accuracy

RCB

The Russian Commitment Bank is a corpus of naturally occurring discourses whose final sentence contains a clause-embedding predicate under an …

📊 6 results
📏 Metrics: Average F1, Accuracy

RTE

The Recognizing Textual Entailment (RTE) datasets come from a series of textual entailment challenges. Data from RTE1, RTE2, RTE3 and …

📊 89 results
📏 Metrics: Accuracy

SICK

The Sentences Involving Compositional Knowledge (SICK) dataset is a dataset for compositional distributional semantics. It includes a large number of …

📊 1 results
📏 Metrics: 1:1 Accuracy

SNLI

The SNLI dataset (Stanford Natural Language Inference) consists of 570k sentence-pairs manually labeled as entailment, contradiction, and neutral. Premises are …

📊 88 results
📏 Metrics: % Test Accuracy, % Train Accuracy, Parameters, Dev Accuracy, % Dev Accuracy, Accuracy

SciTail

The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct …

📊 11 results
📏 Metrics: Accuracy, Dev Accuracy, % Dev Accuracy, % Test Accuracy

TERRa

Textual Entailment Recognition has been proposed recently as a generic task that captures major semantic inference needs across many NLP …

📊 6 results
📏 Metrics: Accuracy

TabFact

TabFact is a large-scale dataset which consists of 117,854 manually annotated statements with regard to 16,573 Wikipedia tables, their relations …

📊 1 results
📏 Metrics: Accuracy

WNLI

The WNLI dataset is a part of the GLUE benchmark used for Natural Language Inference (NLI). It contains pairs of …

📊 22 results
📏 Metrics: Accuracy

XWINO

XWINO is a multilingual collection of Winograd Schemas in six languages that can be used for evaluation of cross-lingual commonsense …

📊 1 results
📏 Metrics: Accuracy

e-SNLI

e-SNLI is used for various goals, such as obtaining full sentence justifications of a model's decisions, improving universal sentence representations …

📊 3 results
📏 Metrics: BLEU, Accuracy

Natural Language Visual Grounding

ScreenSpot

ScreenSpot Evaluation Benchmark ScreenSpot is an evaluation benchmark for GUI grounding, comprising over 1,200 instructions from various environments, including …

📊 18 results
📏 Metrics: Accuracy (%)

Odd One Out

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their …

📊 2 results
📏 Metrics: Accuracy

Program Repair

DeepFix

DeepFix consists of a program repair dataset (fix compiler errors in C programs). It enables research around automatically fixing programming …

📊 4 results
📏 Metrics: Average Success Rate

GitHub-Python

Repair AST parse (syntax) errors in Python code

📊 2 results
📏 Metrics: Accuracy (%)

HumanEvalPack

HumanEvalPack is an extension of OpenAI's HumanEval to cover 6 total languages across 3 tasks. The evaluation suite is fully …

📊 1 results
📏 Metrics: Pass@1

Question Answering

AviationQA

AviationQA is introduced in the paper titled- There is No Big Brother or Small Brother: Knowledge Infusion in Language Models …

📊 1 results
📏 Metrics: Hits@1

BBH

BIG-Bench Hard (BBH) is a subset of the BIG-Bench, a diverse evaluation suite for language models. BBH focuses on a …

📊 1 results
📏 Metrics: Accuracy

BLURB

BLURB is a collection of resources for biomedical natural language processing. In general domains such as newswire and the Web, …

📊 3 results
📏 Metrics: Accuracy

Bamboogle

The Bamboogle dataset is a collection of questions that was constructed to investigate the ability of language models to perform …

📊 9 results
📏 Metrics: Accuracy

BioASQ

BioASQ is a question answering dataset. Instances in the BioASQ dataset are composed of a question (Q), human-annotated answers (A), …

📊 6 results
📏 Metrics: Accuracy

BoolQ

BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring – they are …

📊 65 results
📏 Metrics: Accuracy

CODAH

The COmmonsense Dataset Adversarially-authored by Humans (CODAH) is an evaluation set for commonsense question-answering in the sentence completion style of …

📊 2 results
📏 Metrics: Accuracy

COPA

The Choice Of Plausible Alternatives (COPA) evaluation provides researchers with a tool for assessing progress in open-domain commonsense causal reasoning. …

📊 55 results
📏 Metrics: Accuracy

CaseHOLD

CaseHOLD (Case Holdings On Legal Decisions) is a law dataset comprised of over 53,000+ multiple choice questions to identify the …

📊 3 results
📏 Metrics: Macro F1 (10-fold)

ChAII - Hindi and Tamil Question Answering

The dataset covers Hindi and Tamil, collected without the use of translation. It provides a realistic information-seeking task with questions …

📊 1 results
📏 Metrics: Jaccard

CheGeKa

CheGeKa is a Jeopardy!-like Russian QA dataset collected from the official Russian quiz database ChGK. Motivation The task can be …

📊 4 results
📏 Metrics: Accuracy

Children's Book Test

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 8 results
📏 Metrics: Accuracy-CN, Accuracy-NE

CliCR

CliCR is a new dataset for domain specific reading comprehension used to construct around 100,000 cloze queries from clinical case …

📊 2 results
📏 Metrics: F1

CoQA

CoQA is a large-scale dataset for building Conversational Question Answering systems. The goal of the CoQA challenge is to measure …

📊 9 results
📏 Metrics: In-domain, Out-of-domain, Overall

Complex-CronQuestions

A filtered version of CronQuestions and which can better demonstrate the model’s inference ability for complex temporal questions.

📊 3 results
📏 Metrics: Hits@1

ComplexWebQuestions

ComplexWebQuestions is a dataset for answering complex questions that require reasoning over multiple web snippets. It contains a large set …

📊 1 results
📏 Metrics: EM

ConditionalQA

ConditionalQA is a Question Answering (QA) dataset that contains complex questions with conditional answers, i.e. the answers are only applicable …

📊 3 results
📏 Metrics: Conditional (answers), Conditional (w/ conditions), Overall (answers), Overall (w/ conditions)

ConvFinQA

ConvFinQA is a dataset designed to study the chain of numerical reasoning in conversational question answering. The dataset contains 3892 …

📊 3 results
📏 Metrics: Execution Accuracy

CronQuestions

CRONQUESTIONS, the Temporal KGQA dataset consists of two parts: a KG with temporal annotations, and a set of natural language …

📊 10 results
📏 Metrics: Hits@1

DROP

Discrete Reasoning Over Paragraphs DROP is a crowdsourced, adversarially-created, 96k-question benchmark, in which a system must resolve references in a …

📊 6 results
📏 Metrics: Accuracy

DaNetQA

DaNetQA is a question answering dataset for yes/no questions. These questions are naturally occurring ---they are generated in unprompted and …

📊 6 results
📏 Metrics: Accuracy

DuoRC

DuoRC contains 186,089 unique question-answer pairs created from a collection of 7680 pairs of movie plots where each pair in …

📊 3 results
📏 Metrics: Accuracy

EgoTaskQA

EgoTask QA benchmark contains 40K balanced question-answer pairs selected from 368K programmatically generated questions generated over 2K egocentric videos. It …

📊 4 results
📏 Metrics: Direct

FEVER

FEVER is a publicly available dataset for fact extraction and verification against textual sources. It consists of 185,445 claims manually …

📊 7 results
📏 Metrics: EM

FQuAD

A French Native Reading Comprehension dataset of questions and answers on a set of Wikipedia articles that consists of 25,000+ …

📊 6 results
📏 Metrics: EM, F1

FairytaleQA

FairytaleQA is a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Annotated by educational experts based on an …

📊 4 results
📏 Metrics: F1, Rouge-L

FinQA

FinQA is a new large-scale dataset with Question-Answering pairs over Financial reports, written by financial experts. The dataset contains 8,281 …

📊 6 results
📏 Metrics: Execution Accuracy, Program Accuracy

GraphQuestions

GraphQuestions is a characteristic-rich dataset designed for factoid question answering. The dataset aims to provide a systematic way of constructing …

📊 1 results
📏 Metrics: Accuracy

HellaSwag

HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are …

📊 1 results
📏 Metrics: Accuracy

HotpotQA

HotpotQA is a question answering dataset collected on the English Wikipedia, containing about 113K crowd-sourced questions that are constructed to …

📊 22 results
📏 Metrics: JOINT-F1, ANS-EM, ANS-F1, SUP-EM, SUP-F1, JOINT-EM

HybridQA

A new large-scale question-answering dataset that requires reasoning on heterogeneous information. Each question is aligned with a Wikipedia table and …

📊 3 results
📏 Metrics: ANS-EM

JaQuAD

JaQuAD (Japanese Question Answering Dataset) is a question answering dataset in Japanese that consists of 39,696 extractive question-answer pairs on …

📊 1 results
📏 Metrics: Exact Match, F1

KQA Pro

A large-scale dataset for Complex KBQA. Source: [KQA Pro: A Large-Scale Dataset with Interpretable Programs and Accurate SPARQLs for Complex …

📊 1 results
📏 Metrics: Accuracy

MML

MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively …

📊 1 results
📏 Metrics: Accuracy

MRQA

The MRQA (Machine Reading for Question Answering) dataset is a dataset for evaluating the generalization capabilities of reading comprehension systems. …

📊 2 results
📏 Metrics: Average F1

MS MARCO

The MS MARCO (Microsoft MAchine Reading Comprehension) is a collection of datasets focused on deep learning in search. The first …

📊 4 results
📏 Metrics: Rouge-L, BLEU-1

MapEval-API

MapEval-Textual contains 300 question-answer pairs. The task is to answer question by fetching necessary informations using external Map APIs.

📊 2 results
📏 Metrics: Accuracy (%)

MapEval-Textual

MapEval-Textual contains 300 context-question-answer triplets. The necessary geo-spatial information is provided in the context. The task is to answer question …

📊 1 results
📏 Metrics: Accuracy (% )

Mathematics Dataset

This dataset code generates mathematical question and answer pairs, from a range of question types at roughly school-level difficulty. This …

📊 3 results
📏 Metrics: Accuracy

MedQA

Multiple choice question answering based on the United States Medical License Exams (USMLE). The dataset is collected from the professional …

📊 27 results
📏 Metrics: Accuracy

MetaQA

The MetaQA dataset consists of a movie ontology derived from the WikiMovies Dataset and three sets of question-answer pairs written …

📊 1 results
📏 Metrics: AnswerExactMatch (Question Answering)

Molweni

A machine reading comprehension (MRC) dataset with discourse structure built over multiparty dialog. Molweni's source samples from the Ubuntu Chat …

📊 4 results
📏 Metrics: EM, F1

MultiQ

MultiQ is a multi-hop QA dataset for Russian, suitable for general open-domain question answering, information retrieval, and reading comprehension tasks. …

📊 4 results
📏 Metrics: Accuracy

MultiRC

MultiRC (Multi-Sentence Reading Comprehension) is a dataset of short paragraphs and multi-sentence questions, i.e., questions that can be answered by …

📊 30 results
📏 Metrics: F1, EM

MultiTQ

MULTITQ is a large-scale dataset featuring ample relevant facts and multiple temporal granularities.

📊 9 results
📏 Metrics: Hits@1, Hits@10

NExT-QA (Open-ended VideoQA)

NExT-QA is a VideoQA benchmark targeting the explanation of video contents. It challenges QA models to reason about the causal …

📊 6 results
📏 Metrics: Accuracy, Confidence Score

NarrativeQA

The NarrativeQA dataset includes a list of documents with Wikipedia summaries, links to full stories, and questions and answers. Source: …

📊 8 results
📏 Metrics: Rouge-L, BLEU-1, BLEU-4, METEOR

Natural Questions

The Natural Questions corpus is a question answering dataset containing 307,373 training examples, 7,830 development examples, and 7,842 test examples. …

📊 46 results
📏 Metrics: EM

NewsQA

The NewsQA dataset is a crowd-sourced machine reading comprehension dataset of 120,000 question-answer pairs. * Documents are CNN news articles. …

📊 16 results
📏 Metrics: EM, F1

OTT-QA

The Open Table-and-Text Question Answering (OTT-QA) dataset contains open questions which require retrieving tables and text from the web to …

📊 3 results
📏 Metrics: ANS-EM

OpenBookQA

OpenBookQA is a new kind of question-answering dataset modeled after open book exams for assessing human understanding of a subject. …

📊 40 results
📏 Metrics: Accuracy

PIQA

PIQA is a dataset for commonsense reasoning, and was created to investigate the physical knowledge of existing models in NLP. …

📊 67 results
📏 Metrics: Accuracy

PeerQA

We present PeerQA, a real-world, scientific, document-level Question Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, which …

📊 5 results
📏 Metrics: Prometheus-2 Answer Correctness, Rouge-L, AlignScore

PopQA

PopQA is an open-domain QA dataset with 14k QA pairs with fine-grained Wikidata entity ID, Wikipedia page views, and relationship …

📊 2 results
📏 Metrics: Accuracy

PubChemQA

PubChemQA consists of molecules and their corresponding textual descriptions from PubChem. It contains a single type of question, i.e., please …

📊 2 results
📏 Metrics: BLEU-2, BLEU-4, ROUGE-1, ROUGE-2, ROUGE-L, MEATOR

PubMedQA

The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary …

📊 26 results
📏 Metrics: Accuracy

QASPER

QASPER is a dataset for question answering on scientific research papers. It consists of 5,049 questions over 1,585 Natural Language …

📊 1 results
📏 Metrics: Token F1

QuAC

Question Answering in Context is a large-scale dataset that consists of around 14K crowdsourced Question Answering dialogs with 98K question-answer …

📊 2 results
📏 Metrics: F1, HEQD, HEQQ

QuALITY

QuALITY (Question Answering with Long Input Texts, Yes!) is a multiple-choice question answering dataset for long document comprehension. The dataset …

📊 1 results
📏 Metrics: Accuracy

Quora Question Pairs

Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary …

📊 19 results
📏 Metrics: Accuracy

RACE

The ReAding Comprehension dataset from Examinations (RACE) dataset is a machine reading comprehension dataset consisting of 27,933 passages and 97,867 …

📊 6 results
📏 Metrics: RACE-m, RACE-h, RACE

ReClor

Logical reasoning is an important ability to examine, analyze, and critically evaluate arguments as they occur in ordinary language as …

📊 3 results
📏 Metrics: Accuracy, Accuracy (easy), Accuracy (hard)

RecipeQA

RecipeQA is a dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from …

📊 1 results
📏 Metrics: Accuracy

RuOpenBookQA

RuOpenBookQA is a QA dataset with multiple-choice elementary-level science questions which probe the understanding of core science facts. Motivation RuOpenBookQA …

📊 4 results
📏 Metrics: Accuracy

SCDE

SCDE is a human-created sentence cloze dataset, collected from public school English examinations in China. The task requires a model …

📊 1 results
📏 Metrics: BA, PA, DE

SIQA

Social Interaction QA (SIQA) is a question-answering benchmark for testing social commonsense intelligence. Contrary to many prior benchmarks that focus …

📊 24 results
📏 Metrics: Accuracy

SQA3D

SQA3D is a dataset for embodied scene understanding, where an agent needs to comprehend the scene it situates from an …

📊 7 results
📏 Metrics: AnswerExactMatch (Question Answering)

SQuAD

The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct …

📊 2 results
📏 Metrics: Exact Match, F1

SWAG

Given a partial description like "she opened the hood of the car," humans can reason about the situation and anticipate …

📊 1 results
📏 Metrics: Accuracy

SberQuAD

A large scale analogue of Stanford SQuAD in the Russian language - is a valuable resource that has not been …

📊 3 results
📏 Metrics: EM, F1

SchizzoSQUAD

The “Mental Health” forum was used, a forum dedicated to people suffering from schizophrenia and different mental disorders. Relevant posts …

📊 1 results
📏 Metrics: Average F1, Averaged Precision

SimpleQuestions

SimpleQuestions is a large-scale factoid question answering dataset. It consists of 108,442 natural language questions, each paired with a corresponding …

📊 1 results
📏 Metrics: F1

StepGame

A Benchmark for Robust Multi-Hop Spatial Reasoning in Texts

📊 1 results
📏 Metrics: 1-of-100 Accuracy

StoryCloze

Representation and learning of commonsense knowledge is one of the foundational problems in the quest to enable deep language understanding. …

📊 20 results
📏 Metrics: Accuracy

StrategyQA

StrategyQA is a question answering benchmark where the required reasoning steps are implicit in the question, and should be inferred …

📊 11 results
📏 Metrics: Accuracy, EM

TAT-QA

TAT-QA (Tabular And Textual dataset for Question Answering) is a large-scale QA dataset, aiming to stimulate progress of QA research …

📊 1 results
📏 Metrics: Exact Match (EM)

TIQ

Existing benchmarks for temporal QA focus on a single information source (either a KB or a text corpus), and include …

📊 9 results
📏 Metrics: P@1

TempQA-WD

TempQA-WD is a benchmark dataset for temporal reasoning designed to encourage research in extending the present approaches to target a …

📊 1 results
📏 Metrics: F1

TempQuestions

Here, we take a key step in this direction and release a new benchmark, TempQuestions, containing 1,271 questions, that are …

📊 4 results
📏 Metrics: Hits@1, F1

TimeQuestions

Question answering over knowledge graphs (KG-QA) is a vital topic in IR. Questions with temporal intent are a special class …

📊 16 results
📏 Metrics: P@1

Torque

Torque is an English reading comprehension benchmark built on 3.2k news snippets with 21k human-generated questions querying temporal relationships. Source: …

📊 2 results
📏 Metrics: F1, EM, C

TrecQA

Text Retrieval Conference Question Answering (TrecQA) is a dataset created from the TREC-8 (1999) to TREC-13 (2004) Question Answering tracks. …

📊 12 results
📏 Metrics: MAP, MRR

TriviaQA

TriviaQA is a realistic text-based question answering dataset which includes 950K question-answer pairs from 662K documents collected from Wikipedia and …

📊 51 results
📏 Metrics: EM, F1

TruthfulQA

TruthfulQA is a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises …

📊 30 results
📏 Metrics: MC1, MC2, % true, % info, % true (GPT-judge), BLEURT, ROUGE, BLEU, EM, Accuracy

TweetQA

With social media becoming increasingly popular on which lots of news and real-time events are reported, developing automated question answering …

📊 3 results
📏 Metrics: BLEU-1, ROUGE-L

UniProtQA

UniProtQA consists of proteins and textual queries about their functions and properties. The dataset is constructed from UniProt, and consists …

📊 2 results
📏 Metrics: BLEU-2, BLEU-4, ROUGE-1, ROUGE-2, ROUGE-L, MEATOR

WebQuestions

The WebQuestions dataset is a question answering dataset using Freebase as the knowledge base and contains 6,642 question-answer pairs. It …

📊 36 results
📏 Metrics: EM, F1

WebQuestionsSP

The WebQuestionsSP dataset is released as part of our ACL-2016 paper “The Value of Semantic Parse Labeling for Knowledge Base …

📊 1 results
📏 Metrics: Accuracy

WebSRC

WebSRC is a novel Web-based Structural Reading Comprehension dataset. It consists of 0.44M question-answer pairs, which are collected from 6.5K …

📊 1 results
📏 Metrics: F1

WikiHop

WikiHop is a multi-hop question-answering dataset. The query of WikiHop is constructed with entities and relations from WikiData, while supporting …

📊 9 results
📏 Metrics: Test

WikiQA

The WikiQA corpus is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain …

📊 23 results
📏 Metrics: MAP, MRR

WikiSQL

WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further …

📊 2 results
📏 Metrics: Exact Match (EM)

WikiTableQuestions

WikiTableQuestions is a question answering dataset over semi-structured tables. It is comprised of question-answer pairs on HTML tables, and was …

📊 2 results
📏 Metrics: Accuracy, Accuracy (Test)

catbAbI LM-mode

We aim to improve the bAbI benchmark as a means of developing intelligent dialogue agents. To this end, we propose …

📊 4 results
📏 Metrics: Accuracy (mean)

catbAbI QA-mode

We aim to improve the bAbI benchmark as a means of developing intelligent dialogue agents. To this end, we propose …

📊 4 results
📏 Metrics: 1:1 Accuracy

Reconstruction

ADE20K

The ADE20K semantic segmentation dataset contains more than 20K scene-centric images exhaustively annotated with pixel-level objects and object parts labels. …

📊 1 results
📏 Metrics: PSNR

CelebAMask-HQ

CelebAMask-HQ is a large-scale face image dataset that has 30,000 high-resolution face images selected from the CelebA dataset by following …

📊 1 results
📏 Metrics: PSNR, R-FID

PPMI

The Parkinson’s Progression Markers Initiative (PPMI) dataset originates from an observational clinical and longitudinal study comprising evaluations of people with …

📊 1 results
📏 Metrics: runtime (s)

iDesigner

Fashion trends are constantly evolving, but a trained eye can estimate with some accuracy the signature elements of a particular …

📊 1 results
📏 Metrics: PSNR, R-FID

Robot Task Planning

PackIt

The ability to jointly understand the geometry of objects and plan actions for manipulating them is crucial for intelligent agents. …

📊 4 results
📏 Metrics: Average Reward

SheetCopilot

The SheetCopilot dataset contains 28 evaluation workbooks and 221 spreadsheet manipulation tasks that are applied to these workbooks. These tasks …

📊 2 results
📏 Metrics: Pass@1

Video Question Answering

ActivityNet-QA

The ActivityNet-QA dataset contains 58,000 human-annotated QA pairs on 5,800 videos derived from the popular ActivityNet dataset. The dataset provides …

📊 36 results
📏 Metrics: Accuracy, Confidence score

DramaQA

The DramaQA focuses on two perspectives: 1) Hierarchical QAs as an evaluation metric based on the cognitive developmental stages of …

📊 1 results
📏 Metrics: Accuracy

How2QA

To collect How2QA for video QA task, the same set of selected video clips are presented to another group of …

📊 7 results
📏 Metrics: Accuracy

IntentQA

We contribute an IntentQA dataset with diverse intents in daily social activities. We utilize NExT-QA as the source dataset to …

📊 4 results
📏 Metrics: Accuarcy, CW, CH, TP&TN

MSR-VTT

MSR-VTT (Microsoft Research Video to Text) is a large-scale dataset for the open domain video captioning, which consists of 10,000 …

📊 1 results
📏 Metrics: Accuracy

MSRVTT-MC

The MSRVTT-MC (Multiple Choice) dataset is a video question-answering dataset created based on the MSR-VTT dataset. It consists of 2,990 …

📊 7 results
📏 Metrics: Accuracy

MSRVTT-QA

The MSR-VTT-QA dataset is a benchmark for the task of Visual Question Answering (VQA) on the MSR-VTT (Microsoft Research Video …

📊 14 results
📏 Metrics: Accuracy

MSVD-QA

The MSVD-QA dataset is a Video Question Answering (VideoQA) dataset. It is based on the existing Microsoft Research Video Description …

📊 1 results
📏 Metrics: Accuracy

MVBench

MVBench is a comprehensive Multi-modal Video understanding Benchmark. It was introduced to evaluate the comprehension capabilities of Multi-modal Large Language …

📊 22 results
📏 Metrics: Avg.

NExT-QA

NExT-QA is a VideoQA benchmark targeting the explanation of video contents. It challenges QA models to reason about the causal …

📊 47 results
📏 Metrics: Accuracy

OVBench

OVBench is a benchmark tailored for real-time video understanding: - Memory, Perception, and Prediction of Temporal Contexts: Questions are framed …

📊 15 results
📏 Metrics: AVG

Perception Test

Perception Test is a benchmark designed to evaluate the perception and reasoning skills of multimodal models. It introduces real-world videos …

📊 6 results
📏 Metrics: Accuracy (Top-1)

RoadTextVQA

Text and signs around roads provide crucial information for drivers, vital for safe navigation and situational awareness. Scene text recognition …

📊 2 results
📏 Metrics: ACCURACY

STAR Benchmark

How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence. …

📊 17 results
📏 Metrics: Average Accuracy

SUTD-TrafficQA

SUTD-TrafficQA (Singapore University of Technology and Design - Traffic Question Answering) is a dataset which takes the form of video …

📊 5 results
📏 Metrics: 1/4, 1/2

TGIF-QA

The TGIF-QA dataset contains 165K QA pairs for the animated GIFs from the TGIF dataset [Li et al. CVPR 2016]. …

📊 1 results
📏 Metrics: Accuracy

TVBench

TVBench is a new benchmark specifically created to evaluate temporal understanding in video QA. We identified three main issues in …

📊 28 results
📏 Metrics: Average Accuracy

TVQA

The TVQA dataset is a large-scale video dataset for video question answering. It is based on 6 popular TV shows …

📊 6 results
📏 Metrics: Accuracy

VLEP

VLEP contains 28,726 future event prediction examples (along with their rationales) from 10,234 diverse TV Show and YouTube Lifestyle Vlog …

📊 1 results
📏 Metrics: Accuracy

WildQA

WildQA is a video understanding dataset of videos recorded in outside settings. The dataset can be used to evaluate models …

📊 5 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

iVQA

An open-ended VideoQA benchmark that aims to: i) provide a well-defined evaluation by including five correct answer annotations per question …

📊 7 results
📏 Metrics: Accuracy

Video-based Generative Performance Benchmarking

VideoInstruct

Video Instruction Dataset is used to train Video-ChatGPT. It consists of 100,000 high-quality video instruction pairs. employs a combination of …

📊 23 results
📏 Metrics: mean, Correctness of Information, Detail Orientation, Contextual Understanding, Temporal Understanding, Consistency

Visual Reasoning

Bongard-OpenWorld

Bongard-OpenWorld is a new benchmark for evaluating real-world few-shot reasoning for machine vision. We hope it can help us better …

📊 9 results
📏 Metrics: 2-Class Accuracy

IRFL: Image Recognition of Figurative Language

The IRFL dataset consists of idioms, similes, and metaphors with matching figurative and literal images, as well as two novel …

📊 1 results
📏 Metrics: 1-of-100 Accuracy

NLVR

NLVR contains 92,244 pairs of human-written English sentences grounded in synthetic images. Because the images are synthetically generated, this dataset …

📊 1 results
📏 Metrics: Accuracy (Dev), Accuracy (Test-P), Accuracy (Test-U)

VASR

Visual Analogies of Situation Recognition (VASR) is a dataset for visual analogical mapping, adapting the classical word-analogy task into the …

📊 4 results
📏 Metrics: 1:1 Accuracy

VSR

The Visual Spatial Reasoning (VSR) corpus is a collection of caption-image pairs with true/false labels. Each caption describes the spatial …

📊 5 results
📏 Metrics: accuracy

WinoGAViL

This dataset is collected via the WinoGAViL game to collect challenging vision-and-language associations. Inspired by the popular card game Codenames, …

📊 8 results
📏 Metrics: Jaccard Index

Winoground

Winoground is a dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning. Given two …

📊 105 results
📏 Metrics: Text Score, Image Score, Group Score