Machine Learning Benchmarks

Browse 1020 benchmarks across 203 tasks
← ML Research Wiki / Benchmarks / Natural Language Processing
Clear
Browse by Category

1 Image, 2*2 Stitchi

FQL-Driving

FQL-driving

📊 1 results
📏 Metrics: 0..5sec

2D Semantic Segmentation

CamVid

CamVid (Cambridge-driving Labeled Video Database) is a road/driving scene understanding database which was originally captured as five video sequences with …

📊 1 results
📏 Metrics: mIoU

GF-PA66 3D XCT

Stack of 2D gray images of glass fiber-reinforced polyamide 66 (GF-PA66) 3D X-ray Computed Tomography (XCT) specimen. Usage: 2D/3D image …

📊 1 results
📏 Metrics: Jaccard (Mean)

WaterScenes

A Multi-Task 4D Radar-Camera Fusion Dataset for Autonomous Driving on Water Surfaces description of the dataset * WaterScenes, the first …

📊 1 results
📏 Metrics: mIoU

WildScenes

WildScenes is a bi-modal benchmark dataset consisting of multiple large-scale, sequential traversals in natural environments, including semantic annotations in high-resolution …

📊 5 results
📏 Metrics: mIoU, mIoU (Temporal DA) , mIoU (Env DA)

xBD

The xBD dataset contains over 45,000KM2 of polygon labeled pre and post disaster imagery. The dataset provides the post-disaster imagery …

📊 5 results
📏 Metrics: Weighted Average F1-score, Localization F1-score, Classification F1-score

3D Action Recognition

Assembly101

Assembly101 is a new procedural activity dataset featuring 4321 videos of people assembling and disassembling 101 "take-apart" toy vehicles. Participants …

📊 7 results
📏 Metrics: Actions Top-1, Verbs Top-1, Object Top-1

NTU RGB+D

NTU RGB+D is a large-scale dataset for RGB-D human action recognition. It involves 56,880 samples of 60 action classes collected …

📊 3 results
📏 Metrics: Cross Subject Accuracy, Cross View Accuracy

AMR Graph Similarity

Benchmark for AMR Metrics based on Overt Objectives

Benchmark for AMR Metrics based on Overt Objectives (Bamboo), the first benchmark to support empirical assessment of graph-based MR similarity …

📊 7 results
📏 Metrics: Pearson’s ρ (amean), Spearman Correlation

AMR Parsing

Bio

This corpus includes annotations of cancer-related PubMed articles, covering 3 full papers (PMID:24651010, PMID:11777939, PMID:15630473) as well as the result …

📊 3 results
📏 Metrics: Smatch

LDC2017T10

Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University …

📊 23 results
📏 Metrics: Smatch

LDC2020T02

Abstract Meaning Representation (AMR) Annotation Release 3.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University …

📊 9 results
📏 Metrics: Smatch

New3

New3, a set of 527 instances from AMR 3.0, whose original source was the LORELEI DARPA project – not included …

📊 2 results
📏 Metrics: Smatch

The Little Prince

This corpus is an annotation of the novel The Little Prince by Antoine de Saint-Exupéry, published in 1943. We were …

📊 2 results
📏 Metrics: Smatch

Abstractive Text Summarization

AESLC

To study the task of email subject line generation: automatically generating an email subject line from the email body. Source: …

📊 2 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

CNN/Daily Mail

CNN/Daily Mail is a dataset for text summarization. Human generated abstractive summary bullets were generated from news stories in CNN …

📊 3 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

WikiHow

WikiHow is a dataset of more than 230,000 article and summary pairs extracted and constructed from an online knowledge base …

📊 1 results
📏 Metrics: Content F1, ROUGE-1, ROUGE-2, ROUGE-L

Action Parsing

JerichoWorld

JerichoWorld is a dataset that enables the creation of learning agents that can build knowledge graph-based world models of interactive …

📊 3 results
📏 Metrics: Set accuracy

Age And Gender Classification

BN-AuthProf

Although research on author profiling has quite progressed in abundant resources languages, it is still infancy for limited resources languages …

📊 1 results
📏 Metrics: F1 score

Arabic Text Diacritization

CATT

The CATT benchmark dataset comprises 742 sentences, which were scraped from an internet news source in 2023. It covers multiple …

📊 11 results
📏 Metrics: DER(%), WER (%)

Aspect Category Detection

SemEval-2014 Task-4

Sentiment analysis is increasingly viewed as a vital task both from an academic and a commercial standpoint. The majority of …

📊 1 results
📏 Metrics: Average Recall, Hit@5, MRR, NDCG

Aspect Extraction

SemEval-2014 Task-4

Sentiment analysis is increasingly viewed as a vital task both from an academic and a commercial standpoint. The majority of …

📊 5 results
📏 Metrics: Laptop (F1), Restaurant (F1), Mean F1 (Laptop + Restaurant)

Aspect-Based Sentiment Analysis (ABSA)

ACOS

Most of the aspect based sentiment analysis research aims at identifying the sentiment polarities toward some explicit aspect terms while …

📊 8 results
📏 Metrics: F1 (Laptop), F1 (Restaurant)

ASQP

Aspect-based sentiment analysis (ABSA) typically focuses on extracting aspects and predicting their sentiments on individual sentences such as customer reviews. …

📊 9 results
📏 Metrics: F1 (R15), F1 (R16)

ASTE

Target-based sentiment analysis or aspect-based sentiment analysis (ABSA) refers to addressing various sentiment analysis tasks at a fine-grained level, which …

📊 9 results
📏 Metrics: F1 (L14), F1(R14), F1 (R15), F1 (R16)

MAMS

MAMS is a challenge dataset for aspect-based sentiment analysis (ABSA), in which each sentences contain at least two aspects with …

📊 1 results
📏 Metrics: Acc, Macro-F1

SemEval-2014 Task-4

Sentiment analysis is increasingly viewed as a vital task both from an academic and a commercial standpoint. The majority of …

📊 32 results
📏 Metrics: Mean Acc (Restaurant + Laptop), Restaurant (Acc), Laptop (Acc)

TASD

Aspect-based sentiment analysis (ABSA) aims to detect the targets (which are composed by continuous words), aspects and sentiment polarities in …

📊 9 results
📏 Metrics: F1 (R15), F1 (R16)

Attribute Extraction

SWDE

This dataset is a real-world web page collection used for research on the automatic extraction of structured data (e.g., attribute-value …

📊 2 results
📏 Metrics: Avg F1

Attribute Mining

AE-110k

The dataset contains product information from AliExpress Sports & Entertainment category. Each attribute value in "Item Specific" is matched against …

📊 1 results
📏 Metrics: F1-score

MAVE

The dataset contains 3 million attribute-value annotations across 1257 unique categories created from 2.2 million cleaned Amazon product profiles. It …

📊 1 results
📏 Metrics: F1-score

OA-Mine - annotations

The dataset contains Amazon products from 10 product categories with full human annotations. The dataset was collected in 2021. The …

📊 1 results
📏 Metrics: F1-score

Attribute Value Extraction

AE-110k

The dataset contains product information from AliExpress Sports & Entertainment category. Each attribute value in "Item Specific" is matched against …

📊 2 results
📏 Metrics: F1-score

MAVE

The dataset contains 3 million attribute-value annotations across 1257 unique categories created from 2.2 million cleaned Amazon product profiles. It …

📊 3 results
📏 Metrics: F1-score

OA-Mine - annotations

The dataset contains Amazon products from 10 product categories with full human annotations. The dataset was collected in 2021. The …

📊 2 results
📏 Metrics: F1-score

WDC-PAVE

The datasets contains 1,420 human annotated product offers, systematically selected from the Web Data Commons Product Matching Corpus, featuring 24,582 …

📊 5 results
📏 Metrics: F1-Score

Automated Essay Scoring

ASAP-AES

There are eight essay sets. Each of the sets of essays was generated from a single prompt. Selected essays range …

📊 4 results
📏 Metrics: Quadratic Weighted Kappa

Bias Detection

StereoSet

A large-scale natural dataset in English to measure stereotypical biases in four domains: gender, profession, race, and religion. Source: [StereoSet: …

📊 11 results
📏 Metrics: ICAT Score, LMS, SS

rt-inod-bias

The Innodata Red Teaming Prompts aims to rigorously assess models’ factuality and safety. This dataset, due to its manual creation …

📊 5 results
📏 Metrics: Best-of

Binary Classification

TII-SSRC-23

The TII-SSRC-23 dataset offers a comprehensive collection of network traffic patterns, meticulously compiled to support the development and research of …

📊 1 results
📏 Metrics: F1-Score

fake

[Real or Fake] : Fake Job Description Prediction This dataset contains 18K job descriptions out of which about 800 are …

📊 8 results
📏 Metrics: AUROC

kickstarter

Kickstarter is a community of more than 10 million people comprising of creative, tech enthusiasts who help in bringing creative …

📊 4 results
📏 Metrics: AUROC

Binary Condescension Detection

DPM

Don’t Patronize Me! (DPM) is an annotated dataset with Patronizing and Condescending Language towards vulnerable communities.

📊 2 results
📏 Metrics: F1-score

Binary text classification

TweepFake

The TweepFake dataset consists of 25,572 social media messages posted either by bots or humans on Twitter. Each bot imitated …

📊 2 results
📏 Metrics: F1 score, Accuracy (%)

CCG Supertagging

CCGbank

CCGbank is a translation of the Penn Treebank into a corpus of Combinatory Categorial Grammar derivations. It pairs syntactic derivations …

📊 4 results
📏 Metrics: Accuracy

Chatbot

AlpacaEval

The AlpacaEval set contains 805 instructions form self-instruct, open-assistant, vicuna, koala, hh-rlhf. Those were selected so that the AlpacaEval ranking …

📊 1 results
📏 Metrics: Average win rate

Chunking

Penn Treebank

The English Penn Treebank (PTB) corpus, and in particular the section of the corpus corresponding to the articles of Wall …

📊 5 results
📏 Metrics: F1 score

Classification

Adult

Data Set Information: Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records …

📊 1 results
📏 Metrics: AUROC

BIOSCAN_1M_Insect Dataset

In an effort to catalog insect biodiversity, we propose a new large dataset of hand-labelled insect images, the BIOSCAN-1M Insect …

📊 2 results
📏 Metrics: Macro F1

BiasBios

The purpose of this dataset was to study gender bias in occupations. Online biographies, written in English, were collected to …

📊 1 results
📏 Metrics: 1:1 Accuracy

BoolQ

BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring – they are …

📊 2 results
📏 Metrics: Test Accuracy

Brain Tumor MRI Dataset

This dataset is a combination of the following three datasets : figshare, SARTAJ dataset and Br35H This dataset contains 7022 …

📊 1 results
📏 Metrics: F1 score

CIFAKE: Real and AI-Generated Synthetic Images

The quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness. CIFAKE is a dataset that …

📊 1 results
📏 Metrics: Validation Accuracy

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 1 results
📏 Metrics: Accuracy

CIFAR-10C

Common corruptions dataset for CIFAR10

📊 1 results
📏 Metrics: Accuracy on Brightness Corrupted Images

COVID-19 Image Data Collection

Contains hundreds of frontal view X-rays and is the largest public resource for COVID-19 image and prognostic data, making it …

📊 1 results
📏 Metrics: Accuracy

CWRU Bearing Dataset

Data was collected for normal bearings, single-point drive end and fan end defects. Data was collected at 12,000 samples/second and …

📊 1 results
📏 Metrics: 10 fold Cross validation

Chest X-Ray Images (Pneumonia)

The normal chest X-ray (left panel) depicts clear lungs without any areas of abnormal opacification in the image. Bacterial pneumonia …

📊 1 results
📏 Metrics: Accuracy

ForgeryNet

We construct the ForgeryNet dataset, an extremely large face forgery dataset with unified annotations in image- and video-level data across …

📊 3 results
📏 Metrics: AUC, Accuracy

Full-body Parkinson’s disease dataset

A public data set of walking full-body kinematics and kinetics in individuals with Parkinson’s disease

📊 7 results
📏 Metrics: F1-score (weighted)

HOWS

HOWS-CL-25 (Household Objects Within Simulation dataset for Continual Learning) is a synthetic dataset especially designed for object classification on mobile …

📊 1 results
📏 Metrics: Overall accuracy after last sequence

HRF

The HRF dataset is a dataset for retinal vessel segmentation which comprises 45 images and is organized as 15 subsets. …

📊 1 results
📏 Metrics: Accuracy

IRFL: Image Recognition of Figurative Language

The IRFL dataset consists of idioms, similes, and metaphors with matching figurative and literal images, as well as two novel …

📊 1 results
📏 Metrics: 1-of-100 Accuracy

ISIC 2019

The goal for ISIC 2019 is classify dermoscopic images among nine different diagnostic categories.25,331 images are available for training across …

📊 1 results
📏 Metrics: Balanced Multi-Class Accuracy

ImageNet C-OOD (class-out-of-distribution)

This dataset was presented as part of the ICLR 2023 paper 𝘈 𝘧𝘳𝘢𝘮𝘦𝘸𝘰𝘳𝘬 𝘧𝘰𝘳 𝘣𝘦𝘯𝘤𝘩𝘮𝘢𝘳𝘬𝘪𝘯𝘨 𝘊𝘭𝘢𝘴𝘴-𝘰𝘶𝘵-𝘰𝘧-𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯 𝘥𝘦𝘵𝘦𝘤𝘵𝘪𝘰𝘯 𝘢𝘯𝘥 𝘪𝘵𝘴 𝘢𝘱𝘱𝘭𝘪𝘤𝘢𝘵𝘪𝘰𝘯 …

📊 5 results
📏 Metrics: Detection AUROC (severity 0), Detection AUROC (severity 5), Detection AUROC (severity 10)

InDL

Dataset Introduction In this work, we introduce the In-Diagram Logic (InDL) dataset, an innovative resource crafted to rigorously evaluate the …

📊 9 results
📏 Metrics: Average Recall

LES-AV

This data set comprises 22 fundus images with their corresponding manual annotations for the blood vessels, separated as arteries and …

📊 1 results
📏 Metrics: Accuracy

Liver-US

The Liver-US dataset is a comprehensive collection of high-quality ultrasound images of the liver, including both normal and abnormal cases. …

📊 1 results
📏 Metrics: AUC

MHIST

The minimalist histopathology image analysis dataset (MHIST) is a binary classification dataset of 3,152 fixed-size images of colorectal polyps, each …

📊 6 results
📏 Metrics: Accuracy

MedSecId

The process by which sections in a document are demarcated and labeled is known as section identification. Such sections are …

📊 1 results
📏 Metrics: 1 shot Micro-F1

MixedWM38

MixedWM38 Dataset(WaferMap) has more than 38000 wafer maps, including 1 normal pattern, 8 single defect patterns, and 29 mixed defect …

📊 1 results
📏 Metrics: Accuracy, MCC

MuReD Dataset

Early detection of retinal diseases is one of the most important means of preventing partial or permanent blindness in patients. …

📊 1 results
📏 Metrics: ML F1, ML mAP, ML AUC

N-CARS

A large real-world event-based dataset for object classification. Source: HATS: Histograms of Averaged Time Surfaces for Robust Event-based Object Classification

📊 6 results
📏 Metrics: Accuracy (%), Architecture, Representation, Representation Time( ms / 100ms events), Inference Time, Params (M)

N-ImageNet

The N-ImageNet dataset is an event-camera counterpart for the ImageNet dataset. The dataset is obtained by moving an event camera …

📊 9 results
📏 Metrics: Accuracy (%)

RITE

The RITE (Retinal Images vessel Tree Extraction) is a database that enables comparative studies on segmentation or classification of arteries …

📊 1 results
📏 Metrics: Accuracy

RSSCN7

he RSSCN7 dataset contains satellite images acquired from Google Earth, which is originally collected for remote sensing scene classification. We …

📊 1 results
📏 Metrics: 1:1 Accuracy

RTE

The Recognizing Textual Entailment (RTE) datasets come from a series of textual entailment challenges. Data from RTE1, RTE2, RTE3 and …

📊 2 results
📏 Metrics: Test Accuracy

SGD

The Schema-Guided Dialogue (SGD) dataset consists of over 20k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. …

📊 1 results
📏 Metrics: F1 (Seqeval)

SHD - Adding

This dataset is based on the Spiking Heidelberg Digits (SHD) dataset. Sample inputs consist of two spike encoded digits sampled …

📊 3 results
📏 Metrics: Accuracy (%)

SPOT-10

The SPOTS-10 dataset is an extensive collection of grayscale images showcasing diverse patterns found in ten animal species. Specifically, SPOTS-10 …

📊 9 results
📏 Metrics: Accuracy

SST-2

The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the …

📊 2 results
📏 Metrics: Test Accuracy

Sentiment140

Sentiment140 is a dataset that allows you to discover the sentiment of a brand, product, or topic on Twitter. Source: …

📊 1 results
📏 Metrics: Accuracy

SimGas

This dataset consists of computer-generated images for gas leakage segmentation. It features diverse backgrounds, interfering foreground objects, and precise ground …

📊 1 results
📏 Metrics: Frame Level Accuracy

Sound-based drone fault classification using multitask learning

arxiv : https://arxiv.org/abs/2304.11708 Accepted at 29th International Congress on Sound and Vibration (ICSV29). The drone has been used for various …

📊 1 results
📏 Metrics: macro f1 score (A(100), B(100), C(100) Avg.)

TACM12K

Table-ACM12K (TACM12K) is a relational table dataset derived from the ACM heterogeneous graph dataset. It includes four tables: papers, authors, …

📊 1 results
📏 Metrics: Accuracy

TCGA

📊 1 results
📏 Metrics: AUPRC, AUROC

TLF2K

Table-LastFm2K (TLF2K) is a relational table dataset derived from the classical LastFM2K dataset. It contains three tables: artists, user_artists, and …

📊 1 results
📏 Metrics: Accuracy

TML1M

Table-MovieLens1M (TML1M) is a relational table dataset derived from the classical MovieLens1M dataset. It consists of three tables: users, movies, …

📊 1 results
📏 Metrics: Accuracy

WSC

The Winograd Schema Challenge was introduced both as an alternative to the Turing Test and as a test of a …

📊 2 results
📏 Metrics: Test Accuracy

WiC

WiC is a benchmark for the evaluation of context-sensitive word embeddings. WiC is framed as a binary classification task. Each …

📊 2 results
📏 Metrics: Test Accuracy

XImageNet-12

Enlarge the dataset to understand how image background effect the Computer Vision ML model. With the following topics: Blur Background …

📊 3 results
📏 Metrics: Robustness Score

Clinical Assertion Status Detection

2010 i2b2/VA

2010 i2b2/VA is a biomedical dataset for relation classification and entity typing.

📊 1 results
📏 Metrics: Micro F1

Clinical Concept Extraction

2010 i2b2/VA

2010 i2b2/VA is a biomedical dataset for relation classification and entity typing.

📊 3 results
📏 Metrics: Exact Span F1

Code Generation

APPS

The APPS dataset consists of problems collected from different open-access coding websites such as Codeforces, Kattis, and more. The APPS …

📊 18 results
📏 Metrics: Introductory Pass@1, Interview Pass@1, Competition Pass@1, Competition Pass@any, Interview Pass@any, Introductory Pass@any, Competition Pass@5, Interview Pass@5, Introductory Pass@5, Competition Pass@1000, Interview Pass@1000, Introductory Pass@1000, Pass@1

CONCODE

A new large dataset with over 100,000 examples consisting of Java classes from online code repositories, and develop a new …

📊 2 results
📏 Metrics: Exact Match, BLEU, CodeBLEU

CoNaLa

The CMU CoNaLa, the Code/Natural Language Challenge dataset is a joint project from the Carnegie Mellon University NeuLab and Strudel

📊 7 results
📏 Metrics: BLEU, Exact Match Accuracy

CoNaLa-Ext

The CoNaLa Extended With Question Text is an extension to the original CoNaLa Dataset (Papers With Code Link) proposed in …

📊 5 results
📏 Metrics: BLEU

CodeContests

CodeContests is a competitive programming dataset for machine-learning. This dataset was used when training AlphaCode. It consists of programming problems, …

📊 8 results
📏 Metrics: Test Set pass@1, Test Set pass@5, Val Set pass@1, Val Set pass@5

DSEval-LeetCode

In this paper, we introduce a novel benchmarking framework designed specifically for evaluations of data science agents. Our contributions are …

📊 1 results
📏 Metrics: Pass Rate, w/o Intact, w/o PE

Django

The Django dataset is a dataset for code generation comprising of 16000 training, 1000 development and 1805 test annotations. Each …

📊 5 results
📏 Metrics: Accuracy, BLEU Score

FloCo

the FloCo dataset that contains 11,884 flowchart images and their corresponding Python codes.

📊 1 results
📏 Metrics: BLEU, CodeBLEU

HumanEval

This is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained …

📊 7 results
📏 Metrics: Pass@1

HumanEval-ET

Extension test cases of HumanEval, as well as generated code.

📊 2 results
📏 Metrics: Pass@1

MBPP

The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry-level programmers, covering programming fundamentals, …

📊 95 results
📏 Metrics: Accuracy

MBPP-ET

Extension test cases of MBPP, as well as generated code.

📊 2 results
📏 Metrics: Pass@1

PECC

Recent advancements in large language models (LLMs) have showcased their exceptional abilities across various tasks, such as code generation, problem-solving …

📊 8 results
📏 Metrics: Pass@3

RES-Q

RES-Q is a natural language instruction-based benchmark for evaluating $\textbf{R}$epository $\textbf{E}$diting $\textbf{S}$ystems, which consists of 100 handcrafted repository editing tasks …

📊 9 results
📏 Metrics: pass@1

Shellcode_IA32

Shellcode_IA32 is a dataset containing 20 years of shellcodes from a variety of sources is the largest collection of shellcodes …

📊 3 results
📏 Metrics: BLEU-4, Exact Match Accuracy

TACO-BAAI

TACO (Topics in Algorithmic Code generation dataset) is a dataset focused on algorithmic code generation, designed to provide a more …

📊 3 results
📏 Metrics: easy pass@1

Turbulence

$\textbf{Turbulence}$ is a new benchmark for systematically evaluating the correctness and robustness of instruction-tuned large language models (LLMs) for code …

📊 5 results
📏 Metrics: CorrSc

Verified Smart Contract Code Comments

Verified Smart Contracts Code Comments is a dataset of real Ethereum smart contract functions, containing "code, comment" pairs of both …

📊 2 results
📏 Metrics: BLEU score

VerilogEval

VerilogEval Dataset The VerilogEval Dataset is a benchmark specifically designed to assess the ability of large language models (LLMs) …

📊 1 results
📏 Metrics: Pass Rate

WebApp1K-React

Test-driven benchmark to challenge LLMs to write JavaScript React application GitHub Script

📊 8 results
📏 Metrics: pass@1

WebApp1k-Duo-React

Test-driven benchmark to challenge LLMs to write long JavaScript React application GitHub Script

📊 6 results
📏 Metrics: pass@1

WikiSQL

WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further …

📊 10 results
📏 Metrics: Execution Accuracy, Exact Match Accuracy

Common Sense Reasoning

CODAH

The COmmonsense Dataset Adversarially-authored by Humans (CODAH) is an evaluation set for commonsense question-answering in the sentence completion style of …

📊 1 results
📏 Metrics: Accuracy

CommonsenseQA

The CommonsenseQA is a dataset for commonsense question answering task. The dataset consists of 12,247 questions with 5 choices each. …

📊 38 results
📏 Metrics: Accuracy

PARus

Choice of Plausible Alternatives for Russian language (PARus) evaluation provides researchers with a tool for assessing progress in open-domain commonsense …

📊 6 results
📏 Metrics: Accuracy

RWSD

A Winograd schema is a pair of sentences that differ in only one or two words and that contain an …

📊 6 results
📏 Metrics: Accuracy

ReCoRD

Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) is a large-scale reading comprehension dataset which requires commonsense reasoning. ReCoRD consists of …

📊 33 results
📏 Metrics: EM, F1

RuCoS

Russian reading comprehension with Commonsense reasoning (RuCoS) is a large-scale reading comprehension dataset that requires commonsense reasoning. RuCoS consists of …

📊 6 results
📏 Metrics: Average F1, EM

SWAG

Given a partial description like "she opened the hood of the car," humans can reason about the situation and anticipate …

📊 5 results
📏 Metrics: Test, Dev

WinoGAViL

This dataset is collected via the WinoGAViL game to collect challenging vision-and-language associations. Inspired by the popular card game Codenames, …

📊 1 results
📏 Metrics: Jaccard Index

WinoGrande

WinoGrande is a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the …

📊 73 results
📏 Metrics: Accuracy

Conditional Text Generation

Lipogram-e

This is a dataset of 3 English books which do not contain the letter "e" in them. This dataset includes …

📊 4 results
📏 Metrics: Ignored Constraint Error Rate

Constituency Parsing

Penn Treebank

The English Penn Treebank (PTB) corpus, and in particular the section of the corpus corresponding to the articles of Wall …

📊 22 results
📏 Metrics: F1 score

Continual Learning

20Newsgroup (10 tasks)

This dataset has 20 classes and each class has about 1000 documents. The data split for train/validation/test is 1600/200/200. We …

📊 6 results
📏 Metrics: F1 - macro

AIDS

AIDS is a graph dataset. It consists of 2000 graphs representing molecular compounds which are constructed from the AIDS Antiviral …

📊 1 results
📏 Metrics: 1:3 Accuracy

DSC (10 tasks)

A set of 10 DSC datasets (reviews of 10 products) to produce sequences of tasks. The products are Sports, Toys, …

📊 6 results
📏 Metrics: F1 - macro

F-CelebA (10 tasks)

F-CelebA - This dataset is adapted from federated learning. Federated learning is an emerging machine learning paradigm with an emphasis …

📊 6 results
📏 Metrics: Acc

MLT17

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 1 results
📏 Metrics: Acc

Permuted MNIST

Permuted MNIST is an MNIST variant that consists of 70,000 images of handwritten digits from 0 to 9, where 60,000 …

📊 3 results
📏 Metrics: Average Accuracy, MLP Hidden Layers-width, Pretrained/Transfer Learning, BWT

Conversational Question Answering

ConvFinQA

ConvFinQA is a dataset designed to study the chain of numerical reasoning in conversational question answering. The dataset contains 3892 …

📊 2 results
📏 Metrics: Execution Accuracy, Program Accuracy

Conversational Response Selection

Advising Corpus

Advising Corpus is a dataset based on an entirely new collection of dialogues in which university students are being advised …

📊 1 results
📏 Metrics: R@1, R@10, R@50

Douban

We release Douban Conversation Corpus, comprising a training data set, a development set and a test set for retrieval based …

📊 10 results
📏 Metrics: MAP, MRR, P@1, R10@1, R10@2, R10@5

E-commerce

We release E-commerce Dialogue Corpus, comprising a training data set, a development set and a test set for retrieval based …

📊 11 results
📏 Metrics: R10@1, R10@2, R10@5

RRS

| | Train | Validation | Test | Ranking Test | | --------- | ----- | ---------- | ------- | …

📊 4 results
📏 Metrics: MAP, MRR, P@1, R10@1, R10@2, R10@5

RRS Ranking Test

| | Train | Validation | Test | Ranking Test | | --------- | ----- | ---------- | ------- | …

📊 3 results
📏 Metrics: NDCG@3, NDCG@5

Ubuntu IRC

The Ubuntu IRC dataset is a valuable resource for research in natural language understanding and dialogue systems. Let me provide …

📊 3 results
📏 Metrics: Accuracy

Conversational Web Navigation

WebLINX

WebLINX is a large-scale benchmark of 100K interactions across 2300 expert demonstrations of conversational web navigation. It covers a broad …

📊 17 results
📏 Metrics: Overall score, Intent Match, Element (IoU), Text (F1)

Coreference Resolution

DWIE

The 'Deutsche Welle corpus for Information Extraction' (DWIE) is a multi-task dataset that combines four main Information Extraction (IE) annotation …

📊 3 results
📏 Metrics: Avg. F1

DocRED-IE

The DocRED Information Extraction (DocRED-IE) dataset extends the DocRED dataset for the Document-level Closed Information Extraction (DocIE) task. DocRED-IE is …

📊 1 results
📏 Metrics: Avg F1

GAP

GAP is a graph processing benchmark suite with the goal of helping to standardize graph processing evaluations. Fewer differences between …

📊 4 results
📏 Metrics: Overall F1, Masculine F1 (M), Feminine F1 (F), Bias (F/M), F1

LitBank

LitBank is an annotated dataset of 100 works of English-language fiction to support tasks in natural language processing and the …

📊 2 results
📏 Metrics: Avg F1, F1

OntoGUM

OntoGUM is an OntoNotes-like coreference dataset converted from GUM, an English corpus covering 12 genres using deterministic rules.

📊 2 results
📏 Metrics: Avg F1

PreCo

A large-scale English dataset for coreference resolution. The dataset is designed to embody the core challenges in coreference, such as …

📊 2 results
📏 Metrics: F1

Quizbowl

Consists of multiple sentences whose clues are arranged by difficulty (from obscure to obvious) and uniquely identify a well-known entity …

📊 1 results
📏 Metrics: F1

WikiCoref

WikiCoref is an English corpus annotated for anaphoric relations, where all documents are from the English version of Wikipedia. Source: …

📊 3 results
📏 Metrics: F1

Croatian Text Diacritization

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish …

📊 1 results
📏 Metrics: Alpha-Word accuracy

Cross-Lingual Transfer

XCOPA

The Cross-lingual Choice of Plausible Alternatives (XCOPA) dataset is a benchmark to evaluate the ability of machine learning models to …

📊 6 results
📏 Metrics: Accuracy

Cross-Modal Retrieval

CUHK-PEDES

The CUHK-PEDES dataset is a caption-annotated pedestrian dataset. It contains 40,206 images over 13,003 persons. Images are collected from five …

📊 1 results
📏 Metrics: Text-to-image Medr

ChEBI-20

Dataset contains 33,010 molecule-description pairs split into 80\%/10\%/10\% train/val/test splits. The goal of the task is to retrieve the relevant …

📊 5 results
📏 Metrics: Hits@1, Hits@10, Mean Rank, Test MRR

Flickr30k

The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators. Source: [Guiding …

📊 23 results
📏 Metrics: Image-to-text R@1, Image-to-text R@5, Image-to-text R@10, Text-to-image R@1, Text-to-image R@5, Text-to-image R@10

MSCOCO

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 1 results
📏 Metrics: Image-to-text R@1

RSICD

The Remote Sensing Image Captioning Dataset (RSICD) is a dataset for remote sensing image captioning task. It contains more than …

📊 7 results
📏 Metrics: Mean Recall, Image-to-text R@1, text-to-image R@1

RSITMD

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 7 results
📏 Metrics: Image-to-text R@1, Mean Recall, text-to-imageR@1

Recipe1M+

Recipe1M+ is a dataset which contains one million structured cooking recipes with 13M associated images. Source: [Recipe1M+: A Dataset for …

📊 2 results
📏 Metrics: Image-to-text R@1, Text-to-image R@1

SoundingEarth

SoundingEarth consists of co-located aerial imagery and audio samples all around the world.

📊 2 results
📏 Metrics: Median Rank, Image-to-sound R@100, Sound-to-image R@100

Czech Text Diacritization

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish …

📊 1 results
📏 Metrics: Alpha-Word accuracy

Data Augmentation

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 5 results
📏 Metrics: Percentage error

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 17 results
📏 Metrics: Accuracy (%)

Data-free Knowledge Distillation

QNLI

The QNLI (Question-answering NLI) dataset is a Natural Language Inference dataset automatically derived from the Stanford Question Answering Dataset v1.1 …

📊 4 results
📏 Metrics: Accuracy

SQuAD

The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct …

📊 4 results
📏 Metrics: Exact Match

Data-to-Text Generation

AMR3.0

Abstract Meaning Representation (AMR) Annotation Release 3.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University …

📊 1 results
📏 Metrics: Bleu

DART

DART is a large dataset for open-domain structured data record to text generation. DART consists of 82,191 examples across different …

📊 1 results
📏 Metrics: BLEU, METEOR, FactSpotter

E2E

End-to-End NLG Challenge (E2E) aims to assess whether recent end-to-end NLG systems can generate more complex output by learning from …

📊 2 results
📏 Metrics: METEOR

GenWiki

GenWiki is a large-scale dataset for knowledge graph-to-text (G2T) and text-to-knowledge graph (T2G) conversion. It is introduced in the paper …

📊 1 results
📏 Metrics: BLEU

MLB Dataset

A new dataset on the baseball domain. Source: Data-to-text Generation with Entity Modeling

📊 4 results
📏 Metrics: BLEU

RotoWire

This dataset consists of (human-written) NBA basketball game summaries aligned with their corresponding box- and line-scores. Summaries taken from rotowire.com …

📊 5 results
📏 Metrics: BLEU

ToTTo

ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a …

📊 6 results
📏 Metrics: BLEU, PARENT, METEOR

ViGGO

The ViGGO corpus is a set of 6,900 meaning representation to natural language utterance pairs in the video game domain. …

📊 2 results
📏 Metrics: BLEU

WebNLG

The WebNLG corpus comprises of sets of triplets describing facts (entities and relations between them) and the corresponding facts in …

📊 15 results
📏 Metrics: BLEU, METEOR, Number of parameters (M), FactSpotter, BLEU-4, ROUGE-L

WikiOFGraph

  • a high-level explanation of the dataset characteristics We introduce WikiOFGraph, a novel large-scale, domain-diverse dataset synthesized by LLMs, ensuring …
📊 1 results
📏 Metrics: BLEU

Wikipedia Person and Animal Dataset

This dataset gathers 428,748 person and 12,236 animal infobox with descriptions based on Wikipedia dump (2018/04/01) and Wikidata (2018/04/12).

📊 1 results
📏 Metrics: BLEU

XAlign

It consists of an extensive collection of a high quality cross-lingual fact-to-text dataset in 11 languages: Assamese (as), Bengali (bn), …

📊 6 results
📏 Metrics: BLEU4, METEOR

Deep Clustering

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 1 results
📏 Metrics: NMI

USPS

USPS is a digit dataset automatically scanned from envelopes by the U.S. Postal Service containing a total of 9,298 16×16 …

📊 1 results
📏 Metrics: NMI

Dependency Parsing

Chinese Treebank

📊 1 results
📏 Metrics: LAS, UAS

CoNLL-2009

The task builds on the CoNLL-2008 task and extends it to multiple languages. The core of the task is to …

📊 2 results
📏 Metrics: LAS, UAS

DaNE

Danish Dependency Treebank (DaNE) is a named entity annotation for the Danish Universal Dependencies treebank using the CoNLL-2003 annotation scheme. …

📊 1 results
📏 Metrics: LAS, UAS

Penn Treebank

The English Penn Treebank (PTB) corpus, and in particular the section of the corpus corresponding to the articles of Wall …

📊 19 results
📏 Metrics: LAS, UAS, POS

Tweebank

Briefly describe the dataset. Provide: * a high-level explanation of the dataset characteristics * explain motivations and summary of its …

📊 2 results
📏 Metrics: Labelled Attachment Score, Unlabeled Attachment Score

Universal Dependencies

The Universal Dependencies (UD) project seeks to develop cross-linguistically consistent treebank annotation of morphology and syntax for multiple languages. The …

📊 4 results
📏 Metrics: LAS, UAS, BLEX

Description-guided molecule generation

TOMG-Bench

In this paper, we propose Text-based Open Molecule Generation Benchmark (TOMG-Bench), the first benchmark to evaluate the open-domain molecule generation …

📊 25 results
📏 Metrics: wAcc

Dialogue Evaluation

USR-PersonaChat

This dataset was collected with the goal of assessing dialog evaluation metrics. In the paper, USR: An Unsupervised and Reference …

📊 4 results
📏 Metrics: Spearman Correlation, Pearson Correlation

USR-TopicalChat

This dataset was collected with the goal of assessing dialog evaluation metrics. In the paper, USR: An Unsupervised and Reference …

📊 5 results
📏 Metrics: Spearman Correlation, Pearson Correlation

Dialogue Generation

FusedChat

FusedChat is an inter-mode dialogue dataset. It contains dialogue sessions fusing task-oriented dialogues (TOD) and open-domain dialogues (ODD). Based on …

📊 2 results
📏 Metrics: Slot Accuracy, Joint SA, Inform, Inform_mct, Success, Success_mct, BLEU, PPL, Sensibleness, Specificity, SSA

Harry Potter Dialogue Dataset

Harry Potter Dialogue is the first dialogue dataset that integrates with scene, attributes and relations which are dynamically changed as …

📊 2 results
📏 Metrics: mauve

PG-19

A new open-vocabulary language modelling benchmark derived from books. Source: Compressive Transformers for Long-Range Sequence Modelling

📊 1 results
📏 Metrics: Perplexity

Dialogue Rewriting

CANARD

CANARD is a dataset for question-in-context rewriting that consists of questions each given in a dialog context together with a …

📊 1 results
📏 Metrics: BLEU

Discourse Parsing

Instructional-DT (Instr-DT)

This discourse treebank includes annotated instructional texts originally assembled at the Information Technology Research Institute, University of Brighton. This dataset …

📊 12 results
📏 Metrics: Standard Parseval (Nuclearity), Standard Parseval (Span), Standard Parseval (Full), Standard Parseval (Relation)

Molweni

A machine reading comprehension (MRC) dataset with discourse structure built over multiparty dialog. Molweni's source samples from the Ubuntu Chat …

📊 4 results
📏 Metrics: Link & Rel F1, Link F1

RST-DT

The Rhetorical Structure Theory (RST) Discourse Treebank consists of 385 Wall Street Journal articles from the Penn Treebank annotated with …

📊 27 results
📏 Metrics: Standard Parseval (Full), Standard Parseval (Span), Standard Parseval (Nuclearity), Standard Parseval (Relation), RST-Parseval (Full), RST-Parseval (Span), RST-Parseval (Nuclearity), RST-Parseval (Relation)

Document AI

EPHOIE

EPHOIE is a fully-annotated dataset which is the first Chinese benchmark for both text spotting and visual information extraction. EPHOIE …

📊 1 results
📏 Metrics: Average F1

Document Classification

Cora

The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 …

📊 6 results
📏 Metrics: Accuracy

HOC

The Hallmarks of Cancer (*HOC) corpus consists of 1852 PubMed publication abstracts manually annotated by experts according to the Hallmarks …

📊 5 results
📏 Metrics: F1, Micro F1

Hyperpartisan News Detection

Hyperpartisan News Detection was a dataset created for PAN @ SemEval 2019 Task 4. Given a news article text, decide …

📊 1 results
📏 Metrics: Accuracy

LUN

LUN is used for unreliable news source classification, this dataset includes 17,250 articles from satire, propaganda, and hoaxe.

📊 1 results
📏 Metrics: Accuracy

Reuters-21578

The Reuters-21578 dataset is a collection of documents with news articles. The original corpus has 10,369 documents and a vocabulary …

📊 6 results
📏 Metrics: Accuracy, F1

Document Ranking

DaReCzech

DareCzech DaReCzech is a dataset for text relevance ranking in Czech. The dataset consists of more than 1.6M annotated …

📊 3 results
📏 Metrics: P@10

Document Summarization

Arxiv HEP-TH citation graph

Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a …

📊 1 results
📏 Metrics: ROUGE-1

arXiv Summarization Dataset

This is a dataset for evaluating summarisation methods for research papers. Source: [A Discourse-Aware Attention Model for Abstractive Summarization of …

📊 1 results
📏 Metrics: Rouge-2

Document Text Classification

Tobacco-3482

The Tobacco-3482 dataset consists of document images belonging to 10 classes such as letter, form, email, resume, memo, etc. The …

📊 3 results
📏 Metrics: Accuracy, Training time (hours)

Document-level Relation Extraction

Bc8

Bc8BioRED is built upon BioRED 2022 with the addition of directionality annotations. The training and development sets from the original …

📊 1 results
📏 Metrics: Evaluation Macro F1

DWIE

The 'Deutsche Welle corpus for Information Extraction' (DWIE) is a multi-task dataset that combines four main Information Extraction (IE) annotation …

📊 1 results
📏 Metrics: F1

DocRED-IE

The DocRED Information Extraction (DocRED-IE) dataset extends the DocRED dataset for the Document-level Closed Information Extraction (DocIE) task. DocRED-IE is …

📊 1 results
📏 Metrics: Relation F1

Re-DocRED

The Re-DocRED Dataset resolved the following problems of DocRED: 1. Resolved the incompleteness problem by supplementing large amounts of relation …

📊 1 results
📏 Metrics: F1

Emotion Classification

CAER-Dynamic

13,201 clips from 79 TV shows. Each video clip was manually annotated with six emotion categories, including “anger”, “disgust”, “fear”, …

📊 1 results
📏 Metrics: Accuracy

CMU-MOSEI

CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) is the largest dataset of sentence-level sentiment analysis and emotion recognition in …

📊 4 results
📏 Metrics: Accuracy, Weighted Accuracy

MFA

The MFA (Many Faces of Anger) dataset includes 200 in-the-wild videos from North American and Persian cultures with fine-grained labels …

📊 2 results
📏 Metrics: F-F1 score (Comb.), F-F1 score (Persian), V-F1 score (Comb.), V-F1 score (NA), F-F1 score (NA), V-F1 score (Persian)

ROCStories

ROCStories is a collection of commonsense short stories. The corpus consists of 100,000 five-sentence stories. Each story logically follows everyday …

📊 2 results
📏 Metrics: F1

Emotion Recognition

EMOTIC

The EMOTIC dataset, named after EMOTions In Context, is a database of images with people in real environments, annotated with …

📊 2 results
📏 Metrics: Top-3 Accuracy (%)

Emomusic

1000 songs has been selected from Free Music Archive (FMA). The excerpts which were annotated are available in the same …

📊 5 results
📏 Metrics: EmoA, EmoV

FER2013

Fer2013 contains approximately 30,000 facial RGB images of different expressions with size restricted to 48×48, and the main labels of …

📊 1 results
📏 Metrics: 5-class test accuracy

MSP-Podcast

The MSP-Podcast corpus contains speech segments from podcast recordings which are perceptually annotated using crowdsourcing. The collection of this corpus …

📊 1 results
📏 Metrics: Concordance correlation coefficient (CCC)

RAVDESS

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7,356 files (total size: 24.8 GB). The database contains …

📊 2 results
📏 Metrics: Accuracy, WAR

SEED

The SEED dataset contains subjects' EEG signals when they were watching films clips. The film clips are carefully selected so …

📊 1 results
📏 Metrics: Accuracy

Emotional Intelligence

EQ-Bench

This dataset contains benchmark scores for EQ-Bench, a novel benchmark designed to evaluate aspects of emotional intelligence in Large Language …

📊 24 results
📏 Metrics: EQ-Bench Score

Entity Alignment

DBP1M FR-EN

A large-scale cross-lingual dataset for entity alignment

📊 2 results
📏 Metrics: Hit@1

DBP2.0 zh-en

The DBP2.0 dataset can be downloaded from the figshare repository. It has three entity alignment settings, i.e., ZH-EN, JA-EN and …

📊 2 results
📏 Metrics: dangling entity detection F1, Entity Alignment (Consolidated) F1

Entity Disambiguation

AQUAINT

The AQUAINT Corpus consists of newswire text data in English, drawn from three sources: the Xinhua News Service (People's Republic …

📊 5 results
📏 Metrics: Micro-F1

DocRED-IE

The DocRED Information Extraction (DocRED-IE) dataset extends the DocRED dataset for the Document-level Closed Information Extraction (DocIE) task. DocRED-IE is …

📊 1 results
📏 Metrics: Avg F1

Mewsli-9

A large new multilingual dataset for multilingual entity linking. Source: Entity Linking in 100 Languages

📊 2 results
📏 Metrics: Micro Precision

Entity Linking

AIDA/testc

AIDA/testc is a new challenging test set for entity linking systems containing 131 Reuters news articles published between December 5th …

📊 2 results
📏 Metrics: Micro-F1 strong

EC-FUNSD

EC-FUNSD is introduced in [arXiv:2402.02379] as a benchmark of semantic entity recognition (SER) and entity linking (EL), designed for the …

📊 8 results
📏 Metrics: F1

FIGER

The FIGER dataset is an entity recognition dataset where entities are labelled using fine-grained system 112 tags, such as person/doctor, …

📊 1 results
📏 Metrics: Accuracy, Macro F1, Micro F1

FUNSD

Form Understanding in Noisy Scanned Documents (FUNSD) comprises 199 real, fully annotated, scanned forms. The documents are noisy and vary …

📊 6 results
📏 Metrics: F1

GUM

GUM is an open source multilayer English corpus of richly annotated texts from twelve text types. Annotations include: * Multiple …

📊 1 results
📏 Metrics: F1

MedMentions

MedMentions is a new manually annotated resource for the recognition of biomedical concepts. What distinguishes MedMentions from other annotated biomedical …

📊 1 results
📏 Metrics: Accuracy, Recall@64

REBEL

Wikipedia abstracts automatically annotated with WikiData entities and relations that are entailed by the text. Over 9 million triplets.

📊 1 results
📏 Metrics: Micro-F1

Rare Diseases Mentions in MIMIC-III

Data annotation The 1,073 full rare disease mention annotations (from 312 MIMIC-III discharge summaries) are in full_set_RD_ann_MIMIC_III_disch.csv. The data …

📊 2 results
📏 Metrics: F1

WiC-TSV

WiC-TSV is a new multi-domain evaluation benchmark for Word Sense Disambiguation. More specifically, it is a framework for Target Sense …

📊 6 results
📏 Metrics: Task 1 Accuracy: all, Task 1 Accuracy: general purpose, Task 1 Accuracy: domain specific, Task 2 Accuracy: all, Task 2 Accuracy: general purpose, Task 2 Accuracy: domain specific, Task 3 Accuracy: all, Task 3 Accuracy: general purpose, Task 3 Accuracy: domain specific

Entity Resolution

Abt-Buy

The Abt-Buy dataset for entity resolution derives from the online retailers Abt.com and Buy.com. The dataset contains 1081 entities from …

📊 9 results
📏 Metrics: F1 (%)

Amazon-Google

The Amazon-Google dataset for entity resolution derives from the online retailers Amazon.com and the product search service of Google accessible …

📊 12 results
📏 Metrics: F1 (%)

WDC Products

WDC Products is an entity matching benchmark which provides for the systematic evaluation of matching systems along combinations of three …

📊 1 results
📏 Metrics: F1 (%)

Entity Typing

DocRED-IE

The DocRED Information Extraction (DocRED-IE) dataset extends the DocRED dataset for the Document-level Closed Information Extraction (DocIE) task. DocRED-IE is …

📊 1 results
📏 Metrics: Avg F1

FIGER

The FIGER dataset is an entity recognition dataset where entities are labelled using fine-grained system 112 tags, such as person/doctor, …

📊 1 results
📏 Metrics: Macro F1, Micro F1

Open Entity

The Open Entity dataset is a collection of about 6,000 sentences with fine-grained entity types annotations. The entity types are …

📊 13 results
📏 Metrics: F1

Event Extraction

GENIA

The GENIA corpus is the primary collection of biomedical literature compiled and annotated within the scope of the GENIA project. …

📊 1 results
📏 Metrics: F1

Explanation Generation

CLEVR-X

CLEVR-X is a dataset that extends the CLEVR dataset with natural language explanations in the context of VQA. It consists …

📊 2 results
📏 Metrics: B4, M, RL, C, Acc

VCR

Visual Commonsense Reasoning (VCR) is a large-scale dataset for cognition-level visual understanding. Given a challenging question about an image, machines …

📊 2 results
📏 Metrics: Human Explanation Rating

WHOOPS!

WHOOPS! Is a dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers …

📊 7 results
📏 Metrics: Human (%), Accuracy

e-SNLI-VE

e-SNLI-VE is a large VL (vision-language) dataset with NLEs (natural language explanations) with over 430k instances for which the explanations …

📊 2 results
📏 Metrics: Human Explanation Rating

Extractive Text Summarization

DebateSum

DebateSum consists of 187328 debate documents, arguments (also can be thought of as abstractive summaries, or queries), word-level extractive summaries, …

📊 3 results
📏 Metrics: ROUGE-L

GovReport

GovReport is a dataset for long document summarization, with significantly longer documents and summaries. It consists of reports written by …

📊 2 results
📏 Metrics: Avg. Test Rouge1, Avg. Test Rouge2, Avg. Test RougeLsum

Extreme Summarization

CiteSum

CiteSum is a large-scale scientific extreme summarization benchmark.

📊 9 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

TLDR9+

TLDR9+ is a large-scale summarization dataset containing over 9 million training instances extracted from Reddit discussion forum. This dataset is …

📊 4 results
📏 Metrics: RG-1(%), RG-2(%), RG-L(%)

XSum

The Extreme Summarization (XSum) dataset is a dataset for evaluation of abstractive single-document summarization systems. The goal is to create …

📊 1 results
📏 Metrics: METEOR

Fact Selection

ArgSciChat

ArgSciChat is an argumentative dialogue dataset. It consists of 498 messages collected from 41 dialogues on 20 scientific papers. It …

📊 4 results
📏 Metrics: Fact-F1

Fact Verification

FEVER

FEVER is a publicly available dataset for fact extraction and verification against textual sources. It consists of 185,445 claims manually …

📊 7 results
📏 Metrics: Accuracy, FEVER

Fake News Detection

COVID-19 Fake News Dataset

Along with COVID-19 pandemic we are also fighting an `infodemic'. Fake news and rumors are rampant on social media. Believing …

📊 1 results
📏 Metrics: F1

FNC-1

FNC-1 was designed as a stance detection dataset and it contains 75,385 labeled headline and article pairs. The pairs are …

📊 9 results
📏 Metrics: Weighted Accuracy, Per-class Accuracy (Unrelated), Per-class Accuracy (Agree), Per-class Accuracy (Disagree), Per-class Accuracy (Discuss)

LIAR

LIAR is a publicly available dataset for fake news detection. A decade-long of 12.8K manually labeled short statements were collected …

📊 4 results
📏 Metrics: Test Accuracy, Validation Accuracy

RAWFC

For RAWFC, we constructed it from scratch by collecting the claims from Snopes and relevant raw reports by retrieving claim …

📊 6 results
📏 Metrics: F1

Weibo NER

The Weibo NER dataset is a Chinese Named Entity Recognition dataset drawn from the social media website Sina Weibo. Source: …

📊 1 results
📏 Metrics: Accuracy

Few-Shot Learning

CaseHOLD

CaseHOLD (Case Holdings On Legal Decisions) is a law dataset comprised of over 53,000+ multiple choice questions to identify the …

📊 1 results
📏 Metrics: Accuracy

DTD

The Describable Textures Dataset (DTD) contains 5640 texture images in the wild. They are annotated with human-centric attributes inspired by …

📊 4 results
📏 Metrics: 4-shot Accuracy, 8-shot Accuracy, 12-shot Accuracy, 16-shot Accuracy, Harmonic mean

EuroSAT

Eurosat is a dataset and deep learning benchmark for land use and land cover classification. The dataset is based on …

📊 1 results
📏 Metrics: Harmonic mean

Large COVID-19 CT scan slice dataset

"We built a large lung CT scan dataset for COVID-19 by curating data from 7 public datasets listed in the …

📊 1 results
📏 Metrics: AUC-ROC, Accuracy , Macro F1, Macro Precision, Macro Recall, Micro Precision, Specificity

MR

MR Movie Reviews is a dataset for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect …

📊 1 results
📏 Metrics: Acc

MRPC

Microsoft Research Paraphrase Corpus (MRPC) is a corpus consists of 5,801 sentence pairs collected from newswire articles. Each pair is …

📊 1 results
📏 Metrics: F1-score

MedConceptsQA

MedConceptsQA - Open Source Medical Concepts QA Benchmark The benchmark can be found here: https://huggingface.co/datasets/ofir408/MedConceptsQA

📊 12 results
📏 Metrics: Accuracy

MedNLI

The MedNLI dataset consists of the sentence pairs developed by Physicians from the Past Medical History section of MIMIC-III clinical …

📊 1 results
📏 Metrics: Accuracy

PubMedQA

The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary …

📊 2 results
📏 Metrics: Accuracy

SUN397

The Scene UNderstanding (SUN) database contains 899 categories and 130,519 images. There are 397 well-sampled categories to evaluate numerous state-of-the-art …

📊 1 results
📏 Metrics: Harmonic mean

Stanford Cars

The Stanford Cars dataset consists of 196 classes of cars with a total of 16,185 images, taken from the rear. …

📊 3 results
📏 Metrics: 4-shot Accuracy, 8-shot Accuracy, 12-shot Accuracy, 16-shot Accuracy

UCF101

UCF101 dataset is an extension of UCF50 and consists of 13,320 video clips, which are classified into 101 categories. These …

📊 1 results
📏 Metrics: Harmonic mean

Few-Shot Text Classification

RAFT

The RAFT benchmark (Realworld Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that mirrors deployment. …

📊 9 results
📏 Metrics: Avg, ADE, B77, NIS, OSE, Over, SOT, SRI, TAI, ToS, TEH, TC

SST-5

The SST-5, also known as the Stanford Sentiment Treebank with 5 labels, is a dataset used for sentiment analysis. The …

📊 1 results
📏 Metrics: Accuracy

French Text Diacritization

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish …

📊 1 results
📏 Metrics: Alpha-Word accuracy

GSM8K

GSM8K

GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. …

📊 3 results
📏 Metrics: Accuracy, 0-shot MRR

GermEval2024 Shared Task 1 Subtask 1

GerMS-AT

This dataset contains 7984 user comments from an Austrian online newspaper. The comments have been annotated by 4 or more …

📊 1 results
📏 Metrics: Macro F1

GermEval2024 Shared Task 1 Subtask 2

GerMS-AT

This dataset contains 7984 user comments from an Austrian online newspaper. The comments have been annotated by 4 or more …

📊 1 results
📏 Metrics: Jensen-Shannon distance

Grammatical Error Correction

FCGEC

  • a fine-grained corpus to detect, identify and correct the chinese grammatical errors. * collected mainly from multi-choice questions in …
📊 1 results
📏 Metrics: exact match, F0.5

JFLEG

JFLEG is for developing and evaluating grammatical error correction (GEC). Unlike other corpora, it represents a broad range of language …

📊 6 results
📏 Metrics: GLEU

MuCGEC

MuCGEC is a multi-reference multi-source evaluation dataset for Chinese Grammatical Error Correction (CGEC), consisting of 7,063 sentences collected from three …

📊 1 results
📏 Metrics: F0.5

UA-GEC

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

📊 3 results
📏 Metrics: F0.5

WI-LOCNESS

WI-LOCNESS is part of the Building Educational Applications 2019 Shared Task for Grammatical Error Correction. It consists of two datasets: …

📊 1 results
📏 Metrics: F0.5

Graph-to-Sequence

WebNLG

The WebNLG corpus comprises of sets of triplets describing facts (entities and relations between them) and the corresponding facts in …

📊 1 results
📏 Metrics: BLEU

Handwriting Verification

AND Dataset

The AND Dataset contains 13700 handwritten samples and 15 corresponding expert examined features for each sample. The dataset is released …

📊 1 results
📏 Metrics: Average F1

CEDAR Signature

CEDAR Signature is a database of off-line signatures for signature verification. Each of 55 individuals contributed 24 signatures thereby creating …

📊 2 results
📏 Metrics: FAR

Hate Speech Detection

DKhate

A corpus of Offensive Language and Hate Speech Detection for Danish This DKhate dataset contains 3600 comments from the web …

📊 1 results
📏 Metrics: F1

HatEval

Hate Speech is commonly defined as any communication that disparages a person or a group on the basis of some …

📊 2 results
📏 Metrics: Macro F1

HateMM

Hate speech has become one of the most significant issues in modern society, with implications in both the online and …

📊 2 results
📏 Metrics: TEST F1 (macro)

HateXplain

Covers multiple aspects of the issue. Each post in the dataset is annotated from three different perspectives: the basic, commonly …

📊 11 results
📏 Metrics: AUROC, Macro F1, Accuracy, Macro-F1

OLID

The OLID is a hierarchical dataset to identify the type and the target of offensive texts in social media. The …

📊 1 results
📏 Metrics: Macro F1

SHAJ

This is an abusive/offensive language detection dataset for Albanian. The data is formatted following the OffensEval convention. Data is from …

📊 1 results
📏 Metrics: F1

ToLD-Br

The Toxic Language Detection for Brazilian Portuguese (ToLD-Br) is a dataset with tweets in Brazilian Portuguese annotated according to different …

📊 2 results
📏 Metrics: F1-score

Hope Speech Detection

KanHope

KanHope is a code mixed hope speech dataset for equality, diversity, and inclusion in Kannada, an under-resourced Dravidian language. The …

📊 1 results
📏 Metrics: F1-score (Weighted)

Hungarian Text Diacritization

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish …

📊 1 results
📏 Metrics: Alpha-Word accuracy

Image Captioning

BanglaLekhaImageCaptions

This dataset consists of images and annotations in Bengali. The images are human annotated in Bengali by two adult native …

📊 1 results
📏 Metrics: BLEU-1, BLEU-2, BLEU-3, BLEU-4, CIDEr, METEOR, ROUGE-L, SPICE

COCO (Common Objects in Context)

The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset. It is designed to …

📊 16 results
📏 Metrics: CIDEr, BLEU-1, BLEU-2, BLEU-3, BLEU-4, METEOR, ROUGE, ROUGE-L

COCO Captions

COCO Captions contains over one and a half million captions describing over 330,000 images. For the training and validation images, …

📊 40 results
📏 Metrics: BLEU-4, CIDER, METEOR, SPICE, ROUGE-L, BLEU-1, BLEU-2, BLEU-3, CLIPScore

ChEBI-20

Dataset contains 33,010 molecule-description pairs split into 80\%/10\%/10\% train/val/test splits. The goal of the task is to retrieve the relevant …

📊 1 results
📏 Metrics: BLEU, Exact, Levenshtein, MACCS FTS, Morgan FTS, RDK FTS, Validity

Conceptual Captions

Automatic image captioning is the task of producing a natural-language utterance (usually a sentence) that correctly reflects the visual content …

📊 2 results
📏 Metrics: CIDEr, ROUGE-L, SPICE

FlickrStyle10K

FlickrStyle10K is collected and built on Flickr30K image caption dataset. The original FlickrStyle10K dataset has 10,000 pairs of images and …

📊 1 results
📏 Metrics: BLEU-1 (Romantic)

IU X-Ray

IU X-ray (Demner-Fushman et al., 2016) is a set of chest X-ray images paired with their corresponding diagnostic reports. The …

📊 1 results
📏 Metrics: CIDEr

Localized Narratives

We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. We ask annotators to describe …

📊 1 results
📏 Metrics: CIDEr

MSCOCO

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 1 results
📏 Metrics: BLEU-4

Object HalBench

Object HalBench is a benchmark used to evaluate the performance of Language Models, particularly those that are multimodal (i.e., they …

📊 3 results
📏 Metrics: chair_i, chair_s

Peir Gross

Peir Gross (Jing et al., 2018) was collected with descriptions in the Gross sub-collection from PEIR digital library, resulting in …

📊 1 results
📏 Metrics: CIDEr, METEOR, ROUGE-L

SCICAP

SCICAP is a large-scale image captioning dataset that contains real-world scientific figures and captions. SCICAP was constructed using more than …

📊 9 results
📏 Metrics: BLEU-4

WHOOPS!

WHOOPS! Is a dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers …

📊 6 results
📏 Metrics: BLEU-4, CIDEr

Image Generation

ARKitScenes

ARKitScenes is an RGB-D dataset captured with the widely available Apple LiDAR scanner. Along with the per-frame raw data (Wide …

📊 4 results
📏 Metrics: FID, FID (SwAV)

Binarized MNIST

A binarized version of MNIST. Source: Binarized MNIST

📊 10 results
📏 Metrics: nats, bits/dimension

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 72 results
📏 Metrics: FID, IS, NFE

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 6 results
📏 Metrics: FID, Inception Score, Model Size (MB)

CLEVR

CLEVR (Compositional Language and Elementary Visual Reasoning) is a synthetic Visual Question Answering dataset. It contains images of 3D-rendered objects; …

📊 6 results
📏 Metrics: FID-5k-training-steps

CelebA

CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels …

📊 1 results
📏 Metrics: bpd (8-bits)

CelebA-HQ

The CelebA-HQ dataset is a high-quality version of CelebA that consists of 30,000 images at 1024×1024 resolution. Source: [IntroVAE: Introspective …

📊 1 results
📏 Metrics: FLD

Cityscapes

Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense …

📊 6 results
📏 Metrics: FID-10k-training-steps

FFHQ

Flickr-Faces-HQ (FFHQ) consists of 70,000 high-quality PNG images at 1024×1024 resolution and contains considerable variation in terms of age, ethnicity …

📊 12 results
📏 Metrics: FID, Clean-FID (70k), FID-10k-training-steps

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per …

📊 5 results
📏 Metrics: FID, Precision, Recall

KMNIST

📊 1 results
📏 Metrics: FID

LLVIP

  • Visible-infrared Paired Dataset for Low-light Vision * 30976 images (15488 pairs) * 24 dark scenes, 2 daytime scenes * …
📊 1 results
📏 Metrics: PSNR, SSIM

LSUN

The Large-scale Scene Understanding (LSUN) challenge aims to provide a different benchmark for large-scale scene classification and understanding. The LSUN …

📊 1 results
📏 Metrics: Average FID

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 11 results
📏 Metrics: bits/dimension, FID, Precision, Recall, PSNR, SSIM

MetFaces

MetFaces is an image dataset of human faces extracted from works of art. The dataset consists of 1336 high-quality PNG …

📊 3 results
📏 Metrics: MAE Signature, MAE log-signature, RMSE Signature, RMSE log-signature

Multi-dSprites

📊 1 results
📏 Metrics: FID

NASA Perseverance

Samples from NASA Perseverance and set of GAN generated synthetic images from Neural Mars.

📊 1 results
📏 Metrics: MAE Signature, MAE log-signature, RMSE Signature, RMSE log-signature

ObjectsRoom

The ObjectsRoom dataset is based on the MuJoCo environment used by the Generative Query Network [4] and is a multi-object …

📊 3 results
📏 Metrics: FID

RC-49

RC-49 is a benchmark dataset for generating images conditional on a continuous scalar variable. It is made by rendering 49 …

📊 2 results
📏 Metrics: Intra-FID

Replica

The Replica Dataset is a dataset of high quality reconstructions of a variety of indoor spaces. Each reconstruction has clean …

📊 4 results
📏 Metrics: FID, FID (SwAV)

SDSS Galaxies

This is a dataset of 306,006 galaxies whose coordinates are taken from the Sloan Digital Sky Survey Data Release 7 …

📊 1 results
📏 Metrics: FID

STL-10

The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or …

📊 25 results
📏 Metrics: FID, Inception score, Model Size (MB), Recall, NFE

ShapeStacks

A simulation-based dataset featuring 20,000 stack configurations composed of a variety of elementary geometric primitives richly annotated regarding semantics and …

📊 3 results
📏 Metrics: FID

Stacked MNIST

The Stacked MNIST dataset is derived from the standard MNIST dataset with an increased number of discrete modes. 240,000 RGB …

📊 2 results
📏 Metrics: FID, Inception score

Stanford Cars

The Stanford Cars dataset consists of 196 classes of cars with a total of 16,185 images, taken from the rear. …

📊 4 results
📏 Metrics: FID, Inception score

Stanford Dogs

The Stanford Dogs dataset contains 20,580 images of 120 classes of dogs from around the world, which are divided into …

📊 4 results
📏 Metrics: FID, Inception score

TextAtlasEval

A Dense-text Image Benchmark to evaluate large generation model's ability on text generation.

📊 4 results
📏 Metrics: TextVsionBlend OCR (F1 Score), TextVisionBlend OCR (Accuracy), TextVisionBlend OCR (Cer), TextVisionBlend FID, TextVisionBlend Clip Score, StyledTextSynth OCR (F1 Score), StyledTextSynth OCR (Accuracy), StyledTextSynth OCR (Cer), StyledTextSynth FID, StyledTextSynth Clip Score, TextScenesHQ OCR (F1 Score), TextScenesHQ OCR (Accuracy), TextScenesHQ OCR (Cer), TextScenesHQ FID, TextScenesHQ Clip Score

VLN-CE

Vision and Language Navigation in Continuous Environments (VLN-CE) is an instruction-guided navigation task with crowdsourced instructions, realistic environments, and unconstrained …

📊 4 results
📏 Metrics: FID, FID (SwAV)

VizDoom

ViZDoom is an AI research platform based on the classical First Person Shooter game Doom. The most popular game mode …

📊 4 results
📏 Metrics: FID, FID (SwAV)

WISE

WISE, the first benchmark specifically designed for World Knowledge-Informed Semantic Evaluation. WISE moves beyond simple word-pixel mapping by challenging models …

📊 13 results
📏 Metrics: Overall, Cultural, Time, Space, Biology, Physics, Chemistry

Image-to-Text Retrieval

COCO (Common Objects in Context)

The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset. It is designed to …

📊 9 results
📏 Metrics: Recall@1, Recall@5, Recall@10

FETA Car-Manuals

FETA benchmark focuses on text-to-image and image-to-text retrieval in public car manuals and sales catalogue brochures. The FETA Car-Manuals dataset …

📊 1 results
📏 Metrics: R@1, R@10, R@5

Flickr30k

The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators. Source: [Guiding …

📊 11 results
📏 Metrics: Recall@1, Recall@5, Recall@10, Recall@Sum

RSICD

The Remote Sensing Image Captioning Dataset (RSICD) is a dataset for remote sensing image captioning task. It contains more than …

📊 1 results
📏 Metrics: Image to Text Recall@1

WHOOPS!

WHOOPS! Is a dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers …

📊 7 results
📏 Metrics: Specificity

Information Extraction

SemTabNet

Dataset Card for SemTabNet This dataset accompanies the following paper: ``` Title: Statements: Universal Information Extraction from Tables with …

📊 1 results
📏 Metrics: average Tree Similarity Score

Information Retrieval

BSARD

The Belgian Statutory Article Retrieval Dataset (BSARD) is a French native corpus for studying statutory article retrieval. BSARD consists of …

📊 3 results
📏 Metrics: Recall@100, Recall@200, Recall@500

CQADupStack

CQADupStack is a benchmark dataset for community question-answering research. It contains threads from twelve StackExchange subforums, annotated with duplicate question …

📊 2 results
📏 Metrics: mAP@100

MS MARCO

The MS MARCO (Microsoft MAchine Reading Comprehension) is a collection of datasets focused on deep learning in search. The first …

📊 3 results
📏 Metrics: Time (ms), MRR@10

MSLR-WEB30K

The MSLR-WEB30K dataset consists of 30,000 search queries over the documents from search results. The data also contains the values …

📊 1 results
📏 Metrics: nDCG@10

MTEB

MTEB is a benchmark that spans 8 embedding tasks covering a total of 56 datasets and 112 languages. The 8 …

📊 1 results
📏 Metrics: nDCG@10

Instruction Following

IFEval

This dataset evaluates instruction following ability of large language models. There are 500+ prompts with instructions such as "write an …

📊 4 results
📏 Metrics: Inst-level loose-accuracy, Inst-level strict-accuracy, Prompt-level loose-accuracy, Prompt-level strict-accuracy

Intent Classification

KUAKE-QIC

KUAKE Query Intent Classification, a dataset for intent classification, is used for the KUAKE-QIC task. Given the queries of search …

📊 1 results
📏 Metrics: Accuracy

MASSIVE

MASSIVE is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks …

📊 3 results
📏 Metrics: Intent Accuracy

ORCAS-I

A labelled version of the ORCAS click-based dataset of Web queries, which provides 18 million connections to 10 million distinct …

📊 1 results
📏 Metrics: F1-score, Precision, Recall

SLURP

A new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets. …

📊 5 results
📏 Metrics: Accuracy (%)

Intent Detection

ATIS

The ATIS (Airline Travel Information Systems) is a dataset consisting of audio recordings and corresponding manual transcripts about humans asking …

📊 13 results
📏 Metrics: Accuracy

BANKING77

Dataset composed of online banking queries annotated with their corresponding intents. BANKING77 dataset provides a very fine-grained set of intents …

📊 2 results
📏 Metrics: Accuracy (%)

CAIS

We collect utterances from the Chinese Artificial Intelligence Speakers (CAIS), and annotate them with slot tags and intent labels. The …

📊 1 results
📏 Metrics: Acc

CLINC150

This dataset is for evaluating the performance of intent classification systems in the presence of "out-of-scope" queries, i.e., queries that …

📊 1 results
📏 Metrics: Accuracy (%)

Dialogue State Tracking Challenge

The Dialog State Tracking Challenges 2 & 3 (DSTC2&3) were research challenge focused on improving the state of the art …

📊 1 results
📏 Metrics: Accuracy

HWU64

This project contains natural language data for human-robot interaction in home domain which we collected and annotated for evaluating NLU …

📊 1 results
📏 Metrics: Accuracy (%)

MixATIS

Dataset is constructed from single intent dataset ATIS. This is a publically available multi intent dataset, which can be downloaded …

📊 10 results
📏 Metrics: Accuracy

MixSNIPS

Dataset is constructed from single intent dataset SNIPS. This is a publicly available multi intent dataset, which can be downloaded …

📊 11 results
📏 Metrics: Accuracy, f1 macro

ProSLU

In the paper, to bridge the research gap, we propose a new and important task, Profile-based Spoken Language Understanding (ProSLU), …

📊 1 results
📏 Metrics: Accuracy

SNIPS

The SNIPS Natural Language Understanding benchmark is a dataset of over 16,000 crowdsourced queries distributed among 7 user intents of …

📊 8 results
📏 Metrics: Accuracy

Intent Discovery

ATIS

The ATIS (Airline Travel Information Systems) is a dataset consisting of audio recordings and corresponding manual transcripts about humans asking …

📊 1 results
📏 Metrics: ARI

Persian-ATIS

The PATIS is a Persian language dataset for intent detection and slot filling.

📊 1 results
📏 Metrics: ARI

SNIPS

The SNIPS Natural Language Understanding benchmark is a dataset of over 16,000 crowdsourced queries distributed among 7 user intents of …

📊 1 results
📏 Metrics: ARI

Intrusion Detection

20NewsGroups

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

📊 1 results
📏 Metrics: Actions Top-1 (S2)

UNSW-NB15

UNSW-NB15 is a network intrusion dataset. It contains nine different attacks, includes DoS, worms, Backdoors, and Fuzzers. The dataset contains …

📊 1 results
📏 Metrics: AUC

Irish Text Diacritization

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish …

📊 1 results
📏 Metrics: Alpha-Word accuracy

KG-to-Text Generation

AGENDA

Abstract GENeration DAtaset (AGENDA) is a dataset of knowledge graphs paired with scientific abstracts. The dataset consists of 40k paper …

📊 6 results
📏 Metrics: BLEU

ENT-DESC

ENT-DESC involves retrieving abundant knowledge of various types of main entities from a large knowledge graph (KG), which makes the …

📊 1 results
📏 Metrics: BLEU

EventNarrative

EventNarrative is a knowledge graph-to-text dataset from publicly available open-world knowledge graphs. EventNarrative consists of approximately 230,000 graphs and their …

📊 8 results
📏 Metrics: BLEU, METEOR, ROUGE, BertScore, CIDEr, ChrF++

PathQuestion

Adopts two subsets of Freebase (Bollacker et al., 2008) as Knowledge Bases to construct the PathQuestion (PQ) and the PathQuestion-Large …

📊 5 results
📏 Metrics: BLEU, METEOR, ROUGE

WebQuestions

The WebQuestions dataset is a question answering dataset using Freebase as the knowledge base and contains 6,642 question-answer pairs. It …

📊 5 results
📏 Metrics: BLEU, METEOR, ROUGE

WikiGraphs

WikiGraphs is a dataset of Wikipedia articles each paired with a knowledge graph, to facilitate the research in conditional text …

📊 4 results
📏 Metrics: Test perplexity, rBLEU (Test), rBLEU (Valid), rBLEU(w/title)(Test), rBLEU(w/title)(Valid)

Key Information Extraction

CORD

OCR is inevitably linked to NLP since its final output is in text. Advances in document intelligence are driving the …

📊 9 results
📏 Metrics: F1

EPHOIE

EPHOIE is a fully-annotated dataset which is the first Chinese benchmark for both text spotting and visual information extraction. EPHOIE …

📊 1 results
📏 Metrics: Average F1

ETD500

The paper used 500 scanned Electronic Theses and Dissertation cover pages (i.e., front pages). The dataset contains several intermediate datasets, …

📊 1 results
📏 Metrics: F1 (%)

Kleister NDA

Kleister NDA is a dataset for Key Information Extraction (KIE). The dataset contains a mix of scanned and born-digital long …

📊 3 results
📏 Metrics: F1

SIMARA

Description We propose a new database for information extraction from historical handwritten documents. The corpus includes 5,393 finding aids …

📊 1 results
📏 Metrics: F1 (%)

SROIE

Consists of a dataset with 1000 whole scanned receipt images and annotations for the competition on scanned receipts OCR and …

📊 5 results
📏 Metrics: F1, Accuracy

Keyphrase Extraction

Inspec

Paper: Improved automatic keyword extraction given more linguistic knowledge Doi: 10.3115/1119355.1119383

📊 3 results
📏 Metrics: F1@10

KP20k

KP20k is a large-scale scholarly articles dataset with 528K articles for training, 20K articles for validation and 20K articles for …

📊 7 results
📏 Metrics: Recall, F1@10

KPTimes

KPTimes is a large-scale dataset of news texts paired with editor-curated keyphrases. Source: [KPTimes: A Large-Scale Dataset for Keyphrase Generation …

📊 4 results
📏 Metrics: Recall, F1@10

Krapivin

A dataset for benchmarking keyphrase extraction and generation techniques from long document English scientific papers. The dataset has high quality …

📊 2 results
📏 Metrics: F1@10

NUS

The dataset was constructed by first finding suitable publications and then collecting keyphrases from manual annotators. Google SOAP API was …

📊 1 results
📏 Metrics: F1@10

Keyword Extraction

Inspec

Paper: Improved automatic keyword extraction given more linguistic knowledge Doi: 10.3115/1119355.1119383

📊 5 results
📏 Metrics: F1 score, Precision@10, Recall @ 10

SemEval-2017 Task-10

We describe the SemEval task of extracting keyphrases and relations between them from scientific documents, which is crucial for understanding …

📊 5 results
📏 Metrics: F1 score, Precision@10, Recall@10

Knowledge Base Population

LM-KBC 2023

A diverse set of 21 relations, each covering a different set of subject-entities and a complete list of ground truth …

📊 1 results
📏 Metrics: F1

Knowledge Distillation

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 27 results
📏 Metrics: Top-1 Accuracy (%)

COCO (Common Objects in Context)

The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset. It is designed to …

📊 4 results
📏 Metrics: box AP, mask AP, mAP

Cityscapes

Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense …

📊 1 results
📏 Metrics: AP

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 50 results
📏 Metrics: Top-1 accuracy %, model size, CRD training setting,

KITTI

KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile …

📊 1 results
📏 Metrics: RMSE, model size

PASCAL VOC

The PASCAL Visual Object Classes (VOC) 2012 dataset contains 20 object categories including vehicles, household, animals, and other: aeroplane, bicycle, …

📊 2 results
📏 Metrics: mAP

Knowledge Graph Completion

DBP-5L (English)

DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used …

📊 3 results
📏 Metrics: MRR

DBP-5L (Greek)

DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used …

📊 3 results
📏 Metrics: MRR

DPB-5L (French)

DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used …

📊 3 results
📏 Metrics: MRR

FB15k-237

FB15k-237 is a link prediction dataset created from FB15k. While FB15k consists of 1,345 relations, 14,951 entities, and 592,213 triples, …

📊 3 results
📏 Metrics: Hits@10, Hits@1, Hits@3, MR, MRR

WN18RR

WN18RR is a link prediction dataset created from WN18, which is a subset of WordNet. WN18 consists of 18 relations …

📊 2 results
📏 Metrics: Hits@3, Hits@1, Hits@10

Language Identification

Nordic Language Identification

Automatic language identification is a challenging problem. Discriminating between closely related languages is especially difficult. This paper presents a machine-learning …

📊 1 results
📏 Metrics: Accuracy

OpenSubtitles

OpenSubtitles is collection of multilingual parallel corpora. The dataset is compiled from a large database of movie and TV subtitles …

📊 1 results
📏 Metrics: Accuracy

Universal Dependencies

The Universal Dependencies (UD) project seeks to develop cross-linguistically consistent treebank annotation of morphology and syntax for multiple languages. The …

📊 1 results
📏 Metrics: Accuracy

VOXLINGUA107

Language Identification Dataset

📊 2 results
📏 Metrics: Error rate

VoxForge

VoxForge is an open speech dataset that was set up to collect transcribed speech for use with Free and Open …

📊 1 results
📏 Metrics: Accuracy

Language Modelling

2000 HUB5 English

2000 HUB5 English Evaluation Transcripts was developed by the Linguistic Data Consortium (LDC) and consists of transcripts of 40 English …

📊 1 results
📏 Metrics: 10-stage average accuracy

Arxiv HEP-TH citation graph

Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a …

📊 1 results
📏 Metrics: BPB

Books3

The Books3 dataset emerged as part of a broader effort to train AI models for natural language understanding and generation. …

📊 1 results
📏 Metrics: BPB

C4

C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. …

📊 9 results
📏 Metrics: Perplexity, TPUv3 Hours, Steps

Curation Corpus

The Curation Corpus is a collection of 40,000 professionally-written summaries of news articles, with links to the articles themselves. Source: …

📊 1 results
📏 Metrics: BPB

FreeLaw

Free Law Project is a leading nonprofit organization that aims to make the legal ecosystem more equitable and competitive through …

📊 1 results
📏 Metrics: BPB

Hutter Prize

The Hutter Prize Wikipedia dataset, also known as enwiki8, is a byte-level dataset consisting of the first 100 million bytes …

📊 18 results
📏 Metrics: Bit per Character (BPC), Number of params

LAMBADA

The LAMBADA (LAnguage Modeling Broadened to Account for Discourse Aspects) benchmark is an open-ended cloze task which consists of about …

📊 34 results
📏 Metrics: Accuracy, Perplexity

OpenWebText

OpenWebText is an open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit …

📊 12 results
📏 Metrics: eval_perplexity, eval_loss, parameters

PhilPapers

PhilPapers is a remarkable resource for the philosophical community. Let me break it down for you: 1. PhilPapers: It's an …

📊 1 results
📏 Metrics: BPB

PubMed Cognitive Control Abstracts

A collection of 385,705 scientific abstracts about Cognitive Control and their GPT-3 embeddings.

📊 1 results
📏 Metrics: BPB

SALMon

The SALMon dataset and benchmark was introduced in the paper "A Suite for Acoustic Language Model Evaluation", with the goal …

📊 8 results
📏 Metrics: Sentiment Consistency, Speaker Consistency, Gender Consistency, Background (Domain) Consistency, Background (Random) Consistency, Room Consistency, Sentiment Alignment, Background Alignment

Text8

📊 22 results
📏 Metrics: Bit per Character (BPC), Number of params

The Pile

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets …

📊 39 results
📏 Metrics: Bits per byte, Test perplexity

VietMed

We introduced a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled …

📊 2 results
📏 Metrics: PPL

Wiki-40B

A new multilingual language model benchmark that is composed of 40+ languages spanning several scripts and linguistic families containing round …

📊 3 results
📏 Metrics: Perplexity

WikiText-103

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good …

📊 83 results
📏 Metrics: Test perplexity, Validation perplexity, Number of params

WikiText-2

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good …

📊 34 results
📏 Metrics: Test perplexity, Validation perplexity, Number of params

language-modeling-recommendation

This is the Big-Bench version of our language-based movie recommendation dataset https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/movie_recommendation GPT-2 has a 48.8% accuracy, chance is 25%.

📊 1 results
📏 Metrics: 1:1 Accuracy

Latvian Text Diacritization

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish …

📊 1 results
📏 Metrics: Alpha-Word accuracy

Linguistic Acceptability

CoLA

The Corpus of Linguistic Acceptability (CoLA) consists of 10657 sentences from 23 linguistics publications, expertly annotated for acceptability (grammaticality) by …

📊 42 results
📏 Metrics: Accuracy, MCC

DaLAJ

DaLAJ 1.0, a dataset for Linguistic Acceptability Judgments for Swedish, comprising 9,596 sentences in its first version; and the initial …

📊 1 results
📏 Metrics: Accuracy, MCC

ItaCoLA

ItaCoLA is a corpus for monolingual and cross-lingual acceptability judgments which contains almost 10,000 sentences with acceptability judgments.

📊 4 results
📏 Metrics: MCC, Accuracy

RuCoLA

The Russian Corpus of Linguistic Acceptability (RuCoLA) is built from the ground up under the well-established binary LA approach. RuCoLA …

📊 9 results
📏 Metrics: MCC, Accuracy

Link Prediction

ACM

The ACM dataset contains papers published in KDD, SIGMOD, SIGCOMM, MobiCOMM, and VLDB and are divided into three classes (Database, …

📊 1 results
📏 Metrics: AP, AUC

AbstRCT - Neoplasm

The AbstRCT dataset consists of randomized controlled trials retrieved from the MEDLINE database via PubMed search. The trials are annotated …

📊 1 results
📏 Metrics: F1

Aristo-v4

The Aristo Tuple KB contains a collection of high-precision, domain-targeted (subject,relation,object) tuples extracted from text using a high-precision extraction pipeline, …

📊 1 results
📏 Metrics: Hits@1, Hits@10, Hits@3, MRR

CDCP

The Cornell eRulemaking Corpus – CDCP is an argument mining corpus annotated with argumentative structure information capturing the evaluability of …

📊 1 results
📏 Metrics: F1

COLLAB

COLLAB is a scientific collaboration dataset. A graph corresponds to a researcher’s ego network, i.e., the researcher and its collaborators …

📊 1 results
📏 Metrics: Hits

Citeseer

The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 …

📊 12 results
📏 Metrics: AUC, AP, Accuracy, ACC

CoDEx Large

CoDEx comprises a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph …

📊 6 results
📏 Metrics: MRR, Hits@1, Hits@3, Hits@10

CoDEx Medium

CoDEx comprises a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph …

📊 7 results
📏 Metrics: MRR, Hits@1, Hits@3, Hits@10

CoDEx Small

CoDEx comprises a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph …

📊 6 results
📏 Metrics: MRR, Hits@1, Hits@3, Hits@10

Cora

The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 …

📊 11 results
📏 Metrics: AUC, AP, Accuracy, ACC

DBLP

The DBLP is a citation network dataset. The citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and …

📊 3 results
📏 Metrics: AUC, AP

DRI Corpus

The Dr. Inventor Multi-Layer Scientific Corpus (DRI Corpus) includes 40 Computer Graphics papers, selected by domain experts. Each paper of …

📊 1 results
📏 Metrics: F1

Decagon

Bio-decagon is a dataset for polypharmacy side effect identification problem framed as a multirelational link prediction problem in a two-layer …

📊 2 results
📏 Metrics: AUROC, AUPRC, mAP@50

Douban

We release Douban Conversation Corpus, comprising a training data set, a development set and a test set for retrieval based …

📊 2 results
📏 Metrics: AUC

FB122

📊 4 results
📏 Metrics: HITS@3, Hits@5, Hits@10, MRR

FB15k

The FB15k dataset contains knowledge base relation triples and textual mentions of Freebase entity pairs. It has a total of …

📊 10 results
📏 Metrics: MRR, Hits@1, Hits@3, Hits@10, MR, MRR raw, Hits@5

FB15k-237

FB15k-237 is a link prediction dataset created from FB15k. While FB15k consists of 1,345 relations, 14,951 entities, and 592,213 triples, …

📊 70 results
📏 Metrics: Hits@1, Hits@3, Hits@10, MRR, MR, training time (s), Hit@1, Hit@10

GDELT

The GDELT Project is a remarkable initiative that monitors our world by analyzing global news from various sources. Here are …

📊 10 results
📏 Metrics: MRR

GO21

GO21 is a biomedical knowledge graph that models genes, proteins, drugs, and the hierarchy of the biological processes they participate …

📊 1 results
📏 Metrics: Hit@1, Hits@10, Hits@3, MRR

KG20C

KG20C is a Knowledge Graph about high quality papers from 20 top computer science Conferences. It can serve as a …

📊 1 results
📏 Metrics: MRR, Hits@1, Hits@3, Hits@10

NELL-995

NELL-995 KG Completion Dataset

📊 3 results
📏 Metrics: Hits@1, Hits@10, MRR, Mean AP, HITS@3

PPI

protein roles—in terms of their cellular functions from gene ontology—in various protein-protein interaction (PPI) graphs, with each graph corresponding to …

📊 1 results
📏 Metrics: AP, AUC, Accuracy

Pubmed

The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. …

📊 13 results
📏 Metrics: AUC, AP, Accuracy, ACC

SINS

SINS is a database of continuous real-life audio recordings in a home environment. The home is a vacation home and …

📊 1 results
📏 Metrics: Scaled time-delay embeddings

TSP/HCP Benchmark set

This is a benchmark set for Traveling salesman problem (TSP) with characteristics that are different from the existing benchmark sets. …

📊 4 results
📏 Metrics: F1

UMLS

The Unified Medical Language System (UMLS) is a comprehensive resource that integrates and disseminates essential terminology, classification standards, and coding …

📊 9 results
📏 Metrics: Hits@10, MR

WN18

The WN18 dataset has 18 relations scraped from WordNet for roughly 41,000 synsets, resulting in 141,442 triplets. It was found …

📊 33 results
📏 Metrics: Hits@10, Hits@3, Hits@1, MRR, MR, training time (s)

WN18RR

WN18RR is a link prediction dataset created from WN18, which is a subset of WordNet. WN18 consists of 18 relations …

📊 69 results
📏 Metrics: Hits@10, Hits@3, Hits@1, MRR, MR

Wiki

Context There's a story behind every dataset and here's your opportunity to share yours. ### Content What's inside is …

📊 1 results
📏 Metrics: AUC

Wikidata5M

Wikidata5m is a million-scale knowledge graph dataset with aligned corpus. This dataset integrates the Wikidata knowledge graph and Wikipedia pages. …

📊 12 results
📏 Metrics: MRR, Hits@10, Hits@1, Hits@3

YAGO3-10

YAGO3-10 is benchmark dataset for knowledge base completion. It is a subset of YAGO3 (which itself is an extension of …

📊 17 results
📏 Metrics: Hits@1, Hits@3, Hits@10, MRR, MR

Yelp

The Yelp Dataset is a valuable resource for academic research, teaching, and learning. It provides a rich collection of real-world …

📊 8 results
📏 Metrics: HR@10, AUC, nDCG@10

Long-Context Understanding

L-Eval

Although large language models (LLMs) demonstrate impressive performance for many language tasks, most of them can only handle texts a …

📊 4 results
📏 Metrics: Average Score

LongBench

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 3 results
📏 Metrics: Average Score

MMNeedle

We introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark, specifically designed to assess the long-context capabilities of MLLMs. Besides multi-image input, we …

📊 11 results
📏 Metrics: 1 Image, 4*4 Stitching, Exact Accuracy, 1 Image, 8*8 Stitching, Exact Accuracy, 1 Image, 2*2 Stitching, Exact Accuracy, 10 Images, 1*1 Stitching, Exact Accuracy, 10 Images, 2*2 Stitching, Exact Accuracy, 10 Images, 4*4 Stitching, Exact Accuracy, 10 Images, 8*8 Stitching, Exact Accuracy

Machine Translation

ACES

ACES a dataset consisting of 68 phenomena ranging from simple perturbations at the word/character level to more complex errors based …

📊 21 results
📏 Metrics: Score

Alexa Point of View

The Alexa Point of View dataset is point of view conversion dataset, a parallel corpus of messages spoken to a …

📊 1 results
📏 Metrics: BLEU

FLoRes-200

FLoRes-200 doubles the existing language coverage of FLoRes-101. Given the nature of the new languages, which have less standardization and …

📊 5 results
📏 Metrics: BLEU

IWSLT 2017

The IWSLT 2017 translation dataset.

📊 1 results
📏 Metrics: BLEU score

Itihasa

Itihasa is a large-scale corpus for Sanskrit to English translation containing 93,000 pairs of Sanskrit shlokas and their English translations. …

📊 2 results
📏 Metrics: SacreBLEU

Multi Lingual Bug Reports

Dataset Description The dataset used in this study comprises bug reports extracted from the Visual Studio Code GitHub repository, …

📊 1 results
📏 Metrics: BERTScore

OpenSubtitles

OpenSubtitles is collection of multilingual parallel corpora. The dataset is compiled from a large database of movie and TV subtitles …

📊 1 results
📏 Metrics: BLEU score, METEOR

Math Information Retrieval

ARQMath

The goal of ARQMath is to advance techniques for mathematical information retrieval, in particular, retrieving answers to mathematical questions (Task …

📊 2 results
📏 Metrics: P@10, MAP, NDCG, bpref

Mathematical Reasoning

GeoQA

GeoQA is a dataset for automatic geometric problem solving containing 5,010 geometric problems with corresponding annotated programs, which illustrate the …

📊 2 results
📏 Metrics: Accuracy (%)

PGPS9K

A new large scale plane geometry problem solving dataset called PGPS9K, labeled both fine-grained diagram annotation and interpretable solution program.

📊 6 results
📏 Metrics: Completion accuracy

Meeting Summarization

AMI Meeting Corpus

The AMI Meeting Corpus is a multi-modal data set comprising 100 hours of meeting recordings. It has been meticulously curated …

📊 1 results
📏 Metrics: ROUGE-1 F1

ICSI Meeting Corpus

ICSI Meeting Corpus in JSON format.

📊 1 results
📏 Metrics: ROUGE-1 F1

Meme Classification

Hateful Memes

The Hateful Memes data set is a multimodal dataset for hateful meme detection (image + text) that contains 10,000+ new …

📊 17 results
📏 Metrics: ROC-AUC, Accuracy

MultiOFF

Introudced from Multimodal Meme Dataset (MultiOFF) for Identifying Offensive Content in Image and Text

📊 4 results
📏 Metrics: Accuracy, F1

Tamil Memes

Social media are interactive platforms that facilitate the creation or sharing of information, ideas or other forms of expression among …

📊 2 results
📏 Metrics: Micro-F1

Memex Question Answering

MemexQA

A large, realistic multimodal dataset consisting of real personal photos and crowd-sourced questions/answers. Source: MemexQA: Visual Memex Question Answering

📊 1 results
📏 Metrics: Accuracy

Morpheme Segmentaiton

UniMorph 4.0

The Universal Morphology (UniMorph) project is a collaborative effort to improve how NLP handles complex morphology in the world’s languages. …

📊 3 results
📏 Metrics: macro avg (subtask 1), f1 macro avg (subtask 2), lev dist (subtask 2)

Multi-Label Text Classification

CC3M-TagMask

The dataset offers tag and mask annotations for image-text pairs from the CC3M validation set. Tag annotations denote words that …

📊 4 results
📏 Metrics: Precision, Recall, F1, Accuracy, mAP

Dataset of Propaganda Techniques of the State-Sponsored Information Operation of the People's Republic of China

This data is for the Mis2-KDD 2021 under review paper: Dataset of Propaganda Techniques of the State-Sponsored Information Operation of …

📊 1 results
📏 Metrics: 1:1 Accuracy, F1 - macro, Micro F1

MIMIC-III

The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. …

📊 3 results
📏 Metrics: AUC, Macro F1, Macro Precision, Macro Recall, Micro Precision, Micro Recall, Micro-F1, P@5, Precision, Recall

RCV1

The RCV1 dataset is a benchmark dataset on text categorization. It is a collection of newswire articles producd by Reuters …

📊 1 results
📏 Metrics: Macro-F1, Micro-F1

Reuters-21578

The Reuters-21578 dataset is a collection of documents with news articles. The original corpus has 10,369 documents and a vocabulary …

📊 5 results
📏 Metrics: Micro-F1

Multi-agent Integration

BBAI Dataset

This dataset is for evaluating the task of Black-box Multi-agent Integration which focuses on combining the capabilities of multiple black-box …

📊 1 results
📏 Metrics: P@1

Multi-label Condescension Detection

DPM

Don’t Patronize Me! (DPM) is an annotated dataset with Patronizing and Condescending Language towards vulnerable communities.

📊 2 results
📏 Metrics: Macro-F1

Multimodal Machine Translation

Multi30K

Multi30K is a large-scale multilingual multimodal dataset for interdisciplinary machine learning research. It extends the Flickr30K dataset with German translations …

📊 14 results
📏 Metrics: BLEU (EN-DE), BLUE (DE-EN), Meteor (EN-DE), Meteor (EN-FR)

Multimodal Text Prediction

MultiSubs

MultiSubs is a dataset of multilingual subtitles gathered from the OPUS OpenSubtitles dataset, which in turn was sourced from opensubtitles.org. …

📊 1 results
📏 Metrics: Accuracy, Word similarity

Multimodal Text and Image Classification

CD18

📊 1 results
📏 Metrics: Accuracy, F-measure (%)

Named Entity Recognition (NER)

ACE 2004

ACE 2004 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2004 Automatic …

📊 9 results
📏 Metrics: F1, Multi-Task Supervision

ACE 2005

ACE 2005 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic …

📊 19 results
📏 Metrics: F1

Adverse Drug Events (ADE) Corpus

Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. A significant …

📊 1 results
📏 Metrics: NER Macro F1

BC2GM

Created by Smith et al. at 2008, the BioCreative II Gene Mention Recognition (BC2GM) Dataset contains data where participants are …

📊 11 results
📏 Metrics: F1

BC4CHEMD

Introduced by Krallinger et al. in The CHEMDNER corpus of chemicals and drugs and its annotation principles BC4CHEMD is a …

📊 3 results
📏 Metrics: F1

BC5CDR

BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions. Source: https://www.ncbi.nlm.nih.gov/research/bionlp/Data/ Image …

📊 14 results
📏 Metrics: F1

BioRED

BioRED is a first-of-its-kind biomedical relation extraction dataset with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. …

📊 3 results
📏 Metrics: F1

CMeEE

Chinese Medical Named Entity Recognition, a dataset first released in CHIP20204, is used for CMeEE task. Given a pre-defined schema, …

📊 2 results
📏 Metrics: F1, Micro F1

CORD-r

We introduce FUNSD-r and CORD-r in Token Path Prediction, the revised VrD-NER datasets to reflect the real-world scenarios of NER …

📊 4 results
📏 Metrics: F1

CoNLL++

CoNLL++ is a corrected version of the CoNLL03 NER dataset where 5.38% of the test sentences have been fixed. Source: …

📊 11 results
📏 Metrics: F1

CoNLL-2020

A test dataset that annotated articles in 2020 following the CoNLL-2003 NER task.

📊 2 results
📏 Metrics: F1

DWIE

The 'Deutsche Welle corpus for Information Extraction' (DWIE) is a multi-task dataset that combines four main Information Extraction (IE) annotation …

📊 3 results
📏 Metrics: F1-Hard

DaNE

Danish Dependency Treebank (DaNE) is a named entity annotation for the Danish Universal Dependencies treebank using the CoNLL-2003 annotation scheme. …

📊 1 results
📏 Metrics: Micro-average F1

FUNSD-r

We introduce FUNSD-r and CORD-r in Token Path Prediction, the revised VrD-NER datasets to reflect the real-world scenarios of NER …

📊 4 results
📏 Metrics: F1

FindVehicle

The first NER dataset in the field of traffic, which is to extract the characteristics and attributes of the vehicle …

📊 3 results
📏 Metrics: F1 Score, F1

GENIA

The GENIA corpus is the primary collection of biomedical literature compiled and annotated within the scope of the GENIA project. …

📊 12 results
📏 Metrics: F1

HiNER-collapsed

This dataset releases a significantly sized standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 3 collapsed …

📊 2 results
📏 Metrics: F1-score (Weighted)

HiNER-original

This dataset releases a significantly sized standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 11 tags.

📊 2 results
📏 Metrics: F1-score (Weighted)

JNLPBA

JNLPBA is a biomedical dataset that comes from the GENIA version 3.02 corpus (Kim et al., 2003). It was created …

📊 15 results
📏 Metrics: F1

LINNAEUS

LINNAEUS is a general-purpose dictionary matching software, capable of processing multiple types of document formats in the biomedical domain (MEDLINE, …

📊 4 results
📏 Metrics: F1

NCBI Disease

The NCBI Disease corpus consists of 793 PubMed abstracts, which are separated into training (593), development (100) and test (100) …

📊 1 results
📏 Metrics: F1

NEMO-Corpus

Named Entity (NER) annotations of the Hebrew Treebank (Haaretz newspaper) corpus, including: morpheme and token level NER labels, nested mentions, …

📊 1 results
📏 Metrics: F1

OntoNotes 5.0

OntoNotes 5.0 is a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk …

📊 2 results
📏 Metrics: Average F1, Micro F1

SLUE

Spoken Language Understanding Evaluation (SLUE) is a suite of benchmark tasks for spoken language understanding evaluation. It consists of limited-size …

📊 13 results
📏 Metrics: F1 (%), label-F1 (%), Text model

SciERC

SciERC dataset is a collection of 500 scientific abstract annotated with scientific entities, their relations, and coreference clusters. The abstracts …

📊 6 results
📏 Metrics: F1

Species-800

Species-800 is a corpus for species entities, which is based on manually annotated abstracts. It comprises 800 PubMed abstracts that …

📊 3 results
📏 Metrics: F1

WNUT 2017

This shared task focuses on identifying unusual, previously-unseen entities in the context of emerging discussions. Named entities form the basis …

📊 20 results
📏 Metrics: F1, F1 (surface form), Precision, Recall

WNUT 2020

The training and development dataset for our task was taken from previous work on wet lab corpus (Kulkarni et al., …

📊 2 results
📏 Metrics: F1, Precision, Recall

i2b2 De-identification Dataset

This dataset contains 1304 de-identified longitudinal medical records describing 296 patients.

📊 1 results
📏 Metrics: F1, Precision

Natural Language Inference

BioNLI

BioNLI is a dataset in biomedical natural language inference. This dataset contains abstracts from biomedical literature and mechanistic premises generated …

📊 1 results
📏 Metrics: Macro F1

CommitmentBank

The CommitmentBank is a corpus of 1,200 naturally occurring discourses whose final sentence contains a clause-embedding predicate under an entailment …

📊 20 results
📏 Metrics: Accuracy, F1

FarsTail

Natural Language Inference (NLI), also called Textual Entailment, is an important task in NLP with the goal of determining the …

📊 10 results
📏 Metrics: % Test Accuracy

HANS

The HANS (Heuristic Analysis for NLI Systems) dataset which contains many examples where the heuristics fail. Source: [Right for the …

📊 1 results
📏 Metrics: 1:1 Accuracy

JamPatoisNLI

JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois. Many of the most-spoken low-resource …

📊 2 results
📏 Metrics: Accuracy

KUAKE-QQR

KUAKE Query-Query Relevance, a dataset used to evaluate the relevance of the content expressed in two queries, is used for …

📊 1 results
📏 Metrics: Accuracy

KUAKE-QTR

KUAKE Query Title Relevance, a dataset used to estimate the relevance of the title of a query document, is used …

📊 1 results
📏 Metrics: Accuracy

LiDiRus

LiDiRus is a diagnostic dataset that covers a large volume of linguistic phenomena, while allowing you to evaluate information systems …

📊 6 results
📏 Metrics: MCC

MED

MED is a new evaluation dataset that covers a wide range of monotonicity reasoning that was created by crowdsourcing and …

📊 1 results
📏 Metrics: 1:1 Accuracy

MRPC

Microsoft Research Paraphrase Corpus (MRPC) is a corpus consists of 5,801 sentence pairs collected from newswire articles. Each pair is …

📊 1 results
📏 Metrics: Acc

MedNLI

The MedNLI dataset consists of the sentence pairs developed by Physicians from the Past Medical History section of MIMIC-III clinical …

📊 5 results
📏 Metrics: Accuracy, Params (M)

MultiNLI

The Multi-Genre Natural Language Inference (MultiNLI) dataset has 433K sentence pairs. Its size and mode of collection are modeled closely …

📊 63 results
📏 Metrics: Matched, Mismatched, Accuracy, Dev Matched, Dev Mismatched

Probability words NLI

This dataset tests the capabilities of language models to correctly capture the meaning of words denoting probabilities (WEP), e.g. words …

📊 1 results
📏 Metrics: 1:1 Accuracy

QNLI

The QNLI (Question-answering NLI) dataset is a Natural Language Inference dataset automatically derived from the Stanford Question Answering Dataset v1.1 …

📊 42 results
📏 Metrics: Accuracy

Quora Question Pairs

Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary …

📊 1 results
📏 Metrics: Accuracy

RCB

The Russian Commitment Bank is a corpus of naturally occurring discourses whose final sentence contains a clause-embedding predicate under an …

📊 6 results
📏 Metrics: Average F1, Accuracy

RTE

The Recognizing Textual Entailment (RTE) datasets come from a series of textual entailment challenges. Data from RTE1, RTE2, RTE3 and …

📊 89 results
📏 Metrics: Accuracy

SICK

The Sentences Involving Compositional Knowledge (SICK) dataset is a dataset for compositional distributional semantics. It includes a large number of …

📊 1 results
📏 Metrics: 1:1 Accuracy

SNLI

The SNLI dataset (Stanford Natural Language Inference) consists of 570k sentence-pairs manually labeled as entailment, contradiction, and neutral. Premises are …

📊 88 results
📏 Metrics: % Test Accuracy, % Train Accuracy, Parameters, Dev Accuracy, % Dev Accuracy, Accuracy

SciTail

The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct …

📊 11 results
📏 Metrics: Accuracy, Dev Accuracy, % Dev Accuracy, % Test Accuracy

TERRa

Textual Entailment Recognition has been proposed recently as a generic task that captures major semantic inference needs across many NLP …

📊 6 results
📏 Metrics: Accuracy

TabFact

TabFact is a large-scale dataset which consists of 117,854 manually annotated statements with regard to 16,573 Wikipedia tables, their relations …

📊 1 results
📏 Metrics: Accuracy

WNLI

The WNLI dataset is a part of the GLUE benchmark used for Natural Language Inference (NLI). It contains pairs of …

📊 22 results
📏 Metrics: Accuracy

XWINO

XWINO is a multilingual collection of Winograd Schemas in six languages that can be used for evaluation of cross-lingual commonsense …

📊 1 results
📏 Metrics: Accuracy

e-SNLI

e-SNLI is used for various goals, such as obtaining full sentence justifications of a model's decisions, improving universal sentence representations …

📊 3 results
📏 Metrics: BLEU, Accuracy

Natural Language Queries

Ego4D

Ego4D is a massive-scale egocentric video dataset and benchmark suite. It offers 3,025 hours of daily life activity video spanning …

📊 10 results
📏 Metrics: R@1 Mean(0.3 and 0.5), R@1 IoU=0.3, R@1 IoU=0.5, R@5 IoU=0.3, R@5 IoU=0.5

Natural Language Understanding

GLUE

General Language Understanding Evaluation (GLUE) benchmark is a collection of nine natural language understanding tasks, including single-sentence tasks CoLA and …

📊 2 results
📏 Metrics: Average

LexGLUE

Legal General Language Understanding Evaluation (LexGLUE) benchmark is a collection of datasets for evaluating model performance across a diverse set …

📊 8 results
📏 Metrics: ECtHR Task A, ECtHR Task B, SCOTUS, EUR-LEX, LEDGAR, UNFAIR-ToS, CaseHOLD

STREUSLE

STREUSLE stands for Supersense-Tagged Repository of English with a Unified Semantics for Lexical Expressions. The text is from the web …

📊 10 results
📏 Metrics: Tags (Full) Acc, Role F1 (Preps), Function F1 (Preps), Full F1 (Preps)

Nested Mention Recognition

ACE 2004

ACE 2004 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2004 Automatic …

📊 7 results
📏 Metrics: F1

ACE 2005

ACE 2005 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic …

📊 9 results
📏 Metrics: F1

Open Information Extraction

BenchIE

BenchIE: a benchmark and evaluation framework for comprehensive evaluation of OIE systems for English, Chinese and German. In contrast to …

📊 11 results
📏 Metrics: Precision, F1, Recall

CaRB

CaRB [Bhardwaj et al., 2019] is developed by re-annotating the dev and test splits of OIE2016 via crowd-sourcing. Besides improving …

📊 25 results
📏 Metrics: F1

LSOIE

LSOIE is a large-scale OpenIE data converted from QA-SRL 2.0 in two domains, i.e., Wikipedia and Science. It is 20 …

📊 9 results
📏 Metrics: F1

OIE2016

OIE2016 is the first large-scale OpenIE benchmark. It is created by automatic conversion from QA-SRL [He et al., 2015], a …

📊 12 results
📏 Metrics: F1, AUC

Penn Treebank

The English Penn Treebank (PTB) corpus, and in particular the section of the corpus corresponding to the articles of Wall …

📊 4 results
📏 Metrics: F1, AUC

WiRe57

We manually performed the task of Open Information Extraction on 5 short documents, elaborating tentative guidelines for the task, and …

📊 18 results
📏 Metrics: F1

Open Intent Discovery

ATIS

The ATIS (Airline Travel Information Systems) is a dataset consisting of audio recordings and corresponding manual transcripts about humans asking …

📊 1 results
📏 Metrics: ACC, ARI, NMI

BANKING77

Dataset composed of online banking queries annotated with their corresponding intents. BANKING77 dataset provides a very fine-grained set of intents …

📊 1 results
📏 Metrics: ACC, ARI, NMI

CLINC150

This dataset is for evaluating the performance of intent classification systems in the presence of "out-of-scope" queries, i.e., queries that …

📊 1 results
📏 Metrics: ACC, ARI, NMI

SNIPS

The SNIPS Natural Language Understanding benchmark is a dataset of over 16,000 crowdsourced queries distributed among 7 user intents of …

📊 1 results
📏 Metrics: ACC, ARI, NMI

Open-Domain Question Answering

DuReader

DuReader is a large-scale open-domain Chinese machine reading comprehension dataset. The dataset consists of 200K questions, 420K answers and 1M …

📊 2 results
📏 Metrics: EM

ELI5

ELI5 is a dataset for long-form question answering. It contains 270K complex, diverse questions that require explanatory multi-sentence answers. Web …

📊 6 results
📏 Metrics: Rouge-L, Rouge-1, Rouge-2

Natural Questions

The Natural Questions corpus is a question answering dataset containing 307,373 training examples, 7,830 development examples, and 7,842 test examples. …

📊 5 results
📏 Metrics: Exact Match

SearchQA

SearchQA was built using an in-production, commercial search engine. It closely reflects the full pipeline of a (hypothetical) general question-answering …

📊 12 results
📏 Metrics: EM, N-gram F1, Unigram Acc, F1

TQA

The TextbookQuestionAnswering (TQA) dataset is drawn from middle school science curricula. It consists of 1,076 lessons from Life Science, Earth …

📊 2 results
📏 Metrics: Exact Match

TriviaQA

TriviaQA is a realistic text-based question answering dataset which includes 950K question-answer pairs from 662K documents collected from Wikipedia and …

📊 1 results
📏 Metrics: Exact Match

WebQuestions

The WebQuestions dataset is a question answering dataset using Freebase as the knowledge base and contains 6,642 question-answer pairs. It …

📊 4 results
📏 Metrics: Exact Match

Optical Character Recognition (OCR)

FSNS - Test

Arabic handwriting dataset.

📊 3 results
📏 Metrics: Sequence error

I2L-140K

Introduced by Singh, Sumeet S.. “Teaching Machines to Code: Neural Markup Generation with Visual Attention.” ArXiv abs/1802.05415 (2018): n. pag. …

📊 2 results
📏 Metrics: BLEU

VideoDB's OCR Benchmark Public Collection

Dataset Introduction This dataset leverages VideoDB's Public Collection to offer a diverse range of videos featuring text-containing scenes. It …

📊 5 results
📏 Metrics: Average Accuracy, Character Error Rate (CER), Word Error Rate (WER)

im2latex-100k

A prebuilt dataset for OpenAI's task for image-2-latex system. Includes total of ~100k formulas and images splitted into train, validation …

📊 1 results
📏 Metrics: BLEU

Paraphrase Generation

MSCOCO

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 1 results
📏 Metrics: BLEU, iBLEU

Paralex

Paralex learns from a collection of 18 million question-paraphrase pairs scraped from WikiAnswers.

📊 2 results
📏 Metrics: iBLEU, BLEU

Quora Question Pairs

Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary …

📊 2 results
📏 Metrics: iBLEU, BLEU

Paraphrase Identification

AP

This is a paraphrasing dataset created using the adversarial paradigm. A task was designed called the Adversarial Paraphrasing Task (APT) …

📊 1 results
📏 Metrics: MCC

PIT

Paraphrase and Semantic Similarity in Twitter (PIT) presents a constructed Twitter Paraphrase Corpus that contains 18,762 sentence pairs. Source: [SemEval-2015 …

📊 1 results
📏 Metrics: AP

Quora Question Pairs

Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary …

📊 29 results
📏 Metrics: F1, Accuracy, Direct Intrinsic Dimension, Structure Aware Intrinsic Dimension, Dev Accuracy, Accuarcy, Dev F1

TURL

Twitter News URL Corpus is a human-labeled paraphrase corpus to date of 51,524 sentence pairs and the first cross-domain benchmarking …

📊 1 results
📏 Metrics: AP

Translated SNLI Dataset in Marathi

Translated SNLI Dataset in Marathi A translated version of the SNLI dataset in Marathi, designed for **Semantic Textual Similarity …

📊 1 results
📏 Metrics: 1:1 Accuracy

WikiHop

WikiHop is a multi-hop question-answering dataset. The query of WikiHop is constructed with entities and relations from WikiData, while supporting …

📊 1 results
📏 Metrics: Accuracy

Yelp

The Yelp Dataset is a valuable resource for academic research, teaching, and learning. It provides a rich collection of real-world …

📊 1 results
📏 Metrics: Accuracy

Part-Of-Speech Tagging

DaNE

Danish Dependency Treebank (DaNE) is a named entity annotation for the Danish Universal Dependencies treebank using the CoNLL-2003 annotation scheme. …

📊 1 results
📏 Metrics: Accuracy (%)

Morphosyntactic-analysis-dataset

This dataset is for evaluation of morphosyntactic analyzers.

📊 1 results
📏 Metrics: BLEX

Penn Treebank

The English Penn Treebank (PTB) corpus, and in particular the section of the corpus corresponding to the articles of Wall …

📊 16 results
📏 Metrics: Accuracy

Tweebank

Briefly describe the dataset. Provide: * a high-level explanation of the dataset characteristics * explain motivations and summary of its …

📊 2 results
📏 Metrics: Acc

XGLUE

XGLUE is an evaluation benchmark XGLUE,which is composed of 11 tasks that span 19 languages. For each task, the training …

📊 1 results
📏 Metrics: Avg. F1

Passage Ranking

MS MARCO

The MS MARCO (Microsoft MAchine Reading Comprehension) is a collection of datasets focused on deep learning in search. The first …

📊 4 results
📏 Metrics: MRR@10

Passage Re-Ranking

MS MARCO

The MS MARCO (Microsoft MAchine Reading Comprehension) is a collection of datasets focused on deep learning in search. The first …

📊 4 results
📏 Metrics: MRR

Personality Recognition in Conversation

CPED

We construct a dataset named CPED from 40 Chinese TV shows. CPED consists of multisource knowledge related to empathy and …

📊 4 results
📏 Metrics: Accuracy (%), Macro-F1, Accuracy of Neurotism, Accuracy of Extraversion, Accuracy of Openness, Accuracy of Agreeableness, Accuracy of Conscientiousness

Phrase Grounding

Flickr30k

The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators. Source: [Guiding …

📊 3 results
📏 Metrics: Pointing Game Accuracy

Visual Genome

Visual Genome contains Visual Question Answering data in a multi-choice setting. It consists of 101,174 images from MSCOCO with 1.7 …

📊 3 results
📏 Metrics: Pointing Game Accuracy

Phrase Ranking

KP20k

KP20k is a large-scale scholarly articles dataset with 528K articles for training, 20K articles for validation and 20K articles for …

📊 3 results
📏 Metrics: P@5K, P@50K

KPTimes

KPTimes is a large-scale dataset of news texts paired with editor-curated keyphrases. Source: [KPTimes: A Large-Scale Dataset for Keyphrase Generation …

📊 4 results
📏 Metrics: P@5K, P@50K

Phrase Tagging

KP20k

KP20k is a large-scale scholarly articles dataset with 528K articles for training, 20K articles for validation and 20K articles for …

📊 4 results
📏 Metrics: Precision, Recall, F1

KPTimes

KPTimes is a large-scale dataset of news texts paired with editor-curated keyphrases. Source: [KPTimes: A Large-Scale Dataset for Keyphrase Generation …

📊 4 results
📏 Metrics: Precision, Recall, F1

Poem meters classification

PCD

The Arabic dataset is scraped mainly from الموسوعة الشعرية and الديوان. After merging both, the total number of verses is …

📊 1 results
📏 Metrics: Accuracy

Polyphone disambiguation

CPP

A benchmark dataset that consists of 99,000+ sentences for Chinese polyphone disambiguation. Source: [g2pM: A Neural Grapheme-to-Phoneme Conversion Package for …

📊 3 results
📏 Metrics: Accuracy

Prompt Engineering

Caltech-101

The Caltech101 dataset contains images from 101 object categories (e.g., “helicopter”, “elephant” and “chair” etc.) and a background category that …

📊 14 results
📏 Metrics: Harmonic mean

DTD

The Describable Textures Dataset (DTD) contains 5640 texture images in the wild. They are annotated with human-centric attributes inspired by …

📊 14 results
📏 Metrics: Harmonic mean

EuroSAT

Eurosat is a dataset and deep learning benchmark for land use and land cover classification. The dataset is based on …

📊 14 results
📏 Metrics: Harmonic mean

FGVC-Aircraft

FGVC-Aircraft contains 10,200 images of aircraft, with 100 images for each of 102 different aircraft model variants, most of which …

📊 14 results
📏 Metrics: Harmonic mean

Food-101

The Food-101 dataset consists of 101 food categories with 750 training and 250 test images per category, making a total …

📊 13 results
📏 Metrics: Harmonic mean

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 15 results
📏 Metrics: Harmonic mean

ImageNet-A

The ImageNet-A dataset consists of real-world, unmodified, and naturally occurring examples that are misclassified by ResNet models. Source: [On Robustness …

📊 9 results
📏 Metrics: Top-1 accuracy %

ImageNet-R

ImageNet-R(endition) contains art, cartoons, deviantart, graffiti, embroidery, graphics, origami, paintings, patterns, plastic objects, plush objects, sculptures, sketches, tattoos, toys, and …

📊 9 results
📏 Metrics: Top-1 accuracy %

ImageNet-S

Powered by the ImageNet dataset, unsupervised learning on large-scale data has made significant advances for classification tasks. There are two …

📊 9 results
📏 Metrics: Top-1 accuracy %

Oxford 102 Flower

Oxford 102 Flower is an image classification dataset consisting of 102 flower categories. The flowers chosen to be flower commonly …

📊 14 results
📏 Metrics: Harmonic mean

Oxford-IIIT Pet Dataset

The Oxford-IIIT Pet Dataset has 37 categories with roughly 200 images for each class. The images have a large variations …

📊 14 results
📏 Metrics: Harmonic mean

SUN397

The Scene UNderstanding (SUN) database contains 899 categories and 130,519 images. There are 397 well-sampled categories to evaluate numerous state-of-the-art …

📊 14 results
📏 Metrics: Harmonic mean

Stanford Cars

The Stanford Cars dataset consists of 196 classes of cars with a total of 16,185 images, taken from the rear. …

📊 14 results
📏 Metrics: Harmonic mean

UCF101

UCF101 dataset is an extension of UCF50 and consists of 13,320 video clips, which are classified into 101 categories. These …

📊 14 results
📏 Metrics: Harmonic mean

Question Answering

AviationQA

AviationQA is introduced in the paper titled- There is No Big Brother or Small Brother: Knowledge Infusion in Language Models …

📊 1 results
📏 Metrics: Hits@1

BBH

BIG-Bench Hard (BBH) is a subset of the BIG-Bench, a diverse evaluation suite for language models. BBH focuses on a …

📊 1 results
📏 Metrics: Accuracy

BLURB

BLURB is a collection of resources for biomedical natural language processing. In general domains such as newswire and the Web, …

📊 3 results
📏 Metrics: Accuracy

Bamboogle

The Bamboogle dataset is a collection of questions that was constructed to investigate the ability of language models to perform …

📊 9 results
📏 Metrics: Accuracy

BioASQ

BioASQ is a question answering dataset. Instances in the BioASQ dataset are composed of a question (Q), human-annotated answers (A), …

📊 6 results
📏 Metrics: Accuracy

BoolQ

BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring – they are …

📊 65 results
📏 Metrics: Accuracy

CODAH

The COmmonsense Dataset Adversarially-authored by Humans (CODAH) is an evaluation set for commonsense question-answering in the sentence completion style of …

📊 2 results
📏 Metrics: Accuracy

COPA

The Choice Of Plausible Alternatives (COPA) evaluation provides researchers with a tool for assessing progress in open-domain commonsense causal reasoning. …

📊 55 results
📏 Metrics: Accuracy

CaseHOLD

CaseHOLD (Case Holdings On Legal Decisions) is a law dataset comprised of over 53,000+ multiple choice questions to identify the …

📊 3 results
📏 Metrics: Macro F1 (10-fold)

ChAII - Hindi and Tamil Question Answering

The dataset covers Hindi and Tamil, collected without the use of translation. It provides a realistic information-seeking task with questions …

📊 1 results
📏 Metrics: Jaccard

CheGeKa

CheGeKa is a Jeopardy!-like Russian QA dataset collected from the official Russian quiz database ChGK. Motivation The task can be …

📊 4 results
📏 Metrics: Accuracy

Children's Book Test

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 8 results
📏 Metrics: Accuracy-CN, Accuracy-NE

CliCR

CliCR is a new dataset for domain specific reading comprehension used to construct around 100,000 cloze queries from clinical case …

📊 2 results
📏 Metrics: F1

CoQA

CoQA is a large-scale dataset for building Conversational Question Answering systems. The goal of the CoQA challenge is to measure …

📊 9 results
📏 Metrics: In-domain, Out-of-domain, Overall

Complex-CronQuestions

A filtered version of CronQuestions and which can better demonstrate the model’s inference ability for complex temporal questions.

📊 3 results
📏 Metrics: Hits@1

ComplexWebQuestions

ComplexWebQuestions is a dataset for answering complex questions that require reasoning over multiple web snippets. It contains a large set …

📊 1 results
📏 Metrics: EM

ConditionalQA

ConditionalQA is a Question Answering (QA) dataset that contains complex questions with conditional answers, i.e. the answers are only applicable …

📊 3 results
📏 Metrics: Conditional (answers), Conditional (w/ conditions), Overall (answers), Overall (w/ conditions)

ConvFinQA

ConvFinQA is a dataset designed to study the chain of numerical reasoning in conversational question answering. The dataset contains 3892 …

📊 3 results
📏 Metrics: Execution Accuracy

CronQuestions

CRONQUESTIONS, the Temporal KGQA dataset consists of two parts: a KG with temporal annotations, and a set of natural language …

📊 10 results
📏 Metrics: Hits@1

DROP

Discrete Reasoning Over Paragraphs DROP is a crowdsourced, adversarially-created, 96k-question benchmark, in which a system must resolve references in a …

📊 6 results
📏 Metrics: Accuracy

DaNetQA

DaNetQA is a question answering dataset for yes/no questions. These questions are naturally occurring ---they are generated in unprompted and …

📊 6 results
📏 Metrics: Accuracy

DuoRC

DuoRC contains 186,089 unique question-answer pairs created from a collection of 7680 pairs of movie plots where each pair in …

📊 3 results
📏 Metrics: Accuracy

EgoTaskQA

EgoTask QA benchmark contains 40K balanced question-answer pairs selected from 368K programmatically generated questions generated over 2K egocentric videos. It …

📊 4 results
📏 Metrics: Direct

FEVER

FEVER is a publicly available dataset for fact extraction and verification against textual sources. It consists of 185,445 claims manually …

📊 7 results
📏 Metrics: EM

FQuAD

A French Native Reading Comprehension dataset of questions and answers on a set of Wikipedia articles that consists of 25,000+ …

📊 6 results
📏 Metrics: EM, F1

FairytaleQA

FairytaleQA is a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Annotated by educational experts based on an …

📊 4 results
📏 Metrics: F1, Rouge-L

FinQA

FinQA is a new large-scale dataset with Question-Answering pairs over Financial reports, written by financial experts. The dataset contains 8,281 …

📊 6 results
📏 Metrics: Execution Accuracy, Program Accuracy

GraphQuestions

GraphQuestions is a characteristic-rich dataset designed for factoid question answering. The dataset aims to provide a systematic way of constructing …

📊 1 results
📏 Metrics: Accuracy

HellaSwag

HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are …

📊 1 results
📏 Metrics: Accuracy

HotpotQA

HotpotQA is a question answering dataset collected on the English Wikipedia, containing about 113K crowd-sourced questions that are constructed to …

📊 22 results
📏 Metrics: JOINT-F1, ANS-EM, ANS-F1, SUP-EM, SUP-F1, JOINT-EM

HybridQA

A new large-scale question-answering dataset that requires reasoning on heterogeneous information. Each question is aligned with a Wikipedia table and …

📊 3 results
📏 Metrics: ANS-EM

JaQuAD

JaQuAD (Japanese Question Answering Dataset) is a question answering dataset in Japanese that consists of 39,696 extractive question-answer pairs on …

📊 1 results
📏 Metrics: Exact Match, F1

KQA Pro

A large-scale dataset for Complex KBQA. Source: [KQA Pro: A Large-Scale Dataset with Interpretable Programs and Accurate SPARQLs for Complex …

📊 1 results
📏 Metrics: Accuracy

MML

MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively …

📊 1 results
📏 Metrics: Accuracy

MRQA

The MRQA (Machine Reading for Question Answering) dataset is a dataset for evaluating the generalization capabilities of reading comprehension systems. …

📊 2 results
📏 Metrics: Average F1

MS MARCO

The MS MARCO (Microsoft MAchine Reading Comprehension) is a collection of datasets focused on deep learning in search. The first …

📊 4 results
📏 Metrics: Rouge-L, BLEU-1

MapEval-API

MapEval-Textual contains 300 question-answer pairs. The task is to answer question by fetching necessary informations using external Map APIs.

📊 2 results
📏 Metrics: Accuracy (%)

MapEval-Textual

MapEval-Textual contains 300 context-question-answer triplets. The necessary geo-spatial information is provided in the context. The task is to answer question …

📊 1 results
📏 Metrics: Accuracy (% )

Mathematics Dataset

This dataset code generates mathematical question and answer pairs, from a range of question types at roughly school-level difficulty. This …

📊 3 results
📏 Metrics: Accuracy

MedQA

Multiple choice question answering based on the United States Medical License Exams (USMLE). The dataset is collected from the professional …

📊 27 results
📏 Metrics: Accuracy

MetaQA

The MetaQA dataset consists of a movie ontology derived from the WikiMovies Dataset and three sets of question-answer pairs written …

📊 1 results
📏 Metrics: AnswerExactMatch (Question Answering)

Molweni

A machine reading comprehension (MRC) dataset with discourse structure built over multiparty dialog. Molweni's source samples from the Ubuntu Chat …

📊 4 results
📏 Metrics: EM, F1

MultiQ

MultiQ is a multi-hop QA dataset for Russian, suitable for general open-domain question answering, information retrieval, and reading comprehension tasks. …

📊 4 results
📏 Metrics: Accuracy

MultiRC

MultiRC (Multi-Sentence Reading Comprehension) is a dataset of short paragraphs and multi-sentence questions, i.e., questions that can be answered by …

📊 30 results
📏 Metrics: F1, EM

MultiTQ

MULTITQ is a large-scale dataset featuring ample relevant facts and multiple temporal granularities.

📊 9 results
📏 Metrics: Hits@1, Hits@10

NExT-QA (Open-ended VideoQA)

NExT-QA is a VideoQA benchmark targeting the explanation of video contents. It challenges QA models to reason about the causal …

📊 6 results
📏 Metrics: Accuracy, Confidence Score

NarrativeQA

The NarrativeQA dataset includes a list of documents with Wikipedia summaries, links to full stories, and questions and answers. Source: …

📊 8 results
📏 Metrics: Rouge-L, BLEU-1, BLEU-4, METEOR

Natural Questions

The Natural Questions corpus is a question answering dataset containing 307,373 training examples, 7,830 development examples, and 7,842 test examples. …

📊 46 results
📏 Metrics: EM

NewsQA

The NewsQA dataset is a crowd-sourced machine reading comprehension dataset of 120,000 question-answer pairs. * Documents are CNN news articles. …

📊 16 results
📏 Metrics: EM, F1

OTT-QA

The Open Table-and-Text Question Answering (OTT-QA) dataset contains open questions which require retrieving tables and text from the web to …

📊 3 results
📏 Metrics: ANS-EM

OpenBookQA

OpenBookQA is a new kind of question-answering dataset modeled after open book exams for assessing human understanding of a subject. …

📊 40 results
📏 Metrics: Accuracy

PIQA

PIQA is a dataset for commonsense reasoning, and was created to investigate the physical knowledge of existing models in NLP. …

📊 67 results
📏 Metrics: Accuracy

PeerQA

We present PeerQA, a real-world, scientific, document-level Question Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, which …

📊 5 results
📏 Metrics: Prometheus-2 Answer Correctness, Rouge-L, AlignScore

PopQA

PopQA is an open-domain QA dataset with 14k QA pairs with fine-grained Wikidata entity ID, Wikipedia page views, and relationship …

📊 2 results
📏 Metrics: Accuracy

PubChemQA

PubChemQA consists of molecules and their corresponding textual descriptions from PubChem. It contains a single type of question, i.e., please …

📊 2 results
📏 Metrics: BLEU-2, BLEU-4, ROUGE-1, ROUGE-2, ROUGE-L, MEATOR

PubMedQA

The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary …

📊 26 results
📏 Metrics: Accuracy

QASPER

QASPER is a dataset for question answering on scientific research papers. It consists of 5,049 questions over 1,585 Natural Language …

📊 1 results
📏 Metrics: Token F1

QuAC

Question Answering in Context is a large-scale dataset that consists of around 14K crowdsourced Question Answering dialogs with 98K question-answer …

📊 2 results
📏 Metrics: F1, HEQD, HEQQ

QuALITY

QuALITY (Question Answering with Long Input Texts, Yes!) is a multiple-choice question answering dataset for long document comprehension. The dataset …

📊 1 results
📏 Metrics: Accuracy

Quora Question Pairs

Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary …

📊 19 results
📏 Metrics: Accuracy

RACE

The ReAding Comprehension dataset from Examinations (RACE) dataset is a machine reading comprehension dataset consisting of 27,933 passages and 97,867 …

📊 6 results
📏 Metrics: RACE-m, RACE-h, RACE

ReClor

Logical reasoning is an important ability to examine, analyze, and critically evaluate arguments as they occur in ordinary language as …

📊 3 results
📏 Metrics: Accuracy, Accuracy (easy), Accuracy (hard)

RecipeQA

RecipeQA is a dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from …

📊 1 results
📏 Metrics: Accuracy

RuOpenBookQA

RuOpenBookQA is a QA dataset with multiple-choice elementary-level science questions which probe the understanding of core science facts. Motivation RuOpenBookQA …

📊 4 results
📏 Metrics: Accuracy

SCDE

SCDE is a human-created sentence cloze dataset, collected from public school English examinations in China. The task requires a model …

📊 1 results
📏 Metrics: BA, PA, DE

SIQA

Social Interaction QA (SIQA) is a question-answering benchmark for testing social commonsense intelligence. Contrary to many prior benchmarks that focus …

📊 24 results
📏 Metrics: Accuracy

SQA3D

SQA3D is a dataset for embodied scene understanding, where an agent needs to comprehend the scene it situates from an …

📊 7 results
📏 Metrics: AnswerExactMatch (Question Answering)

SQuAD

The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct …

📊 2 results
📏 Metrics: Exact Match, F1

SWAG

Given a partial description like "she opened the hood of the car," humans can reason about the situation and anticipate …

📊 1 results
📏 Metrics: Accuracy

SberQuAD

A large scale analogue of Stanford SQuAD in the Russian language - is a valuable resource that has not been …

📊 3 results
📏 Metrics: EM, F1

SchizzoSQUAD

The “Mental Health” forum was used, a forum dedicated to people suffering from schizophrenia and different mental disorders. Relevant posts …

📊 1 results
📏 Metrics: Average F1, Averaged Precision

SimpleQuestions

SimpleQuestions is a large-scale factoid question answering dataset. It consists of 108,442 natural language questions, each paired with a corresponding …

📊 1 results
📏 Metrics: F1

StepGame

A Benchmark for Robust Multi-Hop Spatial Reasoning in Texts

📊 1 results
📏 Metrics: 1-of-100 Accuracy

StoryCloze

Representation and learning of commonsense knowledge is one of the foundational problems in the quest to enable deep language understanding. …

📊 20 results
📏 Metrics: Accuracy

StrategyQA

StrategyQA is a question answering benchmark where the required reasoning steps are implicit in the question, and should be inferred …

📊 11 results
📏 Metrics: Accuracy, EM

TAT-QA

TAT-QA (Tabular And Textual dataset for Question Answering) is a large-scale QA dataset, aiming to stimulate progress of QA research …

📊 1 results
📏 Metrics: Exact Match (EM)

TIQ

Existing benchmarks for temporal QA focus on a single information source (either a KB or a text corpus), and include …

📊 9 results
📏 Metrics: P@1

TempQA-WD

TempQA-WD is a benchmark dataset for temporal reasoning designed to encourage research in extending the present approaches to target a …

📊 1 results
📏 Metrics: F1

TempQuestions

Here, we take a key step in this direction and release a new benchmark, TempQuestions, containing 1,271 questions, that are …

📊 4 results
📏 Metrics: Hits@1, F1

TimeQuestions

Question answering over knowledge graphs (KG-QA) is a vital topic in IR. Questions with temporal intent are a special class …

📊 16 results
📏 Metrics: P@1

Torque

Torque is an English reading comprehension benchmark built on 3.2k news snippets with 21k human-generated questions querying temporal relationships. Source: …

📊 2 results
📏 Metrics: F1, EM, C

TrecQA

Text Retrieval Conference Question Answering (TrecQA) is a dataset created from the TREC-8 (1999) to TREC-13 (2004) Question Answering tracks. …

📊 12 results
📏 Metrics: MAP, MRR

TriviaQA

TriviaQA is a realistic text-based question answering dataset which includes 950K question-answer pairs from 662K documents collected from Wikipedia and …

📊 51 results
📏 Metrics: EM, F1

TruthfulQA

TruthfulQA is a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises …

📊 30 results
📏 Metrics: MC1, MC2, % true, % info, % true (GPT-judge), BLEURT, ROUGE, BLEU, EM, Accuracy

TweetQA

With social media becoming increasingly popular on which lots of news and real-time events are reported, developing automated question answering …

📊 3 results
📏 Metrics: BLEU-1, ROUGE-L

UniProtQA

UniProtQA consists of proteins and textual queries about their functions and properties. The dataset is constructed from UniProt, and consists …

📊 2 results
📏 Metrics: BLEU-2, BLEU-4, ROUGE-1, ROUGE-2, ROUGE-L, MEATOR

WebQuestions

The WebQuestions dataset is a question answering dataset using Freebase as the knowledge base and contains 6,642 question-answer pairs. It …

📊 36 results
📏 Metrics: EM, F1

WebQuestionsSP

The WebQuestionsSP dataset is released as part of our ACL-2016 paper “The Value of Semantic Parse Labeling for Knowledge Base …

📊 1 results
📏 Metrics: Accuracy

WebSRC

WebSRC is a novel Web-based Structural Reading Comprehension dataset. It consists of 0.44M question-answer pairs, which are collected from 6.5K …

📊 1 results
📏 Metrics: F1

WikiHop

WikiHop is a multi-hop question-answering dataset. The query of WikiHop is constructed with entities and relations from WikiData, while supporting …

📊 9 results
📏 Metrics: Test

WikiQA

The WikiQA corpus is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain …

📊 23 results
📏 Metrics: MAP, MRR

WikiSQL

WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further …

📊 2 results
📏 Metrics: Exact Match (EM)

WikiTableQuestions

WikiTableQuestions is a question answering dataset over semi-structured tables. It is comprised of question-answer pairs on HTML tables, and was …

📊 2 results
📏 Metrics: Accuracy, Accuracy (Test)

catbAbI LM-mode

We aim to improve the bAbI benchmark as a means of developing intelligent dialogue agents. To this end, we propose …

📊 4 results
📏 Metrics: Accuracy (mean)

catbAbI QA-mode

We aim to improve the bAbI benchmark as a means of developing intelligent dialogue agents. To this end, we propose …

📊 4 results
📏 Metrics: 1:1 Accuracy

Question Generation

FairytaleQA

FairytaleQA is a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Annotated by educational experts based on an …

📊 3 results
📏 Metrics: ROUGE-L

Natural Questions

The Natural Questions corpus is a question answering dataset containing 307,373 training examples, 7,830 development examples, and 7,842 test examples. …

📊 2 results
📏 Metrics: QAE, R-QAE

SQuAD

The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct …

📊 2 results
📏 Metrics: QAE, R-QAE

TriviaQA

TriviaQA is a realistic text-based question answering dataset which includes 950K question-answer pairs from 662K documents collected from Wikipedia and …

📊 2 results
📏 Metrics: QAE, R-QAE

WeiboPolls

Dataset Description The dataset described in the provided text is focused on social media polls collected from Weibo, a …

📊 2 results
📏 Metrics: ROUGE-1, ROUGE-L, BLEU-1, BLEU-3

Question-Answer categorization

QC-Science

QC-Science contains 47832 question-answer pairs belonging to the science domain tagged with labels of the form subject - chapter - …

📊 6 results
📏 Metrics: R@5, R@10, R@15, R@20

Reading Comprehension

AdversarialQA

We have created three new Reading Comprehension datasets constructed using an adversarial model-in-the-loop. We use three different models; BiDAF (Seo …

📊 3 results
📏 Metrics: Overall: F1, D(BiDAF): F1, D(BERT): F1, D(RoBERTa): F1

MuSeRC

We present a reading comprehension challenge in which questions can only be answered by taking into account information from multiple …

📊 6 results
📏 Metrics: Average F1, EM

RACE

The ReAding Comprehension dataset from Examinations (RACE) dataset is a machine reading comprehension dataset consisting of 27,933 passages and 97,867 …

📊 24 results
📏 Metrics: Accuracy, Accuracy (Middle), Accuracy (High)

ReCAM

Tasks Our shared task has three subtasks. Subtask 1 and 2 focus on evaluating machine learning models' performance with regard …

📊 1 results
📏 Metrics: Accuracy

ReClor

Logical reasoning is an important ability to examine, analyze, and critically evaluate arguments as they occur in ordinary language as …

📊 10 results
📏 Metrics: Test

Reading Order Detection

ROOR

ROOR is a reading order prediction (ROP) benchmark which annotates layout reading order as ordering relations. Layout reading order is …

📊 4 results
📏 Metrics: Segment-level F1

ReadingBank

ReadingBank is a benchmark dataset for reading order detection built with weak supervision from WORD documents, which contains 500K document …

📊 2 results
📏 Metrics: Average Relative Distance (ARD), Average Page-level BLEU

Recognizing Emotion Cause in Conversations

EmoCause

EmoCause is a dataset of annotated emotion cause words in emotional situations from the EmpatheticDialogues valid and test set. The …

📊 6 results
📏 Metrics: Top-1 Recall, Top-3 Recall, Top-5 Recall

RECCON

RECCON is a dataset for the task of recognizing emotion cause in conversations. Source: Recognizing Emotion Cause in Conversations

📊 2 results
📏 Metrics: F1, Exact Span F1, F1(Pos), F1(Neg)

Reinforcement Learning

iris

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician, …

📊 1 results
📏 Metrics: 10 Images, 4*4 Stitching, Exact Accuracy

Relation Classification

AbstRCT - Neoplasm

The AbstRCT dataset consists of randomized controlled trials retrieved from the MEDLINE database via PubMed search. The trials are annotated …

📊 1 results
📏 Metrics: Macro F1

CDCP

The Cornell eRulemaking Corpus – CDCP is an argument mining corpus annotated with argumentative structure information capturing the evaluability of …

📊 1 results
📏 Metrics: Macro F1

DRI Corpus

The Dr. Inventor Multi-Layer Scientific Corpus (DRI Corpus) includes 40 Computer Graphics papers, selected by domain experts. Each paper of …

📊 1 results
📏 Metrics: Macro F1

Discovery

The Discovery datasets consists of adjacent sentence pairs (s1,s2) with a discourse marker (y) that occurred at the beginning of …

📊 1 results
📏 Metrics: 1:1 Accuracy

FewRel

The FewRel (Few-Shot Relation Classification Dataset) contains 100 relations and 70,000 instances from Wikipedia. The dataset is divided into three …

📊 5 results
📏 Metrics: F1 (10-way 1-shot), F1 (10-way 5-shot), F1 (5-way 1-shot), F1 (5-way 5-shot, F1

TACRED

TACRED is a large-scale relation extraction dataset with 106,264 examples built over newswire and web text from the corpus used …

📊 17 results
📏 Metrics: F1

Relation Extraction

2010 i2b2/VA

2010 i2b2/VA is a biomedical dataset for relation classification and entity typing.

📊 1 results
📏 Metrics: Macro F1

2012 i2b2 Temporal Relations

The Sixth Informatics for Integrating Biology and the Bedside (i2b2) Natural Language Processing Challenge for Clinical Records focused on the …

📊 1 results
📏 Metrics: Macro F1

ACE 2004

ACE 2004 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2004 Automatic …

📊 9 results
📏 Metrics: RE+ Micro F1, RE Micro F1, NER Micro F1, Cross Sentence

ACE 2005

ACE 2005 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic …

📊 19 results
📏 Metrics: RE Micro F1, RE+ Micro F1, NER Micro F1, Sentence Encoder, Relation classification F1, Cross Sentence, Relation F1

Adverse Drug Events (ADE) Corpus

Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. A significant …

📊 10 results
📏 Metrics: RE+ Macro F1, RE Macro F1, NER Macro F1

BioRED

BioRED is a first-of-its-kind biomedical relation extraction dataset with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. …

📊 2 results
📏 Metrics: F1

CDR

The BioCreative V CDR task corpus is manually annotated for chemicals, diseases and chemical-induced disease (CID) relations. It contains the …

📊 9 results
📏 Metrics: F1

ChemProt

ChemProt consists of 1,820 PubMed abstracts with chemical-protein interactions annotated by domain experts and was used in the BioCreative VI …

📊 12 results
📏 Metrics: F1, Micro F1

CoNLL04

The CoNLL04 dataset is a benchmark dataset used for relation extraction tasks. It contains 1,437 sentences, each of which has …

📊 13 results
📏 Metrics: RE+ Macro F1 , RE+ Micro F1, NER Macro F1, NER Micro F1, RE+ Macro F1

DDI

The DDIExtraction 2013 task relies on the DDI corpus which contains MedLine abstracts on drug-drug interactions as well as documents …

📊 3 results
📏 Metrics: F1, Micro F1

DWIE

The 'Deutsche Welle corpus for Information Extraction' (DWIE) is a multi-task dataset that combines four main Information Extraction (IE) annotation …

📊 3 results
📏 Metrics: F1-Hard

Dataset: Relationship extraction for knowledge graph creation from biomedical literature (Gene-Disease relationships)

This is the dataset used for classifying Gene-Disease relationship types from sentences. The dataset consists of 3 files: * manually_annotated_set.xlsx …

📊 2 results
📏 Metrics: F1

DocRED

DocRED (Document-Level Relation Extraction Dataset) is a relation extraction dataset constructed from Wikipedia and Wikidata. Each document in the dataset …

📊 56 results
📏 Metrics: F1, Ign F1

FUNSD

Form Understanding in Noisy Scanned Documents (FUNSD) comprises 199 real, fully annotated, scanned forms. The documents are noisy and vary …

📊 9 results
📏 Metrics: F1

FewRel

The FewRel (Few-Shot Relation Classification Dataset) contains 100 relations and 70,000 instances from Wikipedia. The dataset is divided into three …

📊 2 results
📏 Metrics: F1, Precision, Recall

GAD

GAD, or Gene Associations Database, is a corpus of gene-disease associations curated from genetic association studies.

📊 3 results
📏 Metrics: F1, Micro F1

GDA

The gene-disease associations corpus contains 30,192 titles and abstracts from PubMed articles that have been automatically labelled for genes, diseases …

📊 9 results
📏 Metrics: F1

JNLPBA

JNLPBA is a biomedical dataset that comes from the GENIA version 3.02 corpus (Kim et al., 2003). It was created …

📊 1 results
📏 Metrics: F1

NYT10-HRL

a dataset from A Hierarchical Framework for Relation Extraction with Reinforcement Learning

📊 10 results
📏 Metrics: F1

NYT11-HRL

Preprocessed version of NYT11. Each relational triple is formatted as follows: rtext : relation type em1 : source entity mention …

📊 11 results
📏 Metrics: F1

PGR

Phenotype-Gene Relations (PGR) is a corpus that consists of 1712 abstracts, 5676 human phenotype annotations, 13835 gene annotations, and 4283 …

📊 1 results
📏 Metrics: Macro F1

REBEL

Wikipedia abstracts automatically annotated with WikiData entities and relations that are entailed by the text. Over 9 million triplets.

📊 2 results
📏 Metrics: Triplet F1 (strict EL)

Re-TACRED

The Re-TACRED dataset is a significantly improved version of the TACRED dataset for relation extraction. Using new crowd-sourced labels, Re-TACRED …

📊 7 results
📏 Metrics: F1

SciERC

SciERC dataset is a collection of 500 scientific abstract annotated with scientific entities, their relations, and coreference clusters. The abstracts …

📊 3 results
📏 Metrics: F1, NER Micro F1, RE+ Micro F1

SemEval-2010 Task-8

The dataset for the SemEval-2010 Task 8 is a dataset for multi-way classification of mutually exclusive semantic relations between pairs …

📊 22 results
📏 Metrics: F1

TACRED

TACRED is a large-scale relation extraction dataset with 106,264 examples built over newswire and web text from the corpus used …

📊 36 results
📏 Metrics: F1, F1 (10% Few-Shot), F1 (5% Few-Shot), F1 (1% Few-Shot), F1 (Zero-Shot)

TACRED-Revisited

The TACRED-Revisited dataset improves the crowd-sourced TACRED dataset for relation extraction by relabeling the dev and test sets using expert …

📊 3 results
📏 Metrics: F1

WNUT 2020

The training and development dataset for our task was taken from previous work on wet lab corpus (Kulkarni et al., …

📊 1 results
📏 Metrics: F1, Precision, Recall

WebNLG

The WebNLG corpus comprises of sets of triplets describing facts (entities and relations between them) and the corresponding facts in …

📊 10 results
📏 Metrics: F1, NER Micro F1

Representation Learning

Animals-10

It contains about 28K medium quality animal images belonging to 10 categories: dog, cat, horse, spyder, butterfly, chicken, sheep, cow, …

📊 1 results
📏 Metrics: 1:1 Accuracy

SciDocs

SciDocs evaluation framework consists of a suite of evaluation tasks designed for document-level tasks. Source: Allen Institute for AI

📊 7 results
📏 Metrics: Avg.

Sports10

  • Games dataset containing 100,000 Gameplay Images of 175 Video Games across 10 Sports Genres - AMERICAN FOOTBALL, BASKETBALL, BIKE …
📊 1 results
📏 Metrics: Silhouette Score

Response Generation

ArgSciChat

ArgSciChat is an argumentative dialogue dataset. It consists of 498 messages collected from 41 dialogues on 20 scientific papers. It …

📊 3 results
📏 Metrics: Message-F1, BScore, Mover

MMConv

The main goal of the data collection is to acquire highly natural conversations that cover a wide variety of styles …

📊 2 results
📏 Metrics: BLEU, Comb., Inform, Success

SIMMC2.0

Next generation task-oriented dialog systems need to understand conversational contexts with their perceived surroundings, to effectively help users in the …

📊 2 results
📏 Metrics: BLEU

Retrieval

HotpotQA

HotpotQA is a question answering dataset collected on the English Wikipedia, containing about 113K crowd-sourced questions that are constructed to …

📊 3 results
📏 Metrics: Queries per second

InfoSeek

In this project, we introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions that cannot be answered with …

📊 1 results
📏 Metrics: Recall@5

MVK

The dataset contains single-shot videos taken from moving cameras in underwater environments. The first shard of a new Marine Video …

📊 1 results
📏 Metrics: text-to-video Mean Rank

Natural Questions

The Natural Questions corpus is a question answering dataset containing 307,373 training examples, 7,830 development examples, and 7,842 test examples. …

📊 3 results
📏 Metrics: Queries per second

OK-VQA

Outside Knowledge Visual Question Answering (OK-VQA) includes more than 14,000 questions that require external knowledge to answer. Source: [OK-VQA: A …

📊 2 results
📏 Metrics: Recall@5

Polyvore

This dataset contains 21,889 outfits from polyvore.com, in which 17,316 are for training, 1,497 for validation and 3,076 for testing. …

📊 1 results
📏 Metrics: Recall@5

PubMedQA

The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary …

📊 1 results
📏 Metrics: Accuracy (Top-1)

PubMedQA corpus with metadata

PubMedQA-MetaGen: Metadata-Enriched PubMedQA Corpus Dataset Summary PubMedQA-MetaGen is a metadata-enriched version of the PubMedQA biomedical question-answering dataset, created using the …

📊 1 results
📏 Metrics: Accuracy (Top-1)

Quora Question Pairs

Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary …

📊 4 results
📏 Metrics: Queries per second

ToolLens

The ToolLens dataset consists of 18,770 concise yet intentionally multifaceted queries, each associated with 1 to 3 verified tools out …

📊 1 results
📏 Metrics: COMP@

Role-filler Entity Extraction

MUC-4

A dataset for evaluate system's understanding of given passages.

📊 1 results
📏 Metrics: Avg. F1

Romanian Text Diacritization

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish …

📊 1 results
📏 Metrics: Alpha-Word accuracy

Rumour Detection

1

111

📊 1 results
📏 Metrics: 0..5sec

Sepehr_RumTel01

The expansion of social networks has accelerated the transmission of information and news at every communities. Over the past few …

📊 2 results
📏 Metrics: F-Measure

Sarcasm Detection

MUStARD++

MUStARD++ is a multimodal sarcasm detection dataset (MUStARD) pre-annotated with 9 emotions. It can be used for the task of …

📊 1 results
📏 Metrics: Precision, Recall, F1

WITS

This dataset is an extension of MASAC, a multimodal, multi-party, Hindi-English code-mixed dialogue dataset compiled from the popular Indian TV …

📊 1 results
📏 Metrics: R1

iSarcasm

iSarcasm is a dataset of tweets, each labelled as either sarcastic or non_sarcastic. Each sarcastic tweet is further labelled for …

📊 1 results
📏 Metrics: F1-Score

Scientific Document Summarization

CL-SciSumm

📊 1 results
📏 Metrics: ROUGE-2

Semantic Parsing

ATIS

The ATIS (Airline Travel Information Systems) is a dataset consisting of audio recordings and corresponding manual transcripts about humans asking …

📊 3 results
📏 Metrics: Accuracy

CFQ

A large and realistic natural language question answering dataset. Source: Measuring Compositional Generalization: A Comprehensive Method on Realistic Data

📊 5 results
📏 Metrics: Exact Match

GraphQuestions

GraphQuestions is a characteristic-rich dataset designed for factoid question answering. The dataset aims to provide a systematic way of constructing …

📊 1 results
📏 Metrics: F1 Score

SParC

SParC is a large-scale dataset for complex, cross-domain, and context-dependent (multi-turn) semantic parsing and text-to-SQL task (interactive natural language interfaces …

📊 1 results
📏 Metrics: Exact

SQA

The SQA dataset was created to explore the task of answering sequences of inter-related questions on HTML tables. It has …

📊 2 results
📏 Metrics: Denotation Accuracy, Accuracy

WebQuestionsSP

The WebQuestionsSP dataset is released as part of our ACL-2016 paper “The Value of Semantic Parse Labeling for Knowledge Base …

📊 4 results
📏 Metrics: Accuracy

WikiSQL

WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further …

📊 5 results
📏 Metrics: Accuracy, Denotation accuracy (test)

WikiTableQuestions

WikiTableQuestions is a question answering dataset over semi-structured tables. It is comprised of question-answer pairs on HTML tables, and was …

📊 22 results
📏 Metrics: Accuracy (Test), Accuracy (Dev), Accuracy, Test Accuracy

Semantic Retrieval

Contract Discovery

A new shared task of semantic retrieval from legal texts, in which a so-called contract discovery is to be performed, …

📊 6 results
📏 Metrics: Soft-F1

Semantic Role Labeling

CoNLL-2009

The task builds on the CoNLL-2008 task and extends it to multiple languages. The core of the task is to …

📊 1 results
📏 Metrics: F1 (Arg.), F1 (Prd.)

Semantic Similarity

BIOSSES

The BIOSSES data set comprises total 100 sentence pairs all of which were selected from the "[TAC2 Biomedical Summarization Track …

📊 3 results
📏 Metrics: Pearson Correlation

CHIP-STS

CHIP Semantic Textual Similarity, a dataset for sentence similarity in the non-i.i.d. (non-independent and identically distributed) setting, is used for …

📊 1 results
📏 Metrics: Macro F1

SICK

The Sentences Involving Compositional Knowledge (SICK) dataset is a dataset for compositional distributional semantics. It includes a large number of …

📊 5 results
📏 Metrics: MSE, Pearson Correlation, Spearman Correlation

Semantic Textual Similarity

CxC

Crisscrossed Captions (CxC) contains 247,315 human-labeled annotations including positive and negative associations between image pairs, caption pairs and image-caption pairs. …

📊 4 results
📏 Metrics: avg ± std

MRPC

Microsoft Research Paraphrase Corpus (MRPC) is a corpus consists of 5,801 sentence pairs collected from newswire articles. Each pair is …

📊 43 results
📏 Metrics: Accuracy, F1

MTEB

MTEB is a benchmark that spans 8 embedding tasks covering a total of 56 datasets and 112 languages. The 8 …

📊 30 results
📏 Metrics: Spearman Correlation

SICK

The Sentences Involving Compositional Knowledge (SICK) dataset is a dataset for compositional distributional semantics. It includes a large number of …

📊 22 results
📏 Metrics: Spearman Correlation

STS Benchmark

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval …

📊 62 results
📏 Metrics: Pearson Correlation, Spearman Correlation, Accuracy, Dev Pearson Correlation, Dev Spearman Correlation

SentEval

SentEval is a toolkit for evaluating the quality of universal sentence representations. SentEval encompasses a variety of tasks, including binary …

📊 5 results
📏 Metrics: MRPC, SICK-R, SICK-E, STS

Semantic entity labeling

EC-FUNSD

EC-FUNSD is introduced in [arXiv:2402.02379] as a benchmark of semantic entity recognition (SER) and entity linking (EL), designed for the …

📊 8 results
📏 Metrics: F1

FUNSD

Form Understanding in Noisy Scanned Documents (FUNSD) comprises 199 real, fully annotated, scanned forms. The documents are noisy and vary …

📊 15 results
📏 Metrics: F1

Sentence Completion

HellaSwag

HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are …

📊 86 results
📏 Metrics: Accuracy

Sentence Ordering

EconLogicQA

EconLogicQA is a benchmark designed to test the sequential reasoning skills of large language models (LLMs) in economics, business, and …

📊 18 results
📏 Metrics: Accuracy

Sentiment Analysis

BanglaBook

This repository contains the code, data, and models of the paper titled "BᴀɴɢʟᴀBᴏᴏᴋ: A Large-scale Bangla Dataset for Sentiment Analysis …

📊 13 results
📏 Metrics: Weighted Average F1-score

DBRD

The DBRD (pronounced dee-bird) dataset contains over 110k book reviews along with associated binary sentiment polarity labels. It is greatly …

📊 3 results
📏 Metrics: Accuracy, F1

DynaSent

DynaSent is an English-language benchmark task for ternary (positive/negative/neutral) sentiment analysis. DynaSent combines naturally occurring sentences with sentences created using …

📊 12 results
📏 Metrics: Macro F1, 10 fold Cross validation

HARD

The Hotel Arabic-Reviews Dataset (HARD) contains 93700 hotel reviews in Arabic language. The hotel reviews were collected from Booking.com website …

📊 1 results
📏 Metrics: Accuracy

IMDb Movie Reviews

The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database …

📊 2 results
📏 Metrics: Accuracy (2 classes), F1 Macro

MR

MR Movie Reviews is a dataset for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect …

📊 18 results
📏 Metrics: Accuracy, Training Time

SLUE

Spoken Language Understanding Evaluation (SLUE) is a suite of benchmark tasks for spoken language understanding evaluation. It consists of limited-size …

📊 8 results
📏 Metrics: Recall (%) , F1 (%), Text model

SST-3

SST-5 is the Stanford Sentiment Treebank 5-way classification dataset (positive, somewhat positive, neutral, somewhat negative, negative). To create SST-3 (positive, …

📊 11 results
📏 Metrics: Macro F1

Sentiment Merged

This is a dataset for 3-way sentiment classification of reviews (negative, neutral, positive). It is a merge of [Stanford Sentiment …

📊 10 results
📏 Metrics: Macro F1

TweetEval

TweetEval introduces an evaluation framework consisting of seven heterogeneous Twitter-specific classification tasks. Source: [TweetEval: Unified Benchmark and Comparative Evaluation for …

📊 7 results
📏 Metrics: Emoji, Emotion, Hate, Irony, Offensive, Sentiment, Stance, ALL

Slot Filling

ATIS

The ATIS (Airline Travel Information Systems) is a dataset consisting of audio recordings and corresponding manual transcripts about humans asking …

📊 12 results
📏 Metrics: F1

CAIS

We collect utterances from the Chinese Artificial Intelligence Speakers (CAIS), and annotate them with slot tags and intent labels. The …

📊 1 results
📏 Metrics: F1

Dialogue State Tracking Challenge

The Dialog State Tracking Challenges 2 & 3 (DSTC2&3) were research challenge focused on improving the state of the art …

📊 1 results
📏 Metrics: F1 score

MASSIVE

MASSIVE is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks …

📊 3 results
📏 Metrics: Slot F1 Score

MixATIS

Dataset is constructed from single intent dataset ATIS. This is a publically available multi intent dataset, which can be downloaded …

📊 10 results
📏 Metrics: Micro F1

MixSNIPS

Dataset is constructed from single intent dataset SNIPS. This is a publicly available multi intent dataset, which can be downloaded …

📊 11 results
📏 Metrics: Micro F1

Polyvore

This dataset contains 21,889 outfits from polyvore.com, in which 17,316 are for training, 1,497 for validation and 3,076 for testing. …

📊 1 results
📏 Metrics: FITB

ProSLU

In the paper, to bridge the research gap, we propose a new and important task, Profile-based Spoken Language Understanding (ProSLU), …

📊 1 results
📏 Metrics: F1

SLURP

A new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets. …

📊 5 results
📏 Metrics: F1

SNIPS

The SNIPS Natural Language Understanding benchmark is a dataset of over 16,000 crowdsourced queries distributed among 7 user intents of …

📊 7 results
📏 Metrics: F1, F1 (1-shot) avg, F1 (5-shot) avg

Slovak Text Diacritization

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish …

📊 1 results
📏 Metrics: Alpha-Word accuracy

Source Code Summarization

CoDesc

CoDesc is a large dataset of 4.2m Java source code and parallel data of their description from code search, and …

📊 1 results
📏 Metrics: BLEU-4

CodeSearchNet

The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and …

📊 1 results
📏 Metrics: F1

DeepCom-Java

The Java dataset introduced in DeepCom (Deep Code Comment Generation), commonly used to evaluate automated code summarization.

📊 2 results
📏 Metrics: BLEU-4, METEOR

Java scripts

The Java dataset introduced in Hybrid-DeepCom (Deep code comment generation with hybrid lexical and syntactical information), commonly used to evaluate …

📊 1 results
📏 Metrics: BLEU-4, METEOR

ParallelCorpus-Python

The Python dataset introduced in the Parallel Corpus paper ([A Parallel Corpus of Python Functions and Documentation Strings for Automated …

📊 2 results
📏 Metrics: BLEU-4, METEOR

Spanish Text Diacritization

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish …

📊 1 results
📏 Metrics: Alpha-Word accuracy

Speaker Attribution in German Parliamentary Debates (GermEval 2023, subtask 1)

GePaDe

This dataset encompasses 265 speeches (over 200,000 tokens) from the German Bundestag, primarily from the 19th legislative term (2017-2021), given …

📊 1 results
📏 Metrics: F1

Speech-to-Text Translation

MuST-C

MuST-C currently represents the largest publicly available multilingual corpus (one-to-many) for speech translation. It covers eight language directions, from English …

📊 2 results
📏 Metrics: SacreBLEU

Stance Detection

ARC (AI2 Reasoning Challenge)

The AI2’s Reasoning Challenge (ARC) dataset is a multiple-choice question-answering dataset, containing questions from science exams from grade 3 to …

📊 1 results
📏 Metrics: F1

Dhoroni

Climate change poses critical challenges globally, disproportionately affecting low-income countries that often lack resources and linguistic representation on the international …

📊 1 results
📏 Metrics: Accuracy, F1 Score, Precision, Recall

FNC-1

FNC-1 was designed as a stance detection dataset and it contains 75,385 labeled headline and article pairs. The pairs are …

📊 1 results
📏 Metrics: F1

MGTAB

MGTAB is the first standardized graph-based benchmark for stance and bot detection. MGTAB contains 10,199 expert-annotated users and 7 types …

📊 4 results
📏 Metrics: Acc, F1

P-Stance

P-Stance: A Large Dataset for Stance Detection in Political Domain 2021

📊 1 results
📏 Metrics: Average F1

Perspectrum

Perspectrum is a dataset of claims, perspectives and evidence, making use of online debate websites to create the initial data …

📊 1 results
📏 Metrics: F1

RuStance

Includes Russian tweets and news comments from multiple sources, covering multiple stories, as well as text classification approaches to stance …

📊 1 results
📏 Metrics: F1

Snopes

Fact-checking (FC) articles which contains pairs (multimodal tweet and a FC-article) from snopes.com. Source: [Where Are the Facts? Searching for …

📊 1 results
📏 Metrics: F1

VAST

VAST consists of a large range of topics covering broad themes, such as politics (e.g., ‘a Palestinian state’), education (e.g., …

📊 1 results
📏 Metrics: F1

Stereotypical Bias Analysis

CrowS-Pairs

CrowS-Pairs has 1508 examples that cover stereotypes dealing with nine types of bias, like race, religion, and age. In CrowS-Pairs …

📊 4 results
📏 Metrics: Gender, Religion, Race/Color, Sexual Orientation, Age, Nationality, Disability, Physical Appearance, Socioeconomic status, Overall

Story Generation

WritingPrompts

WritingPrompts is a large dataset of 300K human-written stories paired with writing prompts from an online forum. Source: [Hierarchical Neural …

📊 1 results
📏 Metrics: BLEU-1, BLEU-2, Distinct-4

Subjectivity Analysis

Czech Subjectivity Dataset

Czech subjectivity dataset of 10k manually annotated subjective and objective sentences from movie reviews and descriptions. See the paper description …

📊 5 results
📏 Metrics: Accuracy

SUBJ

Available are collections of movie-review documents labeled with respect to their overall sentiment polarity (positive or negative) or subjective rating …

📊 16 results
📏 Metrics: Accuracy

Table-based Fact Verification

TabFact

TabFact is a large-scale dataset which consists of 117,854 manually annotated statements with regard to 16,573 Wikipedia tables, their relations …

📊 15 results
📏 Metrics: Test, Val

Table-to-Text Generation

DART

DART is a large dataset for open-domain structured data record to text generation. DART consists of 82,191 examples across different …

📊 2 results
📏 Metrics: METEOR, BLEU, BERT, BLEURT, Mover, TER, FactSpotter

E2E

End-to-End NLG Challenge (E2E) aims to assess whether recent end-to-end NLG systems can generate more complex output by learning from …

📊 2 results
📏 Metrics: BLEU, CIDEr, METEOR, NIST, ROUGE-L

WikiBio

This dataset gathers 728,321 biographies from English Wikipedia. It aims at evaluating text generation algorithms. For each article, we provide …

📊 4 results
📏 Metrics: BLEU, ROUGE, PARENT

Wikipedia Person and Animal Dataset

This dataset gathers 428,748 person and 12,236 animal infobox with descriptions based on Wikipedia dump (2018/04/01) and Wikidata (2018/04/12).

📊 2 results
📏 Metrics: BLEU, ROUGE, METEOR

Task-Oriented Dialogue Systems

SGD

The Schema-Guided Dialogue (SGD) dataset consists of over 20k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. …

📊 2 results
📏 Metrics: METEOR

Temporal Information Extraction

TempEval-3

Within the SemEval-2013 evaluation exercise, the TempEval-3 shared task aims to advance research on temporal information processing. It follows on …

📊 1 results
📏 Metrics: Temporal awareness

Temporal Relation Extraction

Vinoground

A temporal counterfactual dataset composing of 1000 short and natural video-caption pairs.

📊 16 results
📏 Metrics: Text Score, Video Score, Group Score

Text Classification

20 Newsgroups

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

📊 1 results
📏 Metrics: Accuracy

AG News

AG News (AG’s News Corpus) is a subdataset of AG's corpus of news articles constructed by assembling titles and description …

📊 21 results
📏 Metrics: Error

Adverse Drug Events (ADE) Corpus

Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. A significant …

📊 1 results
📏 Metrics: F1 - macro

An Amharic News Text classification Dataset

In NLP, text classification is one of the primary problems we try to solve and its uses in language analyses …

📊 2 results
📏 Metrics: Accuracy

Arxiv HEP-TH citation graph

Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a …

📊 1 results
📏 Metrics: Accuracy

BANKING77

Dataset composed of online banking queries annotated with their corresponding intents. BANKING77 dataset provides a very fine-grained set of intents …

📊 1 results
📏 Metrics: Accuracy

BLURB

BLURB is a collection of resources for biomedical natural language processing. In general domains such as newswire and the Web, …

📊 3 results
📏 Metrics: F1

Bala-Copa

The Balanced Choice of Plausible Alternatives dataset is a benchmark for training machine learning models that are robust to superficial …

📊 3 results
📏 Metrics: Accuracy

DBpedia

DBpedia (from "DB" for "database") is a project aiming to extract structured content from the information created in the Wikipedia …

📊 19 results
📏 Metrics: Error

HateXplain

Covers multiple aspects of the issue. Each post in the dataset is annotated from three different perspectives: the basic, commonly …

📊 4 results
📏 Metrics: Accuracy (2 classes), F1 Macro

IMDb Movie Reviews

The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database …

📊 2 results
📏 Metrics: AUC, Accuracy (2 classes), F1 Macro

Lot-insts

LoT-insts contains over 25k classes whose frequencies are naturally long-tail distributed. Its test set from four different subsets: many-, medium-, …

📊 5 results
📏 Metrics: Accuracy, Macro-F1

MR

MR Movie Reviews is a dataset for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect …

📊 9 results
📏 Metrics: Accuracy

MTEB

MTEB is a benchmark that spans 8 embedding tasks covering a total of 56 datasets and 112 languages. The 8 …

📊 31 results
📏 Metrics: Accuracy

Ohsumed

Ohsumed includes medical abstracts from the MeSH categories of the year 1991. In [Joachims, 1997] were used the first 20,000 …

📊 9 results
📏 Metrics: Accuracy

Overruling

The Overruling dataset is a law dataset corresponding to the task of determining when a sentence is overruling a prior …

📊 3 results
📏 Metrics: F1(10-fold)

RCV1

The RCV1 dataset is a benchmark dataset on text categorization. It is a collection of newswire articles producd by Reuters …

📊 3 results
📏 Metrics: Accuracy, Macro F1, Micro F1, P@1, P@3, P@5, nDCG@1, nDCG@3, nDCG@5

SILICONE Benchmark

The Sequence labellIng evaLuatIon benChmark fOr spoken laNguagE (SILICONE) benchmark is a collection of resources for training, evaluating, and analyzing …

📊 1 results
📏 Metrics: 1:1 Accuracy

SST-2

The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the …

📊 2 results
📏 Metrics: Accuracy

Social media attributions of YouTube comments

Data set constructed from YouTube comments (72,098 comments posted by 43,859 users on 623 relevant videos to the crisis)

📊 2 results
📏 Metrics: Accuracy (2 classes), F1 Macro

TREC-10

A question type classification dataset with 6 classes for questions about a person, location, numeric information, etc. The test split …

📊 1 results
📏 Metrics: Accuracy

Terms of Service

The Terms of Service dataset is a law dataset corresponding to the task of identifying whether contractual terms are potentially …

📊 3 results
📏 Metrics: F1(10-fold)

This is not a Dataset

We introduce a large semi-automatically generated dataset of ~400,000 descriptive sentences about commonsense knowledge that can be true or false …

📊 2 results
📏 Metrics: Accuracy, Coherence

UK Key Stage Readability

Education is increasingly data-driven, and the ability to analyse and adapt educational materials quickly and effectively is important for keeping …

📊 15 results
📏 Metrics: F1

WNUT-2020 Task 2

Briefly describe the dataset. Provide: * a high-level explanation of the dataset characteristics * explain motivations and summary of its …

📊 1 results
📏 Metrics: F1

Yahoo! Answers

The Yahoo! Answers topic classification dataset is constructed using 10 largest main categories. Each class contains 140,000 training samples and …

📊 9 results
📏 Metrics: Accuracy

arXiv-10

Benchmark dataset for abstracts and titles of 100,000 ArXiv scientific papers. This dataset contains 10 classes and is balanced (exactly …

📊 3 results
📏 Metrics: Accuracy

Text Clustering

20 Newsgroups

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

📊 2 results
📏 Metrics: Accuracy

MTEB

MTEB is a benchmark that spans 8 embedding tasks covering a total of 56 datasets and 112 languages. The 8 …

📊 31 results
📏 Metrics: V-Measure

Text Generation

CNN/Daily Mail

CNN/Daily Mail is a dataset for text summarization. Human generated abstractive summary bullets were generated from news stories in CNN …

📊 1 results
📏 Metrics: ROUGE-L

COCO Captions

COCO Captions contains over one and a half million captions describing over 330,000 images. For the training and validation images, …

📊 4 results
📏 Metrics: BLEU-2, BLEU-3, BLEU-4, BLEU-5

CSL

CSL is a synthetic dataset introduced in Murphy et al. (2019) to test the expressivity of GNNs. In particular, graphs …

📊 1 results
📏 Metrics: ROUGE-L

CommonGen

CommonGen is constructed through a combination of crowdsourced and existing caption corpora, consists of 79k commonsense descriptions over 35k unique …

📊 4 results
📏 Metrics: CIDEr, METEOR, BLEU-4, SPICE

Czech restaurant information

Czech restaurant information is a dataset for NLG in task-oriented spoken dialogue systems with Czech as the target language. It …

📊 3 results
📏 Metrics: METEOR

DART

DART is a large dataset for open-domain structured data record to text generation. DART consists of 82,191 examples across different …

📊 3 results
📏 Metrics: BLEU, METEOR, FactSpotter

DailyDialog

DailyDialog is a high-quality multi-turn open-domain English dialog dataset. It contains 13,118 dialogues split into a training set with 11,118 …

📊 1 results
📏 Metrics: BLEU-1, BLEU-2, BLEU-3, BLEU-4

HarmfulQA

Paper | Github | Dataset| Model As a part of our research efforts toward making LLMs more safe for public …

📊 1 results
📏 Metrics: ASR

LCSTS

LCSTS is a large corpus of Chinese short text summarization dataset constructed from the Chinese microblogging website Sina Weibo, which …

📊 1 results
📏 Metrics: ROUGE-L

OpenWebText

OpenWebText is an open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit …

📊 2 results
📏 Metrics: eval_loss

ROCStories

ROCStories is a collection of commonsense short stories. The corpus consists of 100,000 five-sentence stories. Each story logically follows everyday …

📊 4 results
📏 Metrics: BLEU-1, Perplexity

ReDial

ReDial (Recommendation Dialogues) is an annotated dataset of dialogues, where users recommend movies to each other. The dataset consists of …

📊 4 results
📏 Metrics: Distinct-3, Distinct-4, Distinct-2, Perplexity

SciQ

The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in …

📊 3 results
📏 Metrics: Accuracy

Text Simplification

ASSET

ASSET is a new dataset for assessing sentence simplification in English. ASSET is a crowdsourced multi-reference corpus where each simplification …

📊 11 results
📏 Metrics: BLEU, SARI (EASSE>=0.2.1), METEOR, FKGL, QuestEval (Reference-less, BERTScore)

DEplain-APA-doc

DEplain-APA-doc: A German Parallel Corpus for Document Simplification on News Texts DEplain is a new dataset of parallel, professionally …

📊 3 results
📏 Metrics: SARI (EASSE>=0.2.1), BLEU, BertScore (Precision), FRE (Flesch Reading Ease)

DEplain-APA-sent

DEplain-APA-sent: A German Parallel Corpus for Sentence Simplification on News Texts DEplain is a new dataset of parallel, professionally …

📊 2 results
📏 Metrics: SARI (EASSE>=0.2.1), BLEU, BertScore (Precision), FRE (Flesch Reading Ease)

DEplain-web-doc

DEplain-web-doc: A German Parallel Corpus for Document Simplification on Web Texts DEplain is a new dataset of parallel, professionally …

📊 3 results
📏 Metrics: SARI (EASSE>=0.2.1), BLEU, BertScore (Precision), FRE (Flesch Reading Ease)

DEplain-web-sent

DEplain-web-sent: A German Parallel Corpus for Sentence Simplification on Web Texts DEplain is a new dataset of parallel, professionally …

📊 2 results
📏 Metrics: SARI (EASSE>=0.2.1), BLEU, BertScore (Precision), FRE (Flesch Reading Ease)

Newsela

The Newsela dataset was introduced by Xu et al. in their research on text simplification. It is a corpus that …

📊 10 results
📏 Metrics: SARI, BLEU

TurkCorpus

TurkCorpus, a dataset with 2,359 original sentences from English Wikipedia, each with 8 manual reference simplifications. The dataset is divided …

📊 20 results
📏 Metrics: SARI (EASSE>=0.2.1), BLEU, METEOR, FKGL, QuestEval (Reference-less, BERTScore)

Text Summarization

ACI-Bench

Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation

📊 1 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

Arxiv HEP-TH citation graph

Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a …

📊 27 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

BigPatent

Consists of 1.3 million records of U.S. patent documents along with human written abstractive summaries. Source: [BIGPATENT: A Large-Scale Dataset …

📊 2 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

BillSum

BillSum is the first dataset for summarization of US Congressional and California state bills. The BillSum dataset consists of three …

📊 1 results
📏 Metrics: rouge1

BookSum

BookSum is a collection of datasets for long-form narrative summarization. This dataset covers source documents from the literature domain, such …

📊 3 results
📏 Metrics: ROUGE, ROUGE-2, ROUGE-L

CL-SciSumm

📊 1 results
📏 Metrics: ROUGE-2

DialogSum

DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 dialogues with corresponding manually labeled summaries and topics. This work …

📊 3 results
📏 Metrics: Rouge1, Rouge2, RougeL, BertScore

Gazeta

Gazeta is a dataset for automatic summarization of Russian news. The dataset consists of 63,435 text-summary pairs. To form training, …

📊 1 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L, BLEU, Meteor

GovReport

GovReport is a dataset for long document summarization, with significantly longer documents and summaries. It consists of reports written by …

📊 2 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

How2

The How2 dataset contains 13,500 videos, or 300 hours of speech, and is split into 185,187 training, 2022 development (dev), …

📊 2 results
📏 Metrics: Content F1, ROUGE-L, ROUGE-1

Klexikon

The dataset introduces document alignments between German Wikipedia and the children's lexicon Klexikon. The source texts in Wikipedia are both …

📊 4 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

LCSTS

LCSTS is a large corpus of Chinese short text summarization dataset constructed from the Chinese microblogging website Sina Weibo, which …

📊 1 results
📏 Metrics: ROUGE-1

MTEB

MTEB is a benchmark that spans 8 embedding tasks covering a total of 56 datasets and 112 languages. The 8 …

📊 26 results
📏 Metrics: Spearman Correlation

MeQSum

MeQSum is a dataset for medical question summarization. It contains 1,000 summarized consumer health questions. Source: https://www.aclweb.org/anthology/P19-1215.pdf Image Source: https://www.aclweb.org/anthology/P19-1215.pdf

📊 1 results
📏 Metrics: RougeL

MeetingBank

MeetingBank, a benchmark dataset created from the city councils of 6 major U.S. cities to supplement existing datasets. It contains …

📊 2 results
📏 Metrics: ROUGE-L, Rouge-1, ROUGE-2

MentSum

Mental health remains a significant challenge of public health worldwide. With increasing popularity of online platforms, many use the platforms …

📊 1 results
📏 Metrics: Rouge-1, Rouge-2, Rouge-L

OrangeSum

Source: BARThez: a Skilled Pretrained French Sequence-to-Sequence Model OrangeSum is a single-document extreme summarization dataset with two tasks: title and …

📊 2 results
📏 Metrics: ROUGE-1

Pubmed

The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. …

📊 28 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

QMSum

QMSum is a new human-annotated benchmark for query-based multi-domain meeting summarisation task, which consists of 1,808 query-summary pairs over 232 …

📊 1 results
📏 Metrics: ROUGE-1

Reddit TIFU

Reddit TIFU dataset is a newly collected Reddit dataset, where TIFU denotes the name of /r/tifu subbreddit. There are 122,933 …

📊 5 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

SAMSum

A new dataset with abstractive dialogue summaries. Source: SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization

📊 11 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L, BertScoreF1

WikiHow

WikiHow is a dataset of more than 230,000 article and summary pairs extracted and constructed from an online knowledge base …

📊 3 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L, Content F1

XSum

The Extreme Summarization (XSum) dataset is a dataset for evaluation of abstractive single-document summarization systems. The goal is to create …

📊 1 results
📏 Metrics: ROUGE-1

arXiv

For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from …

📊 1 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

arXiv Summarization Dataset

This is a dataset for evaluating summarisation methods for research papers. Source: [A Discourse-Aware Attention Model for Abstractive Summarization of …

📊 4 results
📏 Metrics: ROUGE-1, ROUGE-2, ROUGE-L

Text-To-SQL

BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation)

BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) represents a pioneering, cross-domain dataset that examines the impact of extensive …

📊 16 results
📏 Metrics: Execution Accuracy % (Test), Execution Accuracy % (Dev), Execution Accurarcy (Human)

KaggleDBQA

KaggleDBQA is a challenging cross-domain and complex evaluation dataset of real Web databases, with domain-specific data types, original formatting, and …

📊 2 results
📏 Metrics: Exact Match (EM)

SEDE

SEDE is a dataset comprised of 12,023 complex and diverse SQL queries and their natural language titles and descriptions, written …

📊 1 results
📏 Metrics: PCM-F1 (dev), PCM-F1 (test)

SParC

SParC is a large-scale dataset for complex, cross-domain, and context-dependent (multi-turn) semantic parsing and text-to-SQL task (interactive natural language interfaces …

📊 6 results
📏 Metrics: interaction match accuracy, question match accuracy

SQL-Eval

SQL-Eval is an open-source PostgreSQL evaluation dataset released by Defog, constructed based on Spider. The original link can be found …

📊 1 results
📏 Metrics: Execution Accuracy

Spider 2.0

Spider 2.0 is a comprehensive code generation agent task that includes 632 examples. The agent has to interactively explore various …

📊 8 results
📏 Metrics: Success Rate

Text-To-Speech Synthesis

20000 utterances

20000 utterances

📊 1 results
📏 Metrics: 10-keyword Speech Commands dataset

LJSpeech

This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from …

📊 15 results
📏 Metrics: Audio Quality MOS, Pleasantness MOS, Word Error Rate (WER), MOS, WER (%)

Trinity Speech-Gesture Dataset

Trinity Gesture Dataset includes 23 takes, totalling 244 minutes of motion capture and audio of a male native English speaker …

📊 1 results
📏 Metrics: MOS

Text-to-Image Generation

COCO (Common Objects in Context)

The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset. It is designed to …

📊 69 results
📏 Metrics: FID, Inception score, FID-1, FID-2, FID-4, FID-8, SOA-C, Zero shot FID

Colors

A large dataset of color names and their respective RGB values stores in CSV.

📊 1 results
📏 Metrics: Validation Accuracy

Conceptual Captions

Automatic image captioning is the task of producing a natural-language utterance (usually a sentence) that correctly reflects the visual content …

📊 5 results
📏 Metrics: FID

DrawBench

DrawBench is a comprehensive and challenging benchmark for text-to-image models, introduced by the Imagen research team. Let me provide you …

📊 8 results
📏 Metrics: Aesthetics (Laion Aesthtetics Predictor), Human Preference Alignement (HPSv2), Text Alignement (SentenceBERT)

Flickr-8k

Contains 8k flickr Images with captions. Visit this page to explore the data. Cite this paper if you find it …

📊 1 results
📏 Metrics: LPIPS

GenEval

Recent breakthroughs in diffusion models, multimodal pretraining, and efficient finetuning have led to an explosion of text-to-image generative models. Given …

📊 20 results
📏 Metrics: Overall, Single Obj., Two Obj., Color Attri., Colors, Counting, Position

LAION COCO

LAION-COCO is the world’s largest dataset of 600M generated high-quality captions for publicly available web-images. The images are extracted from …

📊 2 results
📏 Metrics: FID

T2I-CompBench

T2I-CompBench is a comprehensive benchmark for open-world compositional text-to-image generation, consisting of 6,000 compositional textual prompts from 3 categories (attribute …

📊 2 results
📏 Metrics: Color, Shape, Texture, Complex, Non-Spatial, Spatial

Text-to-Video Generation

EvalCrafter Text-to-Video (ECTV) Dataset

This dataset contains around 10000 videos generated by various methods using the Prompt list. These videos have been evaluated using …

📊 5 results
📏 Metrics: Visual Quality, Motion Quality, Temporal Consistency, Text-to-Video Alignment, Total Score

Kinetics

The Kinetics dataset is a large-scale, high-quality dataset for human action recognition in videos. The dataset consists of around 500,000 …

📊 1 results
📏 Metrics: Accuracy

MSR-VTT

MSR-VTT (Microsoft Research Video to Text) is a large-scale dataset for the open domain video captioning, which consists of 10,000 …

📊 18 results
📏 Metrics: FVD, CLIPSIM, CLIP-FID, FID

Something-Something V2

The 20BN-SOMETHING-SOMETHING V2 dataset is a large collection of labeled video clips that show humans performing pre-defined basic actions with …

📊 1 results
📏 Metrics: FVD

WebVid

WebVid contains 10 million video clips with captions, sourced from the web. The videos are diverse and rich in their …

📊 1 results
📏 Metrics: FVD

Topic Models

20 Newsgroups

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

📊 2 results
📏 Metrics: Test perplexity

20NewsGroups

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

📊 6 results
📏 Metrics: C_v

AG News

AG News (AG’s News Corpus) is a subdataset of AG's corpus of news articles constructed by assembling titles and description …

📊 6 results
📏 Metrics: C_v, NPMI

Arxiv HEP-TH citation graph

Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a …

📊 2 results
📏 Metrics: MACC, Topic Coherence@50, Topic coherence@5

Turkish Text Diacritization

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish …

📊 1 results
📏 Metrics: Alpha-Word accuracy

Video Generation

BAIR Robot Pushing

Dataset of 64x64 images of a robot pushing objects on a table top. From Berkeley AI Research (BAIR). Source: Self-Supervised …

📊 31 results
📏 Metrics: FVD score, SSIM, PSNR, LPIPS, Cond, Train, Pred, Notes

How2Sign

The How2Sign is a multimodal and multiview continuous American Sign Language (ASL) dataset consisting of a parallel corpus of more …

📊 1 results
📏 Metrics: FVD16

Kinetics-700

Kinetics-700 is a video dataset of 650,000 clips that covers 700 human action classes. The videos include human-object interactions such …

📊 1 results
📏 Metrics: FID, FVD

LAION-400M

LAION-400M is a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity …

📊 6 results
📏 Metrics: CLIP R-Precision, CLIP

MSR-VTT

MSR-VTT (Microsoft Research Video to Text) is a large-scale dataset for the open domain video captioning, which consists of 10,000 …

📊 1 results
📏 Metrics: FVD16, Inception score

YouTube Driving

YouTube Driving Dataset contains a massive amount of real-world driving frames with various conditions, from different weather, different regions, to …

📊 1 results
📏 Metrics: FVD16

Vietnamese Text Diacritization

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish …

📊 1 results
📏 Metrics: Alpha-Word accuracy

Visual Question Answering

BenchLMM

Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown remarkable capabilities in visual reasoning with common image styles. …

📊 10 results
📏 Metrics: GPT-3.5 score

CLEVR

CLEVR (Compositional Language and Elementary Visual Reasoning) is a synthetic Visual Question Answering dataset. It contains images of 3D-rendered objects; …

📊 1 results
📏 Metrics: Accuracy

EarthVQA

Earth vision research typically focuses on extracting geospatial object locations and categories but neglects the exploration of relations between objects …

📊 1 results
📏 Metrics: Overall Accuracy

GQA

The GQA dataset is a large-scale visual question answering dataset with real images from the Visual Genome dataset and balanced …

📊 1 results
📏 Metrics: Accuracy

GRIT

The General Robust Image Task (GRIT) Benchmark is an evaluation-only benchmark for evaluating the performance and robustness of vision systems …

📊 1 results
📏 Metrics: VQA (ablation)

MM-Vet

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

📊 222 results
📏 Metrics: GPT-4 score, Params

MM-Vet v2

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities

📊 17 results
📏 Metrics: GPT-4 score, Params

MMBench

MMBench is a multi-modality benchmark. It methodically develops a comprehensive evaluation pipeline, primarily comprised of two elements. The first element …

📊 5 results
📏 Metrics: GPT-3.5 score

MSRVTT-QA

The MSR-VTT-QA dataset is a benchmark for the task of Visual Question Answering (VQA) on the MSR-VTT (Microsoft Research Video …

📊 3 results
📏 Metrics: Test Accuracy, Accuracy

MSVD-QA

The MSVD-QA dataset is a Video Question Answering (VideoQA) dataset. It is based on the existing Microsoft Research Video Description …

📊 2 results
📏 Metrics: Accuracy

MapEval-Visual

MapEval-Visual contains 400 image-question-answer triplets. Each question is paired with a snapshot from google maps website. The task is the …

📊 1 results
📏 Metrics: Accuracy (% )

ViP-Bench

ViP-Bench is a comprehensive benchmark designed to assess the capability of multimodal models in understanding visual prompts across multiple dimensions. …

📊 13 results
📏 Metrics: GPT-4 score (bbox), GPT-4 score (human)

VisualMRC

VisualMRC is a visual machine reading comprehension dataset that proposes a task: given a question and a document image, a …

📊 1 results
📏 Metrics: CIDEr

VizWiz

The VizWiz-VQA dataset originates from a natural visual question answering setting where blind people each took an image and recorded …

📊 1 results
📏 Metrics: Accuracy

Visual Question Answering (VQA)

A-OKVQA

A-OKVQA is crowdsourced visual question answering dataset composed of a diverse set of about 25K questions requiring a broad base …

📊 15 results
📏 Metrics: MC Accuracy, DA VQA Score

AI2D

AI2 Diagrams (AI2D) is a dataset of over 5000 grade school science diagrams with over 150000 rich annotations, their ground …

📊 4 results
📏 Metrics: EM

ActivityNet

The ActivityNet dataset contains 200 different types of activities and a total of 849 hours of videos collected from YouTube. …

📊 1 results
📏 Metrics: ClipMatch@1, ClipMatch@5, Contains, ExactMatch, Follow-up ClipMatch@1, Follow-up ClipMatch@5, Follow-up Contains, Follow-up ExactMatch

AutoHallusion

Large vision-language models (LVLMs) are prone to hallucinations, where certain contextual cues in an image can trigger the language module …

📊 3 results
📏 Metrics: Overall Accuracy

CLEVR

CLEVR (Compositional Language and Elementary Visual Reasoning) is a synthetic Visual Question Answering dataset. It contains images of 3D-rendered objects; …

📊 15 results
📏 Metrics: Accuracy

CLEVR-Humans

We collect a new dataset of human-posed free-form natural language questions about CLEVR images. Many of these questions have out-of-vocabulary …

📊 5 results
📏 Metrics: Accuracy

CORE-MM

CORE-MM is an Open-ended VQA benchmark dataset specifically designed for MLLMs, with a focus on complex reasoning tasks. CORE-MM benchmark …

📊 1 results
📏 Metrics: Abductive, Analogical, Deductive, Overall score, Params

DocVQA

DocVQA consists of 50,000 questions defined on 12,000+ document images. Source: DocVQA: A Dataset for VQA on Document Images

📊 1 results
📏 Metrics: ANLS

EgoSchema

EgoSchema is very long-form video question-answering dataset, and benchmark to evaluate long video understanding capabilities of modern vision and language …

📊 1 results
📏 Metrics: Acc

GQA

The GQA dataset is a large-scale visual question answering dataset with real images from the Visual Genome dataset and balanced …

📊 2 results
📏 Metrics: Accuracy

GRIT

The General Robust Image Task (GRIT) Benchmark is an evaluation-only benchmark for evaluating the performance and robustness of vision systems …

📊 2 results
📏 Metrics: VQA (ablation), VQA (test)

HallusionBench

Large language models (LLMs), after being aligned with vision models and integrated into vision-language models (VLMs), can bring impressive improvement …

📊 3 results
📏 Metrics: Question Pair Acc , Question Pair Acc

IconQA

Current visual question answering (VQA) tasks mainly consider answering human-annotated questions for natural images in the daily-life context. **Icon question …

📊 12 results
📏 Metrics: Sub-tasks (Img.), Sub-tasks (Txt.), Sub-tasks (Blank), Reasoning (Geo.), Reasoning (Cou.), Reasoning (Com.), Reasoning (Spa.), Reasoning (Sce.), Reasoning (Pat.), Reasoning (Tim.), Reasoning (Fra.), Reasoning (Est.), Reasoning (Alg.), Reasoning (Mea.), Reasoning (Sen.), Reasoning (Pro.)

IllusionVQA

IllusionVQA is a Visual Question Answering (VQA) dataset with two sub-tasks. The first task tests comprehension on 435 instances in …

📊 7 results
📏 Metrics: Accuracy

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the …

📊 1 results
📏 Metrics: ClipMatch@1, ClipMatch@5, Contains, ExactMatch, Follow-up ClipMatch@1, Follow-up ClipMatch@5, Follow-up Contains, Follow-up ExactMatch

InfiMM-Eval

Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. Although many benchmarks attempt to holistically …

📊 14 results
📏 Metrics: Overall score, Deductive, Abductive, Analogical, Params

InfoSeek

In this project, we introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions that cannot be answered with …

📊 6 results
📏 Metrics: Accuracy

InfographicVQA

InfographicVQA is a dataset that comprises a diverse collection of infographics along with natural language questions and answers annotations. The …

📊 21 results
📏 Metrics: ANLS

MM-Vet

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

📊 1 results
📏 Metrics: Acc

MSRVTT-QA

The MSR-VTT-QA dataset is a benchmark for the task of Visual Question Answering (VQA) on the MSR-VTT (Microsoft Research Video …

📊 33 results
📏 Metrics: Accuracy

MSVD-QA

The MSVD-QA dataset is a Video Question Answering (VideoQA) dataset. It is based on the existing Microsoft Research Video Description …

📊 35 results
📏 Metrics: Accuracy

MVBench

MVBench is a comprehensive Multi-modal Video understanding Benchmark. It was introduced to evaluate the comprehension capabilities of Multi-modal Large Language …

📊 1 results
📏 Metrics: Acc

OK-VQA

Outside Knowledge Visual Question Answering (OK-VQA) includes more than 14,000 questions that require external knowledge to answer. Source: [OK-VQA: A …

📊 35 results
📏 Metrics: Accuracy, Exact Match (EM), Recall@5

OVAD benchmark

Vision-language modeling has enabled open-vocabulary tasks where predictions can be queried using any text prompt in a zero-shot manner. Existing …

📊 1 results
📏 Metrics: Contains w. Synonyms, ExactMatch w. Synonyms

PMC-VQA

PMC-VQA is a large-scale medical visual question-answering dataset that contains 227k VQA pairs of 149k images that cover various modalities …

📊 4 results
📏 Metrics: Accuracy

QLEVR

Synthetic datasets have successfully been used to probe visual question-answering datasets for their reasoning abilities. CLEVR, for example, tests a …

📊 5 results
📏 Metrics: Overall Accuracy

RetVQA

The RetVQA dataset is a large-scale dataset designed for Retrieval-Based Visual Question Answering (RetVQA). RetVQA is a more challenging task …

📊 1 results
📏 Metrics: Accuarcy, Accuracy * Fluency

TDIUC

Task Directed Image Understanding Challenge (TDIUC) dataset is a Visual Question Answering dataset which consists of 1.6M questions and 170K …

📊 2 results
📏 Metrics: Accuracy

TGIF-QA

The TGIF-QA dataset contains 165K QA pairs for the animated GIFs from the TGIF dataset [Li et al. CVPR 2016]. …

📊 2 results
📏 Metrics: Accuracy

TextVQA

TextVQA is a dataset to benchmark visual reasoning based on text in images. TextVQA requires models to read and reason …

📊 1 results
📏 Metrics: Acc

VLM2-Bench

VLM²-Bench: Benchmarking Vision-Language Models on Visual Cue Matching ### Description VLM²-Bench is the first comprehensive benchmark designed to evaluate …

📊 9 results
📏 Metrics: GC-mat, GC-trk, OC-cpr, OC-cnt, OC-grp, PC-cpr, PC-cnt, PC-grp, PC-VID, Average Score on VLM2-bench (9 subtasks)

VQA-CE

This dataset provides a new split of VQA v2 (similarly to VQA-CP v2), which is built of questions that are …

📊 9 results
📏 Metrics: Accuracy (Counterexamples)

VQA-CP

The VQA-CP dataset was constructed by reorganizing VQA v2 such that the correlation between the question type and correct answer …

📊 10 results
📏 Metrics: Score

Visual7W

Visual7W is a large-scale visual question answering (QA) dataset, with object-level groundings and multimodal answers. Each question starts with one …

📊 4 results
📏 Metrics: Percentage correct

WHOOPS!

WHOOPS! Is a dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers …

📊 6 results
📏 Metrics: Exact Match, BEM

WebSRC

WebSRC is a novel Web-based Structural Reading Comprehension dataset. It consists of 0.44M question-answer pairs, which are collected from 6.5K …

📊 1 results
📏 Metrics: EM

ZS-F-VQA

The ZS-F-VQA dataset is a new split of the F-VQA dataset for zero-shot problem. Firstly we obtain the original train/test …

📊 1 results
📏 Metrics: Top-1 Accuracy

Visual Storytelling

VIST

The Visual Storytelling Dataset (VIST) consists of 210,819 unique photos and 50,000 stories. The images were collected from albums on …

📊 21 results
📏 Metrics: BLEU-4, CIDEr, METEOR, BLEU-1, BLEU-2, BLEU-3, ROUGE-L, SPICE, BLEURT, MLTD

Weakly Supervised Classification

ShARe/CLEF 2014: Task 2 Disorders

📊 1 results
📏 Metrics: F1

THYME-2016

📊 1 results
📏 Metrics: F1

Word Sense Disambiguation

FEWS

FEWS (Few-shot Examples of Word Senses) is a few-shot dataset for English Word Sense Disambiguation (WSD) gathered from Wiktionary, an …

📊 2 results
📏 Metrics: F1 (Zeroshot Dev), F1 (Zero shot test), F1(FewShot Dev), F1 (Fewshot Test)

RUSSE

WiC: The Word-in-Context Dataset A reliable benchmark for the evaluation of context-sensitive word embeddings. Depending on its context, an ambiguous …

📊 5 results
📏 Metrics: Accuracy

WiC-TSV

WiC-TSV is a new multi-domain evaluation benchmark for Word Sense Disambiguation. More specifically, it is a framework for Target Sense …

📊 6 results
📏 Metrics: Task 1 Accuracy: all, Task 1 Accuracy: general purpose, Task 1 Accuracy: domain specific, Task 2 Accuracy: all, Task 2 Accuracy: general purpose, Task 2 Accuracy: domain specific, Task 3 Accuracy: all, Task 3 Accuracy: general purpose, Task 3 Accuracy: domain specific

Word Similarity

WS353

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: …

📊 3 results
📏 Metrics: Spearman's Rho

Workflow Discovery

ABCD

Action-Based Conversations Dataset (ABCD) is a goal-oriented dialogue fully-labeled dataset with over 10K human-to-human dialogues containing 55 distinct user intents …

📊 1 results
📏 Metrics: In-domain EM, In-domain CE, Cross-domain EM, Cross-domain CE

Zero-shot Sentiment Classification

AfriSenti

AfriSenti is the largest sentiment analysis dataset for under-represented African languages, covering 110,000+ annotated tweets in 14 African languages (Amharic, …

📊 5 results
📏 Metrics: weighted-F1 score

answerability prediction

PeerQA

We present PeerQA, a real-world, scientific, document-level Question Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, which …

📊 5 results
📏 Metrics: Macro F1