Machine Learning Benchmarks

Browse 216 benchmarks across 22 tasks
← ML Research Wiki / Benchmarks / Computer Code
Clear
Browse by Category

10-shot image generation

FQL-Driving

FQL-driving

📊 1 results
📏 Metrics: 0-shot MRR

FlyingThings3D

FlyingThings3D is a synthetic dataset for optical flow, disparity and scene flow estimation. It consists of everyday objects flying along …

📊 1 results
📏 Metrics: 0..5sec

MEAD

Multi-view Emotional Audio-visual Dataset

📊 1 results
📏 Metrics: 12k

Music21

Music21 is an untrimmed video dataset crawled by keyword query from Youtube. It contains music performances belonging to 21 categories. …

📊 1 results
📏 Metrics: 0..5sec

Autonomous Vehicles

ApolloCar3D

ApolloCar3DT is a dataset that contains 5,277 driving images and over 60K car instances, where each car is fitted with …

📊 1 results
📏 Metrics: A3DP

Chart Question Answering

ChartQA

Charts are very popular for analyzing data. When exploring charts, people often ask a variety of complex reasoning questions that …

📊 27 results
📏 Metrics: 1:1 Accuracy

PlotQA

PlotQA is a VQA dataset with 28.9 million question-answer pairs grounded over 224,377 plots on data from real-world sources and …

📊 6 results
📏 Metrics: 1:1 Accuracy

RealCQA

RealCQA Scientific Chart Question Answering as a Test-bed for First-Order Logic check on huggingface : https://huggingface.co/datasets/sal4ahm/RealCQA

📊 5 results
📏 Metrics: 1:1 Accuracy

Classification

Adult

Data Set Information: Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records …

📊 1 results
📏 Metrics: AUROC

BIOSCAN_1M_Insect Dataset

In an effort to catalog insect biodiversity, we propose a new large dataset of hand-labelled insect images, the BIOSCAN-1M Insect …

📊 2 results
📏 Metrics: Macro F1

BiasBios

The purpose of this dataset was to study gender bias in occupations. Online biographies, written in English, were collected to …

📊 1 results
📏 Metrics: 1:1 Accuracy

BoolQ

BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring – they are …

📊 2 results
📏 Metrics: Test Accuracy

Brain Tumor MRI Dataset

This dataset is a combination of the following three datasets : figshare, SARTAJ dataset and Br35H This dataset contains 7022 …

📊 1 results
📏 Metrics: F1 score

CIFAKE: Real and AI-Generated Synthetic Images

The quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness. CIFAKE is a dataset that …

📊 1 results
📏 Metrics: Validation Accuracy

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists …

📊 1 results
📏 Metrics: Accuracy

CIFAR-10C

Common corruptions dataset for CIFAR10

📊 1 results
📏 Metrics: Accuracy on Brightness Corrupted Images

COVID-19 Image Data Collection

Contains hundreds of frontal view X-rays and is the largest public resource for COVID-19 image and prognostic data, making it …

📊 1 results
📏 Metrics: Accuracy

CWRU Bearing Dataset

Data was collected for normal bearings, single-point drive end and fan end defects. Data was collected at 12,000 samples/second and …

📊 1 results
📏 Metrics: 10 fold Cross validation

Chest X-Ray Images (Pneumonia)

The normal chest X-ray (left panel) depicts clear lungs without any areas of abnormal opacification in the image. Bacterial pneumonia …

📊 1 results
📏 Metrics: Accuracy

ForgeryNet

We construct the ForgeryNet dataset, an extremely large face forgery dataset with unified annotations in image- and video-level data across …

📊 3 results
📏 Metrics: AUC, Accuracy

Full-body Parkinson’s disease dataset

A public data set of walking full-body kinematics and kinetics in individuals with Parkinson’s disease

📊 7 results
📏 Metrics: F1-score (weighted)

HOWS

HOWS-CL-25 (Household Objects Within Simulation dataset for Continual Learning) is a synthetic dataset especially designed for object classification on mobile …

📊 1 results
📏 Metrics: Overall accuracy after last sequence

HRF

The HRF dataset is a dataset for retinal vessel segmentation which comprises 45 images and is organized as 15 subsets. …

📊 1 results
📏 Metrics: Accuracy

IRFL: Image Recognition of Figurative Language

The IRFL dataset consists of idioms, similes, and metaphors with matching figurative and literal images, as well as two novel …

📊 1 results
📏 Metrics: 1-of-100 Accuracy

ISIC 2019

The goal for ISIC 2019 is classify dermoscopic images among nine different diagnostic categories.25,331 images are available for training across …

📊 1 results
📏 Metrics: Balanced Multi-Class Accuracy

ImageNet C-OOD (class-out-of-distribution)

This dataset was presented as part of the ICLR 2023 paper 𝘈 𝘧𝘳𝘢𝘮𝘦𝘸𝘰𝘳𝘬 𝘧𝘰𝘳 𝘣𝘦𝘯𝘤𝘩𝘮𝘢𝘳𝘬𝘪𝘯𝘨 𝘊𝘭𝘢𝘴𝘴-𝘰𝘶𝘵-𝘰𝘧-𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯 𝘥𝘦𝘵𝘦𝘤𝘵𝘪𝘰𝘯 𝘢𝘯𝘥 𝘪𝘵𝘴 𝘢𝘱𝘱𝘭𝘪𝘤𝘢𝘵𝘪𝘰𝘯 …

📊 5 results
📏 Metrics: Detection AUROC (severity 0), Detection AUROC (severity 5), Detection AUROC (severity 10)

InDL

Dataset Introduction In this work, we introduce the In-Diagram Logic (InDL) dataset, an innovative resource crafted to rigorously evaluate the …

📊 9 results
📏 Metrics: Average Recall

LES-AV

This data set comprises 22 fundus images with their corresponding manual annotations for the blood vessels, separated as arteries and …

📊 1 results
📏 Metrics: Accuracy

Liver-US

The Liver-US dataset is a comprehensive collection of high-quality ultrasound images of the liver, including both normal and abnormal cases. …

📊 1 results
📏 Metrics: AUC

MHIST

The minimalist histopathology image analysis dataset (MHIST) is a binary classification dataset of 3,152 fixed-size images of colorectal polyps, each …

📊 6 results
📏 Metrics: Accuracy

MedSecId

The process by which sections in a document are demarcated and labeled is known as section identification. Such sections are …

📊 1 results
📏 Metrics: 1 shot Micro-F1

MixedWM38

MixedWM38 Dataset(WaferMap) has more than 38000 wafer maps, including 1 normal pattern, 8 single defect patterns, and 29 mixed defect …

📊 1 results
📏 Metrics: Accuracy, MCC

MuReD Dataset

Early detection of retinal diseases is one of the most important means of preventing partial or permanent blindness in patients. …

📊 1 results
📏 Metrics: ML F1, ML mAP, ML AUC

N-CARS

A large real-world event-based dataset for object classification. Source: HATS: Histograms of Averaged Time Surfaces for Robust Event-based Object Classification

📊 6 results
📏 Metrics: Accuracy (%), Architecture, Representation, Representation Time( ms / 100ms events), Inference Time, Params (M)

N-ImageNet

The N-ImageNet dataset is an event-camera counterpart for the ImageNet dataset. The dataset is obtained by moving an event camera …

📊 9 results
📏 Metrics: Accuracy (%)

RITE

The RITE (Retinal Images vessel Tree Extraction) is a database that enables comparative studies on segmentation or classification of arteries …

📊 1 results
📏 Metrics: Accuracy

RSSCN7

he RSSCN7 dataset contains satellite images acquired from Google Earth, which is originally collected for remote sensing scene classification. We …

📊 1 results
📏 Metrics: 1:1 Accuracy

RTE

The Recognizing Textual Entailment (RTE) datasets come from a series of textual entailment challenges. Data from RTE1, RTE2, RTE3 and …

📊 2 results
📏 Metrics: Test Accuracy

SGD

The Schema-Guided Dialogue (SGD) dataset consists of over 20k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. …

📊 1 results
📏 Metrics: F1 (Seqeval)

SHD - Adding

This dataset is based on the Spiking Heidelberg Digits (SHD) dataset. Sample inputs consist of two spike encoded digits sampled …

📊 3 results
📏 Metrics: Accuracy (%)

SPOT-10

The SPOTS-10 dataset is an extensive collection of grayscale images showcasing diverse patterns found in ten animal species. Specifically, SPOTS-10 …

📊 9 results
📏 Metrics: Accuracy

SST-2

The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the …

📊 2 results
📏 Metrics: Test Accuracy

Sentiment140

Sentiment140 is a dataset that allows you to discover the sentiment of a brand, product, or topic on Twitter. Source: …

📊 1 results
📏 Metrics: Accuracy

SimGas

This dataset consists of computer-generated images for gas leakage segmentation. It features diverse backgrounds, interfering foreground objects, and precise ground …

📊 1 results
📏 Metrics: Frame Level Accuracy

Sound-based drone fault classification using multitask learning

arxiv : https://arxiv.org/abs/2304.11708 Accepted at 29th International Congress on Sound and Vibration (ICSV29). The drone has been used for various …

📊 1 results
📏 Metrics: macro f1 score (A(100), B(100), C(100) Avg.)

TACM12K

Table-ACM12K (TACM12K) is a relational table dataset derived from the ACM heterogeneous graph dataset. It includes four tables: papers, authors, …

📊 1 results
📏 Metrics: Accuracy

TCGA

📊 1 results
📏 Metrics: AUPRC, AUROC

TLF2K

Table-LastFm2K (TLF2K) is a relational table dataset derived from the classical LastFM2K dataset. It contains three tables: artists, user_artists, and …

📊 1 results
📏 Metrics: Accuracy

TML1M

Table-MovieLens1M (TML1M) is a relational table dataset derived from the classical MovieLens1M dataset. It consists of three tables: users, movies, …

📊 1 results
📏 Metrics: Accuracy

WSC

The Winograd Schema Challenge was introduced both as an alternative to the Turing Test and as a test of a …

📊 2 results
📏 Metrics: Test Accuracy

WiC

WiC is a benchmark for the evaluation of context-sensitive word embeddings. WiC is framed as a binary classification task. Each …

📊 2 results
📏 Metrics: Test Accuracy

XImageNet-12

Enlarge the dataset to understand how image background effect the Computer Vision ML model. With the following topics: Blur Background …

📊 3 results
📏 Metrics: Robustness Score

Code Completion

Defects4J

Defects4J is a collection of reproducible bugs and a supporting infrastructure with the goal of advancing software engineering research. Defects4J …

📊 2 results
📏 Metrics: Compilation Rate, Pass@1, BLEU

DotPrompts

DotPrompts is a set of testcases derived from PragmaticCode, such that each testcase consists of a prompt to a dereference …

📊 2 results
📏 Metrics: Compilation Rate

SAFIM

Syntax-Aware Fill-in-the-Middle (SAFIM) is a benchmark for evaluating Large Language Models (LLMs) on the code Fill-in-the-Middle (FIM) task. SAFIM has …

📊 15 results
📏 Metrics: Average, Algorithmic, Control, API

Code Documentation Generation

CodeSearchNet

The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and …

📊 7 results
📏 Metrics: Smoothed BLEU-4

Code Generation

APPS

The APPS dataset consists of problems collected from different open-access coding websites such as Codeforces, Kattis, and more. The APPS …

📊 18 results
📏 Metrics: Introductory Pass@1, Interview Pass@1, Competition Pass@1, Competition Pass@any, Interview Pass@any, Introductory Pass@any, Competition Pass@5, Interview Pass@5, Introductory Pass@5, Competition Pass@1000, Interview Pass@1000, Introductory Pass@1000, Pass@1

CONCODE

A new large dataset with over 100,000 examples consisting of Java classes from online code repositories, and develop a new …

📊 2 results
📏 Metrics: Exact Match, BLEU, CodeBLEU

CoNaLa

The CMU CoNaLa, the Code/Natural Language Challenge dataset is a joint project from the Carnegie Mellon University NeuLab and Strudel

📊 7 results
📏 Metrics: BLEU, Exact Match Accuracy

CoNaLa-Ext

The CoNaLa Extended With Question Text is an extension to the original CoNaLa Dataset (Papers With Code Link) proposed in …

📊 5 results
📏 Metrics: BLEU

CodeContests

CodeContests is a competitive programming dataset for machine-learning. This dataset was used when training AlphaCode. It consists of programming problems, …

📊 8 results
📏 Metrics: Test Set pass@1, Test Set pass@5, Val Set pass@1, Val Set pass@5

DSEval-LeetCode

In this paper, we introduce a novel benchmarking framework designed specifically for evaluations of data science agents. Our contributions are …

📊 1 results
📏 Metrics: Pass Rate, w/o Intact, w/o PE

Django

The Django dataset is a dataset for code generation comprising of 16000 training, 1000 development and 1805 test annotations. Each …

📊 5 results
📏 Metrics: Accuracy, BLEU Score

FloCo

the FloCo dataset that contains 11,884 flowchart images and their corresponding Python codes.

📊 1 results
📏 Metrics: BLEU, CodeBLEU

HumanEval

This is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained …

📊 7 results
📏 Metrics: Pass@1

HumanEval-ET

Extension test cases of HumanEval, as well as generated code.

📊 2 results
📏 Metrics: Pass@1

MBPP

The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry-level programmers, covering programming fundamentals, …

📊 95 results
📏 Metrics: Accuracy

MBPP-ET

Extension test cases of MBPP, as well as generated code.

📊 2 results
📏 Metrics: Pass@1

PECC

Recent advancements in large language models (LLMs) have showcased their exceptional abilities across various tasks, such as code generation, problem-solving …

📊 8 results
📏 Metrics: Pass@3

RES-Q

RES-Q is a natural language instruction-based benchmark for evaluating $\textbf{R}$epository $\textbf{E}$diting $\textbf{S}$ystems, which consists of 100 handcrafted repository editing tasks …

📊 9 results
📏 Metrics: pass@1

Shellcode_IA32

Shellcode_IA32 is a dataset containing 20 years of shellcodes from a variety of sources is the largest collection of shellcodes …

📊 3 results
📏 Metrics: BLEU-4, Exact Match Accuracy

TACO-BAAI

TACO (Topics in Algorithmic Code generation dataset) is a dataset focused on algorithmic code generation, designed to provide a more …

📊 3 results
📏 Metrics: easy pass@1

Turbulence

$\textbf{Turbulence}$ is a new benchmark for systematically evaluating the correctness and robustness of instruction-tuned large language models (LLMs) for code …

📊 5 results
📏 Metrics: CorrSc

Verified Smart Contract Code Comments

Verified Smart Contracts Code Comments is a dataset of real Ethereum smart contract functions, containing "code, comment" pairs of both …

📊 2 results
📏 Metrics: BLEU score

VerilogEval

VerilogEval Dataset The VerilogEval Dataset is a benchmark specifically designed to assess the ability of large language models (LLMs) …

📊 1 results
📏 Metrics: Pass Rate

WebApp1K-React

Test-driven benchmark to challenge LLMs to write JavaScript React application GitHub Script

📊 8 results
📏 Metrics: pass@1

WebApp1k-Duo-React

Test-driven benchmark to challenge LLMs to write long JavaScript React application GitHub Script

📊 6 results
📏 Metrics: pass@1

WikiSQL

WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further …

📊 10 results
📏 Metrics: Execution Accuracy, Exact Match Accuracy

Code Search

CoDesc

CoDesc is a large dataset of 4.2m Java source code and parallel data of their description from code search, and …

📊 3 results
📏 Metrics: Test MRR

CoIR

CoIR (Code Information Retrieval) benchmark, is designed to evaluate code retrieval capabilities. CoIR includes 10 curated code datasets, covering 8 …

📊 1 results
📏 Metrics: nDCG@10

CodeSearchNet

The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and …

📊 6 results
📏 Metrics: Overall, Go, Ruby, Python, Java, JS, PHP

EditCompletion

C# EditCompletion

We scraped the 53 most popular C# repositories from GitHub and extracted all commits since the beginning of the project’s …

📊 7 results
📏 Metrics: Accuracy

Motion Synthesis

AIOZ-GDANCE

AIOZ-GDANCE comprises 16.7 hours of whole-body motion and music audio of group dancing. The duration of each video in our …

📊 4 results
📏 Metrics: FID, MMC, GenDiv, PFC, GMR, GMC, TIF

AIST++

AIST++ is a 3D dance dataset which contains 3D motion reconstructed from real dancers paired with music. The AIST++ Dance …

📊 12 results
📏 Metrics: FID, Beat alignment score

BRACE

BRACE is a dataset for audio-conditioned dance motion synthesis challenging common assumptions for this task: - strong music-dance correlation - …

📊 3 results
📏 Metrics: Frechet Inception Distance, Beat alignment score, Beat DTW cost, Footwork average, Powermove average, Toprock average

FineDance

Click to add a brief description of the dataset (Markdown and LaTeX enabled). Provide: * a high-level explanation of the …

📊 7 results
📏 Metrics: fid_k, BAS

HumanAct12

HumanAct12 is a new 3D human motion dataset adopted from the polar image and 3D pose dataset PHSPD, with proper …

📊 2 results
📏 Metrics: Accuracy, FID, Multimodality

HumanML3D

HumanML3D is a 3D human motion-language dataset that originates from a combination of HumanAct12 and Amass dataset. It covers a …

📊 35 results
📏 Metrics: FID, R Precision Top3, Diversity, Multimodality

Inter-X

Inter-X is a large-scale dataset containing ~11K interaction sequences, more than 8.1M frames and 34K fine-grained human textual descriptions.

📊 5 results
📏 Metrics: FID, R-Precision Top3, MMDist, MModality

InterHuman

InterHuman is a multimodal dataset, named InterHuman. It consists of about 107M frames for diverse two-person interactions, with accurate skeletal …

📊 8 results
📏 Metrics: FID, R-Precision Top3, MMDist, MModality

KIT Motion-Language

The KIT Motion-Language is a dataset linking human motion and natural language. Source: The KIT Motion-Language Dataset

📊 29 results
📏 Metrics: FID, R Precision Top3, Diversity, Multimodality

LaFAN1

Ubisoft La Forge Animation Dataset ("LAFAN1") Ubisoft La Forge Animation dataset and accompanying code for the SIGGRAPH 2020 paper …

📊 4 results
📏 Metrics: L2Q@5, L2Q@15, L2Q@30, L2P@5, L2P@15, L2P@30, NPSS@5, NPSS@15, NPSS@30

Motion-X

Motion-X is a large-scale 3D expressive whole-body motion dataset, which comprises 15.6M precise 3D whole-body pose annotations (i.e., SMPL-X) covering …

📊 4 results
📏 Metrics: FID, TMR-R-Precision Top3, TMR-Matching Score, MModality, Diversity

TMD

The Text-Music-Dance (TMD) dataset establishes a pioneering benchmark comprising 2,153 text-music-motion pairs. Dance motions and corresponding text annotations are sourced …

📊 1 results
📏 Metrics: FID, BAS, MModality, MMDist

Trinity Speech-Gesture Dataset

Trinity Gesture Dataset includes 23 takes, totalling 244 minutes of motion capture and audio of a male native English speaker …

📊 1 results
📏 Metrics: Mean Opinion Score

Nature-Inspired Optimization Algorithm

CIFAR-10

The CIFAR-10 database (Canadian Institute For Advanced Research database) is a large collection of natural color images. It has a …

📊 2 results
📏 Metrics: training time (s)

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has …

📊 2 results
📏 Metrics: training time (s)

OpenAPI code completion

OpenAPI completion refined

A human-refined dataset of OpenAPI definitions based on the APIs.guru OpenAPI directory. The dataset was collected from the APIs.guru OpenAPI …

📊 4 results
📏 Metrics: Correctness, avg., %, Correctness, max., %, Validness, avg., %, Validness, max., %

Physical Simulations

4D-DRESS

4D-DRESS is the first real-world 4D dataset of human clothing, capturing 64 human outfits in more than 520 motion sequences. …

📊 12 results
📏 Metrics: Chamfer (cm), Stretching Energy

Program Repair

DeepFix

DeepFix consists of a program repair dataset (fix compiler errors in C programs). It enables research around automatically fixing programming …

📊 4 results
📏 Metrics: Average Success Rate

GitHub-Python

Repair AST parse (syntax) errors in Python code

📊 2 results
📏 Metrics: Accuracy (%)

HumanEvalPack

HumanEvalPack is an extension of OpenAI's HumanEval to cover 6 total languages across 3 tasks. The evaluation suite is fully …

📊 1 results
📏 Metrics: Pass@1

Reinforcement Learning (RL)

ProcGen

Procgen Benchmark includes 16 simple-to-use procedurally-generated environments which provide a direct measure of how quickly a reinforcement learning agent learns …

📊 2 results
📏 Metrics: Mean Normalized Performance

Remote Sensing Image Classification

FireRisk

In this work, we propose a novel remote sensing dataset, FireRisk, consisting of 7 fire risk classes with a total …

📊 4 results
📏 Metrics: Accuracy (%)

SQL-to-Text

WikiSQL

WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further …

📊 2 results
📏 Metrics: BLEU-4

Semantic Segmentation

ACDC Scribbles

We release expert-made scribble annotations for the medical ACDC dataset [1]. The released data must be considered as extending the …

📊 6 results
📏 Metrics: Dice (Average)

ADE20K

The ADE20K semantic segmentation dataset contains more than 20K scene-centric images exhaustively annotated with pixel-level objects and object parts labels. …

📊 229 results
📏 Metrics: Validation mIoU, Test Score, Params (M), GFLOPs (512 x 512), GFLOPs, Mean IoU (class)

AI-TOD

AI-TOD comes with 700,621 object instances for eight categories across 28,036 aerial images. Compared to existing object detection datasets in …

📊 2 results
📏 Metrics: Dice

AIRS

The AIRS (Aerial Imagery for Roof Segmentation) dataset provides a wide coverage of aerial imagery with 7.5 cm resolution and …

📊 1 results
📏 Metrics: IoU

ATLANTIS

ATLANTIS is a benchmark for semantic segmentation of waterbody images. This dataset covers a wide range of natural waterbodies such …

📊 1 results
📏 Metrics: A-acc, A-mIoU, Accuracy, mIoU

ApolloScape

ApolloScape is a large dataset consisting of over 140,000 video frames (73 street scene videos) from various locations in China …

📊 2 results
📏 Metrics: mIoU

BIG

A high-resolution semantic segmentation dataset with 50 validation and 100 test objects. Image resolution in BIG ranges from 2048×1600 to …

📊 4 results
📏 Metrics: mBA, IoU

CC3M-TagMask

The dataset offers tag and mask annotations for image-text pairs from the CC3M validation set. Tag annotations denote words that …

📊 4 results
📏 Metrics: mIoU

CEMS-W

The dataset includes annotations for burned area delineation and land cover segmentation, with a focus on European soil. The dataset …

📊 3 results
📏 Metrics: mIoU

COCO (Common Objects in Context)

The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset. It is designed to …

📊 9 results
📏 Metrics: mIoU

COCO-Stuff

The Common Objects in COntext-stuff (COCO-stuff) dataset is a dataset for scene understanding tasks like semantic segmentation, object detection and …

📊 1 results
📏 Metrics: F.W. IU, Per-Class Accuracy, Pixel Accuracy, mIoU

Cam2BEV

The dataset contains two subsets of synthetic, semantically segmented road-scene images, which have been created for developing and applying the …

📊 1 results
📏 Metrics: Mean IoU

CamVid

CamVid (Cambridge-driving Labeled Video Database) is a road/driving scene understanding database which was originally captured as five video sequences with …

📊 20 results
📏 Metrics: Mean IoU, Global Accuracy

Cityscapes

Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense …

📊 2 results
📏 Metrics: mIoU, Pixel Accuracy

Cityscapes 3D

Detecting vehicles and representing their position and orientation in the three dimensional space is a key technology for autonomous driving. …

📊 1 results
📏 Metrics: mIoU

Cityscapes VIPriors subset

The training and validation data are subsets of the training split of the Cityscapes dataset. The test set is taken …

📊 1 results
📏 Metrics: Accuracy, mIoU

DADA-seg

DADA-seg is a pixel-wise annotated accident dataset, which contains a variety of critical scenarios from traffic accidents. It is used …

📊 27 results
📏 Metrics: mIoU

DDD17

DDD17 has over 12 h of a 346x260 pixel DAVIS sensor recording highway and city driving in daytime, evening, night, …

📊 9 results
📏 Metrics: mIoU

DELIVER

DELIVER is an arbitrary-modal segmentation benchmark, covering Depth, LiDAR, multiple Views, Events, and RGB. Aside from this, the dataset is …

📊 9 results
📏 Metrics: mIoU, test mIoU

DIVA-HisDB

The database consists of 150 annotated pages of three different medieval manuscripts with challenging layouts. Furthermore, we provide a layout …

📊 2 results
📏 Metrics: Mean IoU (class)

DSEC

DSEC is a stereo camera dataset in driving scenarios that contains data from two monochrome event cameras and two global …

📊 9 results
📏 Metrics: mIoU

Dark Zurich

Dark Zurich is an image dataset containing a total of 8779 images captured at nighttime, twilight, and daytime, along with …

📊 14 results
📏 Metrics: mIoU

DensePASS

DensePASS - a novel densely annotated dataset for panoramic segmentation under cross-domain conditions, specifically built to study the Pinhole-to-Panoramic transfer …

📊 35 results
📏 Metrics: mIoU

DroneDeploy

From DroneDeploy: We’ve collected a dataset of aerial orthomosaics and elevation images. These have been annotated into 6 different classes: …

📊 1 results
📏 Metrics: Mean IoU (test), Mean IoU (val)

Endoscapes

Cholecystectomy is a very common abdominal surgical procedure almost ubiquitously performed with a laparoscopic approach, hence guided by an endoscopic …

📊 2 results
📏 Metrics: Mean F1

FLAIR (French Land cover from Aerospace ImageRy)

The French National Institute of Geographical and Forest Information (IGN) has the mission to document and measure land-cover on French …

📊 4 results
📏 Metrics: mIoU

FMB Dataset

FMB contains 1500 well-registered infrared and visible image pairs with 14 annotated pixel-level categories. Also, it covers a wide range …

📊 13 results
📏 Metrics: mIoU

Fine-Grained Cloud Segmentation Dataset

The dataset consists of 96 terrain-corrected (Level-1T) scenes from Landsat 8 OLI and TIRS, covering diverse biomes. This variety supports …

📊 3 results
📏 Metrics: mIoU

Fine-Grained Grass Segmentation Dataset

The dataset was created using high-resolution (8 m) satellite imagery from the Gaofen series (Gaofen-2 and Gaofen-6), captured in 2019 …

📊 9 results
📏 Metrics: mIoU

FoodSeg103

FoodSeg103 is a new food image dataset containing 7,118 images. Images are annotated with 104 ingredient classes and each image …

📊 7 results
📏 Metrics: mIoU

Forward-Looking Sonar Marine Debris Datasets

This dataset is made up of forward-looking sonar images containing ten classes of underwater debris. The dataset can be used …

📊 1 results
📏 Metrics: mIOU

Freiburg Forest

The Freiburg Forest dataset was collected using a Viona autonomous mobile robot platform equipped with cameras for capturing multi-spectral and …

📊 2 results
📏 Metrics: Mean IoU

HAM10000

HAM10000 is a dataset of 10000 training images for detecting pigmented skin lesions. The authors collected dermatoscopic images from different …

📊 1 results
📏 Metrics: Average Dice, Average IOU

HERA RFI Detection

This dataset contains simulated and expert-labelled spectrograms from two radio telescopes: the Hydrogen Epoch of Reionization Array (HERA) in South …

📊 2 results
📏 Metrics: AUPRC, AUROC, F1

Hypersim

For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real images. …

📊 5 results
📏 Metrics: mIoU, mIoU (test)

INRIA Aerial Image Labeling

The INRIA Aerial Image Labeling dataset is comprised of 360 RGB tiles of 5000×5000px with a spatial resolution of 30cm/px …

📊 6 results
📏 Metrics: IoU, mIOU

ISPRS Potsdam

The data set contains 38 patches (of the same size), each consisting of a true orthophoto (TOP) extracted from a …

📊 17 results
📏 Metrics: Overall Accuracy, Mean F1, Mean IoU

ISPRS Vaihingen

The data set contains 33 patches (of different sizes), each consisting of a true orthophoto (TOP) extracted from a larger …

📊 10 results
📏 Metrics: Overall Accuracy, Average F1, Category mIoU

ImageNet-S

Powered by the ImageNet dataset, unsupervised learning on large-scale data has made significant advances for classification tasks. There are two …

📊 20 results
📏 Metrics: mIoU (val), mIoU (test)

KITTI-360

KITTI-360 is a large-scale dataset that contains rich sensory information and full annotations. It is the successor of the popular …

📊 14 results
📏 Metrics: mIoU

Kvasir-Instrument

Consists of annotated frames containing GI procedure tools such as snares, balloons and biopsy forceps, etc. Beside of the images, …

📊 2 results
📏 Metrics: DSC, mIoU

LOFAR RFI Detection

This dataset contains simulated and expert-labelled spectrograms from two radio telescopes: the Hydrogen Epoch of Reionization Array (HERA) in South …

📊 2 results
📏 Metrics: AUPRC, AUROC, F1

LaRS

LaRS is the largest and most diverse panoptic maritime obstacle detection dataset. Highlights: * Diverse scenes from manual capture, public …

📊 20 results
📏 Metrics: Q, F1, μ, mIoU

LoveDA

  1. 5987 high spatial resolution (0.3 m) remote sensing images from Nanjing, Changzhou, and Wuhan 2. Focus on different geographical …
📊 16 results
📏 Metrics: Category mIoU

MCubeS

Multimodal material segmentation (MCubeS) dataset contains 500 sets of images from 42 street scenes. Each scene has images for four …

📊 21 results
📏 Metrics: mIoU

MCubeS (P)

Multimodal material segmentation (MCubeS) dataset contains 500 sets of images from 42 street scenes. Each scene has images for four …

📊 8 results
📏 Metrics: mIoU

MUSES: MUlti-SEnsor Semantic perception dataset

MUSES offers 2500 multi-modal scenes, evenly distributed across various combinations of weather conditions (clear, fog, rain, and snow) and types …

📊 2 results
📏 Metrics: mIoU

Matterport3D

The Matterport3D dataset is a large RGB-D dataset for scene understanding in indoor environments. It contains 10,800 panoramic views inside …

📊 4 results
📏 Metrics: Test mIoU, Validation mIoU

Mila Simulated Floods

Mila Simulated Floods Dataset is a 1.5 square km virtual world using the Unity3D game engine including urban, suburban and …

📊 1 results
📏 Metrics: mIoU

MixedWM38

MixedWM38 Dataset(WaferMap) has more than 38000 wafer maps, including 1 normal pattern, 8 single defect patterns, and 29 mixed defect …

📊 1 results
📏 Metrics: Dice, Mean IoU

Montgomery County X-ray Set

X-ray images in this data set have been acquired from the tuberculosis control program of the Department of Health andHuman …

📊 3 results
📏 Metrics: F1-score

Nighttime Driving

Nighttime Driving is a dataset of road scenes consisting of 35,000 images ranging from daytime to twilight time and to …

📊 12 results
📏 Metrics: mIoU

OpenEDS

OpenEDS (Open Eye Dataset) is a large scale data set of eye-images captured using a virtual-reality (VR) head mounted display …

📊 1 results
📏 Metrics: mIOU

PASCAL Context

The PASCAL Context dataset is an extension of the PASCAL VOC 2010 detection challenge, and it contains pixel-wise labels for …

📊 62 results
📏 Metrics: mIoU, Mean Accuracy, Pixel Accuracy

PASCAL VOC

The PASCAL Visual Object Classes (VOC) 2012 dataset contains 20 object categories including vehicles, household, animals, and other: aeroplane, bicycle, …

📊 1 results
📏 Metrics: mIoU

PASCAL VOC 2007

PASCAL VOC 2007 is a dataset for image recognition. The twenty object classes that have been selected are: Person: person …

📊 2 results
📏 Metrics: Mean IoU

PASCAL VOC 2011

PASCAL VOC 2011 is an image segmentation dataset. It contains around 2,223 images for training, consisting of 5,034 objects. Testing …

📊 1 results
📏 Metrics: Mean IoU

PASCAL VOC 2012 test

SCC Data Set

📊 51 results
📏 Metrics: Mean IoU, FLOPS, Params

PASTIS

PASTIS is a benchmark dataset for panoptic and semantic segmentation of agricultural parcels from satellite image time series. It is …

📊 3 results
📏 Metrics: Mean IoU (test), Number of Params, Overall Accuracy

PASTIS-R

Extension of the PASTIS benchmark with radar and optical image time series.

📊 1 results
📏 Metrics: IoU

PETRAW

PETRAW data set was composed of 150 sequences of peg transfer training sessions. The objective of the peg transfer session …

📊 4 results
📏 Metrics: Mean IoU (class)

PH2

The increasing incidence of melanoma has recently promoted the development of computer-aided diagnosis systems for the classification of dermoscopic images. …

📊 2 results
📏 Metrics: Average Dice, Average IOU

Pothole Mix

This dataset for the semantic segmentation of potholes and cracks on the road surface was assembled from 5 other datasets …

📊 7 results
📏 Metrics: Test Dice Multiclass, Test mIoU, Validation Dice Multiclass, Validation mIoU

Potsdam

https://paperswithcode.com/sota/semantic-segmentation-on-isprs-potsdam

📊 3 results
📏 Metrics: mIoU

RUGD

A Video Dataset for Visual Perception and Autonomous Navigation in Unstructured Environments. Website: http://rugd.vision/ The RUGD dataset focuses on semantic …

📊 1 results
📏 Metrics: AIOU, mIoU

Replica

The Replica Dataset is a dataset of high quality reconstructions of a variety of indoor spaces. Each reconstruction has clean …

📊 5 results
📏 Metrics: mIoU

S3DIS

The Stanford 3D Indoor Scene Dataset (S3DIS) dataset contains 6 large-scale indoor areas with 271 rooms. Each point in the …

📊 50 results
📏 Metrics: Mean IoU, mAcc, oAcc, FLOPs, Number of params, mIoU, Params (M)

SBCoseg

The SBCoseg dataset includes 889 groups of images and each group consists of 18 images with a common object, leading …

📊 1 results
📏 Metrics: Jaccard

STARE

The STARE (Structured Analysis of the Retina) dataset is a dataset for retinal vessel segmentation. It contains 20 equal-sized (700×605) …

📊 1 results
📏 Metrics: AUC

SWIMSEG

The SWIMSEG dataset contains 1013 images of sky/cloud patches, along with their corresponding binary segmentation maps. The ground truth annotation …

📊 1 results
📏 Metrics: Average Precision, Average Recall, F1-Score, MCC, Mean IoU

SWINSEG

The SWINSEG dataset contains 115 nighttime images of sky/cloud patches along with their corresponding binary ground truth maps. The ground …

📊 1 results
📏 Metrics: Average Precision, Average Recall, F1-Score, MCC, Mean IoU

SWINySEG

The SWINySEG dataset contains 6768 daytime- and nighttime-images of sky/cloud patches along with their corresponding binary ground truth maps. The …

📊 1 results
📏 Metrics: Average Precision, Average Recall, F1-Score, MCC, Mean IoU

SYNTHIA

The SYNTHIA dataset is a synthetic dataset that consists of 9400 multi-viewpoint photo-realistic frames rendered from a virtual city and …

📊 2 results
📏 Metrics: mIoU

ScanNet

ScanNet is an instance-level indoor RGB-D dataset that includes both 2D and 3D data. It is a collection of labeled …

📊 44 results
📏 Metrics: val mIoU, test mIoU

Semantic3D

Semantic3D is a point cloud dataset of scanned outdoor scenes with over 3 billion points. It contains 15 training and …

📊 13 results
📏 Metrics: mIoU, oAcc

SemanticPOSS

The SemanticPOSS dataset for 3D semantic segmentation contains 2988 various and complicated LiDAR scans with large quantity of dynamic instances. …

📊 1 results
📏 Metrics: Mean IoU

ShapeNet

ShapeNet is a large scale repository for 3D CAD models developed by researchers from Stanford University, Princeton University and the …

📊 4 results
📏 Metrics: Mean IoU

SpaceNet 1

SpaceNet 1: Building Detection v1 is a dataset for building footprint detection. The data is comprised of 382,534 building footprints, …

📊 10 results
📏 Metrics: Mean IoU

Structured3D

Structured3D is a large-scale photo-realistic dataset containing 3.5K house designs (a) created by professional designers with a variety of ground …

📊 4 results
📏 Metrics: Test mIoU, Validation mIoU

Trans10K

A large-scale dataset for transparent object segmentation, named Trans10K, consisting of 10,428 images of real scenarios with carefully manual annotations, …

📊 14 results
📏 Metrics: mIoU, GFLOPs

UAVid

UAVid is a high-resolution UAV semantic segmentation dataset as a complement, which brings new challenges, including large scale variation, moving …

📊 6 results
📏 Metrics: Mean IoU

UPLight

UPLight is an underwater RGB-Polarization multimodal semantic segmentation dataset with 12 typical underwater semantic classes.

📊 6 results
📏 Metrics: mIoU

VDD

Semantic segmentation of drone images is critical for various aerial vision tasks as it provides essential seman- tic details to …

📊 7 results
📏 Metrics: mIoU

WildDash

WildDash is a benchmark evaluation method is presented that uses the meta-information to calculate the robustness of a given algorithm …

📊 1 results
📏 Metrics: Mean IoU

ZJU-RGB-P

Research on semantic segmentation of traffic scenes using color and polarization information (including training and testing sets).

📊 13 results
📏 Metrics: mIoU, Frame (fps)

iSAID

iSAID contains 655,451 object instances for 15 categories across 2,806 high-resolution images. The images of iSAID is the same as …

📊 15 results
📏 Metrics: mIoU

Source Code Summarization

CoDesc

CoDesc is a large dataset of 4.2m Java source code and parallel data of their description from code search, and …

📊 1 results
📏 Metrics: BLEU-4

CodeSearchNet

The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and …

📊 1 results
📏 Metrics: F1

DeepCom-Java

The Java dataset introduced in DeepCom (Deep Code Comment Generation), commonly used to evaluate automated code summarization.

📊 2 results
📏 Metrics: BLEU-4, METEOR

Java scripts

The Java dataset introduced in Hybrid-DeepCom (Deep code comment generation with hybrid lexical and syntactical information), commonly used to evaluate …

📊 1 results
📏 Metrics: BLEU-4, METEOR

ParallelCorpus-Python

The Python dataset introduced in the Parallel Corpus paper ([A Parallel Corpus of Python Functions and Documentation Strings for Automated …

📊 2 results
📏 Metrics: BLEU-4, METEOR

Text Generation

CNN/Daily Mail

CNN/Daily Mail is a dataset for text summarization. Human generated abstractive summary bullets were generated from news stories in CNN …

📊 1 results
📏 Metrics: ROUGE-L

COCO Captions

COCO Captions contains over one and a half million captions describing over 330,000 images. For the training and validation images, …

📊 4 results
📏 Metrics: BLEU-2, BLEU-3, BLEU-4, BLEU-5

CSL

CSL is a synthetic dataset introduced in Murphy et al. (2019) to test the expressivity of GNNs. In particular, graphs …

📊 1 results
📏 Metrics: ROUGE-L

CommonGen

CommonGen is constructed through a combination of crowdsourced and existing caption corpora, consists of 79k commonsense descriptions over 35k unique …

📊 4 results
📏 Metrics: CIDEr, METEOR, BLEU-4, SPICE

Czech restaurant information

Czech restaurant information is a dataset for NLG in task-oriented spoken dialogue systems with Czech as the target language. It …

📊 3 results
📏 Metrics: METEOR

DART

DART is a large dataset for open-domain structured data record to text generation. DART consists of 82,191 examples across different …

📊 3 results
📏 Metrics: BLEU, METEOR, FactSpotter

DailyDialog

DailyDialog is a high-quality multi-turn open-domain English dialog dataset. It contains 13,118 dialogues split into a training set with 11,118 …

📊 1 results
📏 Metrics: BLEU-1, BLEU-2, BLEU-3, BLEU-4

HarmfulQA

Paper | Github | Dataset| Model As a part of our research efforts toward making LLMs more safe for public …

📊 1 results
📏 Metrics: ASR

LCSTS

LCSTS is a large corpus of Chinese short text summarization dataset constructed from the Chinese microblogging website Sina Weibo, which …

📊 1 results
📏 Metrics: ROUGE-L

OpenWebText

OpenWebText is an open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit …

📊 2 results
📏 Metrics: eval_loss

ROCStories

ROCStories is a collection of commonsense short stories. The corpus consists of 100,000 five-sentence stories. Each story logically follows everyday …

📊 4 results
📏 Metrics: BLEU-1, Perplexity

ReDial

ReDial (Recommendation Dialogues) is an annotated dataset of dialogues, where users recommend movies to each other. The dataset consists of …

📊 4 results
📏 Metrics: Distinct-3, Distinct-4, Distinct-2, Perplexity

SciQ

The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in …

📊 3 results
📏 Metrics: Accuracy

Text-To-SQL

BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation)

BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) represents a pioneering, cross-domain dataset that examines the impact of extensive …

📊 16 results
📏 Metrics: Execution Accuracy % (Test), Execution Accuracy % (Dev), Execution Accurarcy (Human)

KaggleDBQA

KaggleDBQA is a challenging cross-domain and complex evaluation dataset of real Web databases, with domain-specific data types, original formatting, and …

📊 2 results
📏 Metrics: Exact Match (EM)

SEDE

SEDE is a dataset comprised of 12,023 complex and diverse SQL queries and their natural language titles and descriptions, written …

📊 1 results
📏 Metrics: PCM-F1 (dev), PCM-F1 (test)

SParC

SParC is a large-scale dataset for complex, cross-domain, and context-dependent (multi-turn) semantic parsing and text-to-SQL task (interactive natural language interfaces …

📊 6 results
📏 Metrics: interaction match accuracy, question match accuracy

SQL-Eval

SQL-Eval is an open-source PostgreSQL evaluation dataset released by Defog, constructed based on Spider. The original link can be found …

📊 1 results
📏 Metrics: Execution Accuracy

Spider 2.0

Spider 2.0 is a comprehensive code generation agent task that includes 632 examples. The agent has to interactively explore various …

📊 8 results
📏 Metrics: Success Rate

Type prediction

ManyTypes4TypeScript

DOI Type Inference dataset for TypeScript. Click on DOI tag for dataset files.

📊 7 results
📏 Metrics: Average Accuracy, Average Precision, Average Recall, Average F1