Machine Learning Benchmarks

Browse 74 benchmarks across 20 tasks
← ML Research Wiki / Benchmarks / Knowledge Base
Clear
Browse by Category

1 Image, 2*2 Stitchi

FQL-Driving

FQL-driving

πŸ“Š 1 results
πŸ“ Metrics: 0..5sec

10-shot image generation

FQL-Driving

FQL-driving

πŸ“Š 1 results
πŸ“ Metrics: 0-shot MRR

FlyingThings3D

FlyingThings3D is a synthetic dataset for optical flow, disparity and scene flow estimation. It consists of everyday objects flying along …

πŸ“Š 1 results
πŸ“ Metrics: 0..5sec

MEAD

Multi-view Emotional Audio-visual Dataset

πŸ“Š 1 results
πŸ“ Metrics: 12k

Music21

Music21 is an untrimmed video dataset crawled by keyword query from Youtube. It contains music performances belonging to 21 categories. …

πŸ“Š 1 results
πŸ“ Metrics: 0..5sec

2D Human Pose Estimation

COCO-WholeBody

COCO-WholeBody is an extension of COCO dataset with whole-body annotations. There are 4 types of bounding boxes (person box, face …

πŸ“Š 14 results
πŸ“ Metrics: WB, body, foot, face, hand

Human-Art

Human-Art is a versatile human-centric dataset to bridge the gap between natural and artificial scenes. It includes twenty high-quality human …

πŸ“Š 10 results
πŸ“ Metrics: AP, AP (gt bbox), Validation AP

OCHuman

This dataset focuses on heavily occluded human with comprehensive annotations including bounding-box, humans pose and instance mask. This dataset contains …

πŸ“Š 10 results
πŸ“ Metrics: Test AP, Validation AP

3D Absolute Human Pose Estimation

Human3.6M

The Human3.6M dataset is one of the largest motion capture datasets, which consists of 3.6 million human poses and corresponding …

πŸ“Š 4 results
πŸ“ Metrics: MRPE, Average MPJPE (mm), PA-MPJPE

Breast Cancer Histology Image Classification

BreakHis

The Breast Cancer Histopathological Image Classification (BreakHis) is composed of 9,109 microscopic images of breast tumor tissue collected from 82 …

πŸ“Š 3 results
πŸ“ Metrics: Accuracy (%), 1:1 Accuracy, Accuracy (Inter-Patient)

Causal Inference

IHDP

The Infant Health and Development Program (IHDP) is a randomized controlled study designed to evaluate the effect of home visit …

πŸ“Š 9 results
πŸ“ Metrics: Average Treatment Effect Error

Jobs

The Jobs dataset by LaLonde [36] is a widely used benchmark in the causal inference community, where the treatment is …

πŸ“Š 3 results
πŸ“ Metrics: Average Treatment Effect on the Treated Error

Entity Alignment

DBP1M FR-EN

A large-scale cross-lingual dataset for entity alignment

πŸ“Š 2 results
πŸ“ Metrics: Hit@1

DBP2.0 zh-en

The DBP2.0 dataset can be downloaded from the figshare repository. It has three entity alignment settings, i.e., ZH-EN, JA-EN and …

πŸ“Š 2 results
πŸ“ Metrics: dangling entity detection F1, Entity Alignment (Consolidated) F1

Explainable Artificial Intelligence (XAI)

ADNI

Alzheimer's Disease Neuroimaging Initiative (ADNI) is a multisite study that aims to improve clinical trials for the prevention and treatment …

πŸ“Š 1 results
πŸ“ Metrics: AD-Related Brain Areas Identified

Knowledge Base Population

LM-KBC 2023

A diverse set of 21 relations, each covering a different set of subject-entities and a complete list of ground truth …

πŸ“Š 1 results
πŸ“ Metrics: F1

Knowledge Graph Completion

DBP-5L (English)

DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used …

πŸ“Š 3 results
πŸ“ Metrics: MRR

DBP-5L (Greek)

DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used …

πŸ“Š 3 results
πŸ“ Metrics: MRR

DPB-5L (French)

DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used …

πŸ“Š 3 results
πŸ“ Metrics: MRR

FB15k-237

FB15k-237 is a link prediction dataset created from FB15k. While FB15k consists of 1,345 relations, 14,951 entities, and 592,213 triples, …

πŸ“Š 3 results
πŸ“ Metrics: Hits@10, Hits@1, Hits@3, MR, MRR

WN18RR

WN18RR is a link prediction dataset created from WN18, which is a subset of WordNet. WN18 consists of 18 relations …

πŸ“Š 2 results
πŸ“ Metrics: Hits@3, Hits@1, Hits@10

Knowledge Graph Embedding

FB15k

The FB15k dataset contains knowledge base relation triples and textual mentions of Freebase entity pairs. It has a total of …

πŸ“Š 1 results
πŸ“ Metrics: MRR

Knowledge Graphs

JerichoWorld

JerichoWorld is a dataset that enables the creation of learning agents that can build knowledge graph-based world models of interactive …

πŸ“Š 5 results
πŸ“ Metrics: Set accuracy

MARS (Multimodal Analogical Reasoning dataSet)

Analogical reasoning is fundamental to human cognition and holds an important place in various fields. However, previous studies mainly focus …

πŸ“Š 8 results
πŸ“ Metrics: MRR

MMLU

MMLU-Pro

The MMLU-Pro dataset is an enhanced version of the Massive Multitask Language Understanding (MMLU) benchmark. It's designed to be more …

πŸ“Š 1 results
πŸ“ Metrics: 0-shot MRR

Mathematical Question Answering

GeoS

GeoS is a dataset for automatic math problem solving. It is a dataset of SAT plane geometry questions where every …

πŸ“Š 1 results
πŸ“ Metrics: Accuracy (%)

Geometry3K

A new large-scale geometry problem-solving dataset - 3,002 multi-choice geometry problems - dense annotations in formal language for the diagrams …

πŸ“Š 8 results
πŸ“ Metrics: Accuracy (%)

Mathematical Reasoning

GeoQA

GeoQA is a dataset for automatic geometric problem solving containing 5,010 geometric problems with corresponding annotated programs, which illustrate the …

πŸ“Š 2 results
πŸ“ Metrics: Accuracy (%)

PGPS9K

A new large scale plane geometry problem solving dataset called PGPS9K, labeled both fine-grained diagram annotation and interpretable solution program.

πŸ“Š 6 results
πŸ“ Metrics: Completion accuracy

Multi-hop Question Answering

MuSiQue-Ans

MuSiQue-Ans is a new multihop QA dataset with ~25K 2-4 hop questions using seed questions from 5 existing single-hop datasets.

πŸ“Š 1 results
πŸ“ Metrics: An, Sp

Multi-modal Entity Alignment

MMKG

MMKG is a collection of three knowledge graphs for link prediction and entity matching research. Contrary to other knowledge graph …

πŸ“Š 1 results
πŸ“ Metrics: H@1

Recommendation Systems

Amazon Beauty

This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links …

πŸ“Š 5 results
πŸ“ Metrics: Hit@10, nDCG@10, NDCG

Amazon Fashion

This datasets is a subset of the Amazon reviews dataset which contain Fashion related products

πŸ“Š 4 results
πŸ“ Metrics: HitRatio@ 10 (100 Neg. Samples), nDCG@10 (100 Neg. Samples), AUC, nDCG@10 (500 Neg. Samples), Hit@10, NDCG

Amazon Men

This datasets is a subset of the Amazon reviews dataset which contain Men related products

πŸ“Š 3 results
πŸ“ Metrics: Hit@10, nDCG@10, NDCG

Amazon Product Data

This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014. This …

πŸ“Š 1 results
πŸ“ Metrics: AUC, F1

Amazon-Book

N/A

πŸ“Š 15 results
πŸ“ Metrics: nDCG@20, Recall@20, HR@10, NDCG@10, HR@50, NDCG@50

Ciao

The Ciao dataset contains rating information of users given to items, and also contain item category information. The data comes …

πŸ“Š 1 results
πŸ“ Metrics: Hits@10, Hits@20, nDCG@10, nDCG@20

Delicious

Delicious : This data set contains tagged web pages retrieved from the website delicious.com. Source: [Text segmentation on multilabel documents: …

πŸ“Š 1 results
πŸ“ Metrics: NDCG, Recall@20

Douban

We release Douban Conversation Corpus, comprising a training data set, a development set and a test set for retrieval based …

πŸ“Š 5 results
πŸ“ Metrics: RMSE, NDCG, Recall@20, AUC, HR@10, HR@100, PSP@10, nDCG@10, nDCG@100

Epinions

The Epinions dataset is built form a who-trust-whom online social network of a general consumer review site Epinions.com. Members of …

πŸ“Š 4 results
πŸ“ Metrics: MAE, RMSE, MAP@20, MRR@20, NDCG@20

Gowalla

Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and …

πŸ“Š 13 results
πŸ“ Metrics: nDCG@20, Recall@20, HR@10, HR@100, PSP@10, nDCG@10, nDCG@100

Pinterest

The Pinterest dataset contains more than 1 million images associated to Pinterest users’ who have β€œpinned” them. Source: https://openaccess.thecvf.com/content_iccv_2015/papers/Geng_Learning_Image_and_ICCV_2015_paper.pdf

πŸ“Š 1 results
πŸ“ Metrics: nDCG@10, Hits@10, Hits@20, nDCG@20

PixelRec

an image cover dataset in short video recommendation

πŸ“Š 1 results
πŸ“ Metrics: Hit@10

Polyvore

This dataset contains 21,889 outfits from polyvore.com, in which 17,316 are for training, 1,497 for validation and 3,076 for testing. …

πŸ“Š 3 results
πŸ“ Metrics: AUC, Accuracy

ReDial

ReDial (Recommendation Dialogues) is an annotated dataset of dialogues, where users recommend movies to each other. The dataset consists of …

πŸ“Š 7 results
πŸ“ Metrics: Recall@1, Recall@10, Recall@50

WeChat

The WeChat dataset for fake news detection contains more than 20k news labelled as fake news or not.

πŸ“Š 2 results
πŸ“ Metrics: AUC, P@10

Yelp

The Yelp Dataset is a valuable resource for academic research, teaching, and learning. It provides a rich collection of real-world …

πŸ“Š 2 results
πŸ“ Metrics: NDCG, NDCG@20, Recall@20

Yelp2018

The Yelp2018 dataset is adopted from the 2018 edition of the yelp challenge. Wherein local businesses like restaurants and bars …

πŸ“Š 11 results
πŸ“ Metrics: NDCG@20, Recall@20, HR@10, HR@100, PSP@10, nDCG@10, nDCG@100

Reinforcement Learning (RL)

ProcGen

Procgen Benchmark includes 16 simple-to-use procedurally-generated environments which provide a direct measure of how quickly a reinforcement learning agent learns …

πŸ“Š 2 results
πŸ“ Metrics: Mean Normalized Performance

Text Summarization

ACI-Bench

Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation

πŸ“Š 1 results
πŸ“ Metrics: ROUGE-1, ROUGE-2, ROUGE-L

Arxiv HEP-TH citation graph

Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a …

πŸ“Š 27 results
πŸ“ Metrics: ROUGE-1, ROUGE-2, ROUGE-L

BigPatent

Consists of 1.3 million records of U.S. patent documents along with human written abstractive summaries. Source: [BIGPATENT: A Large-Scale Dataset …

πŸ“Š 2 results
πŸ“ Metrics: ROUGE-1, ROUGE-2, ROUGE-L

BillSum

BillSum is the first dataset for summarization of US Congressional and California state bills. The BillSum dataset consists of three …

πŸ“Š 1 results
πŸ“ Metrics: rouge1

BookSum

BookSum is a collection of datasets for long-form narrative summarization. This dataset covers source documents from the literature domain, such …

πŸ“Š 3 results
πŸ“ Metrics: ROUGE, ROUGE-2, ROUGE-L

CL-SciSumm

πŸ“Š 1 results
πŸ“ Metrics: ROUGE-2

DialogSum

DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 dialogues with corresponding manually labeled summaries and topics. This work …

πŸ“Š 3 results
πŸ“ Metrics: Rouge1, Rouge2, RougeL, BertScore

Gazeta

Gazeta is a dataset for automatic summarization of Russian news. The dataset consists of 63,435 text-summary pairs. To form training, …

πŸ“Š 1 results
πŸ“ Metrics: ROUGE-1, ROUGE-2, ROUGE-L, BLEU, Meteor

GovReport

GovReport is a dataset for long document summarization, with significantly longer documents and summaries. It consists of reports written by …

πŸ“Š 2 results
πŸ“ Metrics: ROUGE-1, ROUGE-2, ROUGE-L

How2

The How2 dataset contains 13,500 videos, or 300 hours of speech, and is split into 185,187 training, 2022 development (dev), …

πŸ“Š 2 results
πŸ“ Metrics: Content F1, ROUGE-L, ROUGE-1

Klexikon

The dataset introduces document alignments between German Wikipedia and the children's lexicon Klexikon. The source texts in Wikipedia are both …

πŸ“Š 4 results
πŸ“ Metrics: ROUGE-1, ROUGE-2, ROUGE-L

LCSTS

LCSTS is a large corpus of Chinese short text summarization dataset constructed from the Chinese microblogging website Sina Weibo, which …

πŸ“Š 1 results
πŸ“ Metrics: ROUGE-1

MTEB

MTEB is a benchmark that spans 8 embedding tasks covering a total of 56 datasets and 112 languages. The 8 …

πŸ“Š 26 results
πŸ“ Metrics: Spearman Correlation

MeQSum

MeQSum is a dataset for medical question summarization. It contains 1,000 summarized consumer health questions. Source: https://www.aclweb.org/anthology/P19-1215.pdf Image Source: https://www.aclweb.org/anthology/P19-1215.pdf

πŸ“Š 1 results
πŸ“ Metrics: RougeL

MeetingBank

MeetingBank, a benchmark dataset created from the city councils of 6 major U.S. cities to supplement existing datasets. It contains …

πŸ“Š 2 results
πŸ“ Metrics: ROUGE-L, Rouge-1, ROUGE-2

MentSum

Mental health remains a significant challenge of public health worldwide. With increasing popularity of online platforms, many use the platforms …

πŸ“Š 1 results
πŸ“ Metrics: Rouge-1, Rouge-2, Rouge-L

OrangeSum

Source: BARThez: a Skilled Pretrained French Sequence-to-Sequence Model OrangeSum is a single-document extreme summarization dataset with two tasks: title and …

πŸ“Š 2 results
πŸ“ Metrics: ROUGE-1

Pubmed

The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. …

πŸ“Š 28 results
πŸ“ Metrics: ROUGE-1, ROUGE-2, ROUGE-L

QMSum

QMSum is a new human-annotated benchmark for query-based multi-domain meeting summarisation task, which consists of 1,808 query-summary pairs over 232 …

πŸ“Š 1 results
πŸ“ Metrics: ROUGE-1

Reddit TIFU

Reddit TIFU dataset is a newly collected Reddit dataset, where TIFU denotes the name of /r/tifu subbreddit. There are 122,933 …

πŸ“Š 5 results
πŸ“ Metrics: ROUGE-1, ROUGE-2, ROUGE-L

SAMSum

A new dataset with abstractive dialogue summaries. Source: SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization

πŸ“Š 11 results
πŸ“ Metrics: ROUGE-1, ROUGE-2, ROUGE-L, BertScoreF1

WikiHow

WikiHow is a dataset of more than 230,000 article and summary pairs extracted and constructed from an online knowledge base …

πŸ“Š 3 results
πŸ“ Metrics: ROUGE-1, ROUGE-2, ROUGE-L, Content F1

XSum

The Extreme Summarization (XSum) dataset is a dataset for evaluation of abstractive single-document summarization systems. The goal is to create …

πŸ“Š 1 results
πŸ“ Metrics: ROUGE-1

arXiv

For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from …

πŸ“Š 1 results
πŸ“ Metrics: ROUGE-1, ROUGE-2, ROUGE-L

arXiv Summarization Dataset

This is a dataset for evaluating summarisation methods for research papers. Source: [A Discourse-Aware Attention Model for Abstractive Summarization of …

πŸ“Š 4 results
πŸ“ Metrics: ROUGE-1, ROUGE-2, ROUGE-L