ML Research Wiki / Benchmarks / Common Sense Reasoning / CommonsenseQA

CommonsenseQA

Common Sense Reasoning Benchmark

Performance Over Time

📊 Showing 38 results | 📏 Metric: Accuracy

Top Performing Models

Rank	Model	Paper	Accuracy	Date	Code
1	GPT-4o (HPT)	Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles	92.54	2024-06-18	📦 devichand579/HPT
2	DeBERTaV3-large+KEAR 📚	Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention	91.20	2021-12-06	📦 microsoft/DEKCOR-CommonsenseQA 📦 microsoft/kear
3	PaLM 2 (few‑shot, CoT, SC) 📚	PaLM 2 Technical Report	90.40	2023-05-17	📦 eternityyw/tram-benchmark
4	KEAR 📚	Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention	89.40	2021-12-06	📦 microsoft/DEKCOR-CommonsenseQA 📦 microsoft/kear
5	DEKCOR 📚	Fusing Context Into Knowledge Graph for Commonsense Question Answering	83.30	2020-12-09	📦 microsoft/kear 📦 microsoft/DEKCOR-CommonsenseQA
6	Unicorn 11B (fine-tuned)	UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark	79.30	2021-03-24	📦 allenai/rainbow
7	MUPPET Roberta Large 📚	Muppet: Massive Multi-task Representations with Pre-Finetuning	79.20	2021-01-26	📦 facebook/muppet-roberta-base 📦 facebook/muppet-roberta-large
8	UnifiedQA 11B (fine-tuned) 📚	UnifiedQA: Crossing Format Boundaries With a Single QA System	79.10	2020-05-02	📦 allenai/unifiedqa 📦 facebookresearch/metaicl
9	DRAGON	Deep Bidirectional Language-Knowledge Graph Pretraining	78.20	2022-10-17	📦 michiyasunaga/dragon 📦 HaochenLiu2000/QAP
10	T5-XXL 11B (fine-tuned)	UnifiedQA: Crossing Format Boundaries With a Single QA System	78.10	2020-05-02	📦 allenai/unifiedqa 📦 facebookresearch/metaicl

All Papers (38)

Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles

2024

GPT-4o (HPT)

devichand579/HPT

Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention

2021

DeBERTaV3-large+KEAR

microsoft/DEKCOR-CommonsenseQA microsoft/kear

PaLM 2 Technical Report

2023

PaLM 2 (few‑shot, CoT, SC)

eternityyw/tram-benchmark

Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention

2021

KEAR

microsoft/DEKCOR-CommonsenseQA microsoft/kear

Fusing Context Into Knowledge Graph for Commonsense Question Answering

2020

DEKCOR

microsoft/kear microsoft/DEKCOR-CommonsenseQA

UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark

2021

Unicorn 11B (fine-tuned)

allenai/rainbow

Muppet: Massive Multi-task Representations with Pre-Finetuning

2021

MUPPET Roberta Large

facebook/muppet-roberta-base facebook/muppet-roberta-large

UnifiedQA: Crossing Format Boundaries With a Single QA System

2020

UnifiedQA 11B (fine-tuned)

allenai/unifiedqa facebookresearch/metaicl

Deep Bidirectional Language-Knowledge Graph Pretraining

2022

DRAGON

michiyasunaga/dragon HaochenLiu2000/QAP

UnifiedQA: Crossing Format Boundaries With a Single QA System

2020

T5-XXL 11B (fine-tuned)

allenai/unifiedqa facebookresearch/metaicl

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

2019

Albert Lan et al. (2020) (ensemble)

huggingface/transformers tensorflow/models

UnifiedQA: Crossing Format Boundaries With a Single QA System

2020

UnifiedQA 11B (zero-shot)

allenai/unifiedqa facebookresearch/metaicl

QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering

2021

QA-GNN

michiyasunaga/qagnn rucaibox/safe

Graph-Based Reasoning over Heterogeneous External Knowledge for Commonsense Question Answering

2019

XLNet+GraphReason

DecstionBack/AAAI_2020_CommonsenseQA

GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering

2023

GrapeQA: PEGA

Towards Generalizable Neuro-Symbolic Systems for Commonsense Question Answering

2019

RoBERTa+HyKAS Ma et al. (2019)

Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention

2021

GPT-3 Direct Finetuned

microsoft/DEKCOR-CommonsenseQA microsoft/kear

STaR: Bootstrapping Reasoning With Reasoning

2022

STaR (on GPT-J)

ezelikman/STaR

RoBERTa: A Robustly Optimized BERT Pretraining Approach

2019

RoBERTa-Large 355M

huggingface/transformers pytorch/fairseq

STaR: Bootstrapping Reasoning With Reasoning

2022

STaR without Rationalization (on GPT-J)

ezelikman/STaR

BloombergGPT: A Large Language Model for Finance

2023

OPT 66B (1-shot)

yangletliu/finlora open-finance-lab/finlora

BloombergGPT: A Large Language Model for Finance

2023

Bloomberg GPT 50B (1-shot)

yangletliu/finlora open-finance-lab/finlora

Explain Yourself! Leveraging Language Models for Commonsense Reasoning

2019

CAGE-reasoning

salesforce/cos-e

BloombergGPT: A Large Language Model for Finance

2023

BLOOM 176B (1-shot)

yangletliu/finlora open-finance-lab/finlora

UnifiedQA: Crossing Format Boundaries With a Single QA System

2020

UnifiedQA 440M (fine-tuned)

allenai/unifiedqa facebookresearch/metaicl

UnifiedQA: Crossing Format Boundaries With a Single QA System

2020

BART-large 440M (fine-tuned)

allenai/unifiedqa facebookresearch/metaicl

Align, Mask and Select: A Simple Method for Incorporating Commonsense Knowledge into Language Representation Models

2019

BERT_CSlarge

BloombergGPT: A Large Language Model for Finance

2023

GPT-NeoX 20B (1-shot)

yangletliu/finlora open-finance-lab/finlora

STaR: Bootstrapping Reasoning With Reasoning

2022

GPT-J Direct Finetuned

ezelikman/STaR

KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning

2019

KagNet

INK-USC/KagNet INK-USC/MHGRN

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

2018

BERT-LARGE

jonathanherzig/commonsenseqa xlang-ai/batch-prompting

UL2: Unifying Language Learning Paradigms

2022

UL2 20B (chain-of-thought + self-consistency)

google-research/google-research opennlg/openba-v2

STaR: Bootstrapping Reasoning With Reasoning

2022

Few-shot CoT LaMDA 137B

ezelikman/STaR

UL2: Unifying Language Learning Paradigms

2022

UL2 20B (chain-of-thought)

google-research/google-research opennlg/openba-v2

STaR: Bootstrapping Reasoning With Reasoning

2022

Few-shot CoT GPT-J

ezelikman/STaR

UL2: Unifying Language Learning Paradigms

2022

UL2 20B (zero-shot)

google-research/google-research opennlg/openba-v2

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

2022

Chain of thought ASDiv

microsoft/guidance guidance-ai/guidance

STaR: Bootstrapping Reasoning With Reasoning

2022

Few-shot Direct GPT-J

ezelikman/STaR

Model	Paper	Accuracy	Date
GPT-4o (HPT)	Hierarchical Prompting Taxonomy: A Universal Eval…	92.54	2024-06-18
DeBERTaV3-large+KEAR	Human Parity on CommonsenseQA: Augmenting Self-At…	91.20	2021-12-06
PaLM 2 (few‑shot, CoT, SC)	PaLM 2 Technical Report	90.40	2023-05-17
KEAR	Human Parity on CommonsenseQA: Augmenting Self-At…	89.40	2021-12-06
DEKCOR	Fusing Context Into Knowledge Graph for Commonsen…	83.30	2020-12-09
Unicorn 11B (fine-tuned)	UNICORN on RAINBOW: A Universal Commonsense Reaso…	79.30	2021-03-24
MUPPET Roberta Large	Muppet: Massive Multi-task Representations with P…	79.20	2021-01-26
UnifiedQA 11B (fine-tuned)	UnifiedQA: Crossing Format Boundaries With a Sing…	79.10	2020-05-02
DRAGON	Deep Bidirectional Language-Knowledge Graph Pretr…	78.20	2022-10-17
T5-XXL 11B (fine-tuned)	UnifiedQA: Crossing Format Boundaries With a Sing…	78.10	2020-05-02
Albert Lan et al. (2020) (ensemble)	ALBERT: A Lite BERT for Self-supervised Learning …	76.50	2019-09-26
UnifiedQA 11B (zero-shot)	UnifiedQA: Crossing Format Boundaries With a Sing…	76.20	2020-05-02
QA-GNN	QA-GNN: Reasoning with Language Models and Knowle…	76.10	2021-04-13
XLNet+GraphReason	Graph-Based Reasoning over Heterogeneous External…	75.30	2019-09-09
GrapeQA: PEGA	GrapeQA: GRaph Augmentation and Pruning to Enhanc…	73.50	2023-03-22
RoBERTa+HyKAS Ma et al. (2019)	Towards Generalizable Neuro-Symbolic Systems for …	73.20	2019-10-30
GPT-3 Direct Finetuned	Human Parity on CommonsenseQA: Augmenting Self-At…	73.00	2021-12-06
STaR (on GPT-J)	STaR: Bootstrapping Reasoning With Reasoning	72.30	2022-03-28
RoBERTa-Large 355M	RoBERTa: A Robustly Optimized BERT Pretraining Ap…	72.10	2019-07-26
STaR without Rationalization (on GPT-J)	STaR: Bootstrapping Reasoning With Reasoning	68.80	2022-03-28
OPT 66B (1-shot)	BloombergGPT: A Large Language Model for Finance	66.40	2023-03-30
Bloomberg GPT 50B (1-shot)	BloombergGPT: A Large Language Model for Finance	65.50	2023-03-30
CAGE-reasoning	Explain Yourself! Leveraging Language Models for …	64.70	2019-06-06
BLOOM 176B (1-shot)	BloombergGPT: A Large Language Model for Finance	64.20	2023-03-30
UnifiedQA 440M (fine-tuned)	UnifiedQA: Crossing Format Boundaries With a Sing…	64.00	2020-05-02
BART-large 440M (fine-tuned)	UnifiedQA: Crossing Format Boundaries With a Sing…	62.50	2020-05-02
BERT_CSlarge	Align, Mask and Select: A Simple Method for Incor…	62.20	2019-08-19
GPT-NeoX 20B (1-shot)	BloombergGPT: A Large Language Model for Finance	60.40	2023-03-30
GPT-J Direct Finetuned	STaR: Bootstrapping Reasoning With Reasoning	60.00	2022-03-28
KagNet	KagNet: Knowledge-Aware Graph Networks for Common…	58.90	2019-09-04
BERT-LARGE	CommonsenseQA: A Question Answering Challenge Tar…	55.90	2018-11-02
UL2 20B (chain-of-thought + self-consistency)	UL2: Unifying Language Learning Paradigms	55.70	2022-05-10
Few-shot CoT LaMDA 137B	STaR: Bootstrapping Reasoning With Reasoning	55.60	2022-03-28
UL2 20B (chain-of-thought)	UL2: Unifying Language Learning Paradigms	51.40	2022-05-10
Few-shot CoT GPT-J	STaR: Bootstrapping Reasoning With Reasoning	36.60	2022-03-28
UL2 20B (zero-shot)	UL2: Unifying Language Learning Paradigms	34.20	2022-05-10
Chain of thought ASDiv	Chain-of-Thought Prompting Elicits Reasoning in L…	28.60	2022-01-28
Few-shot Direct GPT-J	STaR: Bootstrapping Reasoning With Reasoning	20.90	2022-03-28

CommonsenseQA

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (38)