ML Research Wiki / Benchmarks / Common Sense Reasoning / CommonsenseQA

CommonsenseQA

Common Sense Reasoning Benchmark

Performance Over Time

📊 Showing 38 results | 📏 Metric: Accuracy

Top Performing Models

Rank Model Paper Accuracy Date Code
1 GPT-4o (HPT) Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles 92.54 2024-06-18 📦 devichand579/HPT
2 DeBERTaV3-large+KEAR 📚 Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention 91.20 2021-12-06 📦 microsoft/DEKCOR-CommonsenseQA 📦 microsoft/kear
3 PaLM 2 (few‑shot, CoT, SC) 📚 PaLM 2 Technical Report 90.40 2023-05-17 📦 eternityyw/tram-benchmark
4 KEAR 📚 Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention 89.40 2021-12-06 📦 microsoft/DEKCOR-CommonsenseQA 📦 microsoft/kear
5 DEKCOR 📚 Fusing Context Into Knowledge Graph for Commonsense Question Answering 83.30 2020-12-09 📦 microsoft/kear 📦 microsoft/DEKCOR-CommonsenseQA
6 Unicorn 11B (fine-tuned) UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark 79.30 2021-03-24 📦 allenai/rainbow
7 MUPPET Roberta Large 📚 Muppet: Massive Multi-task Representations with Pre-Finetuning 79.20 2021-01-26 📦 facebook/muppet-roberta-base 📦 facebook/muppet-roberta-large
8 UnifiedQA 11B (fine-tuned) 📚 UnifiedQA: Crossing Format Boundaries With a Single QA System 79.10 2020-05-02 📦 allenai/unifiedqa 📦 facebookresearch/metaicl
9 DRAGON Deep Bidirectional Language-Knowledge Graph Pretraining 78.20 2022-10-17 📦 michiyasunaga/dragon 📦 HaochenLiu2000/QAP
10 T5-XXL 11B (fine-tuned) UnifiedQA: Crossing Format Boundaries With a Single QA System 78.10 2020-05-02 📦 allenai/unifiedqa 📦 facebookresearch/metaicl

All Papers (38)

STaR: Bootstrapping Reasoning With Reasoning

2022
STaR without Rationalization (on GPT-J)