ML Research Wiki / Benchmarks / Common Sense Reasoning / WinoGrande

WinoGrande

Common Sense Reasoning Benchmark

Performance Over Time

📊 Showing 73 results | 📏 Metric: Accuracy

Top Performing Models

Rank Model Paper Accuracy Date Code
1 ST-MoE-32B 269B (fine-tuned) ST-MoE: Designing Stable and Transferable Sparse Expert Models 96.10 2022-02-17 📦 tensorflow/mesh 📦 xuefuzhao/openmoe 📦 yikangshen/megablocks
2 Unicorn 11B (fine-tuned) UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark 91.30 2021-03-24 📦 allenai/rainbow
3 CompassMTL 567M with Tailor Task Compass: Scaling Multi-task Pre-training with Task Prefix 90.50 2022-10-12 📦 cooelf/compassmtl
4 CompassMTL 567M Task Compass: Scaling Multi-task Pre-training with Task Prefix 89.60 2022-10-12 📦 cooelf/compassmtl
5 UnifiedQA 11B (fine-tuned) UnifiedQA: Crossing Format Boundaries With a Single QA System 89.40 2020-05-02 📦 allenai/unifiedqa 📦 facebookresearch/metaicl
6 GPT-4 (5-shot) GPT-4 Technical Report 87.50 2023-03-15 📦 openai/evals 📦 shmsw25/factscore 📦 unispac/visual-adversarial-examples-jailbreak-large-language-models
7 ExDeBERTa 567M Task Compass: Scaling Multi-task Pre-training with Task Prefix 87.00 2022-10-12 📦 cooelf/compassmtl
8 LLaMA-2 13B + MixLoRA MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts 86.30 2024-04-22 📦 TUDB-Labs/MixLoRA 📦 mikecovlee/mLoRA
9 LLaMA3 8B+MoSLoRA Mixture-of-Subspaces in Low-Rank Adaptation 85.80 2024-06-16 📦 wutaiqiang/moslora
10 PaLM 2-L (1-shot) PaLM 2 Technical Report 83.00 2023-05-17 📦 eternityyw/tram-benchmark

All Papers (73)

Language Models are Few-Shot Learners

2020
GPT-3 Large 760M (0-shot)

Efficient Language Modeling with Sparse all-MLP

2022
sMLP – deterministic 9.4B (0-shot)

Efficient Language Modeling with Sparse all-MLP

2022
Switch Transformer 9B (0-shot)