ML Research Wiki / Benchmarks / Question Answering / PIQA

PIQA

Question Answering Benchmark

Performance Over Time

📊 Showing 67 results | 📏 Metric: Accuracy

Top Performing Models

Rank	Model	Paper	Accuracy	Date	Code
1	Unicorn 11B (fine-tuned)	UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark	90.10	2021-03-24	📦 allenai/rainbow
2	LLaMA3 8B+MoSLoRA	Mixture-of-Subspaces in Low-Rank Adaptation	89.70	2024-06-16	📦 wutaiqiang/moslora
3	CompassMTL 567M with Tailor	Task Compass: Scaling Multi-task Pre-training with Task Prefix	88.30	2022-10-12	📦 cooelf/compassmtl
4	LLaMA-3 8B + MixLoRA	MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts	87.60	2024-04-22	📦 TUDB-Labs/MixLoRA 📦 mikecovlee/mLoRA
5	DeBERTa-Large 304M	Two is Better than Many? Binary Classification as an Effective Approach to Multi-Choice Question Answering	87.40	2022-10-29	📦 declare-lab/team
6	CompassMTL 567M	Task Compass: Scaling Multi-task Pre-training with Task Prefix	87.30	2022-10-12	📦 cooelf/compassmtl
7	LLaMA-2 13B + MixLoRA	MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts	86.80	2024-04-22	📦 TUDB-Labs/MixLoRA 📦 mikecovlee/mLoRA
8	Shakti-LLM (2.5B)	SHAKTI: A 2.5 Billion Parameter Small Language Model Optimized for Edge AI and Low-Resource Environments	86.20	2024-10-15	-
9	DeBERTa-Large 304M (classification-based)	Two is Better than Many? Binary Classification as an Effective Approach to Multi-Choice Question Answering	85.90	2022-10-29	📦 declare-lab/team
10	ExDeBERTa 567M	Task Compass: Scaling Multi-task Pre-training with Task Prefix	85.50	2022-10-12	📦 cooelf/compassmtl

All Papers (67)

UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark

2021

Unicorn 11B (fine-tuned)

allenai/rainbow

Mixture-of-Subspaces in Low-Rank Adaptation

2024

LLaMA3 8B+MoSLoRA

wutaiqiang/moslora

Task Compass: Scaling Multi-task Pre-training with Task Prefix

2022

CompassMTL 567M with Tailor

cooelf/compassmtl

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

2024

LLaMA-3 8B + MixLoRA

TUDB-Labs/MixLoRA mikecovlee/mLoRA

Two is Better than Many? Binary Classification as an Effective Approach to Multi-Choice Question Answering

2022

DeBERTa-Large 304M

declare-lab/team

Task Compass: Scaling Multi-task Pre-training with Task Prefix

2022

CompassMTL 567M

cooelf/compassmtl

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

2024

LLaMA-2 13B + MixLoRA

TUDB-Labs/MixLoRA mikecovlee/mLoRA

SHAKTI: A 2.5 Billion Parameter Small Language Model Optimized for Edge AI and Low-Resource Environments

2024

Shakti-LLM (2.5B)

Two is Better than Many? Binary Classification as an Effective Approach to Multi-Choice Question Answering

2022

DeBERTa-Large 304M (classification-based)

declare-lab/team

Task Compass: Scaling Multi-task Pre-training with Task Prefix

2022

ExDeBERTa 567M

cooelf/compassmtl

UnifiedQA: Crossing Format Boundaries With a Single QA System

2020

UnifiedQA 3B

allenai/unifiedqa facebookresearch/metaicl

PaLM 2 Technical Report

2023

PaLM 2-L (1-shot)

eternityyw/tram-benchmark

Mixtral of Experts

2024

Mixtral 8x7B (0-shot)

jingyaogong/minimind hit-scir/chinese-mixtral-8x7b

PaLM 2 Technical Report

2023

PaLM 2-M (1-shot)

eternityyw/tram-benchmark

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

2024

LLaMA-2 7B + MixLoRA

TUDB-Labs/MixLoRA mikecovlee/mLoRA

Mistral 7B

2023

Mistral 7B (0-shot)

mistralai/mistral-src facebookresearch/fairseq2

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 65B (0-shot)

huggingface/transformers ggml-org/llama.cpp

Llama 2: Open Foundation and Fine-Tuned Chat Models

2023

LLaMA 2 70B (0-shot)

facebookresearch/llama llamafamily/llama-chinese

Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks

2024

Camelidae-8×34B

wuhy68/parameter-efficient-moe ShayekhBinIslam/openrag

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 33B (0-shot)

huggingface/transformers ggml-org/llama.cpp

PaLM 2 Technical Report

2023

PaLM 2-S (1-shot)

eternityyw/tram-benchmark

Mixtral of Experts

2024

Mistral 7B (0-shot)

jingyaogong/minimind hit-scir/chinese-mixtral-8x7b

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

2019

MT-NLG 530B (0-shot)

NVIDIA/Megatron-LM PaddlePaddle/PaddleNLP

Llama 2: Open Foundation and Fine-Tuned Chat Models

2023

LLaMA 2 34B (0-shot)

facebookresearch/llama llamafamily/llama-chinese

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

2021

Gopher 280B (0-shot)

allenai/dolma rvlopes/gloria bramiozo/PubScience

Training Compute-Optimal Large Language Models

2022

Chinchilla 70B (0-shot)

karpathy/llama2.c nkluge-correa/teenytinyllama

Finetuned Language Models Are Zero-Shot Learners

2021

FLAN 137B (few-shot, k=10)

hiyouga/llama-efficient-tuning bigcode-project/starcoder

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

2023

OPT-175B

nvidia/tensorrt-model-optimizer ist-daslab/sparsegpt

Language Models are Few-Shot Learners

2020

GPT-3 175B (0-shot)

ggml-org/llama.cpp ggerganov/llama.cpp

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

2023

SparseGPT 175B (50% Sparsity)

nvidia/tensorrt-model-optimizer ist-daslab/sparsegpt

Finetuned Language Models Are Zero-Shot Learners

2021

FLAN 137B (0-shot)

hiyouga/llama-efficient-tuning bigcode-project/starcoder

Llama 2: Open Foundation and Fine-Tuned Chat Models

2023

LLaMA 2 13B (0-shot)

facebookresearch/llama llamafamily/llama-chinese

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 13B (0-shot)

huggingface/transformers ggml-org/llama.cpp

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 7B (0-shot)

huggingface/transformers ggml-org/llama.cpp

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

2023

SparseGPT 175B (4:8 Sparsity)

nvidia/tensorrt-model-optimizer ist-daslab/sparsegpt

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

2023

SparseGPT 175B (2:4 Sparsity)

nvidia/tensorrt-model-optimizer ist-daslab/sparsegpt

RoBERTa: A Robustly Optimized BERT Pretraining Approach

2019

RoBERTa-Large 355M

huggingface/transformers pytorch/fairseq

Llama 2: Open Foundation and Fine-Tuned Chat Models

2023

LLaMA 2 7B (0-shot)

facebookresearch/llama llamafamily/llama-chinese

BloombergGPT: A Large Language Model for Finance

2023

Bloomberg GPT 50B (1-shot)

yangletliu/finlora open-finance-lab/finlora

BloombergGPT: A Large Language Model for Finance

2023

OPT 66B (1-shot)

yangletliu/finlora open-finance-lab/finlora

PIQA: Reasoning about Physical Commonsense in Natural Language

2019

RoBERTa-large 355M (fine-tuned)

vered1986/self_talk AkariAsai/logic_guided_qa

Textbooks Are All You Need II: phi-1.5 technical report

2023

phi-1.5-web (1.3B)

knowlab/bi-weekly-paper-presentation

BloombergGPT: A Large Language Model for Finance

2023

BLOOM 176B (1-shot)

yangletliu/finlora open-finance-lab/finlora

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

2023

Pythia 12B (5-shot)

Lightning-AI/lit-gpt jzhang38/tinyllama

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

2023

Open-LLaMA-3B-v2

princeton-nlp/llm-shearing hexuandeng/drpruning

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

2023

Pythia 12B (0-shot)

Lightning-AI/lit-gpt jzhang38/tinyllama

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

2023

Sheared-LLaMA-2.7B

princeton-nlp/llm-shearing hexuandeng/drpruning

BloombergGPT: A Large Language Model for Finance

2023

GPT-NeoX 20B (1-shot)

yangletliu/finlora open-finance-lab/finlora

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

2023

Pythia 6.9B (0-shot)

Lightning-AI/lit-gpt jzhang38/tinyllama

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

2023

Sheared-LLaMA-1.3B

princeton-nlp/llm-shearing hexuandeng/drpruning

Efficient Language Modeling with Sparse all-MLP

2022

sMLP - deterministic 9.4B (0-shot)

Language Models are Few-Shot Learners

2020

GPT-3 Large 760M (0-shot)

ggml-org/llama.cpp ggerganov/llama.cpp

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

2023

FLAN-T5-Large 783M

mbzuai-nlp/lamini-lm

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

2023

LaMini-GPT 1.5B

mbzuai-nlp/lamini-lm

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

2023

LaMini-F-T5 783M

mbzuai-nlp/lamini-lm

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

2023

GPT-2-XL 1.5B

mbzuai-nlp/lamini-lm

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

2023

Pythia 1B (5-shot)

Lightning-AI/lit-gpt jzhang38/tinyllama

PIQA: Reasoning about Physical Commonsense in Natural Language

2019

GPT-2-small 124M (fine-tuned)

vered1986/self_talk AkariAsai/logic_guided_qa

Efficient Language Modeling with Sparse all-MLP

2022

Gshard 9B

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

2023

LaMini-T5 738M

mbzuai-nlp/lamini-lm

PIQA: Reasoning about Physical Commonsense in Natural Language

2019

BERT-large 340M (fine-tuned)

vered1986/self_talk AkariAsai/logic_guided_qa

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

2018

BERT-Large 340M

huggingface/transformers tensorflow/models

Efficient Language Modeling with Sparse all-MLP

2022

Base Layers 10B (0-shot)

Efficient Language Modeling with Sparse all-MLP

2022

HASH Layers 10B (0-shot)

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

2023

T5-Large 738M

mbzuai-nlp/lamini-lm

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

2023

OPT-175B (50% Sparsity)

nvidia/tensorrt-model-optimizer ist-daslab/sparsegpt

PIQA: Reasoning about Physical Commonsense in Natural Language

2019

Random chance baseline

vered1986/self_talk AkariAsai/logic_guided_qa

Model	Paper	Accuracy	Date
Unicorn 11B (fine-tuned)	UNICORN on RAINBOW: A Universal Commonsense Reaso…	90.10	2021-03-24
LLaMA3 8B+MoSLoRA	Mixture-of-Subspaces in Low-Rank Adaptation	89.70	2024-06-16
CompassMTL 567M with Tailor	Task Compass: Scaling Multi-task Pre-training wit…	88.30	2022-10-12
LLaMA-3 8B + MixLoRA	MixLoRA: Enhancing Large Language Models Fine-Tun…	87.60	2024-04-22
DeBERTa-Large 304M	Two is Better than Many? Binary Classification as…	87.40	2022-10-29
CompassMTL 567M	Task Compass: Scaling Multi-task Pre-training wit…	87.30	2022-10-12
LLaMA-2 13B + MixLoRA	MixLoRA: Enhancing Large Language Models Fine-Tun…	86.80	2024-04-22
Shakti-LLM (2.5B)	SHAKTI: A 2.5 Billion Parameter Small Language Mo…	86.20	2024-10-15
DeBERTa-Large 304M (classification-based)	Two is Better than Many? Binary Classification as…	85.90	2022-10-29
ExDeBERTa 567M	Task Compass: Scaling Multi-task Pre-training wit…	85.50	2022-10-12
UnifiedQA 3B	UnifiedQA: Crossing Format Boundaries With a Sing…	85.30	2020-05-02
PaLM 2-L (1-shot)	PaLM 2 Technical Report	85.00	2023-05-17
Mixtral 8x7B (0-shot)	Mixtral of Experts	83.60	2024-01-08
PaLM 2-M (1-shot)	PaLM 2 Technical Report	83.20	2023-05-17
LLaMA-2 7B + MixLoRA	MixLoRA: Enhancing Large Language Models Fine-Tun…	83.20	2024-04-22
Mistral 7B (0-shot)	Mistral 7B	83.00	2023-10-10
LLaMA 65B (0-shot)	LLaMA: Open and Efficient Foundation Language Mod…	82.80	2023-02-27
LLaMA 2 70B (0-shot)	Llama 2: Open Foundation and Fine-Tuned Chat Mode…	82.80	2023-07-18
Camelidae-8×34B	Parameter-Efficient Sparsity Crafting from Dense …	82.70	2024-01-05
LLaMA 33B (0-shot)	LLaMA: Open and Efficient Foundation Language Mod…	82.30	2023-02-27
PaLM 2-S (1-shot)	PaLM 2 Technical Report	82.20	2023-05-17
Mistral 7B (0-shot)	Mixtral of Experts	82.20	2024-01-08
MT-NLG 530B (0-shot)	Megatron-LM: Training Multi-Billion Parameter Lan…	82.00	2019-09-17
LLaMA 2 34B (0-shot)	Llama 2: Open Foundation and Fine-Tuned Chat Mode…	81.90	2023-07-18
Gopher 280B (0-shot)	Scaling Language Models: Methods, Analysis & Insi…	81.80	2021-12-08
Chinchilla 70B (0-shot)	Training Compute-Optimal Large Language Models	81.80	2022-03-29
FLAN 137B (few-shot, k=10)	Finetuned Language Models Are Zero-Shot Learners	81.70	2021-09-03
OPT-175B	SparseGPT: Massive Language Models Can Be Accurat…	81.07	2023-01-02
GPT-3 175B (0-shot)	Language Models are Few-Shot Learners	81.00	2020-05-28
SparseGPT 175B (50% Sparsity)	SparseGPT: Massive Language Models Can Be Accurat…	80.63	2023-01-02
FLAN 137B (0-shot)	Finetuned Language Models Are Zero-Shot Learners	80.50	2021-09-03
LLaMA 2 13B (0-shot)	Llama 2: Open Foundation and Fine-Tuned Chat Mode…	80.50	2023-07-18
LLaMA 13B (0-shot)	LLaMA: Open and Efficient Foundation Language Mod…	80.10	2023-02-27
LLaMA 7B (0-shot)	LLaMA: Open and Efficient Foundation Language Mod…	79.80	2023-02-27
SparseGPT 175B (4:8 Sparsity)	SparseGPT: Massive Language Models Can Be Accurat…	79.54	2023-01-02
SparseGPT 175B (2:4 Sparsity)	SparseGPT: Massive Language Models Can Be Accurat…	79.54	2023-01-02
RoBERTa-Large 355M	RoBERTa: A Robustly Optimized BERT Pretraining Ap…	79.40	2019-07-26
LLaMA 2 7B (0-shot)	Llama 2: Open Foundation and Fine-Tuned Chat Mode…	78.80	2023-07-18
Bloomberg GPT 50B (1-shot)	BloombergGPT: A Large Language Model for Finance	77.90	2023-03-30
OPT 66B (1-shot)	BloombergGPT: A Large Language Model for Finance	77.60	2023-03-30
RoBERTa-large 355M (fine-tuned)	PIQA: Reasoning about Physical Commonsense in Nat…	77.10	2019-11-26
phi-1.5-web (1.3B)	Textbooks Are All You Need II: phi-1.5 technical …	77.00	2023-09-11
BLOOM 176B (1-shot)	BloombergGPT: A Large Language Model for Finance	77.00	2023-03-30
Pythia 12B (5-shot)	Pythia: A Suite for Analyzing Large Language Mode…	76.70	2023-04-03
Open-LLaMA-3B-v2	Sheared LLaMA: Accelerating Language Model Pre-tr…	76.20	2023-10-10
Pythia 12B (0-shot)	Pythia: A Suite for Analyzing Large Language Mode…	76.00	2023-04-03
Sheared-LLaMA-2.7B	Sheared LLaMA: Accelerating Language Model Pre-tr…	75.80	2023-10-10
GPT-NeoX 20B (1-shot)	BloombergGPT: A Large Language Model for Finance	75.80	2023-03-30
Pythia 6.9B (0-shot)	Pythia: A Suite for Analyzing Large Language Mode…	75.20	2023-04-03
Sheared-LLaMA-1.3B	Sheared LLaMA: Accelerating Language Model Pre-tr…	73.40	2023-10-10
sMLP - deterministic 9.4B (0-shot)	Efficient Language Modeling with Sparse all-MLP	73.00	2022-03-14
GPT-3 Large 760M (0-shot)	Language Models are Few-Shot Learners	72.90	2020-05-28
FLAN-T5-Large 783M	LaMini-LM: A Diverse Herd of Distilled Models fro…	72.20	2023-04-27
LaMini-GPT 1.5B	LaMini-LM: A Diverse Herd of Distilled Models fro…	71.30	2023-04-27
LaMini-F-T5 783M	LaMini-LM: A Diverse Herd of Distilled Models fro…	70.60	2023-04-27
GPT-2-XL 1.5B	LaMini-LM: A Diverse Herd of Distilled Models fro…	70.50	2023-04-27
Pythia 1B (5-shot)	Pythia: A Suite for Analyzing Large Language Mode…	70.40	2023-04-03
GPT-2-small 124M (fine-tuned)	PIQA: Reasoning about Physical Commonsense in Nat…	69.20	2019-11-26
Gshard 9B	Efficient Language Modeling with Sparse all-MLP	68.10	2022-03-14
LaMini-T5 738M	LaMini-LM: A Diverse Herd of Distilled Models fro…	67.20	2023-04-27
BERT-large 340M (fine-tuned)	PIQA: Reasoning about Physical Commonsense in Nat…	66.80	2019-11-26
BERT-Large 340M	BERT: Pre-training of Deep Bidirectional Transfor…	66.70	2018-10-11
Base Layers 10B (0-shot)	Efficient Language Modeling with Sparse all-MLP	63.80	2022-03-14
HASH Layers 10B (0-shot)	Efficient Language Modeling with Sparse all-MLP	63.80	2022-03-14
T5-Large 738M	LaMini-LM: A Diverse Herd of Distilled Models fro…	55.90	2023-04-27
OPT-175B (50% Sparsity)	SparseGPT: Massive Language Models Can Be Accurat…	54.73	2023-01-02
Random chance baseline	PIQA: Reasoning about Physical Commonsense in Nat…	50.00	2019-11-26

PIQA

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (67)