ML Research Wiki / Benchmarks / Question Answering / SIQA

SIQA

Question Answering Benchmark

Performance Over Time

📊 Showing 24 results | 📏 Metric: Accuracy

Top Performing Models

Rank	Model	Paper	Accuracy	Date	Code
1	Unicorn 11B (fine-tuned)	UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark	83.20	2021-03-24	📦 allenai/rainbow
2	LLaMA-2 13B + MixLoRA	MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts	82.50	2024-04-22	📦 TUDB-Labs/MixLoRA 📦 mikecovlee/mLoRA
3	CompassMTL 567M with Tailor	Task Compass: Scaling Multi-task Pre-training with Task Prefix	82.20	2022-10-12	📦 cooelf/compassmtl
4	CompassMTL 567M	Task Compass: Scaling Multi-task Pre-training with Task Prefix	81.70	2022-10-12	📦 cooelf/compassmtl
5	LLaMA-3 8B+MoSLoRA (fine-tuned)	Mixture-of-Subspaces in Low-Rank Adaptation	81.00	2024-06-16	📦 wutaiqiang/moslora
6	DeBERTa-Large 304M	Two is Better than Many? Binary Classification as an Effective Approach to Multi-Choice Question Answering	80.20	2022-10-29	📦 declare-lab/team
7	DeBERTa-Large 304M (classification-based)	Two is Better than Many? Binary Classification as an Effective Approach to Multi-Choice Question Answering	79.90	2022-10-29	📦 declare-lab/team
8	UnifiedQA 3B	UnifiedQA: Crossing Format Boundaries With a Single QA System	79.80	2020-05-02	📦 allenai/unifiedqa 📦 facebookresearch/metaicl
9	ExDeBERTa 567M	Task Compass: Scaling Multi-task Pre-training with Task Prefix	79.60	2022-10-12	📦 cooelf/compassmtl
10	LLaMA-3 8B + MixLoRA	MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts	78.80	2024-04-22	📦 TUDB-Labs/MixLoRA 📦 mikecovlee/mLoRA

All Papers (24)

UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark

2021

Unicorn 11B (fine-tuned)

allenai/rainbow

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

2024

LLaMA-2 13B + MixLoRA

TUDB-Labs/MixLoRA mikecovlee/mLoRA

Task Compass: Scaling Multi-task Pre-training with Task Prefix

2022

CompassMTL 567M with Tailor

cooelf/compassmtl

Task Compass: Scaling Multi-task Pre-training with Task Prefix

2022

CompassMTL 567M

cooelf/compassmtl

Mixture-of-Subspaces in Low-Rank Adaptation

2024

LLaMA-3 8B+MoSLoRA (fine-tuned)

wutaiqiang/moslora

Two is Better than Many? Binary Classification as an Effective Approach to Multi-Choice Question Answering

2022

DeBERTa-Large 304M

declare-lab/team

Two is Better than Many? Binary Classification as an Effective Approach to Multi-Choice Question Answering

2022

DeBERTa-Large 304M (classification-based)

declare-lab/team

UnifiedQA: Crossing Format Boundaries With a Single QA System

2020

UnifiedQA 3B

allenai/unifiedqa facebookresearch/metaicl

Task Compass: Scaling Multi-task Pre-training with Task Prefix

2022

ExDeBERTa 567M

cooelf/compassmtl

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

2024

LLaMA-3 8B + MixLoRA

TUDB-Labs/MixLoRA mikecovlee/mLoRA

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

2024

LLaMA-2 7B + MixLoRA

TUDB-Labs/MixLoRA mikecovlee/mLoRA

RoBERTa: A Robustly Optimized BERT Pretraining Approach

2019

RoBERTa-Large 355M (fine-tuned)

huggingface/transformers pytorch/fairseq

SocialIQA: Commonsense Reasoning about Social Interactions

2019

BERT-large 340M (fine-tuned)

clear-nus/llm-human-model

SocialIQA: Commonsense Reasoning about Social Interactions

2019

BERT-base 110M (fine-tuned)

clear-nus/llm-human-model

SocialIQA: Commonsense Reasoning about Social Interactions

2019

GPT-1 117M (fine-tuned)

clear-nus/llm-human-model

Textbooks Are All You Need II: phi-1.5 technical report

2023

phi-1.5-web 1.3B (zero-shot)

knowlab/bi-weekly-paper-presentation

Textbooks Are All You Need II: phi-1.5 technical report

2023

phi-1.5 1.3B (zero-shot)

knowlab/bi-weekly-paper-presentation

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 65B (zero-shot)

huggingface/transformers ggml-org/llama.cpp

Training Compute-Optimal Large Language Models

2022

Chinchilla (zero-shot)

karpathy/llama2.c nkluge-correa/teenytinyllama

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

2021

Gopher (zero-shot)

allenai/dolma rvlopes/gloria bramiozo/PubScience

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 13B (zero-shot)

huggingface/transformers ggml-org/llama.cpp

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 33B (zero-shot)

huggingface/transformers ggml-org/llama.cpp

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 7B (zero-shot)

huggingface/transformers ggml-org/llama.cpp

SocialIQA: Commonsense Reasoning about Social Interactions

2019

Random chance baseline

clear-nus/llm-human-model

SIQA

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (24)

UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

Task Compass: Scaling Multi-task Pre-training with Task Prefix

Task Compass: Scaling Multi-task Pre-training with Task Prefix

Mixture-of-Subspaces in Low-Rank Adaptation

Two is Better than Many? Binary Classification as an Effective Approach to Multi-Choice Question Answering

Two is Better than Many? Binary Classification as an Effective Approach to Multi-Choice Question Answering

UnifiedQA: Crossing Format Boundaries With a Single QA System

Task Compass: Scaling Multi-task Pre-training with Task Prefix

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

RoBERTa: A Robustly Optimized BERT Pretraining Approach

SocialIQA: Commonsense Reasoning about Social Interactions

SocialIQA: Commonsense Reasoning about Social Interactions

SocialIQA: Commonsense Reasoning about Social Interactions

Textbooks Are All You Need II: phi-1.5 technical report

Textbooks Are All You Need II: phi-1.5 technical report

LLaMA: Open and Efficient Foundation Language Models

Training Compute-Optimal Large Language Models

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

LLaMA: Open and Efficient Foundation Language Models

LLaMA: Open and Efficient Foundation Language Models

LLaMA: Open and Efficient Foundation Language Models

SocialIQA: Commonsense Reasoning about Social Interactions

Model	Paper	Accuracy	Date
Unicorn 11B (fine-tuned)	UNICORN on RAINBOW: A Universal Commonsense Reaso…	83.20	2021-03-24
LLaMA-2 13B + MixLoRA	MixLoRA: Enhancing Large Language Models Fine-Tun…	82.50	2024-04-22
CompassMTL 567M with Tailor	Task Compass: Scaling Multi-task Pre-training wit…	82.20	2022-10-12
CompassMTL 567M	Task Compass: Scaling Multi-task Pre-training wit…	81.70	2022-10-12
LLaMA-3 8B+MoSLoRA (fine-tuned)	Mixture-of-Subspaces in Low-Rank Adaptation	81.00	2024-06-16
DeBERTa-Large 304M	Two is Better than Many? Binary Classification as…	80.20	2022-10-29
DeBERTa-Large 304M (classification-based)	Two is Better than Many? Binary Classification as…	79.90	2022-10-29
UnifiedQA 3B	UnifiedQA: Crossing Format Boundaries With a Sing…	79.80	2020-05-02
ExDeBERTa 567M	Task Compass: Scaling Multi-task Pre-training wit…	79.60	2022-10-12
LLaMA-3 8B + MixLoRA	MixLoRA: Enhancing Large Language Models Fine-Tun…	78.80	2024-04-22
LLaMA-2 7B + MixLoRA	MixLoRA: Enhancing Large Language Models Fine-Tun…	78.00	2024-04-22
RoBERTa-Large 355M (fine-tuned)	RoBERTa: A Robustly Optimized BERT Pretraining Ap…	76.70	2019-07-26
BERT-large 340M (fine-tuned)	SocialIQA: Commonsense Reasoning about Social Int…	64.50	2019-04-22
BERT-base 110M (fine-tuned)	SocialIQA: Commonsense Reasoning about Social Int…	63.10	2019-04-22
GPT-1 117M (fine-tuned)	SocialIQA: Commonsense Reasoning about Social Int…	63.00	2019-04-22
phi-1.5-web 1.3B (zero-shot)	Textbooks Are All You Need II: phi-1.5 technical …	53.00	2023-09-11
phi-1.5 1.3B (zero-shot)	Textbooks Are All You Need II: phi-1.5 technical …	52.60	2023-09-11
LLaMA 65B (zero-shot)	LLaMA: Open and Efficient Foundation Language Mod…	52.30	2023-02-27
Chinchilla (zero-shot)	Training Compute-Optimal Large Language Models	51.30	2022-03-29
Gopher (zero-shot)	Scaling Language Models: Methods, Analysis & Insi…	50.60	2021-12-08
LLaMA 13B (zero-shot)	LLaMA: Open and Efficient Foundation Language Mod…	50.40	2023-02-27
LLaMA 33B (zero-shot)	LLaMA: Open and Efficient Foundation Language Mod…	50.40	2023-02-27
LLaMA 7B (zero-shot)	LLaMA: Open and Efficient Foundation Language Mod…	48.90	2023-02-27
Random chance baseline	SocialIQA: Commonsense Reasoning about Social Int…	33.30	2019-04-22