ML Research Wiki / Benchmarks / Question Answering / OpenBookQA

OpenBookQA

Question Answering Benchmark

Performance Over Time

📊 Showing 40 results | 📏 Metric: Accuracy

Top Performing Models

Rank	Model	Paper	Accuracy	Date	Code
1	PaLM 540B (Self Improvement, Self Consistency)	Large Language Models Can Self-Improve	94.40	2022-10-20	-
2	PaLM 540B (Self Improvement, CoT Prompting)	Large Language Models Can Self-Improve	93.00	2022-10-20	-
3	PaLM 540B (Self Improvement, Standard-Prompting)	Large Language Models Can Self-Improve	92.00	2022-10-20	-
4	PaLM 540B (Self Consistency)	Large Language Models Can Self-Improve	90.00	2022-10-20	-
5	GrapeQA: PEGA+CANP	GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering	90.00	2023-03-22	-
6	GenMC 11B	Clues Before Answers: Generation-Enhanced Multiple-Choice QA	89.80	2022-04-30	📦 nju-websoft/genmc
7	AristoRoBERTa + Graph Soft Counter	GNN is a Counter? Revisiting GNN for Question Answering	87.40	2021-10-07	-
8	UnifiedQA 11B	UnifiedQA: Crossing Format Boundaries With a Single QA System	87.20	2020-05-02	📦 allenai/unifiedqa 📦 facebookresearch/metaicl
9	LLaMA-3 8B+MoSLoRA	Mixture-of-Subspaces in Low-Rank Adaptation	86.80	2024-06-16	📦 wutaiqiang/moslora
10	PaLM 540B (CoT Prompting)	Large Language Models Can Self-Improve	86.40	2022-10-20	-

All Papers (40)

Large Language Models Can Self-Improve

2022

PaLM 540B (Self Improvement, Self Consistency)

Large Language Models Can Self-Improve

2022

PaLM 540B (Self Improvement, CoT Prompting)

Large Language Models Can Self-Improve

2022

PaLM 540B (Self Improvement, Standard-Prompting)

Large Language Models Can Self-Improve

2022

PaLM 540B (Self Consistency)

GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering

2023

GrapeQA: PEGA+CANP

Clues Before Answers: Generation-Enhanced Multiple-Choice QA

2022

GenMC 11B

nju-websoft/genmc

GNN is a Counter? Revisiting GNN for Question Answering

2021

AristoRoBERTa + Graph Soft Counter

UnifiedQA: Crossing Format Boundaries With a Single QA System

2020

UnifiedQA 11B

allenai/unifiedqa facebookresearch/metaicl

Mixture-of-Subspaces in Low-Rank Adaptation

2024

LLaMA-3 8B+MoSLoRA

wutaiqiang/moslora

Large Language Models Can Self-Improve

2022

PaLM 540B (CoT Prompting)

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

2024

LLaMA-3 8B + MixLoRA

TUDB-Labs/MixLoRA mikecovlee/mLoRA

Large Language Models Can Self-Improve

2022

PaLM 540B (Standard-Prompting)

Fusing Context Into Knowledge Graph for Commonsense Question Answering

2020

TTTTT 3B

microsoft/kear microsoft/DEKCOR-CommonsenseQA

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

2024

LLaMA-2 13B + MixLoRA

TUDB-Labs/MixLoRA mikecovlee/mLoRA

QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering

2021

AristoRoBERTa + QA-GNN

michiyasunaga/qagnn rucaibox/safe

QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering

2021

QA-GNN

michiyasunaga/qagnn rucaibox/safe

Fusing Context Into Knowledge Graph for Commonsense Question Answering

2020

DEKCOR

microsoft/kear microsoft/DEKCOR-CommonsenseQA

GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering

2023

GrapeQA: PEGA

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

2024

LLaMA-2 7B + MixLoRA

TUDB-Labs/MixLoRA mikecovlee/mLoRA

QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering

2021

AristoRoBERTa

michiyasunaga/qagnn rucaibox/safe

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

2018

BiLSTM max-out question-match (science fact + common knowledge fact)

allenai/arc-solvers

Careful Selection of Knowledge to solve Open Book Question Answering

2019

Careful Selection

GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering

2023

GrapeQA: CANP

Language Models are Few-Shot Learners

2020

GPT-3 175B (few-shot, k=32)

ggml-org/llama.cpp ggerganov/llama.cpp

PaLM 2 Technical Report

2023

PaLM 2-L (1-shot)

eternityyw/tram-benchmark

BloombergGPT: A Large Language Model for Finance

2023

OPT 66B (one-shot)

yangletliu/finlora open-finance-lab/finlora

PaLM 2 Technical Report

2023

PaLM 2-S (1-shot)

eternityyw/tram-benchmark

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

2018

BiLSTM max-out question-match (WordNet + science fact)

allenai/arc-solvers

PaLM 2 Technical Report

2023

PaLM 2-M (1-shot)

eternityyw/tram-benchmark

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

2018

BiLSTM max-out question-match (with a science fact)

allenai/arc-solvers

BloombergGPT: A Large Language Model for Finance

2023

Bloomberg GPT 50B (1-shot)

yangletliu/finlora open-finance-lab/finlora

BloombergGPT: A Large Language Model for Finance

2023

BLOOM 176B (2-shot)

yangletliu/finlora open-finance-lab/finlora

BloombergGPT: A Large Language Model for Finance

2023

GPT-NeoX 50B (2-shot)

yangletliu/finlora open-finance-lab/finlora

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

2023

LaMini-GPT 1.5B

mbzuai-nlp/lamini-lm

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

2023

LaMini-T5 738M

mbzuai-nlp/lamini-lm

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

2023

LaMini-F-T5 783M

mbzuai-nlp/lamini-lm

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

2023

T5-Large 738M

mbzuai-nlp/lamini-lm

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

2023

GPT-2-XL 1.5B

mbzuai-nlp/lamini-lm

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

2023

FLAN-T5-Large 783M

mbzuai-nlp/lamini-lm

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

2018

Random chance baseline

allenai/arc-solvers

Model	Paper	Accuracy	Date
PaLM 540B (Self Improvement, Self Consistency)	Large Language Models Can Self-Improve	94.40	2022-10-20
PaLM 540B (Self Improvement, CoT Prompting)	Large Language Models Can Self-Improve	93.00	2022-10-20
PaLM 540B (Self Improvement, Standard-Prompting)	Large Language Models Can Self-Improve	92.00	2022-10-20
PaLM 540B (Self Consistency)	Large Language Models Can Self-Improve	90.00	2022-10-20
GrapeQA: PEGA+CANP	GrapeQA: GRaph Augmentation and Pruning to Enhanc…	90.00	2023-03-22
GenMC 11B	Clues Before Answers: Generation-Enhanced Multipl…	89.80	2022-04-30
AristoRoBERTa + Graph Soft Counter	GNN is a Counter? Revisiting GNN for Question Ans…	87.40	2021-10-07
UnifiedQA 11B	UnifiedQA: Crossing Format Boundaries With a Sing…	87.20	2020-05-02
LLaMA-3 8B+MoSLoRA	Mixture-of-Subspaces in Low-Rank Adaptation	86.80	2024-06-16
PaLM 540B (CoT Prompting)	Large Language Models Can Self-Improve	86.40	2022-10-20
LLaMA-3 8B + MixLoRA	MixLoRA: Enhancing Large Language Models Fine-Tun…	84.80	2024-04-22
PaLM 540B (Standard-Prompting)	Large Language Models Can Self-Improve	84.40	2022-10-20
TTTTT 3B	Fusing Context Into Knowledge Graph for Commonsen…	83.20	2020-12-09
LLaMA-2 13B + MixLoRA	MixLoRA: Enhancing Large Language Models Fine-Tun…	83.00	2024-04-22
AristoRoBERTa + QA-GNN	QA-GNN: Reasoning with Language Models and Knowle…	82.80	2021-04-13
QA-GNN	QA-GNN: Reasoning with Language Models and Knowle…	82.80	2021-04-13
DEKCOR	Fusing Context Into Knowledge Graph for Commonsen…	82.40	2020-12-09
GrapeQA: PEGA	GrapeQA: GRaph Augmentation and Pruning to Enhanc…	82.00	2023-03-22
LLaMA-2 7B + MixLoRA	MixLoRA: Enhancing Large Language Models Fine-Tun…	81.60	2024-04-22
AristoRoBERTa	QA-GNN: Reasoning with Language Models and Knowle…	77.80	2021-04-13
BiLSTM max-out question-match (science fact + common knowledge fact)	Can a Suit of Armor Conduct Electricity? A New Da…	76.90	2018-09-08
Careful Selection	Careful Selection of Knowledge to solve Open Book…	72.00	2019-07-24
GrapeQA: CANP	GrapeQA: GRaph Augmentation and Pruning to Enhanc…	66.20	2023-03-22
GPT-3 175B (few-shot, k=32)	Language Models are Few-Shot Learners	65.40	2020-05-28
PaLM 2-L (1-shot)	PaLM 2 Technical Report	58.50	2023-05-17
OPT 66B (one-shot)	BloombergGPT: A Large Language Model for Finance	58.00	2023-03-30
PaLM 2-S (1-shot)	PaLM 2 Technical Report	57.40	2023-05-17
BiLSTM max-out question-match (WordNet + science fact)	Can a Suit of Armor Conduct Electricity? A New Da…	56.30	2018-09-08
PaLM 2-M (1-shot)	PaLM 2 Technical Report	56.20	2023-05-17
BiLSTM max-out question-match (with a science fact)	Can a Suit of Armor Conduct Electricity? A New Da…	55.80	2018-09-08
Bloomberg GPT 50B (1-shot)	BloombergGPT: A Large Language Model for Finance	51.60	2023-03-30
BLOOM 176B (2-shot)	BloombergGPT: A Large Language Model for Finance	47.20	2023-03-30
GPT-NeoX 50B (2-shot)	BloombergGPT: A Large Language Model for Finance	44.20	2023-03-30
LaMini-GPT 1.5B	LaMini-LM: A Diverse Herd of Distilled Models fro…	39.80	2023-04-27
LaMini-T5 738M	LaMini-LM: A Diverse Herd of Distilled Models fro…	36.00	2023-04-27
LaMini-F-T5 783M	LaMini-LM: A Diverse Herd of Distilled Models fro…	34.00	2023-04-27
T5-Large 738M	LaMini-LM: A Diverse Herd of Distilled Models fro…	32.80	2023-04-27
GPT-2-XL 1.5B	LaMini-LM: A Diverse Herd of Distilled Models fro…	32.00	2023-04-27
FLAN-T5-Large 783M	LaMini-LM: A Diverse Herd of Distilled Models fro…	31.20	2023-04-27
Random chance baseline	Can a Suit of Armor Conduct Electricity? A New Da…	25.00	2018-09-08

OpenBookQA

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (40)