ML Research Wiki / Benchmarks / Question Answering / BoolQ

BoolQ

Question Answering Benchmark

Performance Over Time

📊 Showing 65 results | 📏 Metric: Accuracy

Top Performing Models

Rank	Model	Paper	Accuracy	Date	Code
1	Mistral-Nemo 12B (HPT)	Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles	99.87	2024-06-18	📦 devichand579/HPT
2	ST-MoE-32B 269B (fine-tuned)	ST-MoE: Designing Stable and Transferable Sparse Expert Models	92.40	2022-02-17	📦 tensorflow/mesh 📦 xuefuzhao/openmoe 📦 yikangshen/megablocks
3	PaLM 540B (fine-tuned)	PaLM: Scaling Language Modeling with Pathways	92.20	2022-04-05	📦 lucidrains/CoCa-pytorch 📦 lucidrains/PaLM-pytorch 📦 google/paxml
4	Turing NLR v5 XXL 5.4B (fine-tuned)	Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUE	92.00	2022-12-04	-
5	T5-XXL 11B (fine-tuned)	Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer	91.20	2019-10-23	📦 huggingface/transformers 📦 PaddlePaddle/PaddleNLP 📦 google-research/text-to-text-transfer-transformer
6	PaLM 2-L (1-shot)	PaLM 2 Technical Report	90.90	2023-05-17	📦 eternityyw/tram-benchmark
7	UL2 20B (fine-tuned)	UL2: Unifying Language Learning Paradigms	90.80	2022-05-10	📦 google-research/google-research 📦 opennlg/openba-v2
8	Vega v2 6B (fine-tuned)	Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUE	90.50	2022-12-04	-
9	DeBERTa-1.5B	DeBERTa: Decoding-enhanced BERT with Disentangled Attention	90.40	2020-06-05	📦 huggingface/transformers 📦 microsoft/DeBERTa 📦 osu-nlp-group/mind2web
10	PaLM 2-M (1-shot)	PaLM 2 Technical Report	88.60	2023-05-17	📦 eternityyw/tram-benchmark

All Papers (65)

Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles

2024

Mistral-Nemo 12B (HPT)

devichand579/HPT

ST-MoE: Designing Stable and Transferable Sparse Expert Models

2022

ST-MoE-32B 269B (fine-tuned)

tensorflow/mesh xuefuzhao/openmoe yikangshen/megablocks

PaLM: Scaling Language Modeling with Pathways

2022

PaLM 540B (fine-tuned)

lucidrains/CoCa-pytorch lucidrains/PaLM-pytorch

Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUE

2022

Turing NLR v5 XXL 5.4B (fine-tuned)

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

2019

T5-XXL 11B (fine-tuned)

huggingface/transformers PaddlePaddle/PaddleNLP

PaLM 2 Technical Report

2023

PaLM 2-L (1-shot)

eternityyw/tram-benchmark

UL2: Unifying Language Learning Paradigms

2022

UL2 20B (fine-tuned)

google-research/google-research opennlg/openba-v2

Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUE

2022

Vega v2 6B (fine-tuned)

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

2020

DeBERTa-1.5B

huggingface/transformers microsoft/DeBERTa

PaLM 2 Technical Report

2023

PaLM 2-M (1-shot)

eternityyw/tram-benchmark

ST-MoE: Designing Stable and Transferable Sparse Expert Models

2022

ST-MoE-L 4.1B (fine-tuned)

tensorflow/mesh xuefuzhao/openmoe yikangshen/megablocks

PaLM 2 Technical Report

2023

PaLM 2-S (1-shot)

eternityyw/tram-benchmark

Muppet: Massive Multi-task Representations with Pre-Finetuning

2021

MUPPET Roberta Large

facebook/muppet-roberta-base facebook/muppet-roberta-large

Finetuned Language Models Are Zero-Shot Learners

2021

FLAN 137B (prompt-tuned)

hiyouga/llama-efficient-tuning bigcode-project/starcoder

Entailment as Few-Shot Learner

2021

RoBERTa-large 355M + Entailment as Few-shot Learner

PaddlePaddle/PaddleNLP sunyilgdx/prompts4keras cactilab/hateguard

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

2019

T5-Large 770M (fine-tuned)

huggingface/transformers PaddlePaddle/PaddleNLP

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 65B (0-shot)

huggingface/transformers ggml-org/llama.cpp

Llama 2: Open Foundation and Fine-Tuned Chat Models

2023

LLaMA 2 70B (0-shot)

facebookresearch/llama llamafamily/llama-chinese

Finetuned Language Models Are Zero-Shot Learners

2021

FLAN 137B (4-shot)

hiyouga/llama-efficient-tuning bigcode-project/starcoder

Muppet: Massive Multi-task Representations with Pre-Finetuning

2021

MUPPET Roberta Base

facebook/muppet-roberta-base facebook/muppet-roberta-large

Training Compute-Optimal Large Language Models

2022

Chinchilla 70B (0-shot)

karpathy/llama2.c nkluge-correa/teenytinyllama

Llama 2: Open Foundation and Fine-Tuned Chat Models

2023

LLaMA 2 34B (0-shot)

facebookresearch/llama llamafamily/llama-chinese

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 33B (0-shot)

huggingface/transformers ggml-org/llama.cpp

Finetuned Language Models Are Zero-Shot Learners

2021

FLAN 137B (0-shot)

hiyouga/llama-efficient-tuning bigcode-project/starcoder

Llama 2: Open Foundation and Fine-Tuned Chat Models

2023

LLaMA 2 13B (0-shot)

facebookresearch/llama llamafamily/llama-chinese

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

2019

T5-Base 220M (fine-tuned)

huggingface/transformers PaddlePaddle/PaddleNLP

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

2019

BERT-MultiNLI 340M (fine-tuned)

google-research-datasets/boolean-questions

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

2021

Gopher (zero-shot)

allenai/dolma rvlopes/gloria bramiozo/PubScience

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 13B (zero-shot)

huggingface/transformers ggml-org/llama.cpp

Llama 2: Open Foundation and Fine-Tuned Chat Models

2023

LLaMA 2 7B (zero-shot)

facebookresearch/llama llamafamily/llama-chinese

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

2024

LLaMA-2 13B + MixLoRA

TUDB-Labs/MixLoRA mikecovlee/mLoRA

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 7B (zero-shot)

huggingface/transformers ggml-org/llama.cpp

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

2019

T5-Small 60M (fine-tuned)

huggingface/transformers PaddlePaddle/PaddleNLP

Language Models are Few-Shot Learners

2020

GPT-3 175B (few-shot, k=32)

ggml-org/llama.cpp ggerganov/llama.cpp

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

2019

BiDAF-MultiNLI (fine-tuned)

google-research-datasets/boolean-questions

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

2024

LLaMA-3 8B + MixLoRA

TUDB-Labs/MixLoRA mikecovlee/mLoRA

BloombergGPT: A Large Language Model for Finance

2023

Bloomberg GPT 50B (1-shot)

yangletliu/finlora open-finance-lab/finlora

Mixture-of-Subspaces in Low-Rank Adaptation

2024

LLaMA3+MoSLoRA

wutaiqiang/moslora

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

2019

GPT-1 117M (fine-tuned)

google-research-datasets/boolean-questions

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

2024

LLaMA-2 7B + MixLoRA

TUDB-Labs/MixLoRA mikecovlee/mLoRA

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

2019

BiDAF + ELMo (fine-tuned)

google-research-datasets/boolean-questions

OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

2022

OPT-IML 175B

tanyuqian/cappy

AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model

2022

AlexaTM 20B

amazon-science/alexa-teacher-models

Ask Me Anything: A simple strategy for prompting language models

2022

Neo-6B (QA + WS)

hazyresearch/ama_prompting simran-arora/privacy_fm simran-arora/focus

OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

2022

OPT-IML 30B

tanyuqian/cappy

Ask Me Anything: A simple strategy for prompting language models

2022

Neo-6B (few-shot)

hazyresearch/ama_prompting simran-arora/privacy_fm simran-arora/focus

N-Grammer: Augmenting Transformers with latent n-grams

2022

N-Grammer 343M

tensorflow/lingvo yiyixuxu/n-grammer-flax

Ask Me Anything: A simple strategy for prompting language models

2022

Neo-6B (QA)

hazyresearch/ama_prompting simran-arora/privacy_fm simran-arora/focus

OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

2022

OPT 30B (0-shot)

tanyuqian/cappy

UL2: Unifying Language Learning Paradigms

2022

UL2 20B (0-shot)

google-research/google-research opennlg/openba-v2

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

2019

Majority baseline

google-research-datasets/boolean-questions

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

2022

Hybrid H3 1.3B (0-shot, logit scoring)

hazyresearch/safari hazyresearch/h3 lindermanlab/S5

OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

2022

OPT-IML 1.3B (0-shot)

tanyuqian/cappy

SHAKTI: A 2.5 Billion Parameter Small Language Model Optimized for Edge AI and Low-Resource Environments

2024

Shakti-LLM (2.5B)

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

2022

Hybrid H3 2.7B (3-shot, logit scoring)

hazyresearch/safari hazyresearch/h3 lindermanlab/S5

Language Models are Few-Shot Learners

2020

GPT-3 75B (0-shot)

ggml-org/llama.cpp ggerganov/llama.cpp

OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

2022

OPT 1.3B (zero-shot)

tanyuqian/cappy

OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

2022

OPT 175B

tanyuqian/cappy

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

2022

Hybrid H3 125M (0-shot, logit scoring)

hazyresearch/safari hazyresearch/h3 lindermanlab/S5

BloombergGPT: A Large Language Model for Finance

2023

OPT 66B (1-shot)

yangletliu/finlora open-finance-lab/finlora

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

2022

Hybrid H3 125M (3-shot, logit scoring)

hazyresearch/safari hazyresearch/h3 lindermanlab/S5

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

2022

Hybrid H3 125M (3-shot, rank classification)

hazyresearch/safari hazyresearch/h3 lindermanlab/S5

BloombergGPT: A Large Language Model for Finance

2023

BLOOM 176B (1-shot)

yangletliu/finlora open-finance-lab/finlora

Hyena Hierarchy: Towards Larger Convolutional Language Models

2023

Hyena

hazyresearch/safari togethercomputer/stripedhyena

BloombergGPT: A Large Language Model for Finance

2023

GPT-NeoX 20B (1-shot)

yangletliu/finlora open-finance-lab/finlora

Model	Paper	Accuracy	Date
Mistral-Nemo 12B (HPT)	Hierarchical Prompting Taxonomy: A Universal Eval…	99.87	2024-06-18
ST-MoE-32B 269B (fine-tuned)	ST-MoE: Designing Stable and Transferable Sparse …	92.40	2022-02-17
PaLM 540B (fine-tuned)	PaLM: Scaling Language Modeling with Pathways	92.20	2022-04-05
Turing NLR v5 XXL 5.4B (fine-tuned)	Toward Efficient Language Model Pretraining and D…	92.00	2022-12-04
T5-XXL 11B (fine-tuned)	Exploring the Limits of Transfer Learning with a …	91.20	2019-10-23
PaLM 2-L (1-shot)	PaLM 2 Technical Report	90.90	2023-05-17
UL2 20B (fine-tuned)	UL2: Unifying Language Learning Paradigms	90.80	2022-05-10
Vega v2 6B (fine-tuned)	Toward Efficient Language Model Pretraining and D…	90.50	2022-12-04
DeBERTa-1.5B	DeBERTa: Decoding-enhanced BERT with Disentangled…	90.40	2020-06-05
PaLM 2-M (1-shot)	PaLM 2 Technical Report	88.60	2023-05-17
ST-MoE-L 4.1B (fine-tuned)	ST-MoE: Designing Stable and Transferable Sparse …	88.60	2022-02-17
PaLM 2-S (1-shot)	PaLM 2 Technical Report	88.10	2023-05-17
MUPPET Roberta Large	Muppet: Massive Multi-task Representations with P…	87.50	2021-01-26
FLAN 137B (prompt-tuned)	Finetuned Language Models Are Zero-Shot Learners	86.30	2021-09-03
RoBERTa-large 355M + Entailment as Few-shot Learner	Entailment as Few-Shot Learner	86.00	2021-04-29
T5-Large 770M (fine-tuned)	Exploring the Limits of Transfer Learning with a …	85.40	2019-10-23
LLaMA 65B (0-shot)	LLaMA: Open and Efficient Foundation Language Mod…	85.30	2023-02-27
LLaMA 2 70B (0-shot)	Llama 2: Open Foundation and Fine-Tuned Chat Mode…	85.00	2023-07-18
FLAN 137B (4-shot)	Finetuned Language Models Are Zero-Shot Learners	84.60	2021-09-03
MUPPET Roberta Base	Muppet: Massive Multi-task Representations with P…	83.80	2021-01-26
Chinchilla 70B (0-shot)	Training Compute-Optimal Large Language Models	83.70	2022-03-29
LLaMA 2 34B (0-shot)	Llama 2: Open Foundation and Fine-Tuned Chat Mode…	83.70	2023-07-18
LLaMA 33B (0-shot)	LLaMA: Open and Efficient Foundation Language Mod…	83.10	2023-02-27
FLAN 137B (0-shot)	Finetuned Language Models Are Zero-Shot Learners	82.90	2021-09-03
LLaMA 2 13B (0-shot)	Llama 2: Open Foundation and Fine-Tuned Chat Mode…	81.70	2023-07-18
T5-Base 220M (fine-tuned)	Exploring the Limits of Transfer Learning with a …	81.40	2019-10-23
BERT-MultiNLI 340M (fine-tuned)	BoolQ: Exploring the Surprising Difficulty of Nat…	80.40	2019-05-24
Gopher (zero-shot)	Scaling Language Models: Methods, Analysis & Insi…	79.30	2021-12-08
LLaMA 13B (zero-shot)	LLaMA: Open and Efficient Foundation Language Mod…	78.10	2023-02-27
LLaMA 2 7B (zero-shot)	Llama 2: Open Foundation and Fine-Tuned Chat Mode…	77.40	2023-07-18
LLaMA-2 13B + MixLoRA	MixLoRA: Enhancing Large Language Models Fine-Tun…	77.10	2024-04-22
LLaMA 7B (zero-shot)	LLaMA: Open and Efficient Foundation Language Mod…	76.50	2023-02-27
T5-Small 60M (fine-tuned)	Exploring the Limits of Transfer Learning with a …	76.40	2019-10-23
GPT-3 175B (few-shot, k=32)	Language Models are Few-Shot Learners	76.40	2020-05-28
BiDAF-MultiNLI (fine-tuned)	BoolQ: Exploring the Surprising Difficulty of Nat…	75.57	2019-05-24
LLaMA-3 8B + MixLoRA	MixLoRA: Enhancing Large Language Models Fine-Tun…	75.00	2024-04-22
Bloomberg GPT 50B (1-shot)	BloombergGPT: A Large Language Model for Finance	74.60	2023-03-30
LLaMA3+MoSLoRA	Mixture-of-Subspaces in Low-Rank Adaptation	74.60	2024-06-16
GPT-1 117M (fine-tuned)	BoolQ: Exploring the Surprising Difficulty of Nat…	72.87	2019-05-24
LLaMA-2 7B + MixLoRA	MixLoRA: Enhancing Large Language Models Fine-Tun…	72.70	2024-04-22
BiDAF + ELMo (fine-tuned)	BoolQ: Exploring the Surprising Difficulty of Nat…	71.41	2019-05-24
OPT-IML 175B	OPT-IML: Scaling Language Model Instruction Meta …	71.40	2022-12-22
AlexaTM 20B	AlexaTM 20B: Few-Shot Learning Using a Large-Scal…	69.40	2022-08-02
Neo-6B (QA + WS)	Ask Me Anything: A simple strategy for prompting …	67.20	2022-10-05
OPT-IML 30B	OPT-IML: Scaling Language Model Instruction Meta …	66.90	2022-12-22
Neo-6B (few-shot)	Ask Me Anything: A simple strategy for prompting …	66.50	2022-10-05
N-Grammer 343M	N-Grammer: Augmenting Transformers with latent n-…	65.00	2022-07-13
Neo-6B (QA)	Ask Me Anything: A simple strategy for prompting …	64.90	2022-10-05
OPT 30B (0-shot)	OPT-IML: Scaling Language Model Instruction Meta …	64.00	2022-12-22
UL2 20B (0-shot)	UL2: Unifying Language Learning Paradigms	63.10	2022-05-10
Majority baseline	BoolQ: Exploring the Surprising Difficulty of Nat…	62.17	2019-05-24
Hybrid H3 1.3B (0-shot, logit scoring)	Hungry Hungry Hippos: Towards Language Modeling w…	61.70	2022-12-28
OPT-IML 1.3B (0-shot)	OPT-IML: Scaling Language Model Instruction Meta …	61.50	2022-12-22
Shakti-LLM (2.5B)	SHAKTI: A 2.5 Billion Parameter Small Language Mo…	61.10	2024-10-15
Hybrid H3 2.7B (3-shot, logit scoring)	Hungry Hungry Hippos: Towards Language Modeling w…	60.60	2022-12-28
GPT-3 75B (0-shot)	Language Models are Few-Shot Learners	60.50	2020-05-28
OPT 1.3B (zero-shot)	OPT-IML: Scaling Language Model Instruction Meta …	60.50	2022-12-22
OPT 175B	OPT-IML: Scaling Language Model Instruction Meta …	60.10	2022-12-22
Hybrid H3 125M (0-shot, logit scoring)	Hungry Hungry Hippos: Towards Language Modeling w…	59.60	2022-12-28
OPT 66B (1-shot)	BloombergGPT: A Large Language Model for Finance	57.50	2023-03-30
Hybrid H3 125M (3-shot, logit scoring)	Hungry Hungry Hippos: Towards Language Modeling w…	56.10	2022-12-28
Hybrid H3 125M (3-shot, rank classification)	Hungry Hungry Hippos: Towards Language Modeling w…	56.10	2022-12-28
BLOOM 176B (1-shot)	BloombergGPT: A Large Language Model for Finance	52.90	2023-03-30
Hyena	Hyena Hierarchy: Towards Larger Convolutional Lan…	51.80	2023-02-21
GPT-NeoX 20B (1-shot)	BloombergGPT: A Large Language Model for Finance	46.40	2023-03-30

BoolQ

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (65)