ML Research Wiki / Benchmarks / Question Answering / Natural Questions

Natural Questions

Question Answering Benchmark

Performance Over Time

📊 Showing 46 results | 📏 Metric: EM

Top Performing Models

Rank	Model	Paper	EM	Date	Code
1	Atlas (full, Wiki-dec-2018 index)	Atlas: Few-shot Learning with Retrieval Augmented Language Models	64.00	2022-08-05	📦 facebookresearch/atlas 📦 thunlp/clueanchor
2	Atlas (full, Wiki-dec-2021+CC index)	Atlas: Few-shot Learning with Retrieval Augmented Language Models	60.40	2022-08-05	📦 facebookresearch/atlas 📦 thunlp/clueanchor
3	DPA-RAG	Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation	59.19	2024-06-26	📦 dongguanting/dpa-rag
4	FiE	FiE: Building a Global Probability Space by Leveraging Early Fusion in Encoder for Open-Domain Question Answering	58.40	2022-11-18	-
5	R2-D2 (full)	R2-D2: A Modular Baseline for Open-Domain Question Answering	55.90	2021-09-08	📦 KNOT-FIT-BUT/R2-D2
6	ReAtt	Retrieval as Attention: End-to-end Learning of Retrieval and Reading within a Single Transformer	54.70	2022-12-05	📦 jzbjyb/reatt
7	FiD-KD (full)	Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering	54.70	2020-07-02	📦 jhyuklee/DensePhrases 📦 princeton-nlp/DensePhrases 📦 facebookresearch/FiD
8	RankRAG-llama3-70b (Zero-Shot, KILT) 📚	RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs	54.20	2024-07-02	-
9	EMDR^2	End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering	52.50	2021-06-09	📦 DevSinghSachan/emdr2 📦 DevSinghSachan/art
10	FID (full)	Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering	51.40	2020-07-02	📦 jhyuklee/DensePhrases 📦 princeton-nlp/DensePhrases 📦 facebookresearch/FiD

All Papers (46)

Atlas: Few-shot Learning with Retrieval Augmented Language Models

2022

Atlas (full, Wiki-dec-2018 index)

facebookresearch/atlas thunlp/clueanchor

Atlas: Few-shot Learning with Retrieval Augmented Language Models

2022

Atlas (full, Wiki-dec-2021+CC index)

facebookresearch/atlas thunlp/clueanchor

Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation

2024

DPA-RAG

dongguanting/dpa-rag

FiE: Building a Global Probability Space by Leveraging Early Fusion in Encoder for Open-Domain Question Answering

2022

FiE

R2-D2: A Modular Baseline for Open-Domain Question Answering

2021

R2-D2 (full)

KNOT-FIT-BUT/R2-D2

Retrieval as Attention: End-to-end Learning of Retrieval and Reading within a Single Transformer

2022

ReAtt

jzbjyb/reatt

Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

2020

FiD-KD (full)

jhyuklee/DensePhrases princeton-nlp/DensePhrases

RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

2024

RankRAG-llama3-70b (Zero-Shot, KILT)

End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering

2021

EMDR^2

DevSinghSachan/emdr2 DevSinghSachan/art

Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

2020

FID (full)

jhyuklee/DensePhrases princeton-nlp/DensePhrases

RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

2024

RankRAG-llama3-8b (Zero-Shot, KILT)

RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

2024

RankRAG-llama3-70b (Zero-Shot, DPR)

ChatQA: Surpassing GPT-4 on Conversational QA and RAG

2024

ChatQA-1.5-llama3-70b (Zero-Shot, KILT)

RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

2024

RankRAG-llama3-8b (Zero-Shot, DPR)

Improving language models by retrieving from trillions of tokens

2021

RETRO + DPR (full)

labmlai/annotated_deep_learning_paper_implementations lucidrains/RETRO-pytorch

REPLUG: Retrieval-Augmented Black-Box Language Models

2023

code-davinci-002 175B + REPLUG LSR (few-shot)

ruc-nlpir/flashrag intellabs/fastrag liano3/RAG-fairness

Atlas: Few-shot Learning with Retrieval Augmented Language Models

2022

Atlas (few-shot, k=64, Wiki-Dec-2018 index)

facebookresearch/atlas thunlp/clueanchor

REPLUG: Retrieval-Augmented Black-Box Language Models

2023

code-davinci-002 175B + REPLUG (few-shot)

ruc-nlpir/flashrag intellabs/fastrag liano3/RAG-fairness

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

2020

RAG

huggingface/transformers assafelovic/gpt-researcher

ChatQA: Surpassing GPT-4 on Conversational QA and RAG

2024

ChatQA-1.5-llama3-8b (Zero-Shot, KILT)

Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers

2024

Blended RAG

ibm-ecosystem-engineering/blended-rag

Atlas: Few-shot Learning with Retrieval Augmented Language Models

2022

Atlas (few-shot, k=64, Wiki-dec-2021+CC index)

facebookresearch/atlas thunlp/clueanchor

Dense Passage Retrieval for Open-Domain Question Answering

2020

DPR

huggingface/transformers deepset-ai/haystack

REALM: Retrieval-Augmented Language Model Pre-Training

2020

REALM

deepset-ai/haystack google-research/language

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 65B (few-shot, k=64)

huggingface/transformers ggml-org/llama.cpp

PaLM: Scaling Language Modeling with Pathways

2022

PaLM-540B (Few-Shot, k=64)

lucidrains/CoCa-pytorch lucidrains/PaLM-pytorch

PaLM 2 Technical Report

2023

PaLM 2-L (one-shot)

eternityyw/tram-benchmark

Training Compute-Optimal Large Language Models

2022

Chinchilla (few-shot, k=64)

karpathy/llama2.c nkluge-correa/teenytinyllama

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 65B (few-shot, k=5)

huggingface/transformers ggml-org/llama.cpp

Search-o1: Agentic Search-Enhanced Large Reasoning Models

2025

Search-o1

sunnynexus/search-o1 terrierteam/pyterrier_rag

Llama 2: Open Foundation and Fine-Tuned Chat Models

2023

LLaMA 2 70B (one-shot)

facebookresearch/llama llamafamily/llama-chinese

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

2021

GLaM 62B/64E (Few-Shot)

PaLM 2 Technical Report

2023

PaLM 2-M (one-shot)

eternityyw/tram-benchmark

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 65B (one-shot)

huggingface/transformers ggml-org/llama.cpp

Language Models are Few-Shot Learners

2020

GPT-3 175B (Few-Shot, k=64)

ggml-org/llama.cpp ggerganov/llama.cpp

PaLM: Scaling Language Modeling with Pathways

2022

PaLM-540B (One-Shot)

lucidrains/CoCa-pytorch lucidrains/PaLM-pytorch

Mistral 7B

2023

Mistral 7B (5-shot)

mistralai/mistral-src facebookresearch/fairseq2

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

2021

Gopher (few-shot, k=64)

allenai/dolma rvlopes/gloria bramiozo/PubScience

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

2021

GLaM 62B/64E (One-Shot)

PaLM 2 Technical Report

2023

PaLM 2-S (one-shot)

eternityyw/tram-benchmark

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 33B (zero-shot)

huggingface/transformers ggml-org/llama.cpp

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

2021

GLaM 62B/64E (Zero-Shot)

PaLM: Scaling Language Modeling with Pathways

2022

PaLM-540B (Zero-Shot)

lucidrains/CoCa-pytorch lucidrains/PaLM-pytorch

Ask Me Anything: A simple strategy for prompting language models

2022

Neo-6B (QA)

hazyresearch/ama_prompting simran-arora/privacy_fm simran-arora/focus

Ask Me Anything: A simple strategy for prompting language models

2022

Neo-6B (QA + WS)

hazyresearch/ama_prompting simran-arora/privacy_fm simran-arora/focus

Ask Me Anything: A simple strategy for prompting language models

2022

Neo-6B (Few-Shot)

hazyresearch/ama_prompting simran-arora/privacy_fm simran-arora/focus

Model	Paper	EM	Date
Atlas (full, Wiki-dec-2018 index)	Atlas: Few-shot Learning with Retrieval Augmented…	64.00	2022-08-05
Atlas (full, Wiki-dec-2021+CC index)	Atlas: Few-shot Learning with Retrieval Augmented…	60.40	2022-08-05
DPA-RAG	Understand What LLM Needs: Dual Preference Alignm…	59.19	2024-06-26
FiE	FiE: Building a Global Probability Space by Lever…	58.40	2022-11-18
R2-D2 (full)	R2-D2: A Modular Baseline for Open-Domain Questio…	55.90	2021-09-08
ReAtt	Retrieval as Attention: End-to-end Learning of Re…	54.70	2022-12-05
FiD-KD (full)	Leveraging Passage Retrieval with Generative Mode…	54.70	2020-07-02
RankRAG-llama3-70b (Zero-Shot, KILT)	RankRAG: Unifying Context Ranking with Retrieval-…	54.20	2024-07-02
EMDR^2	End-to-End Training of Multi-Document Reader and …	52.50	2021-06-09
FID (full)	Leveraging Passage Retrieval with Generative Mode…	51.40	2020-07-02
RankRAG-llama3-8b (Zero-Shot, KILT)	RankRAG: Unifying Context Ranking with Retrieval-…	50.60	2024-07-02
RankRAG-llama3-70b (Zero-Shot, DPR)	RankRAG: Unifying Context Ranking with Retrieval-…	50.00	2024-07-02
ChatQA-1.5-llama3-70b (Zero-Shot, KILT)	ChatQA: Surpassing GPT-4 on Conversational QA and…	47.00	2024-01-18
RankRAG-llama3-8b (Zero-Shot, DPR)	RankRAG: Unifying Context Ranking with Retrieval-…	46.10	2024-07-02
RETRO + DPR (full)	Improving language models by retrieving from tril…	45.50	2021-12-08
code-davinci-002 175B + REPLUG LSR (few-shot)	REPLUG: Retrieval-Augmented Black-Box Language Mo…	45.50	2023-01-30
Atlas (few-shot, k=64, Wiki-Dec-2018 index)	Atlas: Few-shot Learning with Retrieval Augmented…	45.10	2022-08-05
code-davinci-002 175B + REPLUG (few-shot)	REPLUG: Retrieval-Augmented Black-Box Language Mo…	44.70	2023-01-30
RAG	Retrieval-Augmented Generation for Knowledge-Inte…	44.50	2020-05-22
ChatQA-1.5-llama3-8b (Zero-Shot, KILT)	ChatQA: Surpassing GPT-4 on Conversational QA and…	42.70	2024-01-18
Blended RAG	Blended RAG: Improving RAG (Retriever-Augmented G…	42.63	2024-03-22
Atlas (few-shot, k=64, Wiki-dec-2021+CC index)	Atlas: Few-shot Learning with Retrieval Augmented…	42.40	2022-08-05
DPR	Dense Passage Retrieval for Open-Domain Question …	41.50	2020-04-10
REALM	REALM: Retrieval-Augmented Language Model Pre-Tra…	40.40	2020-02-10
LLaMA 65B (few-shot, k=64)	LLaMA: Open and Efficient Foundation Language Mod…	39.90	2023-02-27
PaLM-540B (Few-Shot, k=64)	PaLM: Scaling Language Modeling with Pathways	39.60	2022-04-05
PaLM 2-L (one-shot)	PaLM 2 Technical Report	37.50	2023-05-17
Chinchilla (few-shot, k=64)	Training Compute-Optimal Large Language Models	35.50	2022-03-29
LLaMA 65B (few-shot, k=5)	LLaMA: Open and Efficient Foundation Language Mod…	35.00	2023-02-27
Search-o1	Search-o1: Agentic Search-Enhanced Large Reasonin…	34.00	2025-01-09
LLaMA 2 70B (one-shot)	Llama 2: Open Foundation and Fine-Tuned Chat Mode…	33.00	2023-07-18
GLaM 62B/64E (Few-Shot)	GLaM: Efficient Scaling of Language Models with M…	32.50	2021-12-13
PaLM 2-M (one-shot)	PaLM 2 Technical Report	32.00	2023-05-17
LLaMA 65B (one-shot)	LLaMA: Open and Efficient Foundation Language Mod…	31.00	2023-02-27
GPT-3 175B (Few-Shot, k=64)	Language Models are Few-Shot Learners	29.90	2020-05-28
PaLM-540B (One-Shot)	PaLM: Scaling Language Modeling with Pathways	29.30	2022-04-05
Mistral 7B (5-shot)	Mistral 7B	28.80	2023-10-10
Gopher (few-shot, k=64)	Scaling Language Models: Methods, Analysis & Insi…	28.20	2021-12-08
GLaM 62B/64E (One-Shot)	GLaM: Efficient Scaling of Language Models with M…	26.30	2021-12-13
PaLM 2-S (one-shot)	PaLM 2 Technical Report	25.30	2023-05-17
LLaMA 33B (zero-shot)	LLaMA: Open and Efficient Foundation Language Mod…	24.90	2023-02-27
GLaM 62B/64E (Zero-Shot)	GLaM: Efficient Scaling of Language Models with M…	24.70	2021-12-13
PaLM-540B (Zero-Shot)	PaLM: Scaling Language Modeling with Pathways	21.20	2022-04-05
Neo-6B (QA)	Ask Me Anything: A simple strategy for prompting …	19.70	2022-10-05
Neo-6B (QA + WS)	Ask Me Anything: A simple strategy for prompting …	19.60	2022-10-05
Neo-6B (Few-Shot)	Ask Me Anything: A simple strategy for prompting …	13.70	2022-10-05

Natural Questions

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (46)