ML Research Wiki / Benchmarks / Question Answering / TriviaQA

TriviaQA

Question Answering Benchmark

Performance Over Time

📊 Showing 51 results | 📏 Metric: EM

Top Performing Models

Rank	Model	Paper	EM	Date	Code
1	RankRAG-llama3-70b (Zero-Shot, KILT) 📚	RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs	86.50	2024-07-02	-
2	PaLM 2-L (one-shot) 📚	PaLM 2 Technical Report	86.10	2023-05-17	📦 eternityyw/tram-benchmark
3	ChatQA-1.5-llama3-70b (Zero-Shot, KILT) 📚	ChatQA: Surpassing GPT-4 on Conversational QA and RAG	85.60	2024-01-18	-
4	LLaMA 2 70B (one-shot)	Llama 2: Open Foundation and Fine-Tuned Chat Models	85.00	2023-07-18	📦 facebookresearch/llama 📦 llamafamily/llama-chinese 📦 flagalpha/llama2-chinese
5	GPT-4-0613 (Zero-shot)	GPT-4 Technical Report	84.80	2023-03-15	📦 openai/evals 📦 shmsw25/factscore 📦 unispac/visual-adversarial-examples-jailbreak-large-language-models
6	SpanBERT	SpanBERT: Improving Pre-training by Representing and Predicting Spans	83.60	2019-07-24	📦 facebookresearch/SpanBERT 📦 mandarjoshi90/coref 📦 zixinzeng-jennifer/spanbert_trans
7	RankRAG-llama3-8b (Zero-Shot, KILT) 📚	RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs	82.90	2024-07-02	-
8	PaLM 2-M (one-shot)	PaLM 2 Technical Report	81.70	2023-05-17	📦 eternityyw/tram-benchmark
9	PaLM-540B (Few-Shot) 📚	PaLM: Scaling Language Modeling with Pathways	81.40	2022-04-05	📦 lucidrains/CoCa-pytorch 📦 lucidrains/PaLM-pytorch 📦 google/paxml
10	PaLM-540B (One-Shot)	PaLM: Scaling Language Modeling with Pathways	81.40	2022-04-05	📦 lucidrains/CoCa-pytorch 📦 lucidrains/PaLM-pytorch 📦 google/paxml

All Papers (51)

RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

2024

RankRAG-llama3-70b (Zero-Shot, KILT)

PaLM 2 Technical Report

2023

PaLM 2-L (one-shot)

eternityyw/tram-benchmark

ChatQA: Surpassing GPT-4 on Conversational QA and RAG

2024

ChatQA-1.5-llama3-70b (Zero-Shot, KILT)

Llama 2: Open Foundation and Fine-Tuned Chat Models

2023

LLaMA 2 70B (one-shot)

facebookresearch/llama llamafamily/llama-chinese

GPT-4 Technical Report

2023

GPT-4-0613 (Zero-shot)

openai/evals shmsw25/factscore

SpanBERT: Improving Pre-training by Representing and Predicting Spans

2019

SpanBERT

facebookresearch/SpanBERT mandarjoshi90/coref

RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

2024

RankRAG-llama3-8b (Zero-Shot, KILT)

PaLM 2 Technical Report

2023

PaLM 2-M (one-shot)

eternityyw/tram-benchmark

PaLM: Scaling Language Modeling with Pathways

2022

PaLM-540B (Few-Shot)

lucidrains/CoCa-pytorch lucidrains/PaLM-pytorch

PaLM: Scaling Language Modeling with Pathways

2022

PaLM-540B (One-Shot)

lucidrains/CoCa-pytorch lucidrains/PaLM-pytorch

ChatQA: Surpassing GPT-4 on Conversational QA and RAG

2024

ChatQA-1.5-llama3-8B (Zero-Shot, KILT)

Big Bird: Transformers for Longer Sequences

2020

BigBird-etc

huggingface/transformers tensorflow/models

Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation

2024

DPA-RAG

dongguanting/dpa-rag

Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling

2024

GaC(Qwen2-72B-Instruct + Llama-3-70B-Instruct)

yaoching0/gac

LinkBERT: Pretraining Language Models with Document Links

2022

LinkBERT (large)

michiyasunaga/LinkBERT

DyREx: Dynamic Query Representation for Extractive Question Answering

2022

DyREX

urchade/dyrex

REPLUG: Retrieval-Augmented Black-Box Language Models

2023

code-davinci-002 175B + REPLUG LSR (Few-Shot)

ruc-nlpir/flashrag intellabs/fastrag liano3/RAG-fairness

PaLM: Scaling Language Modeling with Pathways

2022

PaLM-540B (Zero-Shot)

lucidrains/CoCa-pytorch lucidrains/PaLM-pytorch

REPLUG: Retrieval-Augmented Black-Box Language Models

2023

code-davinci-002 175B + REPLUG (Few-Shot)

ruc-nlpir/flashrag intellabs/fastrag liano3/RAG-fairness

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

2021

GLaM 62B/64E (One-shot)

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

2021

GLaM 62B/64E (Few-shot)

RA-DIT: Retrieval-Augmented Dual Instruction Tuning

2023

RA-DIT (Zero-Shot)

PaLM 2 Technical Report

2023

PaLM 2-S (one-shot)

eternityyw/tram-benchmark

Search-o1: Agentic Search-Enhanced Large Reasoning Models

2025

Search-o1

sunnynexus/search-o1 terrierteam/pyterrier_rag

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 65B (few-shot, k=64)

huggingface/transformers ggml-org/llama.cpp

FiE: Building a Global Probability Space by Leveraging Early Fusion in Encoder for Open-Domain Question Answering

2022

FiE+PAQ

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 65B (few-shot, k=5)

huggingface/transformers ggml-org/llama.cpp

RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

2024

RankRAG-llama3-70b (Zero-Shot, DPR)

Distilling Knowledge from Reader to Retriever for Question Answering

2020

FiD+Distil

facebookresearch/FiD lucidrains/marge-pytorch

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 65B (one-shot)

huggingface/transformers ggml-org/llama.cpp

End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering

2021

EMDR2

DevSinghSachan/emdr2 DevSinghSachan/art

Simple and Effective Multi-Paragraph Reading Comprehension

2017

S-Norm

allenai/document-qa

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

2021

GLaM 62B/64E (Zero-shot)

Language Models are Few-Shot Learners

2020

GPT-3 175B (Few-Shot)

ggml-org/llama.cpp ggerganov/llama.cpp

UnitedQA: A Hybrid Approach for Open Domain Question Answering

2021

UnitedQA (Hybrid reader)

Mistral 7B

2023

Mistral 7B (5-shot)

mistralai/mistral-src facebookresearch/fairseq2

ChatQA: Surpassing GPT-4 on Conversational QA and RAG

2024

ChatQA-1.5-llama3-70b (Zero-Shot, DPR)

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 65B (zero-shot)

huggingface/transformers ggml-org/llama.cpp

Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

2020

Fusion-in-Decoder (large)

jhyuklee/DensePhrases princeton-nlp/DensePhrases

Mention Memory: incorporating textual knowledge into Transformers through entity mention attention

2021

TOME-2

google-research/language

SHAKTI: A 2.5 Billion Parameter Small Language Model Optimized for Edge AI and Low-Resource Environments

2024

Shakti-LLM (2.5B)

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

2024

Branch-Train-MiX 4x7B (sampling top-2 experts)

Leeroo-AI/mergoo

Dense Passage Retrieval for Open-Domain Question Answering

2020

DPR

huggingface/transformers deepset-ai/haystack

Dynamic Integration of Background Knowledge in Neural NLU Systems

2017

Reading Twice for NLU

Finetuned Language Models Are Zero-Shot Learners

2021

FLAN 137B (zero-shot)

hiyouga/llama-efficient-tuning bigcode-project/starcoder

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

2020

RAG

huggingface/transformers assafelovic/gpt-researcher

Reinforced Mnemonic Reader for Machine Reading Comprehension

2017

Mnemonic Reader

HKUST-KnowComp/MnemonicReader ewrfcas/Reinforced-Mnemonic-Reader yly-revive/chainer-mreader

MEMEN: Multi-layer Embedding with Memory Networks for Machine Comprehension

2017

MEMEN

ReasonBERT: Pre-trained to Reason with Distant Supervision

2021

ReasonBERTR

sunlab-osu/reasonbert

Latent Retrieval for Weakly Supervised Open Domain Question Answering

2019

ORQA

google-research/language mia-workshop/mia-shared-task-2022 okanvk/Question-Answering-Project

ReasonBERT: Pre-trained to Reason with Distant Supervision

2021

ReasonBERTB

sunlab-osu/reasonbert

Model	Paper	EM	Date
RankRAG-llama3-70b (Zero-Shot, KILT)	RankRAG: Unifying Context Ranking with Retrieval-…	86.50	2024-07-02
PaLM 2-L (one-shot)	PaLM 2 Technical Report	86.10	2023-05-17
ChatQA-1.5-llama3-70b (Zero-Shot, KILT)	ChatQA: Surpassing GPT-4 on Conversational QA and…	85.60	2024-01-18
LLaMA 2 70B (one-shot)	Llama 2: Open Foundation and Fine-Tuned Chat Mode…	85.00	2023-07-18
GPT-4-0613 (Zero-shot)	GPT-4 Technical Report	84.80	2023-03-15
SpanBERT	SpanBERT: Improving Pre-training by Representing …	83.60	2019-07-24
RankRAG-llama3-8b (Zero-Shot, KILT)	RankRAG: Unifying Context Ranking with Retrieval-…	82.90	2024-07-02
PaLM 2-M (one-shot)	PaLM 2 Technical Report	81.70	2023-05-17
PaLM-540B (Few-Shot)	PaLM: Scaling Language Modeling with Pathways	81.40	2022-04-05
PaLM-540B (One-Shot)	PaLM: Scaling Language Modeling with Pathways	81.40	2022-04-05
ChatQA-1.5-llama3-8B (Zero-Shot, KILT)	ChatQA: Surpassing GPT-4 on Conversational QA and…	81.00	2024-01-18
BigBird-etc	Big Bird: Transformers for Longer Sequences	80.90	2020-07-28
DPA-RAG	Understand What LLM Needs: Dual Preference Alignm…	80.10	2024-06-26
GaC(Qwen2-72B-Instruct + Llama-3-70B-Instruct)	Breaking the Ceiling of the LLM Community by Trea…	79.29	2024-06-18
LinkBERT (large)	LinkBERT: Pretraining Language Models with Docume…	78.20	2022-03-29
DyREX	DyREx: Dynamic Query Representation for Extractiv…	77.37	2022-10-26
code-davinci-002 175B + REPLUG LSR (Few-Shot)	REPLUG: Retrieval-Augmented Black-Box Language Mo…	77.30	2023-01-30
PaLM-540B (Zero-Shot)	PaLM: Scaling Language Modeling with Pathways	76.90	2022-04-05
code-davinci-002 175B + REPLUG (Few-Shot)	REPLUG: Retrieval-Augmented Black-Box Language Mo…	76.80	2023-01-30
GLaM 62B/64E (One-shot)	GLaM: Efficient Scaling of Language Models with M…	75.80	2021-12-13
GLaM 62B/64E (Few-shot)	GLaM: Efficient Scaling of Language Models with M…	75.80	2021-12-13
RA-DIT (Zero-Shot)	RA-DIT: Retrieval-Augmented Dual Instruction Tuni…	75.40	2023-10-02
PaLM 2-S (one-shot)	PaLM 2 Technical Report	75.20	2023-05-17
Search-o1	Search-o1: Agentic Search-Enhanced Large Reasonin…	74.10	2025-01-09
LLaMA 65B (few-shot, k=64)	LLaMA: Open and Efficient Foundation Language Mod…	73.00	2023-02-27
FiE+PAQ	FiE: Building a Global Probability Space by Lever…	72.60	2022-11-18
LLaMA 65B (few-shot, k=5)	LLaMA: Open and Efficient Foundation Language Mod…	72.60	2023-02-27
RankRAG-llama3-70b (Zero-Shot, DPR)	RankRAG: Unifying Context Ranking with Retrieval-…	72.60	2024-07-02
FiD+Distil	Distilling Knowledge from Reader to Retriever for…	72.10	2020-12-08
LLaMA 65B (one-shot)	LLaMA: Open and Efficient Foundation Language Mod…	71.60	2023-02-27
EMDR2	End-to-End Training of Multi-Document Reader and …	71.40	2021-06-09
S-Norm	Simple and Effective Multi-Paragraph Reading Comp…	71.32	2017-10-29
GLaM 62B/64E (Zero-shot)	GLaM: Efficient Scaling of Language Models with M…	71.30	2021-12-13
GPT-3 175B (Few-Shot)	Language Models are Few-Shot Learners	71.20	2020-05-28
UnitedQA (Hybrid reader)	UnitedQA: A Hybrid Approach for Open Domain Quest…	70.30	2021-01-01
Mistral 7B (5-shot)	Mistral 7B	69.90	2023-10-10
ChatQA-1.5-llama3-70b (Zero-Shot, DPR)	ChatQA: Surpassing GPT-4 on Conversational QA and…	69.00	2024-01-18
LLaMA 65B (zero-shot)	LLaMA: Open and Efficient Foundation Language Mod…	68.20	2023-02-27
Fusion-in-Decoder (large)	Leveraging Passage Retrieval with Generative Mode…	67.60	2020-07-02
TOME-2	Mention Memory: incorporating textual knowledge i…	65.80	2021-10-12
Shakti-LLM (2.5B)	SHAKTI: A 2.5 Billion Parameter Small Language Mo…	58.20	2024-10-15
Branch-Train-MiX 4x7B (sampling top-2 experts)	Branch-Train-MiX: Mixing Expert LLMs into a Mixtu…	57.10	2024-03-12
DPR	Dense Passage Retrieval for Open-Domain Question …	56.80	2020-04-10
Reading Twice for NLU	Dynamic Integration of Background Knowledge in Ne…	56.73	2017-06-08
FLAN 137B (zero-shot)	Finetuned Language Models Are Zero-Shot Learners	56.70	2021-09-03
RAG	Retrieval-Augmented Generation for Knowledge-Inte…	56.10	2020-05-22
Mnemonic Reader	Reinforced Mnemonic Reader for Machine Reading Co…	52.85	2017-05-08
MEMEN	MEMEN: Multi-layer Embedding with Memory Networks…	46.90	2017-07-28
ReasonBERTR	ReasonBERT: Pre-trained to Reason with Distant Su…	45.50	2021-09-10
ORQA	Latent Retrieval for Weakly Supervised Open Domai…	45.00	2019-06-01
ReasonBERTB	ReasonBERT: Pre-trained to Reason with Distant Su…	37.20	2021-09-10

TriviaQA

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (51)