ML Research Wiki / Benchmarks / Question Answering / TruthfulQA

TruthfulQA

Question Answering Benchmark

Performance Over Time

📊 Showing 30 results | 📏 Metric: MC1

Top Performing Models

Rank	Model	Paper	MC1	Date	Code
1	Shakti-LLM (2.5B)	SHAKTI: A 2.5 Billion Parameter Small Language Model Optimized for Edge AI and Low-Resource Environments	68.40	2024-10-15	-
2	CoA	Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models	67.30	2024-03-26	📦 MAGICS-LAB/Chain-of-Actions
3	ToT	Tree of Thoughts: Deliberate Problem Solving with Large Language Models	66.60	2023-05-17	📦 ysymyth/tree-of-thought-llm 📦 princeton-nlp/tree-of-thought-llm 📦 codelion/optillm
4	CoA w/o actions	Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models	63.30	2024-03-26	📦 MAGICS-LAB/Chain-of-Actions
5	LLaMA 65B	LLaMA: Open and Efficient Foundation Language Models	53.00	2023-02-27	📦 huggingface/transformers 📦 ggml-org/llama.cpp 📦 ggerganov/llama.cpp
6	LLaMA 33B	LLaMA: Open and Efficient Foundation Language Models	48.00	2023-02-27	📦 huggingface/transformers 📦 ggml-org/llama.cpp 📦 ggerganov/llama.cpp
7	Auto-CoT	Automatic Chain of Thought Prompting in Large Language Models	42.20	2022-10-07	📦 microsoft/guidance 📦 guidance-ai/guidance 📦 amazon-research/auto-cot 📦 amazon-science/auto-cot 📦 lastmile-ai/aiconfig
8	LLaMA 13B	LLaMA: Open and Efficient Foundation Language Models	41.00	2023-02-27	📦 huggingface/transformers 📦 ggml-org/llama.cpp 📦 ggerganov/llama.cpp
9	LLaMA 7B	LLaMA: Open and Efficient Foundation Language Models	29.00	2023-02-27	📦 huggingface/transformers 📦 ggml-org/llama.cpp 📦 ggerganov/llama.cpp
10	GPT-4 (RLHF)	GPT-4 Technical Report	0.59	2023-03-15	📦 openai/evals 📦 shmsw25/factscore 📦 unispac/visual-adversarial-examples-jailbreak-large-language-models

All Papers (30)

SHAKTI: A 2.5 Billion Parameter Small Language Model Optimized for Edge AI and Low-Resource Environments

2024

Shakti-LLM (2.5B)

Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models

2024

CoA

MAGICS-LAB/Chain-of-Actions

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

2023

ToT

ysymyth/tree-of-thought-llm princeton-nlp/tree-of-thought-llm

Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models

2024

CoA w/o actions

MAGICS-LAB/Chain-of-Actions

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 65B

huggingface/transformers ggml-org/llama.cpp

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 33B

huggingface/transformers ggml-org/llama.cpp

Automatic Chain of Thought Prompting in Large Language Models

2022

Auto-CoT

microsoft/guidance guidance-ai/guidance

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 13B

huggingface/transformers ggml-org/llama.cpp

LLaMA: Open and Efficient Foundation Language Models

2023

LLaMA 7B

huggingface/transformers ggml-org/llama.cpp

GPT-4 Technical Report

2023

GPT-4 (RLHF)

openai/evals shmsw25/factscore

TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space

2024

Mistral-7B-Instruct-v0.2 + TruthX

ictnlp/truthx

TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space

2024

LLaMa-2-7B-Chat + TruthX

ictnlp/truthx

Representation Engineering: A Top-Down Approach to AI Transparency

2023

LLaMA-2-Chat-13B + Representation Control (Contrast Vector)

andyzoujm/representation-engineering steering-vectors/steering-vectors

Representation Engineering: A Top-Down Approach to AI Transparency

2023

LLaMA-2-Chat-7B + Representation Control (Contrast Vector)

andyzoujm/representation-engineering steering-vectors/steering-vectors

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

2021

Gopher 280B (zero-shot, Our Prompt + Choices)

allenai/dolma rvlopes/gloria bramiozo/PubScience

Galactica: A Large Language Model for Science

2022

GAL 120B

paperswithcode/galai

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

2021

Gopher 7.1 (zero-shot, QA prompts)

allenai/dolma rvlopes/gloria bramiozo/PubScience

Galactica: A Large Language Model for Science

2022

GAL 30B

paperswithcode/galai

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

2021

Gopher 7.1B (zero-shot, Our Prompt + Choices)

allenai/dolma rvlopes/gloria bramiozo/PubScience

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

2021

Gopher 1.4 (zero-shot, QA prompts)

allenai/dolma rvlopes/gloria bramiozo/PubScience

TruthfulQA: Measuring How Models Mimic Human Falsehoods

2021

GPT-2 1.5B

sylinrl/truthfulqa yizhongw/truthfulqa_reeval lurosenb/sass

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

2021

Gopher 1.4B (zero-shot, Our Prompt + Choices)

allenai/dolma rvlopes/gloria bramiozo/PubScience

TruthfulQA: Measuring How Models Mimic Human Falsehoods

2021

GPT-3 175B

sylinrl/truthfulqa yizhongw/truthfulqa_reeval lurosenb/sass

Galactica: A Large Language Model for Science

2022

OPT 175B

paperswithcode/galai

TruthfulQA: Measuring How Models Mimic Human Falsehoods

2021

GPT-J 6B

sylinrl/truthfulqa yizhongw/truthfulqa_reeval lurosenb/sass

TruthfulQA: Measuring How Models Mimic Human Falsehoods

2021

UnifiedQA 3B

sylinrl/truthfulqa yizhongw/truthfulqa_reeval lurosenb/sass

Galactica: A Large Language Model for Science

2022

GAL 125M

paperswithcode/galai

Galactica: A Large Language Model for Science

2022

GAL 1.3B

paperswithcode/galai

Galactica: A Large Language Model for Science

2022

GAL 6.7B

paperswithcode/galai

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

2021

Gopher 280B (zero-shot, QA prompts)

allenai/dolma rvlopes/gloria bramiozo/PubScience

TruthfulQA

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (30)

SHAKTI: A 2.5 Billion Parameter Small Language Model Optimized for Edge AI and Low-Resource Environments

Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models

LLaMA: Open and Efficient Foundation Language Models

LLaMA: Open and Efficient Foundation Language Models

Automatic Chain of Thought Prompting in Large Language Models

LLaMA: Open and Efficient Foundation Language Models

LLaMA: Open and Efficient Foundation Language Models

GPT-4 Technical Report

TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space

TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space

Representation Engineering: A Top-Down Approach to AI Transparency

Representation Engineering: A Top-Down Approach to AI Transparency

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Galactica: A Large Language Model for Science

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Galactica: A Large Language Model for Science

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Galactica: A Large Language Model for Science

TruthfulQA: Measuring How Models Mimic Human Falsehoods

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Galactica: A Large Language Model for Science

Galactica: A Large Language Model for Science

Galactica: A Large Language Model for Science

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Model	Paper	MC1	Date
Shakti-LLM (2.5B)	SHAKTI: A 2.5 Billion Parameter Small Language Mo…	68.40	2024-10-15
CoA	Chain-of-Action: Faithful and Multimodal Question…	67.30	2024-03-26
ToT	Tree of Thoughts: Deliberate Problem Solving with…	66.60	2023-05-17
CoA w/o actions	Chain-of-Action: Faithful and Multimodal Question…	63.30	2024-03-26
LLaMA 65B	LLaMA: Open and Efficient Foundation Language Mod…	53.00	2023-02-27
LLaMA 33B	LLaMA: Open and Efficient Foundation Language Mod…	48.00	2023-02-27
Auto-CoT	Automatic Chain of Thought Prompting in Large Lan…	42.20	2022-10-07
LLaMA 13B	LLaMA: Open and Efficient Foundation Language Mod…	41.00	2023-02-27
LLaMA 7B	LLaMA: Open and Efficient Foundation Language Mod…	29.00	2023-02-27
GPT-4 (RLHF)	GPT-4 Technical Report	0.59	2023-03-15
Mistral-7B-Instruct-v0.2 + TruthX	TruthX: Alleviating Hallucinations by Editing Lar…	0.56	2024-02-27
LLaMa-2-7B-Chat + TruthX	TruthX: Alleviating Hallucinations by Editing Lar…	0.54	2024-02-27
LLaMA-2-Chat-13B + Representation Control (Contrast Vector)	Representation Engineering: A Top-Down Approach t…	0.54	2023-10-02
LLaMA-2-Chat-7B + Representation Control (Contrast Vector)	Representation Engineering: A Top-Down Approach t…	0.48	2023-10-02
Gopher 280B (zero-shot, Our Prompt + Choices)	Scaling Language Models: Methods, Analysis & Insi…	0.30	2021-12-08
GAL 120B	Galactica: A Large Language Model for Science	0.26	2022-11-16
Gopher 7.1 (zero-shot, QA prompts)	Scaling Language Models: Methods, Analysis & Insi…	0.25	2021-12-08
GAL 30B	Galactica: A Large Language Model for Science	0.24	2022-11-16
Gopher 7.1B (zero-shot, Our Prompt + Choices)	Scaling Language Models: Methods, Analysis & Insi…	0.23	2021-12-08
Gopher 1.4 (zero-shot, QA prompts)	Scaling Language Models: Methods, Analysis & Insi…	0.23	2021-12-08
GPT-2 1.5B	TruthfulQA: Measuring How Models Mimic Human Fals…	0.22	2021-09-08
Gopher 1.4B (zero-shot, Our Prompt + Choices)	Scaling Language Models: Methods, Analysis & Insi…	0.22	2021-12-08
GPT-3 175B	TruthfulQA: Measuring How Models Mimic Human Fals…	0.21	2021-09-08
OPT 175B	Galactica: A Large Language Model for Science	0.21	2022-11-16
GPT-J 6B	TruthfulQA: Measuring How Models Mimic Human Fals…	0.20	2021-09-08
UnifiedQA 3B	TruthfulQA: Measuring How Models Mimic Human Fals…	0.19	2021-09-08
GAL 125M	Galactica: A Large Language Model for Science	0.19	2022-11-16
GAL 1.3B	Galactica: A Large Language Model for Science	0.19	2022-11-16
GAL 6.7B	Galactica: A Large Language Model for Science	0.19	2022-11-16
Gopher 280B (zero-shot, QA prompts)	Scaling Language Models: Methods, Analysis & Insi…		2021-12-08