ML Research Wiki / Benchmarks / Question Answering / PeerQA

PeerQA

Question Answering Benchmark

Performance Over Time

📊 Showing 5 results | 📏 Metric: Prometheus-2 Answer Correctness

Click "Edit" next to any result to modify it, or add a new result at the bottom. All changes will be reviewed before going live.

Yellow rows = Pending edits Green rows = Pending new results

Model	Paper	Prometheus-2 Answer Correctness	Date
GPT-3.5-Turbo-0613-16k	Language Models are Few-Shot Learners	0.24	2020-05-28
Llama-3-IT-8B-8k	The Llama 3 Herd of Models	0.23	2024-07-31
Llama-3-IT-8B-32k	The Llama 3 Herd of Models	0.23	2024-07-31
GPT-4o-2024-08-06-128k	GPT-4 Technical Report	0.23	2023-03-15
Mistral-v02-7B-32k	Mistral 7B	0.19	2023-10-10

Rank	Model	Paper	Prometheus-2 Answer Correctness	Date	Code
1	GPT-3.5-Turbo-0613-16k	Language Models are Few-Shot Learners	0.24	2020-05-28	📦 ggml-org/llama.cpp 📦 ggerganov/llama.cpp 📦 karpathy/llm.c
2	Llama-3-IT-8B-8k	The Llama 3 Herd of Models	0.23	2024-07-31	📦 zhuzilin/ring-flash-attention 📦 wenet-e2e/west 📦 zechenli03/sensorllm 📦 ziye2chen/LLMs-for-Mathematical-Analysis 📦 willemsenbram/mention-detection-vgd
3	Llama-3-IT-8B-32k	The Llama 3 Herd of Models	0.23	2024-07-31	📦 zhuzilin/ring-flash-attention 📦 wenet-e2e/west 📦 zechenli03/sensorllm 📦 ziye2chen/LLMs-for-Mathematical-Analysis 📦 willemsenbram/mention-detection-vgd
4	GPT-4o-2024-08-06-128k	GPT-4 Technical Report	0.23	2023-03-15	📦 openai/evals 📦 shmsw25/factscore 📦 unispac/visual-adversarial-examples-jailbreak-large-language-models
5	Mistral-v02-7B-32k	Mistral 7B	0.19	2023-10-10	📦 mistralai/mistral-src 📦 facebookresearch/fairseq2 📦 mgmalek/efficient_cross_entropy