ML Research Wiki / Benchmarks / Question Answering / PubMedQA

PubMedQA

Question Answering Benchmark

Performance Over Time

📊 Showing 26 results | 📏 Metric: Accuracy

Top Performing Models

Rank Model Paper Accuracy Date Code
1 Meditron-70B (CoT + SC) MEDITRON-70B: Scaling Medical Pretraining for Large Language Models 81.60 2023-11-27 📦 epfllm/meditron
2 BioGPT-Large(1.5B) BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining 81.00 2022-10-19 📦 huggingface/transformers 📦 microsoft/biogpt 📦 2024-MindSpore-1/Code2 📦 TaoQin/taoqin.github.io
3 RankRAG-llama3-70B (Zero-Shot) RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs 79.80 2024-07-02 -
4 Med-PaLM 2 (5-shot) Towards Expert-Level Medical Question Answering with Large Language Models 79.20 2023-05-16 📦 m42-health/med42
5 Flan-PaLM (540B, Few-shot) Large Language Models Encode Clinical Knowledge 79.00 2022-12-26 📦 dmis-lab/olaph
6 BioGPT(345M) BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining 78.20 2022-10-19 📦 huggingface/transformers 📦 microsoft/biogpt 📦 2024-MindSpore-1/Code2 📦 TaoQin/taoqin.github.io
7 Codex 5-shot CoT Can large language models reason about medical questions? 78.20 2022-07-17 📦 vlievin/medical-reasoning
8 Human Performance (single annotator) PubMedQA: A Dataset for Biomedical Research Question Answering 78.00 2019-09-13 📦 open-dataflow/rare 📦 okanvk/Medical-Specific-Electra-Med-Electra- 📦 okanvk/Medical-Electra 📦 okanvk/Question-Answering-Project 📦 8023looker/med-rr
9 MetaGen Blended RAG (zero-shot) MetaGen Blended RAG: Higher Accuracy for Domain-Specific Q&A Without Fine-Tuning 77.90 2025-05-23 📦 ibm-self-serve-assets/metagen-blended-rag
10 GAL 120B (zero-shot) Galactica: A Large Language Model for Science 77.60 2022-11-16 📦 paperswithcode/galai

All Papers (26)