PeerQA

Dataset Information
Modalities
Texts
Languages
English
Introduced
2025
License

Overview

We present PeerQA, a real-world, scientific, document-level Question Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, which contain questions that reviewers raised while thoroughly examining the scientific article. Answers have been annotated by the original authors of each paper. The dataset contains 579 QA pairs from 208 academic articles, with a majority from ML and NLP, as well as a subset of other scientific communities like Geoscience and Public Health. PeerQA supports three critical tasks for developing practical QA systems: Evidence retrieval, unanswerable question classification, and answer generation. We provide a detailed analysis of the collected dataset and conduct experiments establishing baseline systems for all three tasks. Our experiments and analyses reveal the need for decontextualization in document-level retrieval, where we find that even simple decontextualization approaches consistently improve retrieval performance across architectures. On answer generation, PeerQA serves as a challenging benchmark for long-context modeling, as the papers have an average size of 12k tokens.

Variants: PeerQA

Associated Benchmarks

This dataset is used in 2 benchmarks:

Recent Benchmark Submissions

Task Model Paper Date
answerability prediction Llama-3-IT-8B-32k The Llama 3 Herd of … 2024-07-31
answerability prediction Llama-3-IT-8B-8k The Llama 3 Herd of … 2024-07-31
Question Answering Llama-3-IT-8B-32k The Llama 3 Herd of … 2024-07-31
Question Answering Llama-3-IT-8B-8k The Llama 3 Herd of … 2024-07-31
answerability prediction Mistral-IT-v02-7B-32k Mistral 7B 2023-10-10
Question Answering Mistral-v02-7B-32k Mistral 7B 2023-10-10
Question Answering GPT-4o-2024-08-06-128k GPT-4 Technical Report 2023-03-15
answerability prediction GPT-4o-2024-08-06 GPT-4 Technical Report 2023-03-15
answerability prediction GPT-3.5-Turbo-0613-16k Language Models are Few-Shot Learners 2020-05-28
Question Answering GPT-3.5-Turbo-0613-16k Language Models are Few-Shot Learners 2020-05-28

Research Papers

Recent papers with results on this dataset: