PeerQA

Name: PeerQA
Published: 2025-02-19
License: CC-BY-NC-SA 4.0

Dataset Information

Modalities

Texts

Languages

English

Introduced

2025

License

CC-BY-NC-SA 4.0

Contents

Overview
Associated Benchmarks
Recent Benchmark Submissions
Research Papers

Overview

We present PeerQA, a real-world, scientific, document-level Question Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, which contain questions that reviewers raised while thoroughly examining the scientific article. Answers have been annotated by the original authors of each paper. The dataset contains 579 QA pairs from 208 academic articles, with a majority from ML and NLP, as well as a subset of other scientific communities like Geoscience and Public Health. PeerQA supports three critical tasks for developing practical QA systems: Evidence retrieval, unanswerable question classification, and answer generation. We provide a detailed analysis of the collected dataset and conduct experiments establishing baseline systems for all three tasks. Our experiments and analyses reveal the need for decontextualization in document-level retrieval, where we find that even simple decontextualization approaches consistently improve retrieval performance across architectures. On answer generation, PeerQA serves as a challenging benchmark for long-context modeling, as the papers have an average size of 12k tokens.

Variants: PeerQA

Associated Benchmarks

This dataset is used in 2 benchmarks:

Question Answering - Metrics: Prometheus-2 Answer Correctness, Rouge-L, AlignScore
answerability prediction - Metrics: Macro F1

Recent Benchmark Submissions

Task	Model	Paper	Date
answerability prediction	Llama-3-IT-8B-32k	The Llama 3 Herd of …	2024-07-31
answerability prediction	Llama-3-IT-8B-8k	The Llama 3 Herd of …	2024-07-31
Question Answering	Llama-3-IT-8B-32k	The Llama 3 Herd of …	2024-07-31
Question Answering	Llama-3-IT-8B-8k	The Llama 3 Herd of …	2024-07-31
answerability prediction	Mistral-IT-v02-7B-32k	Mistral 7B	2023-10-10
Question Answering	Mistral-v02-7B-32k	Mistral 7B	2023-10-10
Question Answering	GPT-4o-2024-08-06-128k	GPT-4 Technical Report	2023-03-15
answerability prediction	GPT-4o-2024-08-06	GPT-4 Technical Report	2023-03-15
answerability prediction	GPT-3.5-Turbo-0613-16k	Language Models are Few-Shot Learners	2020-05-28
Question Answering	GPT-3.5-Turbo-0613-16k	Language Models are Few-Shot Learners	2020-05-28

Research Papers

Recent papers with results on this dataset:

The Llama 3 Herd of Models (2024) -
Mistral 7B (2023) -
GPT-4 Technical Report (2023) -
Language Models are Few-Shot Learners (2020) -

External Links:

Papers with Code Entry

PeerQA

Overview edit

Associated Benchmarks

Recent Benchmark Submissions

Research Papers

Edit Dataset Information

Overview