PubChemQA

Dataset Information
Languages
English
Introduced
2023
License
MIT
Homepage

Overview

PubChemQA consists of molecules and their corresponding textual descriptions from PubChem. It contains a single type of question, i.e., please describe the molecule. We remove molecules that cannot be processed by RDKit [Landrum et al., 2021] to generate 2D molecular graphs. We also remove texts with less than 4 words, and crops descriptions with more than 256 words. Finally, we obtain 325, 754 unique molecules and 365, 129 molecule-text pairs. On average, each text description contains 17 words.

Variants: PubChemQA

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
Question Answering BioMedGPT-10B BioMedGPT: Open Multimodal Generative Pre-trained … 2023-08-18
Question Answering Llama2-7B-chat Llama 2: Open Foundation and … 2023-07-18

Research Papers

Recent papers with results on this dataset: