PubChemQA consists of molecules and their corresponding textual descriptions from PubChem. It contains a single type of question, i.e., please describe the molecule. We remove molecules that cannot be processed by RDKit [Landrum et al., 2021] to generate 2D molecular graphs. We also remove texts with less than 4 words, and crops descriptions with more than 256 words. Finally, we obtain 325, 754 unique molecules and 365, 129 molecule-text pairs. On average, each text description contains 17 words.
Variants: PubChemQA
This dataset is used in 1 benchmark:
Task | Model | Paper | Date |
---|---|---|---|
Question Answering | BioMedGPT-10B | BioMedGPT: Open Multimodal Generative Pre-trained … | 2023-08-18 |
Question Answering | Llama2-7B-chat | Llama 2: Open Foundation and … | 2023-07-18 |
Recent papers with results on this dataset: