ChEBI-20

Dataset Information
Modalities
Texts, Graphs, Biomedical
Languages
English
Introduced
2021
License
Unknown
Homepage

Overview

Dataset contains 33,010 molecule-description pairs split into 80\%/10\%/10\% train/val/test splits. The goal of the task is to retrieve the relevant molecule for a natural language description. It is defined as follows:

To push the boundaries of multimodal models, we present a new IR task: \textbf{Text2Mol}.

Given a text query and list of molecules without any reference textual information (represented, for example, as SMILES strings, graphs, or other equivalent representations) retrieve the molecule corresponding to the query. From a text description of a molecule, the model must incorporate the information in the description into a semantic representation which can be used to directly retrieve the molecule. This requires the integration of two very different types of information: the structured knowledge represented by text and the chemical properties present in molecular graphs. We assume there is only one correct (relevant) molecule for each description, so we consider two measures for this task: Hits@1 and mean reciprocal rank (MRR).

80\% of the data is used for training. Retrieval is done against the entire corpus of molecules (train, val, test).

Variants: ChEBI-20

Associated Benchmarks

This dataset is used in 4 benchmarks:

Recent Benchmark Submissions

Task Model Paper Date
Cross-Modal Retrieval CLASS (ORMA) CLASS: Enhancing Cross-Modal Text-Molecule Retrieval … 2025-02-17
Cross-Modal Retrieval CLASS (AMAN) CLASS: Enhancing Cross-Modal Text-Molecule Retrieval … 2025-02-17
Molecule Captioning LaMolT5-Base Automatic Annotation Augmentation Boosts Translation … 2025-02-10
Molecule Captioning LaMolT5-Small Automatic Annotation Augmentation Boosts Translation … 2025-02-10
Molecule Captioning LaMolT5-Large Automatic Annotation Augmentation Boosts Translation … 2025-02-10
Molecule Captioning Mol-LLM (SELFIES) Mol-LLM: Multimodal Generalist Molecular LLM … 2025-02-05
Molecule Captioning Mol-LLM Mol-LLM: Multimodal Generalist Molecular LLM … 2025-02-05
Molecule Captioning PEIT-GEN Property Enhanced Instruction Tuning for … 2024-12-24
Cross-Modal Retrieval ORMA Exploring Optimal Transport-Based Multi-Grained Alignments … 2024-11-04
Cross-Modal Retrieval Song et al. Towards Cross-Modal Text-Molecule Retrieval with … 2024-10-31
Cross-Modal Retrieval DSOKR Deep Sketched Output Kernel Regression … 2024-06-13
Text-based de novo Molecule Generation LDMol LDMol: Text-to-Molecule Diffusion Model with … 2024-05-28
Text-based de novo Molecule Generation BioT5+ BioT5+: Towards Generalized Biological Understanding … 2024-02-27
Molecule Captioning BioT5+ BioT5+: Towards Generalized Biological Understanding … 2024-02-27
Text-based de novo Molecule Generation TGM-DLM w/o corr Text-Guided Molecule Generation with Diffusion … 2024-02-20
Text-based de novo Molecule Generation TGM-DLM Text-Guided Molecule Generation with Diffusion … 2024-02-20
Molecule Captioning InstructMol-GS InstructMol: Multi-Modal Integration for Building … 2023-11-27
Molecule Captioning InstructMol-G InstructMol: Multi-Modal Integration for Building … 2023-11-27
Molecule Captioning MolCA, Galac125M MolCA: Molecular Graph-Language Modeling with … 2023-10-19
Molecule Captioning MolCA, Galac1.3B MolCA: Molecular Graph-Language Modeling with … 2023-10-19

Research Papers

Recent papers with results on this dataset: