ChEBI-20

Name: ChEBI-20
Published: 2021-11-01
License: Unknown

Dataset Information

Modalities

Texts, Graphs, Biomedical

Languages

English

Introduced

2021

License

Unknown

Homepage

Official Website

Contents

Overview
Associated Benchmarks
Recent Benchmark Submissions
Research Papers

Overview

Dataset contains 33,010 molecule-description pairs split into 80\%/10\%/10\% train/val/test splits. The goal of the task is to retrieve the relevant molecule for a natural language description. It is defined as follows:

To push the boundaries of multimodal models, we present a new IR task: \textbf{Text2Mol}.

Given a text query and list of molecules without any reference textual information (represented, for example, as SMILES strings, graphs, or other equivalent representations) retrieve the molecule corresponding to the query. From a text description of a molecule, the model must incorporate the information in the description into a semantic representation which can be used to directly retrieve the molecule. This requires the integration of two very different types of information: the structured knowledge represented by text and the chemical properties present in molecular graphs. We assume there is only one correct (relevant) molecule for each description, so we consider two measures for this task: Hits@1 and mean reciprocal rank (MRR).

80\% of the data is used for training. Retrieval is done against the entire corpus of molecules (train, val, test).

Variants: ChEBI-20

Associated Benchmarks

This dataset is used in 4 benchmarks:

Image Captioning - Metrics: BLEU, Exact, Levenshtein, MACCS FTS, Morgan FTS, RDK FTS, Validity
Molecule Captioning - Metrics: BLEU-2, BLEU-4, METEOR, ROUGE-1, ROUGE-2, ROUGE-L, Text2Mol
Cross-Modal Retrieval - Metrics: Hits@1, Hits@10, Mean Rank, Test MRR
Text-based de novo Molecule Generation - Metrics: BLEU, Exact Match, Frechet ChemNet Distance (FCD), Levenshtein, MACCS FTS, Morgan FTS, RDK FTS, Text2Mol, Validity, Parameter Count

Recent Benchmark Submissions

Task	Model	Paper	Date
Cross-Modal Retrieval	CLASS (ORMA)	CLASS: Enhancing Cross-Modal Text-Molecule Retrieval …	2025-02-17
Cross-Modal Retrieval	CLASS (AMAN)	CLASS: Enhancing Cross-Modal Text-Molecule Retrieval …	2025-02-17
Molecule Captioning	LaMolT5-Base	Automatic Annotation Augmentation Boosts Translation …	2025-02-10
Molecule Captioning	LaMolT5-Small	Automatic Annotation Augmentation Boosts Translation …	2025-02-10
Molecule Captioning	LaMolT5-Large	Automatic Annotation Augmentation Boosts Translation …	2025-02-10
Molecule Captioning	Mol-LLM (SELFIES)	Mol-LLM: Multimodal Generalist Molecular LLM …	2025-02-05
Molecule Captioning	Mol-LLM	Mol-LLM: Multimodal Generalist Molecular LLM …	2025-02-05
Molecule Captioning	PEIT-GEN	Property Enhanced Instruction Tuning for …	2024-12-24
Cross-Modal Retrieval	ORMA	Exploring Optimal Transport-Based Multi-Grained Alignments …	2024-11-04
Cross-Modal Retrieval	Song et al.	Towards Cross-Modal Text-Molecule Retrieval with …	2024-10-31
Cross-Modal Retrieval	DSOKR	Deep Sketched Output Kernel Regression …	2024-06-13
Text-based de novo Molecule Generation	LDMol	LDMol: Text-to-Molecule Diffusion Model with …	2024-05-28
Text-based de novo Molecule Generation	BioT5+	BioT5+: Towards Generalized Biological Understanding …	2024-02-27
Molecule Captioning	BioT5+	BioT5+: Towards Generalized Biological Understanding …	2024-02-27
Text-based de novo Molecule Generation	TGM-DLM w/o corr	Text-Guided Molecule Generation with Diffusion …	2024-02-20
Text-based de novo Molecule Generation	TGM-DLM	Text-Guided Molecule Generation with Diffusion …	2024-02-20
Molecule Captioning	InstructMol-GS	InstructMol: Multi-Modal Integration for Building …	2023-11-27
Molecule Captioning	InstructMol-G	InstructMol: Multi-Modal Integration for Building …	2023-11-27
Molecule Captioning	MolCA, Galac125M	MolCA: Molecular Graph-Language Modeling with …	2023-10-19
Molecule Captioning	MolCA, Galac1.3B	MolCA: Molecular Graph-Language Modeling with …	2023-10-19

Research Papers

Recent papers with results on this dataset:

External Links:

ChEBI-20

Overview edit

Associated Benchmarks

Recent Benchmark Submissions

Research Papers

Edit Dataset Information

Overview