← ML Research Wiki / 2306.00890

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Chunyuan Li Equal Contribution, Cliff Wong Equal Contribution, Sheng Zhang Equal Contribution, Naoto Usuyama Equal Contribution, Haotian Liu Equal Contribution, Jianwei Yang Equal Contribution, Tristan Naumann Equal Contribution, JianfengHoifung Poon Equal Contribution, Gao Microsoft Equal Contribution (2023)

Paper Information

arXiv ID

2306.00890

Venue

Neural Information Processing Systems

Domain

biomedicine

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text.Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophistication in understanding and conversing about biomedical images.In this paper, we propose a cost-efficient approach for training a visionlanguage conversational assistant that can answer open-ended research questions of biomedical images.The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method.Specifically, the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-following data, broadly mimicking how a layperson gradually acquires biomedical knowledge.This enables us to train a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less than 15 hours (with eight A100s).LLaVA-Med exhibits excellent multimodal conversational capability and can follow open-ended instruction to assist with inquiries about a biomedical image.On three standard biomedical visual question answering datasets, fine-tuning LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics.To facilitate biomedical multimodal research, we will release our instruction-following data and the LLaVA-Med model.

Summary

This paper presents LLaVA-Med, a multimodal conversational AI model designed for the biomedical domain, emphasizing its ability to interpret and converse about biomedical images. The authors propose a novel training approach that utilizes a large biomedical figure-caption dataset from PubMed Central, specifically the PMC-15M, which contains 15 million image-text pairs. The model leverages GPT-4 to generate diverse instruction-following data and employs a curriculum learning method for fine-tuning. LLaVA-Med shows proficiency in answering open-ended research questions related to biomedical images, trained in less than 15 hours with high efficiency. The paper also discusses limitations of existing multimodal biomedical systems and highlights the advantages of generalizing the approach to other domains. Experimental results demonstrate that LLaVA-Med outperforms previous state-of-the-art methods on certain metrics across established biomedical visual question answering datasets, and the authors plan to release their dataset and model to foster further research in the field.

Methods

This paper employs the following methods:

Curriculum Learning
Instruction-Tuning

Models Used

GPT-4
LLaVA

Datasets

The following datasets were used in this research:

PMC-15M
VQA-RAD
SLAKE
PathVQA

Evaluation Metrics

Accuracy
Recall

Results

LLaVA-Med exhibits excellent multimodal conversational capability.
LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics.

Limitations

The authors identified the following limitations:

Hallucinations and weak in-depth reasoning common to many LMMs.

Technical Requirements

Number of GPUs: 8
GPU Type: A100

Keywords

multimodal AI vision-language models biomedical images instruction tuning chatbots

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 56
Influential Citations: 79

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers