← ML Research Wiki / 2306.00890

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Chunyuan Li Equal Contribution, Cliff Wong Equal Contribution, Sheng Zhang Equal Contribution, Naoto Usuyama Equal Contribution, Haotian Liu Equal Contribution, Jianwei Yang Equal Contribution, Tristan Naumann Equal Contribution, JianfengHoifung Poon Equal Contribution, Gao Microsoft Equal Contribution (2023)

Paper Information
arXiv ID
Venue
Neural Information Processing Systems
Domain
biomedicine
SOTA Claim
Yes
Code
Reproducibility
8/10

Abstract

Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text.Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophistication in understanding and conversing about biomedical images.In this paper, we propose a cost-efficient approach for training a visionlanguage conversational assistant that can answer open-ended research questions of biomedical images.The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method.Specifically, the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-following data, broadly mimicking how a layperson gradually acquires biomedical knowledge.This enables us to train a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less than 15 hours (with eight A100s).LLaVA-Med exhibits excellent multimodal conversational capability and can follow open-ended instruction to assist with inquiries about a biomedical image.On three standard biomedical visual question answering datasets, fine-tuning LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics.To facilitate biomedical multimodal research, we will release our instruction-following data and the LLaVA-Med model.

Summary

This paper presents LLaVA-Med, a multimodal conversational AI model designed for the biomedical domain, emphasizing its ability to interpret and converse about biomedical images. The authors propose a novel training approach that utilizes a large biomedical figure-caption dataset from PubMed Central, specifically the PMC-15M, which contains 15 million image-text pairs. The model leverages GPT-4 to generate diverse instruction-following data and employs a curriculum learning method for fine-tuning. LLaVA-Med shows proficiency in answering open-ended research questions related to biomedical images, trained in less than 15 hours with high efficiency. The paper also discusses limitations of existing multimodal biomedical systems and highlights the advantages of generalizing the approach to other domains. Experimental results demonstrate that LLaVA-Med outperforms previous state-of-the-art methods on certain metrics across established biomedical visual question answering datasets, and the authors plan to release their dataset and model to foster further research in the field.

Methods

This paper employs the following methods:

  • Curriculum Learning
  • Instruction-Tuning

Models Used

  • GPT-4
  • LLaVA

Datasets

The following datasets were used in this research:

  • PMC-15M
  • VQA-RAD
  • SLAKE
  • PathVQA

Evaluation Metrics

  • Accuracy
  • Recall

Results

  • LLaVA-Med exhibits excellent multimodal conversational capability.
  • LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics.

Limitations

The authors identified the following limitations:

  • Hallucinations and weak in-depth reasoning common to many LMMs.

Technical Requirements

  • Number of GPUs: 8
  • GPU Type: A100

Keywords

multimodal AI vision-language models biomedical images instruction tuning chatbots

Papers Using Similar Methods

External Resources