Venue
Neural Information Processing Systems
Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text.Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophistication in understanding and conversing about biomedical images.In this paper, we propose a cost-efficient approach for training a visionlanguage conversational assistant that can answer open-ended research questions of biomedical images.The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method.Specifically, the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-following data, broadly mimicking how a layperson gradually acquires biomedical knowledge.This enables us to train a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less than 15 hours (with eight A100s).LLaVA-Med exhibits excellent multimodal conversational capability and can follow open-ended instruction to assist with inquiries about a biomedical image.On three standard biomedical visual question answering datasets, fine-tuning LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics.To facilitate biomedical multimodal research, we will release our instruction-following data and the LLaVA-Med model.
This paper presents LLaVA-Med, a multimodal conversational AI model designed for the biomedical domain, emphasizing its ability to interpret and converse about biomedical images. The authors propose a novel training approach that utilizes a large biomedical figure-caption dataset from PubMed Central, specifically the PMC-15M, which contains 15 million image-text pairs. The model leverages GPT-4 to generate diverse instruction-following data and employs a curriculum learning method for fine-tuning. LLaVA-Med shows proficiency in answering open-ended research questions related to biomedical images, trained in less than 15 hours with high efficiency. The paper also discusses limitations of existing multimodal biomedical systems and highlights the advantages of generalizing the approach to other domains. Experimental results demonstrate that LLaVA-Med outperforms previous state-of-the-art methods on certain metrics across established biomedical visual question answering datasets, and the authors plan to release their dataset and model to foster further research in the field.
This paper employs the following methods:
- Curriculum Learning
- Instruction-Tuning
The following datasets were used in this research:
- PMC-15M
- VQA-RAD
- SLAKE
- PathVQA
- LLaVA-Med exhibits excellent multimodal conversational capability.
- LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics.
The authors identified the following limitations:
- Hallucinations and weak in-depth reasoning common to many LMMs.
- Number of GPUs: 8
- GPU Type: A100
multimodal AI
vision-language models
biomedical images
instruction tuning
chatbots