← ML Research Wiki / 2402.10373

BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains

(2024)

Paper Information
arXiv ID
Venue
Annual Meeting of the Association for Computational Linguistics

Abstract

Large Language Models (LLMs) have demonstrated remarkable versatility in recent years, offering potential applications across specialized domains such as healthcare and medicine.Despite the availability of various open-source LLMs tailored for health contexts, adapting general-purpose LLMs to the medical domain presents significant challenges.In this paper, we introduce BioMistral, an open-source LLM tailored for the biomedical domain, utilizing Mistral as its foundation model and further pre-trained on PubMed Central.We conduct a comprehensive evaluation of BioMistral on a benchmark comprising 10 established medical question-answering (QA) tasks in English.We also explore lightweight models obtained through quantization and model merging approaches.Our results demonstrate BioMistral's superior performance compared to existing open-source medical models and its competitive edge against proprietary counterparts.Finally, to address the limited availability of data beyond English and to assess the multilingual generalization of medical LLMs, we automatically translated and evaluated this benchmark into 7 other languages.This marks the first large-scale multilingual evaluation of LLMs in the medical domain.Datasets, multilingual evaluation benchmarks, scripts, and all the models obtained during our experiments are freely released.

Summary

The paper introduces BioMistral, an open-source large language model (LLM) specifically designed for the biomedical domain. Built upon the Mistral 7B Instruct model, BioMistral is further pre-trained on PubMed Central, demonstrating superior performance across a benchmark of 10 established medical question-answering tasks in English. The study explores lightweight model variants achieved via quantization and model merging techniques, showcasing BioMistral's competitive edge against both open-source and proprietary medical models. Additionally, the paper addresses the challenges of multilingual evaluation by translating and benchmarking BioMistral's capabilities in 7 languages for the first time in the medical LLM domain.

Methods

This paper employs the following methods:

  • Mistral 7B Instruct
  • Few-shot learning
  • Supervised fine-tuning
  • Model merging
  • Quantization

Models Used

  • BioMistral 7B
  • Mistral 7B Instruct
  • MedPaLM-2
  • GPT-4
  • BioGPT
  • ClinicalGPT
  • MedAlpaca
  • PMC-LLaMA
  • MediTron-7B

Datasets

The following datasets were used in this research:

  • PubMed Central
  • MedQA
  • MMLU

Evaluation Metrics

  • Accuracy
  • Expected Calibration Error (ECE)

Results

  • BioMistral 7B shows superior performance on medical QA tasks compared to existing open-source biomedical models.
  • Introduces a benchmark for multilingual evaluation of LLMs in the medical domain.
  • BioMistral demonstrates improvements in few-shot learning and fine-tuning scenarios.

Limitations

The authors identified the following limitations:

  • Significant computational resources required for model training and evaluation.
  • Performance degradation in non-English contexts due to limited data availability.
  • Difficulty in general medical terminology not present in the training dataset.

Technical Requirements

  • Number of GPUs: 32
  • GPU Type: NVIDIA A100 80GB
  • Compute Requirements: 1.5 epochs within the 20-hour limit of Jean Zay HPC, batch size of 16, total batch size of 1024

Papers Using Similar Methods

External Resources