← ML Research Wiki / 2506.17140

MeDi: Metadata-Guided Diffusion Models for Mitigating Biases in Tumor Classification

(2025)

Paper Information
arXiv ID

Abstract

Deep learning models have made significant advances in histological prediction tasks in recent years.However, for adaptation in clinical practice, their lack of robustness to varying conditions such as staining, scanner, hospital, and demographics is still a limiting factor: if trained on overrepresented subpopulations, models regularly struggle with less frequent patterns, leading to shortcut learning and biased predictions.Large-scale foundation models have not fully eliminated this issue.Therefore, we propose a novel approach explicitly modeling such metadata into a Metadata-guided generative Diffusion model framework (MeDi).MeDi allows for a targeted augmentation of underrepresented subpopulations with synthetic data, which balances limited training data and mitigates biases in downstream models.We experimentally show that MeDi generates high-quality histopathology images for unseen subpopulations in TCGA, boosts the overall fidelity of the generated images, and enables improvements in performance for downstream classifiers on datasets with subpopulation shifts.Our work is a proof-of-concept towards better mitigating data biases with generative models.

Summary

This paper presents a novel approach called MeDi (Metadata-Guided Diffusion Models) to mitigate biases in tumor classification by leveraging metadata in the training of generative models. It addresses issues of robustness and bias in deep learning models for histopathology, particularly due to variations in datasets across demographics and clinical conditions. The paper explains how MeDi improves the quality of synthetic histopathology images and the performance of downstream classifiers by targeting underrepresented subpopulations through controlled data augmentation mechanisms. The experimental results show that MeDi outperforms traditional class-only methods by generating images that adhere more closely to real-world distributions and enhance classifier accuracy in scenarios with subpopulation shifts.

Methods

This paper employs the following methods:

  • Generative Models
  • Diffusion models

Models Used

  • MeDi

Datasets

The following datasets were used in this research:

  • TCGA-UT

Evaluation Metrics

  • Fréchet Inception Distance (FID)
  • Balanced Accuracy

Results

  • Improved fidelity of synthetic images to real-world distributions
  • Significantly improves performance to unseen tissue source sites in low data regimes

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified
  • Compute Requirements: two distinct diffusion models on the TCGA-UT dataset for 800,000 optimization steps at a learning rate of 10^-4 with a batch size of 64

Papers Using Similar Methods

External Resources