← ML Research Wiki / 2302.05543

Adding Conditional Control to Text-to-Image Diffusion Models

Lvmin Zhang Stanford University, Anyi Rao [email protected] Stanford University, Maneesh Agrawala [email protected] Stanford University (2023)

Paper Information
arXiv ID
Venue
IEEE International Conference on Computer Vision
Domain
computer vision, machine learning

Abstract

Input Canny edge Default"masterpiece of fairy tale, giant deer, golden antlers"Input human pose Default "chef in kitchen""…, quaint city Galic" "Lincoln statue" Figure 1: Controlling Stable Diffusion with learned conditions.ControlNet allows users to add conditions like Canny edges (top), human pose (bottom), etc., to control the image generation of large pretrained diffusion models.The default results use the prompt "a high-quality, detailed, and professional image".Users can optionally give prompts like the "chef in kitchen".

Summary

This paper introduces ControlNet, a neural network architecture designed to enhance large pretrained text-to-image diffusion models by introducing spatially localized input conditions. It addresses the limitations of text-to-image models in controlling image composition by allowing users to specify additional images, such as Canny edges and human poses, which directly influence the image generation process. ControlNet operates by locking the parameters of the pretrained model and creating a trainable copy that learns to adjust based on the added conditions. Experiments demonstrate that ControlNet can effectively control image generation even with limited training data, achieving competitive performance on various tasks. The architecture preserves the robustness of the pretrained model while ensuring efficient learning of conditional inputs. The paper also includes extensive comparisons with existing models and discusses findings from user studies and qualitative experiments, highlighting ControlNet's capability in handling diverse input conditions.

Methods

This paper employs the following methods:

  • ControlNet

Models Used

  • Stable Diffusion

Datasets

The following datasets were used in this research:

  • LAION-5B
  • ADE20K

Evaluation Metrics

  • IoU
  • FID
  • Average Human Ranking (AHR)

Results

  • ControlNet can control Stable Diffusion with various conditioning inputs, including Canny edges, Hough lines, user scribbles, human key points, segmentation maps, shape normals, depths.

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: 1
  • GPU Type: NVIDIA RTX 3090Ti

Keywords

text-to-image diffusion models ControlNet conditional control Stable Diffusion generative models

Papers Using Similar Methods

External Resources