← ML Research Wiki / 2302.05543

Adding Conditional Control to Text-to-Image Diffusion Models

Lvmin Zhang Stanford University, Anyi Rao [email protected] Stanford University, Maneesh Agrawala [email protected] Stanford University (2023)

Paper Information

arXiv ID

2302.05543

Venue

IEEE International Conference on Computer Vision

Domain

computer vision, machine learning

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Input Canny edge Default"masterpiece of fairy tale, giant deer, golden antlers"Input human pose Default "chef in kitchen""…, quaint city Galic" "Lincoln statue" Figure 1: Controlling Stable Diffusion with learned conditions.ControlNet allows users to add conditions like Canny edges (top), human pose (bottom), etc., to control the image generation of large pretrained diffusion models.The default results use the prompt "a high-quality, detailed, and professional image".Users can optionally give prompts like the "chef in kitchen".

Summary

This paper introduces ControlNet, a neural network architecture designed to enhance large pretrained text-to-image diffusion models by introducing spatially localized input conditions. It addresses the limitations of text-to-image models in controlling image composition by allowing users to specify additional images, such as Canny edges and human poses, which directly influence the image generation process. ControlNet operates by locking the parameters of the pretrained model and creating a trainable copy that learns to adjust based on the added conditions. Experiments demonstrate that ControlNet can effectively control image generation even with limited training data, achieving competitive performance on various tasks. The architecture preserves the robustness of the pretrained model while ensuring efficient learning of conditional inputs. The paper also includes extensive comparisons with existing models and discusses findings from user studies and qualitative experiments, highlighting ControlNet's capability in handling diverse input conditions.

Methods

This paper employs the following methods:

ControlNet

Models Used

Stable Diffusion

Datasets

The following datasets were used in this research:

LAION-5B
ADE20K

Evaluation Metrics

IoU
FID
Average Human Ranking (AHR)

Results

ControlNet can control Stable Diffusion with various conditioning inputs, including Canny edges, Hough lines, user scribbles, human key points, segmentation maps, shape normals, depths.

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: 1
GPU Type: NVIDIA RTX 3090Ti

Keywords

text-to-image diffusion models ControlNet conditional control Stable Diffusion generative models

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 120
Influential Citations: 646

Adding Conditional Control to Text-to-Image Diffusion Models

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers