Lvmin Zhang Stanford University, Anyi Rao [email protected] Stanford University, Maneesh Agrawala [email protected] Stanford University (2023)
This paper introduces ControlNet, a neural network architecture designed to enhance large pretrained text-to-image diffusion models by introducing spatially localized input conditions. It addresses the limitations of text-to-image models in controlling image composition by allowing users to specify additional images, such as Canny edges and human poses, which directly influence the image generation process. ControlNet operates by locking the parameters of the pretrained model and creating a trainable copy that learns to adjust based on the added conditions. Experiments demonstrate that ControlNet can effectively control image generation even with limited training data, achieving competitive performance on various tasks. The architecture preserves the robustness of the pretrained model while ensuring efficient learning of conditional inputs. The paper also includes extensive comparisons with existing models and discusses findings from user studies and qualitative experiments, highlighting ControlNet's capability in handling diverse input conditions.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: