Yuheng Li University of Wisconsin, Haotian Liu University of Wisconsin, Qingyang Wu University of Wisconsin, Fangzhou Mu University of Wisconsin, Jianwei Yang University of Wisconsin, Jianfeng Gao University of Wisconsin, Chunyuan Li University of Wisconsin, Yong Jae Lee University of Wisconsin, -Madison University of Wisconsin, Columbia University University of Wisconsin (2023)
GLIGEN (Grounded-Language-to-Image Generation) presents a novel approach to enhancing text-to-image diffusion models by incorporating grounded inputs like bounding boxes along with text captions. This method aims to improve controllability in image generation, which has been largely limited to text alone in existing models. By freezing the weights of pre-trained models and adding new trainable layers, GLIGEN successfully integrates additional grounding information to generate images that accurately correspond to spatial configurations. The model showcases superior zero-shot performance on the COCO and LVIS datasets, significantly outperforming existing baselines in terms of grounding and image quality. Key innovations include a gated mechanism to gradually introduce grounding conditions into the model without losing the original pre-trained knowledge. Moreover, the paper outlines variations in grounding conditions, such as keypoints, reference images, and spatially-aligned conditions, demonstrating the model's extensive applicability in grounded image generation tasks.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: