← ML Research Wiki / 2301.07093

GLIGEN: Open-Set Grounded Text-to-Image Generation

Yuheng Li University of Wisconsin, Haotian Liu University of Wisconsin, Qingyang Wu University of Wisconsin, Fangzhou Mu University of Wisconsin, Jianwei Yang University of Wisconsin, Jianfeng Gao University of Wisconsin, Chunyuan Li University of Wisconsin, Yong Jae Lee University of Wisconsin, -Madison University of Wisconsin, Columbia University University of Wisconsin (2023)

Paper Information
arXiv ID
Venue
Computer Vision and Pattern Recognition
Domain
computer vision and natural language processing
SOTA Claim
Yes
Reproducibility
8/10

Abstract

Caption: "a baby girl / monkey / Hormer Simpson / is scratching her/its head" Grounded keypoints: plotted dots on the left image Caption: "A dog / bird / helmet / backpack is on the grass" Grounded image: red inset Caption: "Elon Musk and Emma Watson on a movie poster" Grounded text: Elon Musk, Emma Watson; Grounded style image: blue inset Caption: "A vibrant colorful bird sitting on tree branch" Grounded depth map: the left image Caption: "A young boy with white powder on his face looks away" Grounded HED map: the left image Caption: "Cars park on the snowy street" Grounded normal map: the left image Caption: "A living room filled with lots of furniture and plants" Grounded semantic map: the left image

Summary

GLIGEN (Grounded-Language-to-Image Generation) presents a novel approach to enhancing text-to-image diffusion models by incorporating grounded inputs like bounding boxes along with text captions. This method aims to improve controllability in image generation, which has been largely limited to text alone in existing models. By freezing the weights of pre-trained models and adding new trainable layers, GLIGEN successfully integrates additional grounding information to generate images that accurately correspond to spatial configurations. The model showcases superior zero-shot performance on the COCO and LVIS datasets, significantly outperforming existing baselines in terms of grounding and image quality. Key innovations include a gated mechanism to gradually introduce grounding conditions into the model without losing the original pre-trained knowledge. Moreover, the paper outlines variations in grounding conditions, such as keypoints, reference images, and spatially-aligned conditions, demonstrating the model's extensive applicability in grounded image generation tasks.

Methods

This paper employs the following methods:

  • Gated Transformer
  • Scheduled Sampling

Models Used

  • Stable Diffusion
  • LDM

Datasets

The following datasets were used in this research:

  • COCO
  • LVIS

Evaluation Metrics

  • FID
  • YOLO score

Results

  • GLIGEN outperforms existing layout-to-image baselines on COCO and LVIS
  • Achieved state-of-the-art performance in grounded image generation
  • Model's zero-shot performance is significantly better than supervised baselines

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

grounded text-to-image diffusion models controllable generation open-set grounding

Papers Using Similar Methods

External Resources