← ML Research Wiki / 2301.07093

GLIGEN: Open-Set Grounded Text-to-Image Generation

Yuheng Li University of Wisconsin, Haotian Liu University of Wisconsin, Qingyang Wu University of Wisconsin, Fangzhou Mu University of Wisconsin, Jianwei Yang University of Wisconsin, Jianfeng Gao University of Wisconsin, Chunyuan Li University of Wisconsin, Yong Jae Lee University of Wisconsin, -Madison University of Wisconsin, Columbia University University of Wisconsin (2023)

Paper Information

arXiv ID

2301.07093

Venue

Computer Vision and Pattern Recognition

Domain

computer vision and natural language processing

SOTA Claim

Yes

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Caption: "a baby girl / monkey / Hormer Simpson / is scratching her/its head" Grounded keypoints: plotted dots on the left image Caption: "A dog / bird / helmet / backpack is on the grass" Grounded image: red inset Caption: "Elon Musk and Emma Watson on a movie poster" Grounded text: Elon Musk, Emma Watson; Grounded style image: blue inset Caption: "A vibrant colorful bird sitting on tree branch" Grounded depth map: the left image Caption: "A young boy with white powder on his face looks away" Grounded HED map: the left image Caption: "Cars park on the snowy street" Grounded normal map: the left image Caption: "A living room filled with lots of furniture and plants" Grounded semantic map: the left image

Summary

GLIGEN (Grounded-Language-to-Image Generation) presents a novel approach to enhancing text-to-image diffusion models by incorporating grounded inputs like bounding boxes along with text captions. This method aims to improve controllability in image generation, which has been largely limited to text alone in existing models. By freezing the weights of pre-trained models and adding new trainable layers, GLIGEN successfully integrates additional grounding information to generate images that accurately correspond to spatial configurations. The model showcases superior zero-shot performance on the COCO and LVIS datasets, significantly outperforming existing baselines in terms of grounding and image quality. Key innovations include a gated mechanism to gradually introduce grounding conditions into the model without losing the original pre-trained knowledge. Moreover, the paper outlines variations in grounding conditions, such as keypoints, reference images, and spatially-aligned conditions, demonstrating the model's extensive applicability in grounded image generation tasks.

Methods

This paper employs the following methods:

Gated Transformer
Scheduled Sampling

Models Used

Stable Diffusion
LDM

Datasets

The following datasets were used in this research:

COCO
LVIS

Evaluation Metrics

FID
YOLO score

Results

GLIGEN outperforms existing layout-to-image baselines on COCO and LVIS
Achieved state-of-the-art performance in grounded image generation
Model's zero-shot performance is significantly better than supervised baselines

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

grounded text-to-image diffusion models controllable generation open-set grounding

Papers Using Similar Methods

External Resources

Funding: None specified
References: 80
Influential Citations: 102

GLIGEN: Open-Set Grounded Text-to-Image Generation

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers