← ML Research Wiki / 2404.02905

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Keyu Tian [email protected] Peking University 2 Bytedance Inc, Yi Jiang Peking University 2 Bytedance Inc, Zehuan Yuan [email protected] Peking University 2 Bytedance Inc, Bingyue Peng [email protected] Peking University 2 Bytedance Inc, Liwei Wang [email protected] Peking University 2 Bytedance Inc (2024)

Paper Information

arXiv ID

2404.02905

Venue

Neural Information Processing Systems

Domain

computer vision

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Abstract not available.

Summary

This paper introduces Visual AutoRegressive modeling (VAR), a novel approach for image generation that utilizes a coarse-to-fine methodology known as next-scale prediction, as opposed to traditional next-token prediction. VAR employs autoregressive transformers to learn visual distributions effectively, leading to significant improvements in image generation quality, speed, and efficiency compared to conventional methods and diffusion transformers. The paper presents results indicating that VAR achieves a Fréchet inception distance (FID) of 1.73 and an inception score (IS) of 350.2 on the ImageNet 256×256 benchmark, outperforming existing models in multiple dimensions including image quality and inference speed. Furthermore, VAR exhibits scaling laws similar to large language models (LLMs), and demonstrates capabilities for zero-shot generalization in various downstream tasks like image in-painting and out-painting. These findings underscore the potential of VAR to integrate best practices from LLMs into the field of computer vision, paving the way for advancements in multimodal AI.

Methods

This paper employs the following methods:

Visual AutoRegressive modeling (VAR)
next-scale prediction
multi-scale VQVAE

Models Used

VAR

Datasets

The following datasets were used in this research:

ImageNet

Evaluation Metrics

Fréchet inception distance (FID)
inception score (IS)

Results

Improved FID from 18.65 to 1.73
Improved IS from 80.4 to 350.2
20x faster inference speed
Zero-shot generalization capabilities

Limitations

The authors identified the following limitations:

Performance may still lag behind diffusion models in certain aspects
VQVAE architecture and training might require further advancements to enhance VAR's effectiveness

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

autoregessive modeling image generation multi-scale prediction transformers scaling laws

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 105
Influential Citations: 47

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers