← ML Research Wiki / 2404.02905

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Keyu Tian [email protected] Peking University 2 Bytedance Inc, Yi Jiang Peking University 2 Bytedance Inc, Zehuan Yuan [email protected] Peking University 2 Bytedance Inc, Bingyue Peng [email protected] Peking University 2 Bytedance Inc, Liwei Wang [email protected] Peking University 2 Bytedance Inc (2024)

Paper Information
arXiv ID
Venue
Neural Information Processing Systems
Domain
computer vision
Code
Available
Reproducibility
8/10

Abstract

Abstract not available.

Summary

This paper introduces Visual AutoRegressive modeling (VAR), a novel approach for image generation that utilizes a coarse-to-fine methodology known as next-scale prediction, as opposed to traditional next-token prediction. VAR employs autoregressive transformers to learn visual distributions effectively, leading to significant improvements in image generation quality, speed, and efficiency compared to conventional methods and diffusion transformers. The paper presents results indicating that VAR achieves a Fréchet inception distance (FID) of 1.73 and an inception score (IS) of 350.2 on the ImageNet 256×256 benchmark, outperforming existing models in multiple dimensions including image quality and inference speed. Furthermore, VAR exhibits scaling laws similar to large language models (LLMs), and demonstrates capabilities for zero-shot generalization in various downstream tasks like image in-painting and out-painting. These findings underscore the potential of VAR to integrate best practices from LLMs into the field of computer vision, paving the way for advancements in multimodal AI.

Methods

This paper employs the following methods:

  • Visual AutoRegressive modeling (VAR)
  • next-scale prediction
  • multi-scale VQVAE

Models Used

  • VAR

Datasets

The following datasets were used in this research:

  • ImageNet

Evaluation Metrics

  • Fréchet inception distance (FID)
  • inception score (IS)

Results

  • Improved FID from 18.65 to 1.73
  • Improved IS from 80.4 to 350.2
  • 20x faster inference speed
  • Zero-shot generalization capabilities

Limitations

The authors identified the following limitations:

  • Performance may still lag behind diffusion models in certain aspects
  • VQVAE architecture and training might require further advancements to enhance VAR's effectiveness

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

autoregessive modeling image generation multi-scale prediction transformers scaling laws

Papers Using Similar Methods

External Resources