Keyu Tian [email protected] Peking University 2 Bytedance Inc, Yi Jiang Peking University 2 Bytedance Inc, Zehuan Yuan [email protected] Peking University 2 Bytedance Inc, Bingyue Peng [email protected] Peking University 2 Bytedance Inc, Liwei Wang [email protected] Peking University 2 Bytedance Inc (2024)
This paper introduces Visual AutoRegressive modeling (VAR), a novel approach for image generation that utilizes a coarse-to-fine methodology known as next-scale prediction, as opposed to traditional next-token prediction. VAR employs autoregressive transformers to learn visual distributions effectively, leading to significant improvements in image generation quality, speed, and efficiency compared to conventional methods and diffusion transformers. The paper presents results indicating that VAR achieves a Fréchet inception distance (FID) of 1.73 and an inception score (IS) of 350.2 on the ImageNet 256×256 benchmark, outperforming existing models in multiple dimensions including image quality and inference speed. Furthermore, VAR exhibits scaling laws similar to large language models (LLMs), and demonstrates capabilities for zero-shot generalization in various downstream tasks like image in-painting and out-painting. These findings underscore the potential of VAR to integrate best practices from LLMs into the field of computer vision, paving the way for advancements in multimodal AI.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: