Xin Ma Department of Data Science & AI Faculty of Information Technology Monash University Australia Shanghai Artificial Intelligence Laboratory China, Yaohui Wang [email protected] Shanghai Artificial Intelligence Laboratory China, Gengyun Jia Nanjing University of Posts and Telecommunications China, Xinyuan Chen Shanghai Artificial Intelligence Laboratory China, Ziwei Liu S-Lab Nanyang Technological University Singapore, Yuan-Fang Li Department of Data Science & AI Faculty of Information Technology Monash University Australia, Cunjian Chen Department of Data Science & AI Faculty of Information Technology Monash University Australia, Yu Qiao Shanghai Artificial Intelligence Laboratory China (2024)
The paper introduces Latte, a Latent Diffusion Transformer designed for video generation, which extracts spatio-temporal tokens from input videos and utilizes Transformer blocks to model video distribution in latent space. It includes four efficient model variants that optimize spatial and temporal information processing. The authors provide a thorough evaluation demonstrating that Latte achieves state-of-the-art performance across four video generation datasets: FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. Additionally, Latte is adapted for text-to-video generation, providing competitive results against existing models. The research highlights optimal practices for improving video generation quality, including variations in model architecture and learning strategies.
This paper employs the following methods:
The following datasets were used in this research: