Zhuoyi Yang Tsinghua University † Zhipu AI, Jiayan Teng Tsinghua University † Zhipu AI, Wendi Zheng Tsinghua University † Zhipu AI, Ming Ding Tsinghua University † Zhipu AI, Shiyu Huang Tsinghua University † Zhipu AI, Jiazheng Xu Tsinghua University † Zhipu AI, Yuanming Yang Tsinghua University † Zhipu AI, Wenyi Hong Tsinghua University † Zhipu AI, Xiaohan Zhang Tsinghua University † Zhipu AI, Guanyu Feng Tsinghua University † Zhipu AI, Da Yin Tsinghua University † Zhipu AI, Yuxuan Zhang Tsinghua University † Zhipu AI, Weihan Wang Tsinghua University † Zhipu AI, Yean Cheng Tsinghua University † Zhipu AI, Bin Xu Tsinghua University † Zhipu AI, Xiaotao Gu Tsinghua University † Zhipu AI, Yuxiao Dong Tsinghua University † Zhipu AI, Jie Tang [email protected] Tsinghua University † Zhipu AI (2024)
The paper presents CogVideoX, a state-of-the-art text-to-video diffusion model developed to generate coherent, long-duration, high-quality videos leveraging a 3D Variational Autoencoder (VAE) and an expert Transformer architecture. It addresses challenges in creating dynamically consistent videos from textual prompts through innovative methods such as a progressive training pipeline, a video data filtering and captioning system, and explicit sampling techniques. The model has been trained on a dataset of 35 million video clips and evaluated against existing models, achieving superior performance in various metrics. CogVideoX is publicly released to advance the field of video generation.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: