← ML Research Wiki / 2408.06072

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang Tsinghua University † Zhipu AI, Jiayan Teng Tsinghua University † Zhipu AI, Wendi Zheng Tsinghua University † Zhipu AI, Ming Ding Tsinghua University † Zhipu AI, Shiyu Huang Tsinghua University † Zhipu AI, Jiazheng Xu Tsinghua University † Zhipu AI, Yuanming Yang Tsinghua University † Zhipu AI, Wenyi Hong Tsinghua University † Zhipu AI, Xiaohan Zhang Tsinghua University † Zhipu AI, Guanyu Feng Tsinghua University † Zhipu AI, Da Yin Tsinghua University † Zhipu AI, Yuxuan Zhang Tsinghua University † Zhipu AI, Weihan Wang Tsinghua University † Zhipu AI, Yean Cheng Tsinghua University † Zhipu AI, Bin Xu Tsinghua University † Zhipu AI, Xiaotao Gu Tsinghua University † Zhipu AI, Yuxiao Dong Tsinghua University † Zhipu AI, Jie Tang [email protected] Tsinghua University † Zhipu AI (2024)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
Computer vision, Generative AI, Natural language processing
SOTA Claim
Yes
Reproducibility
7/10

Abstract

Figure 1: CogVideoX can generate long-duration, high-resolution videos with coherent actions and rich semantics.

Summary

The paper presents CogVideoX, a state-of-the-art text-to-video diffusion model developed to generate coherent, long-duration, high-quality videos leveraging a 3D Variational Autoencoder (VAE) and an expert Transformer architecture. It addresses challenges in creating dynamically consistent videos from textual prompts through innovative methods such as a progressive training pipeline, a video data filtering and captioning system, and explicit sampling techniques. The model has been trained on a dataset of 35 million video clips and evaluated against existing models, achieving superior performance in various metrics. CogVideoX is publicly released to advance the field of video generation.

Methods

This paper employs the following methods:

  • 3D Variational Autoencoder
  • Expert Transformer
  • Explicit Uniform Sampling

Models Used

  • CogVideoX-5B
  • CogVideoX-2B

Datasets

The following datasets were used in this research:

  • LAION-5B
  • COYO-700M

Evaluation Metrics

  • Dynamic Quality
  • GPT4o-MTScore
  • Human evaluation scores

Results

  • CogVideoX-5B outperforms top-performing video models
  • CogVideoX-2B is competitive across most dimensions

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

Text-to-video Diffusion models Transformer Video VAE Video captioning Progressive training

Papers Using Similar Methods

External Resources