← ML Research Wiki / 2408.06072

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang Tsinghua University † Zhipu AI, Jiayan Teng Tsinghua University † Zhipu AI, Wendi Zheng Tsinghua University † Zhipu AI, Ming Ding Tsinghua University † Zhipu AI, Shiyu Huang Tsinghua University † Zhipu AI, Jiazheng Xu Tsinghua University † Zhipu AI, Yuanming Yang Tsinghua University † Zhipu AI, Wenyi Hong Tsinghua University † Zhipu AI, Xiaohan Zhang Tsinghua University † Zhipu AI, Guanyu Feng Tsinghua University † Zhipu AI, Da Yin Tsinghua University † Zhipu AI, Yuxuan Zhang Tsinghua University † Zhipu AI, Weihan Wang Tsinghua University † Zhipu AI, Yean Cheng Tsinghua University † Zhipu AI, Bin Xu Tsinghua University † Zhipu AI, Xiaotao Gu Tsinghua University † Zhipu AI, Yuxiao Dong Tsinghua University † Zhipu AI, Jie Tang [email protected] Tsinghua University † Zhipu AI (2024)

Paper Information

arXiv ID

2408.06072

Venue

arXiv.org

Domain

Computer vision, Generative AI, Natural language processing

SOTA Claim

Yes

Reproducibility

7/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Figure 1: CogVideoX can generate long-duration, high-resolution videos with coherent actions and rich semantics.

Summary

The paper presents CogVideoX, a state-of-the-art text-to-video diffusion model developed to generate coherent, long-duration, high-quality videos leveraging a 3D Variational Autoencoder (VAE) and an expert Transformer architecture. It addresses challenges in creating dynamically consistent videos from textual prompts through innovative methods such as a progressive training pipeline, a video data filtering and captioning system, and explicit sampling techniques. The model has been trained on a dataset of 35 million video clips and evaluated against existing models, achieving superior performance in various metrics. CogVideoX is publicly released to advance the field of video generation.

Methods

This paper employs the following methods:

3D Variational Autoencoder
Expert Transformer
Explicit Uniform Sampling

Models Used

CogVideoX-5B
CogVideoX-2B

Datasets

The following datasets were used in this research:

LAION-5B
COYO-700M

Evaluation Metrics

Dynamic Quality
GPT4o-MTScore
Human evaluation scores

Results

CogVideoX-5B outperforms top-performing video models
CogVideoX-2B is competitive across most dimensions

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

Text-to-video Diffusion models Transformer Video VAE Video captioning Progressive training

Papers Using Similar Methods

External Resources

Funding: NSFC 62425601, 62495063, Tsinghua University Initiative Scientific Research Program 20233080067, New Cornerstone Science Foundation XPLORER PRIZE
References: 50
Influential Citations: 93

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers