← ML Research Wiki / 2401.09047

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Haoxin Chen Tencent AI Lab, Yong Zhang Tencent AI Lab, Xiaodong Cun Tencent AI Lab, Menghan Xia Tencent AI Lab, Xintao Wang Tencent AI Lab, Chao Weng Tencent AI Lab, Ying Shan Tencent AI Lab (2024)

Paper Information

arXiv ID

2401.09047

Venue

Computer Vision and Pattern Recognition

Domain

Not specified

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

In cyberpunk, neonpunk style, Kung Fu Panda, jump and kick.Cinematic photo melting pistachio ice cream dripping down the cone.35mm photograph, film, bokeh.Large motion, surrounded by butterflies, a girl walks through a lush garden.Figure 1.Give a text prompt, our method can generate a video with high visual quality and accurate text-video alignment.Note that it is trained with only low-quality videos and high-quality images.No high-quality videos are required.Best viewed with Acrobat Reader.Click the images to play the video clips.

Summary

The paper presents VideoCrafter2, a novel method for training high-quality video diffusion models using only low-quality videos and high-quality images. It highlights the challenges of dataset limitations in video generation due to the inaccessibility of large, high-quality video datasets, often constrained by copyright issues. The authors analyze the spatial-temporal connection in existing video diffusion models, noting that fully trained models exhibit stronger coupling between temporal and spatial modules compared to partially trained models. The proposed approach involves disentangling motion from appearance at the data level, wherein training leverages low-quality videos for motion synthesis and high-quality images for enhancing picture quality. The paper describes the pipeline design for training and fine-tuning the model, evaluates its performance on various metrics, and compares it with state-of-the-art models. Results indicate that the proposed method achieves competitive visual quality and advantageous motion consistency, thus addressing previous limitations in training under data constraints.

Methods

This paper employs the following methods:

Diffusion models
Disentangling motion from appearance

Models Used

Stable Diffusion
SDXL
Midjourney
VideoCrafter1
Gen-2
Pika Labs
Show-1
AnimateDiff

Datasets

The following datasets were used in this research:

WebVid-10M
JDB
LAION-COCO

Evaluation Metrics

EvalCrafter

Results

Achieved competitive visual quality compared to models trained on high-quality videos
Enhanced text-video alignment performance
Superior motion quality compared to certain models

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: 32
GPU Type: NVIDIA A100

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 54
Influential Citations: 38

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Related Papers