← ML Research Wiki / 2401.09047

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Haoxin Chen Tencent AI Lab, Yong Zhang Tencent AI Lab, Xiaodong Cun Tencent AI Lab, Menghan Xia Tencent AI Lab, Xintao Wang Tencent AI Lab, Chao Weng Tencent AI Lab, Ying Shan Tencent AI Lab (2024)

Paper Information
arXiv ID
Venue
Computer Vision and Pattern Recognition
Domain
Not specified

Abstract

In cyberpunk, neonpunk style, Kung Fu Panda, jump and kick.Cinematic photo melting pistachio ice cream dripping down the cone.35mm photograph, film, bokeh.Large motion, surrounded by butterflies, a girl walks through a lush garden.Figure 1.Give a text prompt, our method can generate a video with high visual quality and accurate text-video alignment.Note that it is trained with only low-quality videos and high-quality images.No high-quality videos are required.Best viewed with Acrobat Reader.Click the images to play the video clips.

Summary

The paper presents VideoCrafter2, a novel method for training high-quality video diffusion models using only low-quality videos and high-quality images. It highlights the challenges of dataset limitations in video generation due to the inaccessibility of large, high-quality video datasets, often constrained by copyright issues. The authors analyze the spatial-temporal connection in existing video diffusion models, noting that fully trained models exhibit stronger coupling between temporal and spatial modules compared to partially trained models. The proposed approach involves disentangling motion from appearance at the data level, wherein training leverages low-quality videos for motion synthesis and high-quality images for enhancing picture quality. The paper describes the pipeline design for training and fine-tuning the model, evaluates its performance on various metrics, and compares it with state-of-the-art models. Results indicate that the proposed method achieves competitive visual quality and advantageous motion consistency, thus addressing previous limitations in training under data constraints.

Methods

This paper employs the following methods:

  • Diffusion models
  • Disentangling motion from appearance

Models Used

  • Stable Diffusion
  • SDXL
  • Midjourney
  • VideoCrafter1
  • Gen-2
  • Pika Labs
  • Show-1
  • AnimateDiff

Datasets

The following datasets were used in this research:

  • WebVid-10M
  • JDB
  • LAION-COCO

Evaluation Metrics

  • EvalCrafter

Results

  • Achieved competitive visual quality compared to models trained on high-quality videos
  • Enhanced text-video alignment performance
  • Superior motion quality compared to certain models

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: 32
  • GPU Type: NVIDIA A100

Papers Using Similar Methods

External Resources