← ML Research Wiki / 2311.15127

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann Yam Levi Zion English Vikram Voleti Adam Letts Varun Jampani Robin Rombach Stability AI, Tim Dockhorn Yam Levi Zion English Vikram Voleti Adam Letts Varun Jampani Robin Rombach Stability AI, Sumith Kulal Yam Levi Zion English Vikram Voleti Adam Letts Varun Jampani Robin Rombach Stability AI, Daniel Mendelevitch Yam Levi Zion English Vikram Voleti Adam Letts Varun Jampani Robin Rombach Stability AI, Maciej Kilian Yam Levi Zion English Vikram Voleti Adam Letts Varun Jampani Robin Rombach Stability AI, Dominik Lorenz Yam Levi Zion English Vikram Voleti Adam Letts Varun Jampani Robin Rombach Stability AI (2023)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
computer vision, machine learning, generative models
Reproducibility
6/10

Abstract

We present Stable Video Diffusion -a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation.Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets.However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for cu-* Equal contributions.rating video data.In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.Furthermore, we demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering strategies.We then explore the impact of finetuning our base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation.We also show that our base

Summary

This paper presents Stable Video Diffusion, a latent video diffusion model aimed at high-resolution text-to-video and image-to-video generation. It discusses the inadequacies of current training techniques for video models, which often combine various datasets without a standardized methodology. The authors propose a systematic training framework consisting of three stages: text-to-image pretraining, video pretraining on large low-resolution datasets, and high-quality video finetuning. They demonstrate that well-curated pretraining datasets significantly improve video quality. The paper also emphasizes the need for specific data curation strategies to filter out low-quality clips and enhance model performance through human preference studies. The results show that the model significantly outperforms existing methods in generating high-quality videos and can effectively generate multi-view representations and enable explicit motion control.

Methods

This paper employs the following methods:

  • Latent Video Diffusion Models

Models Used

  • Stable Video Diffusion

Datasets

The following datasets were used in this research:

  • Large Video Dataset (LVD)
  • WebVid-10M

Evaluation Metrics

  • FVD
  • PSNR
  • LPIPS

Results

  • Significant performance improvements from well-curated datasets
  • Outperformed state-of-the-art models in text-to-video and image-to-video generation

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: 8
  • GPU Type: NVIDIA A100 80GB

Papers Using Similar Methods

External Resources