Andreas Blattmann Yam Levi Zion English Vikram Voleti Adam Letts Varun Jampani Robin Rombach Stability AI, Tim Dockhorn Yam Levi Zion English Vikram Voleti Adam Letts Varun Jampani Robin Rombach Stability AI, Sumith Kulal Yam Levi Zion English Vikram Voleti Adam Letts Varun Jampani Robin Rombach Stability AI, Daniel Mendelevitch Yam Levi Zion English Vikram Voleti Adam Letts Varun Jampani Robin Rombach Stability AI, Maciej Kilian Yam Levi Zion English Vikram Voleti Adam Letts Varun Jampani Robin Rombach Stability AI, Dominik Lorenz Yam Levi Zion English Vikram Voleti Adam Letts Varun Jampani Robin Rombach Stability AI (2023)
This paper presents Stable Video Diffusion, a latent video diffusion model aimed at high-resolution text-to-video and image-to-video generation. It discusses the inadequacies of current training techniques for video models, which often combine various datasets without a standardized methodology. The authors propose a systematic training framework consisting of three stages: text-to-image pretraining, video pretraining on large low-resolution datasets, and high-quality video finetuning. They demonstrate that well-curated pretraining datasets significantly improve video quality. The paper also emphasizes the need for specific data curation strategies to filter out low-quality clips and enhance model performance through human preference studies. The results show that the model significantly outperforms existing methods in generating high-quality videos and can effectively generate multi-view representations and enable explicit motion control.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: