Haoxin Chen Tencent AI Lab, Yong Zhang Tencent AI Lab, Xiaodong Cun Tencent AI Lab, Menghan Xia Tencent AI Lab, Xintao Wang Tencent AI Lab, Chao Weng Tencent AI Lab, Ying Shan Tencent AI Lab (2024)
The paper presents VideoCrafter2, a novel method for training high-quality video diffusion models using only low-quality videos and high-quality images. It highlights the challenges of dataset limitations in video generation due to the inaccessibility of large, high-quality video datasets, often constrained by copyright issues. The authors analyze the spatial-temporal connection in existing video diffusion models, noting that fully trained models exhibit stronger coupling between temporal and spatial modules compared to partially trained models. The proposed approach involves disentangling motion from appearance at the data level, wherein training leverages low-quality videos for motion synthesis and high-quality images for enhancing picture quality. The paper describes the pipeline design for training and fine-tuning the model, evaluates its performance on various metrics, and compares it with state-of-the-art models. Results indicate that the proposed method achieves competitive visual quality and advantageous motion consistency, thus addressing previous limitations in training under data constraints.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: