https://cvlab-kaist.github.io/DiffTrackPrompt : "A red fox, its russet fur illuminated by the golden hues of dawn, stands on the edge of meadow…" Trajectory Attention Video Prompt : "A panorama revealing a majestic mountain range cascading into a tranquil sea, dotted with islets…" Trajectory Attention Video Starting Point(s) Starting Point(s) * Equal contribution.† Co-corresponding author.Preprint.Under review.In summary, our contributions are:• We identify the importance of understanding temporal correspondence in video DiTs and introduce DiffTrack, a novel framework that quantitatively analyzes and identifies temporal matching information within DiTs during video generation.• We provide a detailed analysis of the open-source video DiT models including CogVideoX [80], HunyuanVideo [47], and CogVideoX-I2V [80], revealing key insights into their internal mechanisms.• We demonstrate the effectiveness of DiffTrack in zero-shot point tracking, achieving stateof-the-art performance among existing vision foundation and self-supervised video models.• We present motion-enhanced video generation with CAG, a novel guidance method that improves the motion consistency of generated videos without auxiliary models or supervision.
The paper titled 'Emergent Temporal Correspondences from Video Diffusion Transformers' introduces a new framework called DiffTrack, which focuses on understanding how video Diffusion Transformers (DiTs) establish temporal correspondences during video generation. The authors conducted a detailed analysis of state-of-the-art video DiT models like CogVideoX, uncovering how specific layers and query-key similarities play a critical role in temporal matching, especially during the denoising process. The framework enables zero-shot point tracking with impressive performance compared to existing models and suggests a novel guidance method, Cross-Attention Guidance (CAG), to enhance motion consistency without additional training. The findings contribute towards a deeper understanding of DiTs and pave the way for future applications in tracking and video generation.
This paper employs the following methods:
- DiffTrack
- Cross-Attention Guidance
The following datasets were used in this research:
- State-of-the-art performance in zero-shot point tracking
- Enhanced temporal consistency in video generation using CAG
The authors identified the following limitations:
- Relies on pre-trained video diffusion transformers; advancements in video backbones could enhance performance
- Does not directly support motion manipulation
- Number of GPUs: 1
- GPU Type: A6000
- Compute Requirements: l = 17, t = 1 for CogVideoX-2B, l = 16, t = 1 for CogVideoX-5B, and l = 16, t = 1 for HunyuanVideo.