← ML Research Wiki / 2506.17220

Emergent Temporal Correspondences from Video Diffusion Transformers

(2025)

Paper Information

arXiv ID

2506.17220

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

https://cvlab-kaist.github.io/DiffTrackPrompt : "A red fox, its russet fur illuminated by the golden hues of dawn, stands on the edge of meadow…" Trajectory Attention Video Prompt : "A panorama revealing a majestic mountain range cascading into a tranquil sea, dotted with islets…" Trajectory Attention Video Starting Point(s) Starting Point(s) * Equal contribution.† Co-corresponding author.Preprint.Under review.In summary, our contributions are:• We identify the importance of understanding temporal correspondence in video DiTs and introduce DiffTrack, a novel framework that quantitatively analyzes and identifies temporal matching information within DiTs during video generation.• We provide a detailed analysis of the open-source video DiT models including CogVideoX [80], HunyuanVideo [47], and CogVideoX-I2V [80], revealing key insights into their internal mechanisms.• We demonstrate the effectiveness of DiffTrack in zero-shot point tracking, achieving stateof-the-art performance among existing vision foundation and self-supervised video models.• We present motion-enhanced video generation with CAG, a novel guidance method that improves the motion consistency of generated videos without auxiliary models or supervision.

Summary

The paper titled 'Emergent Temporal Correspondences from Video Diffusion Transformers' introduces a new framework called DiffTrack, which focuses on understanding how video Diffusion Transformers (DiTs) establish temporal correspondences during video generation. The authors conducted a detailed analysis of state-of-the-art video DiT models like CogVideoX, uncovering how specific layers and query-key similarities play a critical role in temporal matching, especially during the denoising process. The framework enables zero-shot point tracking with impressive performance compared to existing models and suggests a novel guidance method, Cross-Attention Guidance (CAG), to enhance motion consistency without additional training. The findings contribute towards a deeper understanding of DiTs and pave the way for future applications in tracking and video generation.

Methods

This paper employs the following methods:

DiffTrack
Cross-Attention Guidance

Models Used

CogVideoX
CogVideoX-2B

Datasets

The following datasets were used in this research:

None specified

Evaluation Metrics

None specified

Results

State-of-the-art performance in zero-shot point tracking
Enhanced temporal consistency in video generation using CAG

Limitations

The authors identified the following limitations:

Relies on pre-trained video diffusion transformers; advancements in video backbones could enhance performance
Does not directly support motion manipulation

Technical Requirements

Number of GPUs: 1
GPU Type: A6000
Compute Requirements: l = 17, t = 1 for CogVideoX-2B, l = 16, t = 1 for CogVideoX-5B, and l = 16, t = 1 for HunyuanVideo.

Papers Using Similar Methods

External Resources

References: 82

Emergent Temporal Correspondences from Video Diffusion Transformers

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Related Papers