Figure 1.Hunyuan-GameCraft can create high-dynamic interactive game video content from a single image and corresponding prompt.We simulate a series of action signals.The left and right frames depict key moments from game video sequences generated in response to different inputs.Hunyuan-GameCraft can accurately produce content aligned with each interaction, supports long-term video generation with temporal and 3D consistency, and effectively preserves historical scene information throughout the sequence.In this case, W, A, S, D represent transition movement and ↑, ←, ↓, → denote changes in view angles.
Hunyuan-GameCraft is introduced as a novel framework for high-dynamic interactive game video generation that leverages a hybrid history conditioning strategy to enhance user interaction and long-term consistency in gameplay footage. Built upon the text-to-video foundation model Hunyuan-Video, the framework effectively integrates standard game control actions into a unified camera representation, allowing for smooth user-driven interactions. It employs a hybrid history-conditioned training approach to maintain the fidelity of extended video sequences while addressing the challenges of computational overhead through model distillation for real-time contexts. Extensive experimental evaluation demonstrates its superiority over existing models in terms of generation quality, interactive capability, and computational efficiency. The paper also discusses its architectural innovations in camera control, long video generation, and highlights the benefits of using both curated game scenes and synthetic datasets to achieve a robust performance across diverse contexts.
This paper employs the following methods:
- Hybrid history conditioning
- Model distillation
- Action representation
- Autoregressive video extension
The following datasets were used in this research:
- Fréchet Video Distance (FVD)
- Relative pose error (RPE trans and RPE rot)
- Image Quality
- Aesthetic scores
- Temporal Consistency
- Dynamic Average
- Significant improvements in many metrics such as generation quality, dynamic capability, control accuracy, and temporal consistency compared to existing models.
- Achieves up to 20× speedup in inference time, reaching near real-time rendering rates.
- Demonstrated high user satisfaction in qualitative evaluations.
- Number of GPUs: 192
- GPU Type: None specified
- Compute Requirements: First phase trains for 30k iterations at a learning rate of 3 × 10 −5; second phase trains for an additional 20,000 iterations at a learning rate of 1 × 10 −5.