← ML Research Wiki / 2506.17201

Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition

(2025)

Paper Information

arXiv ID

2506.17201

Contents

Abstract
Methods
Datasets
Results
Related Work
External Resources

Abstract

Figure 1.Hunyuan-GameCraft can create high-dynamic interactive game video content from a single image and corresponding prompt.We simulate a series of action signals.The left and right frames depict key moments from game video sequences generated in response to different inputs.Hunyuan-GameCraft can accurately produce content aligned with each interaction, supports long-term video generation with temporal and 3D consistency, and effectively preserves historical scene information throughout the sequence.In this case, W, A, S, D represent transition movement and ↑, ←, ↓, → denote changes in view angles.

Summary

Hunyuan-GameCraft is introduced as a novel framework for high-dynamic interactive game video generation that leverages a hybrid history conditioning strategy to enhance user interaction and long-term consistency in gameplay footage. Built upon the text-to-video foundation model Hunyuan-Video, the framework effectively integrates standard game control actions into a unified camera representation, allowing for smooth user-driven interactions. It employs a hybrid history-conditioned training approach to maintain the fidelity of extended video sequences while addressing the challenges of computational overhead through model distillation for real-time contexts. Extensive experimental evaluation demonstrates its superiority over existing models in terms of generation quality, interactive capability, and computational efficiency. The paper also discusses its architectural innovations in camera control, long video generation, and highlights the benefits of using both curated game scenes and synthetic datasets to achieve a robust performance across diverse contexts.

Methods

This paper employs the following methods:

Hybrid history conditioning
Model distillation
Action representation
Autoregressive video extension

Models Used

Hunyuan-Video

Datasets

The following datasets were used in this research:

None specified

Evaluation Metrics

Fréchet Video Distance (FVD)
Relative pose error (RPE trans and RPE rot)
Image Quality
Aesthetic scores
Temporal Consistency
Dynamic Average

Results

Significant improvements in many metrics such as generation quality, dynamic capability, control accuracy, and temporal consistency compared to existing models.
Achieves up to 20× speedup in inference time, reaching near real-time rendering rates.
Demonstrated high user satisfaction in qualitative evaluations.

Technical Requirements

Number of GPUs: 192
GPU Type: None specified
Compute Requirements: First phase trains for 30k iterations at a learning rate of 3 × 10 −5; second phase trains for an additional 20,000 iterations at a learning rate of 1 × 10 −5.

Papers Using Similar Methods

External Resources

References: 33

Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Technical Requirements edit

Related Papers