← ML Research Wiki / 2506.17206

DreamCube: 3D Panorama Generation via Multi-plane Synchronization

(2025)

Paper Information
arXiv ID

Abstract

Figure 1.In this work, we introduce Multi-plane Synchronization to generalize 2D diffusion models to multi-plane omnidirectional representations (i.e., cubemaps), and DreamCube for RGB-D cubemap generation.The proposed approaches can be applied to different tasks including RGB-D panorama generation, panorama depth estimation, and 3D scene generation.

Summary

The paper presents DreamCube, a framework for generating RGB-D cubemaps using a method called multi-plane synchronization, which aims to improve 3D panorama generation from single-view inputs. DreamCube addresses the challenges faced by existing 2D diffusion models when applied to multi-plane panoramic representations. By adapting spatial operators to maintain translation equivariance, it enables seamless integration of multiple views without overlapping FoV techniques that can degrade image quality. Key contributions include a comprehensive analysis of existing methods' limitations and the introduction of a synchronized generation approach that enhances the quality of RGB-D scene generative outputs. Extensive experiments validate DreamCube’s effectiveness in RGB-D panorama generation, depth estimation, and 3D scene reconstruction, showcasing its superior performance compared to existing models.

Methods

This paper employs the following methods:

  • Multi-plane Synchronization

Models Used

  • DreamCube
  • Stable Diffusion v2

Datasets

The following datasets were used in this research:

  • Structured3D
  • SUN360

Evaluation Metrics

  • FID
  • IS
  • δ-1.25
  • AbsRel
  • RMSE
  • MAE

Results

  • Improved RGB-D panorama generation
  • Enhanced depth estimation accuracy
  • Effective 3D scene reconstruction

Limitations

The authors identified the following limitations:

  • High computational cost
  • Restricted input conditions

Technical Requirements

  • Number of GPUs: 4
  • GPU Type: Nvidia L40S
  • Compute Requirements: batch size of 4, resolution of RGB images and depth maps is 512 × 512, training took approximately two days on four Nvidia L40S GPUs.

External Resources