Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao, Tiktok, Lcm Depthfm (2024)
This paper introduces Depth Anything V2, a foundational model for monocular depth estimation (MDE) that addresses challenges in producing robust and fine-grained depth predictions in complex scenes. To achieve this, the authors propose a novel training strategy by utilizing synthetic images with precise labels and enhancing the model's capabilities through large-scale pseudo-labeled real images. They critically analyze the limitations of conventional real datasets, such as label noise and lack of detail, arguing for the superiority of synthetic data in certain contexts. The paper presents a new evaluation benchmark, DA-2K, which is designed to overcome the limitations of existing benchmarks by providing high-resolution images with precise sparse depth annotations. The results demonstrate significant improvements over previous models, confirming the effectiveness of the proposed methods and the importance of integrating large-scale unlabeled real data in MDE tasks.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: