Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao, Tiktok, Zju (2024)
This paper introduces Depth Anything, a foundation model designed for monocular depth estimation (MDE) capable of producing accurate depth information from any image in diverse conditions, leveraging a large scale of unlabeled data. The authors advocate the use of monocular unlabeled images due to their cost-effectiveness and the broad scene coverage they provide. The model is trained using both a self-trained setup, where unlabeled images receive pseudo labels from a pre-trained MDE model and joint learning with labeled images. The authors propose an innovative approach that enhances the learning process by challenging the student model with difficult optimization targets and leveraging semantic presences from pre-trained networks. Empirical results show that Depth Anything significantly outperforms MiDaS and other models in zero-shot settings across various datasets, demonstrating its effectiveness in both relative and metric depth estimation tasks as well as semantic segmentation.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: