← ML Research Wiki / 2406.09414

Depth Anything V2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao, Tiktok, Lcm Depthfm (2024)

Paper Information
arXiv ID
Venue
Neural Information Processing Systems
Domain
Not specified

Abstract

Figure 1: Depth Anything V2 significantly outperforms V1 [89] in robustness and fine-grained details.Compared with SD-based models [31, 25], it enjoys faster inference speed, fewer parameters, and higher depth accuracy.

Summary

This paper introduces Depth Anything V2, a foundational model for monocular depth estimation (MDE) that addresses challenges in producing robust and fine-grained depth predictions in complex scenes. To achieve this, the authors propose a novel training strategy by utilizing synthetic images with precise labels and enhancing the model's capabilities through large-scale pseudo-labeled real images. They critically analyze the limitations of conventional real datasets, such as label noise and lack of detail, arguing for the superiority of synthetic data in certain contexts. The paper presents a new evaluation benchmark, DA-2K, which is designed to overcome the limitations of existing benchmarks by providing high-resolution images with precise sparse depth annotations. The results demonstrate significant improvements over previous models, confirming the effectiveness of the proposed methods and the importance of integrating large-scale unlabeled real data in MDE tasks.

Methods

This paper employs the following methods:

  • Monocular Depth Estimation (MDE)
  • Discriminative Models
  • Generative Models
  • Knowledge Distillation

Models Used

  • Depth Anything V1
  • DINOv2
  • Marigold

Datasets

The following datasets were used in this research:

  • Hypersim
  • Virtual KITTI
  • ImageNet-21K
  • Objects365
  • Open Images V7
  • Places365
  • BDD100K
  • Google Landmarks
  • SA-1B
  • DIML

Evaluation Metrics

  • AbsRel
  • δ1
  • F1-score

Results

  • Depth Anything V2 outperforms previous models in depth accuracy and inference speed.
  • Achieved a competitive score of 83.6% in the Transparent Surface Challenge
  • Significantly better performance on the proposed DA-2K evaluation benchmark compared to other models.

Limitations

The authors identified the following limitations:

  • Heavy computational burden due to the use of 62M unlabeled images.

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Papers Using Similar Methods

External Resources