← ML Research Wiki / 2401.10891

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao, Tiktok, Zju (2024)

Paper Information
arXiv ID
Venue
Computer Vision and Pattern Recognition
Domain
Computer Vision, Deep Learning, Semi-Supervised Learning
SOTA Claim
Yes
Reproducibility
8/10

Abstract

† project lead ‡ corresponding author https://depth-anything.github.io Figure 1.Our model exhibits impressive generalization ability across extensive unseen scenes.Left two columns: COCO [36].Middle two: SA-1B [27] (a hold-out unseen set).Right two: photos captured by ourselves.Our model works robustly in low-light environments (1st and 3rd column), complex scenes (2nd and 5th column), foggy weather (5th column), and ultra-remote distance (5th and 6th column), etc.

Summary

This paper introduces Depth Anything, a foundation model designed for monocular depth estimation (MDE) capable of producing accurate depth information from any image in diverse conditions, leveraging a large scale of unlabeled data. The authors advocate the use of monocular unlabeled images due to their cost-effectiveness and the broad scene coverage they provide. The model is trained using both a self-trained setup, where unlabeled images receive pseudo labels from a pre-trained MDE model and joint learning with labeled images. The authors propose an innovative approach that enhances the learning process by challenging the student model with difficult optimization targets and leveraging semantic presences from pre-trained networks. Empirical results show that Depth Anything significantly outperforms MiDaS and other models in zero-shot settings across various datasets, demonstrating its effectiveness in both relative and metric depth estimation tasks as well as semantic segmentation.

Methods

This paper employs the following methods:

  • Self-training
  • Feature alignment
  • Challenging optimization targets

Models Used

  • MiDaS
  • DINOv2
  • ZoeDepth

Datasets

The following datasets were used in this research:

  • SA-1B
  • Open Images
  • BDD100K
  • None specified

Evaluation Metrics

  • AbsRel
  • δ1
  • mIoU

Results

  • Depth Anything surpasses MiDaS in zero-shot capability.
  • Model demonstrates strong generalization on unseen datasets.
  • Achieved improvements in relative depth estimation metrics (AbsRel, δ1).

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

Depth Estimation Unlabeled Data Foundation Models Self-Training Semantic Priors Vision Transformers

Papers Using Similar Methods

External Resources