← ML Research Wiki / 2304.07193

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet [email protected], Théo Moutakanni [email protected], Huy V Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin [email protected], Piotr Bojanowski [email protected], Meta AI Research 1 Inria, Université Paris Saclay (2023)

Paper Information

arXiv ID

2304.07193

Venue

Trans. Mach. Learn. Res.

Domain

computer vision

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision.These models could greatly simplify the use of images in any system by producing generalpurpose visual features, i.e., features that work across image distributions and tasks without finetuning.This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources.We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size.Most of the technical contributions aim at accelerating and stabilizing the training at scale.In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature.In terms of models, we train a ViT model(Dosovitskiy et al., 2021)with 1B parameters and distill it into a series of smaller models that surpass the best available general-purpose features, OpenCLIP(Ilharco et al., 2021)on most of the benchmarks at image and pixel levels.

Summary

DINOv2 is a new framework introduced for learning robust visual features without supervision, leveraging the advancements in foundation models observed in natural language processing. The authors focus on self-supervised learning by creating a larger, curated dataset from uncurated images. They present an automatic data curation pipeline that increases diversity and quality, training a Vision Transformer (ViT) model with 1 billion parameters. This model is distilled into smaller versions, achieving state-of-the-art performance on various benchmarks. The paper further emphasizes the competitive nature of DINOv2 features against weakly-supervised models in various tasks, including image classification, semantic segmentation, and video classification. The findings endorse the effectiveness of self-supervised pretraining over traditional text-guided approaches while addressing scalability concerns related to both data and model size.

Methods

This paper employs the following methods:

Self-supervised learning
Vision Transformers (ViT)
Knowledge distillation

Models Used

DINOv2
ViT

Datasets

The following datasets were used in this research:

LVD-142M
ImageNet-1k
ImageNet-22k
Google Landmarks
iNaturalist 2018
iNaturalist 2021
Places205
UCF-101
Kinetics-400
Something-Something v2
ADE20k
Cityscapes
VOC 2012
NYUd
KITTI
SUN-RGBD

Evaluation Metrics

Top-1 accuracy
Mean Average Precision (mAP)
Mean Intersection over Union (mIoU)
Linear evaluation
K-nearest neighbors (k-NN) performance

Results

DINOv2 surpasses existing self-supervised methods on numerous benchmarks
Achieved competitive performance with weakly-supervised models in diverse tasks

Limitations

The authors identified the following limitations:

The model exhibits biases against certain demographics in fairness evaluations, particularly regarding geographical and income disparities.

Technical Requirements

Number of GPUs: 8
GPU Type: NVIDIA V100 32GB

Keywords

self-supervised learning visual features model scaling dataset curation transformers knowledge distillation

Papers Using Similar Methods

External Resources

Funding: Meta AI Research and Inria
References: 140
Influential Citations: 454

DINOv2: Learning Robust Visual Features without Supervision

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers