Maxime Oquab, Timothée Darcet [email protected], Théo Moutakanni [email protected], Huy V Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin [email protected], Piotr Bojanowski [email protected], Meta AI Research 1 Inria, Université Paris Saclay (2023)
DINOv2 is a new framework introduced for learning robust visual features without supervision, leveraging the advancements in foundation models observed in natural language processing. The authors focus on self-supervised learning by creating a larger, curated dataset from uncurated images. They present an automatic data curation pipeline that increases diversity and quality, training a Vision Transformer (ViT) model with 1 billion parameters. This model is distilled into smaller versions, achieving state-of-the-art performance on various benchmarks. The paper further emphasizes the competitive nature of DINOv2 features against weakly-supervised models in various tasks, including image classification, semantic segmentation, and video classification. The findings endorse the effectiveness of self-supervised pretraining over traditional text-guided approaches while addressing scalability concerns related to both data and model size.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: