Alexey Dosovitskiy [email protected], Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby [email protected] (2020)
This paper presents the Vision Transformer (ViT), a model that applies the Transformer architecture directly to image classification tasks without relying on convolutional neural networks (CNNs). By treating image patches as sequences akin to words in text processing, the ViT achieves competitive results on various benchmarks, particularly when pre-trained on large datasets such as ImageNet-21k and JFT-300M. The authors demonstrate that while Transformers traditionally underperform on smaller datasets due to a lack of inductive bias inherent in CNNs, they excel when scaled to larger datasets. The experimental results show that the ViT achieves impressive accuracies, reaching 88.55% on ImageNet and outperforming state-of-the-art CNNs in terms of computational efficiency. The paper emphasizes the potential of Transformers to transform image recognition approaches, especially at scale, while also pointing to future challenges in adapting the model for other vision tasks and enhancing self-supervised learning techniques.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: