Venue
International Conference on Machine Learning
Domain
Computer Vision, Natural Language Processing
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP. 1 The base query list is all words occurring at least 100 times in the English version of Wikipedia. This is augmented with bi-grams
The paper presents CLIP (Contrastive Language-Image Pre-training), a model that learns visual representations from images and their corresponding natural language captions. It addresses the limitation of traditional computer vision models that rely on a fixed set of labeled categories by leveraging a large dataset of 400 million (image, text) pairs sourced from the internet. The authors demonstrate that CLIP achieves competitive performance in zero-shot transfer across various downstream tasks without needing task-specific fine-tuning. The paper discusses the implications of using natural language as a source of supervision for visual models, highlighting efficiency, scalability, and general transfer capabilities compared to conventional supervised models. The authors benchmark the performance on over 30 datasets, revealing improvements over previous systems, but also acknowledge limitations such as generalization to complex tasks, robustness in natural distribution shifts, and potential biases in the model's predictions. Overall, CLIP offers a promising framework for flexible and less resource-intensive image classification using natural language supervision.
This paper employs the following methods:
- CLIP
- ResNet-50
- Vision Transformer
The following datasets were used in this research:
- WIT
- ImageNet
- YFCC100M
- MS-COCO
- Visual Genome
- MNIST
- SVHN
- UCF101
- Kinetics700
- Country211
- Accuracy
- Top-1 Accuracy
- R@1
- R@5
- R@10
- ROC AUC
- CLIP matches accuracy of ResNet-50 on ImageNet zero-shot without using training examples.
- CLIP improves performance on fine-grained tasks and shows robust zero-shot capabilities across many datasets.
The authors identified the following limitations:
- Number of GPUs: 592
- GPU Type: V100
Transfer learning
Natural language supervision
Contrastive learning
Multimodal models
Zero-shot transfer