← ML Research Wiki / 2103.00020

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever (2021)

Paper Information

arXiv ID

2103.00020

Venue

International Conference on Machine Learning

Domain

Computer Vision, Natural Language Processing

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP. 1 The base query list is all words occurring at least 100 times in the English version of Wikipedia. This is augmented with bi-grams

Summary

The paper presents CLIP (Contrastive Language-Image Pre-training), a model that learns visual representations from images and their corresponding natural language captions. It addresses the limitation of traditional computer vision models that rely on a fixed set of labeled categories by leveraging a large dataset of 400 million (image, text) pairs sourced from the internet. The authors demonstrate that CLIP achieves competitive performance in zero-shot transfer across various downstream tasks without needing task-specific fine-tuning. The paper discusses the implications of using natural language as a source of supervision for visual models, highlighting efficiency, scalability, and general transfer capabilities compared to conventional supervised models. The authors benchmark the performance on over 30 datasets, revealing improvements over previous systems, but also acknowledge limitations such as generalization to complex tasks, robustness in natural distribution shifts, and potential biases in the model's predictions. Overall, CLIP offers a promising framework for flexible and less resource-intensive image classification using natural language supervision.

Methods

This paper employs the following methods:

Contrastive Learning

Models Used

CLIP
ResNet-50
Vision Transformer

Datasets

The following datasets were used in this research:

WIT
ImageNet
YFCC100M
MS-COCO
Visual Genome
MNIST
SVHN
UCF101
Kinetics700
Country211

Evaluation Metrics

Accuracy
Top-1 Accuracy
R@1
R@5
R@10
ROC AUC

Results

CLIP matches accuracy of ResNet-50 on ImageNet zero-shot without using training examples.
CLIP improves performance on fine-grained tasks and shows robust zero-shot capabilities across many datasets.

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: 592
GPU Type: V100

Keywords

Transfer learning Natural language supervision Contrastive learning Multimodal models Zero-shot transfer

Papers Using Similar Methods

External Resources

Funding: OpenAI
References: 220
Influential Citations: 6209

Learning Transferable Visual Models From Natural Language Supervision

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers