← ML Research Wiki / 2103.00020

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever (2021)

Paper Information
arXiv ID
Venue
International Conference on Machine Learning
Domain
Computer Vision, Natural Language Processing
SOTA Claim
Yes
Code
Reproducibility
8/10

Abstract

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP. 1 The base query list is all words occurring at least 100 times in the English version of Wikipedia. This is augmented with bi-grams

Summary

The paper presents CLIP (Contrastive Language-Image Pre-training), a model that learns visual representations from images and their corresponding natural language captions. It addresses the limitation of traditional computer vision models that rely on a fixed set of labeled categories by leveraging a large dataset of 400 million (image, text) pairs sourced from the internet. The authors demonstrate that CLIP achieves competitive performance in zero-shot transfer across various downstream tasks without needing task-specific fine-tuning. The paper discusses the implications of using natural language as a source of supervision for visual models, highlighting efficiency, scalability, and general transfer capabilities compared to conventional supervised models. The authors benchmark the performance on over 30 datasets, revealing improvements over previous systems, but also acknowledge limitations such as generalization to complex tasks, robustness in natural distribution shifts, and potential biases in the model's predictions. Overall, CLIP offers a promising framework for flexible and less resource-intensive image classification using natural language supervision.

Methods

This paper employs the following methods:

  • Contrastive Learning

Models Used

  • CLIP
  • ResNet-50
  • Vision Transformer

Datasets

The following datasets were used in this research:

  • WIT
  • ImageNet
  • YFCC100M
  • MS-COCO
  • Visual Genome
  • MNIST
  • SVHN
  • UCF101
  • Kinetics700
  • Country211

Evaluation Metrics

  • Accuracy
  • Top-1 Accuracy
  • R@1
  • R@5
  • R@10
  • ROC AUC

Results

  • CLIP matches accuracy of ResNet-50 on ImageNet zero-shot without using training examples.
  • CLIP improves performance on fine-grained tasks and shows robust zero-shot capabilities across many datasets.

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: 592
  • GPU Type: V100

Keywords

Transfer learning Natural language supervision Contrastive learning Multimodal models Zero-shot transfer

Papers Using Similar Methods

External Resources