ML Research Wiki / Benchmarks / Zero-Shot Transfer Image Classification / ImageNet

ImageNet

Zero-Shot Transfer Image Classification Benchmark

Performance Over Time

📊 Showing 20 results | 📏 Metric: Param

Top Performing Models

Rank	Model	Paper	Param	Date	Code
1	M2-Encoder 📚	M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining	88.50	2024-01-29	📦 alipay/Ant-Multi-Modal-Framework
2	CoCa 📚	CoCa: Contrastive Captioners are Image-Text Foundation Models	86.30	2022-05-04	📦 mlfoundations/open_clip 📦 facebookresearch/multimodal 📦 lucidrains/CoCa-pytorch
3	LiT-22B	Scaling Vision Transformers to 22 Billion Parameters	85.90	2023-02-10	📦 lucidrains/flash-cosine-sim-attention
4	BASIC 📚	Combined Scaling for Zero-shot Transfer Learning	85.70	2021-11-19	-
5	LiT ViT-e	PaLI: A Jointly-Scaled Multilingual Language-Image Model	85.40	2022-09-14	📦 google-research/big_vision
6	LiT-tuning	LiT: Zero-Shot Transfer with Locked-image text Tuning	84.50	2021-11-15	📦 mlfoundations/open_clip 📦 google-research/vision_transformer 📦 google-research/big_vision 📦 laion-ai/clip_benchmark 📦 eify/clip_benchmark
7	IMP-MoE-L	Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception	83.90	2023-05-10	-
8	EVA-CLIP-18B	EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters	83.80	2024-02-06	📦 baaivision/EVA 📦 baaivision/eva
9	InternVL-C	InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks	83.20	2023-12-21	📦 opengvlab/internvl 📦 opengvlab/internvl-mmdetseg
10	MAWS (ViT-2B)	The effectiveness of MAE pre-pretraining for billion-scale pretraining	82.10	2023-03-23	📦 facebookresearch/maws

All Papers (20)

M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

2024

M2-Encoder

alipay/Ant-Multi-Modal-Framework

CoCa: Contrastive Captioners are Image-Text Foundation Models

2022

CoCa

mlfoundations/open_clip facebookresearch/multimodal

Scaling Vision Transformers to 22 Billion Parameters

2023

LiT-22B

lucidrains/flash-cosine-sim-attention

Combined Scaling for Zero-shot Transfer Learning

2021

BASIC

PaLI: A Jointly-Scaled Multilingual Language-Image Model

2022

LiT ViT-e

google-research/big_vision

LiT: Zero-Shot Transfer with Locked-image text Tuning

2021

LiT-tuning

mlfoundations/open_clip google-research/vision_transformer

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

2023

IMP-MoE-L

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

2024

EVA-CLIP-18B

baaivision/EVA baaivision/eva

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

2023

InternVL-C

opengvlab/internvl opengvlab/internvl-mmdetseg

The effectiveness of MAE pre-pretraining for billion-scale pretraining

2023

MAWS (ViT-2B)

facebookresearch/maws

EVA-CLIP: Improved Training Techniques for CLIP at Scale

2023

EVA-CLIP-E/14+

baaivision/eva PaddlePaddle/PaddleMIX

The effectiveness of MAE pre-pretraining for billion-scale pretraining

2023

MAWS (ViT-H)

facebookresearch/maws

Learning Customized Visual Models with Retrieval-Augmented Knowledge

2023

REACT

microsoft/react

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

2021

ALIGN

facebookresearch/metaclip kakaobrain/coyo-dataset

Learning Transferable Visual Models From Natural Language Supervision

2021

CLIP（ViT-L/14-336px）

openai/CLIP mlfoundations/open_clip

AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities

2022

AltCLIP

flagai-open/flagai pwc-1/Paper-8

PaLI: A Jointly-Scaled Multilingual Language-Image Model

2022

PaLI

google-research/big_vision

Your Diffusion Model is Secretly a Zero-Shot Classifier

2023

Diffusion Classifier (zero-shot)

diffusion-classifier/diffusion-classifier SamsungSAILMontreal/ForestDiffusion

Learning Transferable Visual Models From Natural Language Supervision

2021

CLIP (ResNet50)

openai/CLIP mlfoundations/open_clip

Learning Transferable Visual Models From Natural Language Supervision

2021

CLIP

openai/CLIP mlfoundations/open_clip

ImageNet

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (20)

M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

CoCa: Contrastive Captioners are Image-Text Foundation Models

Scaling Vision Transformers to 22 Billion Parameters

Combined Scaling for Zero-shot Transfer Learning

PaLI: A Jointly-Scaled Multilingual Language-Image Model

LiT: Zero-Shot Transfer with Locked-image text Tuning

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

The effectiveness of MAE pre-pretraining for billion-scale pretraining

EVA-CLIP: Improved Training Techniques for CLIP at Scale

The effectiveness of MAE pre-pretraining for billion-scale pretraining

Learning Customized Visual Models with Retrieval-Augmented Knowledge

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Learning Transferable Visual Models From Natural Language Supervision

AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Your Diffusion Model is Secretly a Zero-Shot Classifier

Learning Transferable Visual Models From Natural Language Supervision

Learning Transferable Visual Models From Natural Language Supervision

Model	Paper	Param	Date
M2-Encoder	M2-Encoder: Advancing Bilingual Image-Text Unders…	88.50	2024-01-29
CoCa	CoCa: Contrastive Captioners are Image-Text Found…	86.30	2022-05-04
LiT-22B	Scaling Vision Transformers to 22 Billion Paramet…	85.90	2023-02-10
BASIC	Combined Scaling for Zero-shot Transfer Learning	85.70	2021-11-19
LiT ViT-e	PaLI: A Jointly-Scaled Multilingual Language-Imag…	85.40	2022-09-14
LiT-tuning	LiT: Zero-Shot Transfer with Locked-image text Tu…	84.50	2021-11-15
IMP-MoE-L	Alternating Gradient Descent and Mixture-of-Exper…	83.90	2023-05-10
EVA-CLIP-18B	EVA-CLIP-18B: Scaling CLIP to 18 Billion Paramete…	83.80	2024-02-06
InternVL-C	InternVL: Scaling up Vision Foundation Models and…	83.20	2023-12-21
MAWS (ViT-2B)	The effectiveness of MAE pre-pretraining for bill…	82.10	2023-03-23
EVA-CLIP-E/14+	EVA-CLIP: Improved Training Techniques for CLIP a…	82.00	2023-03-27
MAWS (ViT-H)	The effectiveness of MAE pre-pretraining for bill…	81.10	2023-03-23
REACT	Learning Customized Visual Models with Retrieval-…	78.50	2023-01-17
ALIGN	Scaling Up Visual and Vision-Language Representat…	76.40	2021-02-11
CLIP（ViT-L/14-336px）	Learning Transferable Visual Models From Natural …	76.20	2021-02-26
AltCLIP	AltCLIP: Altering the Language Encoder in CLIP fo…	74.50	2022-11-12
PaLI	PaLI: A Jointly-Scaled Multilingual Language-Imag…	72.11	2022-09-14
Diffusion Classifier (zero-shot)	Your Diffusion Model is Secretly a Zero-Shot Clas…	61.40	2023-03-28
CLIP (ResNet50)	Learning Transferable Visual Models From Natural …	59.60	2021-02-26
CLIP	Learning Transferable Visual Models From Natural …	31.30	2021-02-26