ML Research Wiki / Benchmarks / Zero-Shot Action Recognition / UCF101

UCF101

Zero-Shot Action Recognition Benchmark

Performance Over Time

📊 Showing 27 results | 📏 Metric: Top-1 Accuracy

Top Performing Models

Rank	Model	Paper	Top-1 Accuracy	Date	Code
1	OTI(ViT-L/14)	Orthogonal Temporal Interpolation for Zero-Shot Video Recognition	92.80	2023-08-14	📦 sweetorangezhuyan/mm2023_oti
2	IMP-MoE-L 📚	Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception	91.50	2023-05-10	-
3	MOV (ViT-L/14)	Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models	87.10	2022-07-15	-
4	VideoCoCa 📚	VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners	86.60	2022-12-09	-
5	BIKE	Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models	86.60	2022-12-31	📦 whwu95/Cap4Video 📦 whwu95/text4vis 📦 whwu95/GPT4Vis 📦 whwu95/BIKE 📦 whwu95/ATM
6	Text4Vis	Revisiting Classifier: Transferring Vision-Language Models for Video Recognition	85.80	2022-07-04	📦 whwu95/Cap4Video 📦 whwu95/text4vis 📦 whwu95/GPT4Vis 📦 whwu95/BIKE 📦 whwu95/ATM
7	TC-CLIP	Leveraging Temporal Contextualization for Video Action Recognition	85.40	2024-04-15	📦 naver-ai/tc-clip 📦 naver-ai/dawin
8	EVA-CLIP-E/14+ 📚	EVA-CLIP: Improved Training Techniques for CLIP at Scale	83.10	2023-03-27	📦 baaivision/eva 📦 PaddlePaddle/PaddleMIX 📦 Yui010206/CREMA 📦 jaehong31/raccoon
9	MOV (ViT-B/16)	Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models	82.60	2022-07-15	-
10	OST	OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition	79.70	2023-11-30	📦 tomchen-ctj/OST

All Papers (27)

Orthogonal Temporal Interpolation for Zero-Shot Video Recognition

2023

OTI(ViT-L/14)

sweetorangezhuyan/mm2023_oti

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

2023

IMP-MoE-L

Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

2022

MOV (ViT-L/14)

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

2022

VideoCoCa

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

2022

BIKE

whwu95/Cap4Video whwu95/text4vis

Revisiting Classifier: Transferring Vision-Language Models for Video Recognition

2022

Text4Vis

whwu95/Cap4Video whwu95/text4vis

Leveraging Temporal Contextualization for Video Action Recognition

2024

TC-CLIP

naver-ai/tc-clip naver-ai/dawin

EVA-CLIP: Improved Training Techniques for CLIP at Scale

2023

EVA-CLIP-E/14+

baaivision/eva PaddlePaddle/PaddleMIX

Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

2022

MOV (ViT-B/16)

OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

2023

OST

tomchen-ctj/OST

EZ-CLIP: Efficient Zeroshot Video Action Recognition

2023

EZ-CLIP

shahzadnit/ez-clip

MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

2023

MAXI

wlin-at/maxi

VicTR: Video-conditioned Text Representations for Activity Recognition

2023

VicTR (ViT-B/16)

Expanding Language-Image Pretrained Models for General Video Recognition

2022

X-CLIP

microsoft/videox microsoft/VideoX

Cross-modal Representation Learning for Zero-shot Action Recognition

2022

ResT

Alignment-Uniformity aware Representation Learning for Zero-shot Video Classification

2022

AURL

ShipuLoveMili/CVPR2022-AURL

CLASTER: Clustering with Reinforcement Learning for Zero-Shot Action Recognition

2021

CLASTER

Elaborative Rehearsal for Zero-shot Action Recognition

2021

ER-ZSAR

DeLightCMU/ElaborativeRehearsal

Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications

2020

E2E

bbrattoli/ZeroShotVideoClassification

Synthetic Sample Selection for Generalized Zero-Shot Learning

2023

SPOT

Objects2action: Classifying and localizing actions without any video example

2015

O2A

Alternative Semantic Representations for Zero-Shot Human Action Recognition

2017

ASR

Towards Universal Representation for Unseen Action Recognition

2018

UR

Multi-Task Zero-Shot Action Recognition with Prioritised Data Augmentation

2016

MTE

Evaluation of Output Embeddings for Fine-Grained Image Classification

2014

SJE(Attribute)

mvp18/Popular-ZSL-Algorithms inars/developing_mc_for_zsl

Semantic Embedding Space for Zero-Shot Action Recognition

2015

SVE

Evaluation of Output Embeddings for Fine-Grained Image Classification

2014

SJE(Word Embedding)

mvp18/Popular-ZSL-Algorithms inars/developing_mc_for_zsl

Model	Paper	Top-1 Accuracy	Date
OTI(ViT-L/14)	Orthogonal Temporal Interpolation for Zero-Shot V…	92.80	2023-08-14
IMP-MoE-L	Alternating Gradient Descent and Mixture-of-Exper…	91.50	2023-05-10
MOV (ViT-L/14)	Multimodal Open-Vocabulary Video Classification v…	87.10	2022-07-15
VideoCoCa	VideoCoCa: Video-Text Modeling with Zero-Shot Tra…	86.60	2022-12-09
BIKE	Bidirectional Cross-Modal Knowledge Exploration f…	86.60	2022-12-31
Text4Vis	Revisiting Classifier: Transferring Vision-Langua…	85.80	2022-07-04
TC-CLIP	Leveraging Temporal Contextualization for Video A…	85.40	2024-04-15
EVA-CLIP-E/14+	EVA-CLIP: Improved Training Techniques for CLIP a…	83.10	2023-03-27
MOV (ViT-B/16)	Multimodal Open-Vocabulary Video Classification v…	82.60	2022-07-15
OST	OST: Refining Text Knowledge with Optimal Spatio-…	79.70	2023-11-30
EZ-CLIP	EZ-CLIP: Efficient Zeroshot Video Action Recognit…	79.10	2023-12-13
MAXI	MAtch, eXpand and Improve: Unsupervised Finetunin…	78.20	2023-03-15
VicTR (ViT-B/16)	VicTR: Video-conditioned Text Representations for…	72.40	2023-04-05
X-CLIP	Expanding Language-Image Pretrained Models for Ge…	72.00	2022-08-04
ResT	Cross-modal Representation Learning for Zero-shot…	58.70	2022-05-03
AURL	Alignment-Uniformity aware Representation Learnin…	58.00	2022-03-29
CLASTER	CLASTER: Clustering with Reinforcement Learning f…	53.90	2021-01-18
ER-ZSAR	Elaborative Rehearsal for Zero-shot Action Recogn…	51.80	2021-08-05
E2E	Rethinking Zero-shot Video Classification: End-to…	48.00	2020-03-03
SPOT	Synthetic Sample Selection for Generalized Zero-S…	40.90	2023-04-06
O2A	Objects2action: Classifying and localizing action…	30.30	2015-10-23
ASR	Alternative Semantic Representations for Zero-Sho…	24.40	2017-06-28
UR	Towards Universal Representation for Unseen Actio…	17.50	2018-03-22
MTE	Multi-Task Zero-Shot Action Recognition with Prio…	15.80	2016-11-26
SJE(Attribute)	Evaluation of Output Embeddings for Fine-Grained …	12.00	2014-09-30
SVE	Semantic Embedding Space for Zero-Shot Action Rec…	10.90	2015-02-05
SJE(Word Embedding)	Evaluation of Output Embeddings for Fine-Grained …	9.90	2014-09-30

UCF101

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (27)

Orthogonal Temporal Interpolation for Zero-Shot Video Recognition

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Revisiting Classifier: Transferring Vision-Language Models for Video Recognition

Leveraging Temporal Contextualization for Video Action Recognition

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

EZ-CLIP: Efficient Zeroshot Video Action Recognition

MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

VicTR: Video-conditioned Text Representations for Activity Recognition

Expanding Language-Image Pretrained Models for General Video Recognition

Cross-modal Representation Learning for Zero-shot Action Recognition

Alignment-Uniformity aware Representation Learning for Zero-shot Video Classification

CLASTER: Clustering with Reinforcement Learning for Zero-Shot Action Recognition

Elaborative Rehearsal for Zero-shot Action Recognition

Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications

Synthetic Sample Selection for Generalized Zero-Shot Learning

Objects2action: Classifying and localizing actions without any video example

Alternative Semantic Representations for Zero-Shot Human Action Recognition

Towards Universal Representation for Unseen Action Recognition

Multi-Task Zero-Shot Action Recognition with Prioritised Data Augmentation

Evaluation of Output Embeddings for Fine-Grained Image Classification

Semantic Embedding Space for Zero-Shot Action Recognition

Evaluation of Output Embeddings for Fine-Grained Image Classification