ML Research Wiki / Benchmarks / Zero-Shot Action Recognition / Kinetics

Kinetics

Zero-Shot Action Recognition Benchmark

Performance Over Time

📊 Showing 16 results | 📏 Metric: Top-1 Accuracy

Top Performing Models

Rank	Model	Paper	Top-1 Accuracy	Date	Code
1	TC-CLIP	Leveraging Temporal Contextualization for Video Action Recognition	78.10	2024-04-15	📦 naver-ai/tc-clip 📦 naver-ai/dawin
2	IMP-MoE-L 📚	Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception	76.80	2023-05-10	-
3	OST	OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition	75.10	2023-11-30	📦 tomchen-ctj/OST
4	MAXI	MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge	71.60	2023-03-15	📦 wlin-at/maxi
5	OTI（ViT-L/14）	Orthogonal Temporal Interpolation for Zero-Shot Video Recognition	70.60	2023-08-14	📦 sweetorangezhuyan/mm2023_oti
6	VideoCoCa 📚	VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners	70.10	2022-12-09	-
7	Text4Vis	Revisiting Classifier: Transferring Vision-Language Models for Video Recognition	68.90	2022-07-04	📦 whwu95/Cap4Video 📦 whwu95/text4vis 📦 whwu95/GPT4Vis 📦 whwu95/BIKE 📦 whwu95/ATM
8	BIKE	Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models	68.50	2022-12-31	📦 whwu95/Cap4Video 📦 whwu95/text4vis 📦 whwu95/GPT4Vis 📦 whwu95/BIKE 📦 whwu95/ATM
9	X-CLIP	Expanding Language-Image Pretrained Models for General Video Recognition	65.20	2022-08-04	📦 microsoft/videox 📦 microsoft/VideoX
10	LanguageBind 📚	LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment	64.10	2023-10-03	📦 PKU-YuanGroup/Video-LLaVA 📦 PKU-YuanGroup/MoE-LLaVA 📦 pku-yuangroup/languagebind

All Papers (16)

Leveraging Temporal Contextualization for Video Action Recognition

2024

TC-CLIP

naver-ai/tc-clip naver-ai/dawin

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

2023

IMP-MoE-L

OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

2023

OST

tomchen-ctj/OST

MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

2023

MAXI

wlin-at/maxi

Orthogonal Temporal Interpolation for Zero-Shot Video Recognition

2023

OTI（ViT-L/14）

sweetorangezhuyan/mm2023_oti

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

2022

VideoCoCa

Revisiting Classifier: Transferring Vision-Language Models for Video Recognition

2022

Text4Vis

whwu95/Cap4Video whwu95/text4vis

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

2022

BIKE

whwu95/Cap4Video whwu95/text4vis

Expanding Language-Image Pretrained Models for General Video Recognition

2022

X-CLIP

microsoft/videox microsoft/VideoX

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

2023

LanguageBind

PKU-YuanGroup/Video-LLaVA PKU-YuanGroup/MoE-LLaVA

Elaborative Rehearsal for Zero-shot Action Recognition

2021

ER-ZSAR (ST+Obj)

DeLightCMU/ElaborativeRehearsal

Elaborative Rehearsal for Zero-shot Action Recognition

2021

ER-ZSAR (ST)

DeLightCMU/ElaborativeRehearsal

Learning a Deep Embedding Model for Zero-Shot Learning

2016

DEM

lzrobots/DeepEmbeddingModel_ZSL CristianoPatricio/zsl-methods

Label-Embedding for Image Classification

2015

ALE

mvp18/Popular-ZSL-Algorithms inars/developing_mc_for_zsl

All About Knowledge Graphs for Actions

2020

GCN

Evaluation of Output Embeddings for Fine-Grained Image Classification

2014

SJE(Word Embedding)

mvp18/Popular-ZSL-Algorithms inars/developing_mc_for_zsl

Kinetics

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (16)

Leveraging Temporal Contextualization for Video Action Recognition

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

Orthogonal Temporal Interpolation for Zero-Shot Video Recognition

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

Revisiting Classifier: Transferring Vision-Language Models for Video Recognition

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Expanding Language-Image Pretrained Models for General Video Recognition

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Elaborative Rehearsal for Zero-shot Action Recognition

Elaborative Rehearsal for Zero-shot Action Recognition

Learning a Deep Embedding Model for Zero-Shot Learning

Label-Embedding for Image Classification

All About Knowledge Graphs for Actions

Evaluation of Output Embeddings for Fine-Grained Image Classification

Model	Paper	Top-1 Accuracy	Date
TC-CLIP	Leveraging Temporal Contextualization for Video A…	78.10	2024-04-15
IMP-MoE-L	Alternating Gradient Descent and Mixture-of-Exper…	76.80	2023-05-10
OST	OST: Refining Text Knowledge with Optimal Spatio-…	75.10	2023-11-30
MAXI	MAtch, eXpand and Improve: Unsupervised Finetunin…	71.60	2023-03-15
OTI（ViT-L/14）	Orthogonal Temporal Interpolation for Zero-Shot V…	70.60	2023-08-14
VideoCoCa	VideoCoCa: Video-Text Modeling with Zero-Shot Tra…	70.10	2022-12-09
Text4Vis	Revisiting Classifier: Transferring Vision-Langua…	68.90	2022-07-04
BIKE	Bidirectional Cross-Modal Knowledge Exploration f…	68.50	2022-12-31
X-CLIP	Expanding Language-Image Pretrained Models for Ge…	65.20	2022-08-04
LanguageBind	LanguageBind: Extending Video-Language Pretrainin…	64.10	2023-10-03
ER-ZSAR (ST+Obj)	Elaborative Rehearsal for Zero-shot Action Recogn…	42.10	2021-08-05
ER-ZSAR (ST)	Elaborative Rehearsal for Zero-shot Action Recogn…	37.10	2021-08-05
DEM	Learning a Deep Embedding Model for Zero-Shot Lea…	23.60	2016-11-15
ALE	Label-Embedding for Image Classification	23.40	2015-03-30
GCN	All About Knowledge Graphs for Actions	22.30	2020-08-28
SJE(Word Embedding)	Evaluation of Output Embeddings for Fine-Grained …	22.30	2014-09-30