ML Research Wiki / Benchmarks / Action Recognition / Something-Something V2

Something-Something V2

Action Recognition Benchmark

Performance Over Time

📊 Showing 116 results | 📏 Metric: Top-1 Accuracy

Top Performing Models

Rank	Model	Paper	Top-1 Accuracy	Date	Code
1	MViTv2-B (IN-21K + Kinetics400 pretrain)	MViTv2: Improved Multiscale Vision Transformers for Classification and Detection	93.40	2021-12-02	📦 rwightman/pytorch-image-models 📦 facebookresearch/detectron2 📦 facebookresearch/SlowFast
2	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)	Relational Self-Attention: What's Missing in Attention for Video Understanding	91.10	2021-11-02	📦 KimManjin/RSA
3	MVD (Kinetics400 pretrain, ViT-H, 16 frame) 📚	Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning	77.30	2022-12-08	📦 ruiwang2021/mvd 📦 2023-MindSpore-4/Code-5 📦 Mind23-2/MindCode-3 📦 Mind23-2/MindCode-101
4	InternVideo 📚	InternVideo: General Video Foundation Models via Generative and Discriminative Learning	77.20	2022-12-06	📦 opengvlab/internvideo 📦 yingsen1/unimd
5	InternVideo2-1B 📚	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	77.10	2024-03-22	📦 opengvlab/internvideo 📦 opengvlab/internvideo2
6	VideoMAE V2-g 📚	VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking	77.00	2023-03-29	📦 OpenGVLab/VideoMAEv2
7	MVD (Kinetics400 pretrain, ViT-L, 16 frame) 📚	Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning	76.70	2022-12-08	📦 ruiwang2021/mvd 📦 2023-MindSpore-4/Code-5 📦 Mind23-2/MindCode-3 📦 Mind23-2/MindCode-101
8	Hiera-L (no extra data)	Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles	76.50	2023-06-01	📦 huggingface/pytorch-image-models 📦 facebookresearch/hiera 📦 leondgarse/keras_cv_attention_models 📦 birder/birder
9	TubeViT-L	Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning	76.10	2022-12-06	📦 daniel-code/TubeViT
10	VideoMAE (no extra data, ViT-L, 32x2)	VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training	75.40	2022-03-23	📦 huggingface/transformers 📦 MCG-NJU/VideoMAE 📦 MCG-NJU/VideoMAE-Action-Detection

All Papers (116)

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

2021

MViTv2-B (IN-21K + Kinetics400 pretrain)

rwightman/pytorch-image-models facebookresearch/detectron2

Relational Self-Attention: What's Missing in Attention for Video Understanding

2021

RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)

KimManjin/RSA

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning

2022

MVD (Kinetics400 pretrain, ViT-H, 16 frame)

ruiwang2021/mvd 2023-MindSpore-4/Code-5

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

2022

InternVideo

opengvlab/internvideo yingsen1/unimd

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

2024

InternVideo2-1B

opengvlab/internvideo opengvlab/internvideo2

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

2023

VideoMAE V2-g

OpenGVLab/VideoMAEv2

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning

2022

MVD (Kinetics400 pretrain, ViT-L, 16 frame)

ruiwang2021/mvd 2023-MindSpore-4/Code-5

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

2023

Hiera-L (no extra data)

huggingface/pytorch-image-models facebookresearch/hiera

Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

2022

TubeViT-L

daniel-code/TubeViT

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

2022

VideoMAE (no extra data, ViT-L, 32x2)

huggingface/transformers MCG-NJU/VideoMAE

Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning

2023

Side4Video (EVA ViT-E/14)

whwu95/ATM HJYao00/Side4Video

Masked Feature Prediction for Self-Supervised Visual Pre-Training

2021

MaskFeat (Kinetics600 pretrain, MViT-L)

facebookresearch/SlowFast open-mmlab/mmselfsup

MAR: Masked Autoencoders for Efficient Action Recognition

2022

MAR (50% mask, ViT-L, 16x4)

alibaba-mmai-research/masked-action-recognition

What Can Simple Arithmetic Operations Do for Temporal Modeling?

2023

ATM

whwu95/ATM HJYao00/Side4Video

The effectiveness of MAE pre-pretraining for billion-scale pretraining

2023

MAWS (ViT-L)

facebookresearch/maws

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

2022

VideoMAE (no extra data, ViT-L, 16frame)

huggingface/transformers MCG-NJU/VideoMAE

MAR: Masked Autoencoders for Efficient Action Recognition

2022

MAR (75% mask, ViT-L, 16x4)

alibaba-mmai-research/masked-action-recognition

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning

2022

MVD (Kinetics400 pretrain, ViT-B, 16 frame)

ruiwang2021/mvd 2023-MindSpore-4/Code-5

ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders

2023

ViC-MAE (ViT-L)

jeffhernandez1995/vic-mae MindCode-4/code-5

Temporally-Adaptive Models for Efficient Video Understanding

2023

TAdaFormer-L/14

alibaba-mmai-research/TAdaConv

TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning

2024

TDS-CLIP-ViT-L/14(8frames)

BBYL9413/TDS-CLIP

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

2021

MViTv2-L (IN-21K + Kinetics400 pretrain)

rwightman/pytorch-image-models facebookresearch/detectron2

Asymmetric Masked Distillation for Pre-Training Small Foundation Models

2023

AMD(ViT-B/16)

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

2022

UniFormerV2-L

OpenGVLab/UniFormerV2 innat/UniFormerV2

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning

2022

ST-Adapter (ViT-L, CLIP)

linziyi96/st-adapter

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video

2023

ZeroI2V ViT-L/14

mcg-nju/zeroi2v leexinhao/ZeroI2V

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

2021

MViT-B (IN-21K + Kinetics400 pretrain)

rwightman/pytorch-image-models facebookresearch/detectron2

CAST: Cross-Attention in Space and Time for Video Action Recognition

2023

CAST(ViT-B/16)

khu-vll/cast

Learning Correlation Structures for Vision Transformers

2024

StructVit-B-4-1

Omnivore: A Single Model for Many Visual Modalities

2022

OMNIVORE (Swin-B, IN-21K+ Kinetics400 pretrain)

towhee-io/towhee facebookresearch/omnivore

BEVT: BERT Pretraining of Video Transformers

2021

BEVT (IN-1K + Kinetics400 pretrain)

xyzforever/bevt

Temporally-Adaptive Models for Efficient Video Understanding

2023

TAdaConvNeXtV2-B

alibaba-mmai-research/TAdaConv

MAR: Masked Autoencoders for Efficient Action Recognition

2022

MAR (50% mask, ViT-B, 16x4)

alibaba-mmai-research/masked-action-recognition

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning

2022

MVD (Kinetics400 pretrain, ViT-S, 16 frame)

ruiwang2021/mvd 2023-MindSpore-4/Code-5

Co-training Transformer with Videos and Images Improves Action Recognition

2021

CoVeR(JFT-3B)

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

2022

VideoMAE (no extra data, ViT-B, 16frame)

huggingface/transformers MCG-NJU/VideoMAE

Asymmetric Masked Distillation for Pre-Training Small Foundation Models

2023

AMD(ViT-S/16)

Implicit Temporal Modeling with Learnable Alignment for Video Recognition

2023

ILA (ViT-L/14)

francis-rings/ila

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

2021

MorphMLP-B (IN-1K)

liuruiyang98/Jittor-MLP MTLab/MorphMLP

Stand-Alone Inter-Frame Attention in Video Models

2022

SIFA

fuchenustc/sifa

Co-training Transformer with Videos and Images Improves Action Recognition

2021

CoVeR(JFT-300M)

Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition

2022

TPS

martinxm/tps

Video Swin Transformer

2021

Swin-B (IN-21K + Kinetics400 pretrain)

open-mmlab/mmaction2 towhee-io/towhee

TDN: Temporal Difference Networks for Efficient Action Recognition

2020

TDN ResNet101 (one clip, three crop, 8+16 ensemble, ImageNet pretrained, RGB only)

MCG-NJU/TDN

MAR: Masked Autoencoders for Efficient Action Recognition

2022

MAR (75% mask, ViT-B, 16x4)

alibaba-mmai-research/masked-action-recognition

Object-Region Video Transformers

2021

ORViT Mformer-L (ORViT blocks)

eladb3/orvit

Mutual Modality Learning for Video Action Classification

2020

MML (ensemble)

papermsucode/mutual-modality-learning

Multiscale Vision Transformers

2021

MViT-B-24, 32x3

facebookresearch/SlowFast facebookresearch/pytorchvideo

Multiview Transformers for Video Recognition

2022

MTV-B

google-research/scenic

MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing

2022

MLP-3D

TDN: Temporal Difference Networks for Efficient Action Recognition

2020

TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)

MCG-NJU/TDN

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

2021

Mformer-L

facebookresearch/xformers facebookresearch/Motionformer

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

2021

VIMPAC

airsplay/vimpac

Object-Region Video Transformers

2021

ORViT Mformer (ORViT blocks)

eladb3/orvit

Multiscale Vision Transformers

2021

MViT-B, 32x3(Kinetics600 pretrain)

facebookresearch/SlowFast facebookresearch/pytorchvideo

Group Contextualization for Video Recognition

2022

GC-TDN Ensemble (R50,8+16)

haoyanbin918/group-contextualization

CT-Net: Channel Tensorization Network for Video Classification

2021

CT-Net Ensemble (R50, 8+12+16+24)

Andy1621/CT-Net

Motion-driven Visual Tempo Learning for Video-based Action Recognition

2022

TCM (Ensemble)

yzfly/tcm zphyix/tcm

Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition

2021

SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)

arunos728/SELFY

Relational Self-Attention: What's Missing in Attention for Video Understanding

2021

RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips

KimManjin/RSA

Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition

2021

SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)

arunos728/SELFY

Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification

2020

VoV3D-L (32frames, Kinetics pretrained, single)

youngwanLEE/VoV3D

SCP: Soft Conditional Prompt Learning for Aerial Video Action Recognition

2023

PLAR

Relational Self-Attention: What's Missing in Attention for Video Understanding

2021

RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)

KimManjin/RSA

Space-time Mixing Attention for Video Transformer

2021

X-Vit (x16)

1adrianb/video-transformers

TAda! Temporally-Adaptive Convolutions for Video Understanding

2021

TAda2D-En (ResNet-50, 8+16 frames)

alibaba-mmai-research/TAdaConv alibaba-mmai-research/pytorch-video-understanding

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

2021

Mformer-HR

facebookresearch/xformers facebookresearch/Motionformer

TAda! Temporally-Adaptive Convolutions for Video Understanding

2021

TAdaConvNeXt-T

alibaba-mmai-research/TAdaConv alibaba-mmai-research/pytorch-video-understanding

Mutual Modality Learning for Video Action Classification

2020

MML (single)

papermsucode/mutual-modality-learning

Implicit Temporal Modeling with Learnable Alignment for Video Recognition

2023

ILA (ViT-B/16)

francis-rings/ila

TSM: Temporal Shift Module for Efficient Video Understanding

2018

TSM (RGB + Flow)

open-mmlab/mmaction2 towhee-io/towhee

MotionSqueeze: Neural Motion Feature Learning for Video Understanding

2020

MSNet-R50En (8+16 ensemble, ImageNet pretrained)

arunos728/MotionSqueeze arunos728/arunos728.github.io

PAN: Towards Fast Action Recognition via Learning Persistence of Appearance

2020

PAN ResNet101 (RGB only, no Flow)

zhang-can/PAN-PyTorch tianyuan168326/EAN-Pytorch

Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention

2020

TSM+W3 (16 frames, RGB ResNet-50)

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

2021

Mformer

facebookresearch/xformers facebookresearch/Motionformer

MVFNet: Multi-View Fusion Network for Efficient Video Recognition

2020

MVFNet-ResNet50 (center crop, 8+16 ensemble, ImageNet pretrained, RGB only)

whwu95/MVFNet whwu95/DSANet txyugood/PaddleMVF

Multiscale Vision Transformers

2021

MViT-B, 16x4

facebookresearch/SlowFast facebookresearch/pytorchvideo

Relational Self-Attention: What's Missing in Attention for Video Understanding

2021

RSANet-R50 (16 frames, ImageNet pretrained, a single clip)

KimManjin/RSA

Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification

2020

VoV3D-L (32frames, from scratch, single)

youngwanLEE/VoV3D

Maximizing Spatio-Temporal Entropy of Deep 3D CNNs for Efficient Video Recognition

2023

E3D-L

alibaba/lightweight-neural-architecture-search

Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition

2021

SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)

arunos728/SELFY

TAda! Temporally-Adaptive Convolutions for Video Understanding

2021

TAda2D (ResNet-50, 16 frames)

alibaba-mmai-research/TAdaConv alibaba-mmai-research/pytorch-video-understanding

ViViT: A Video Vision Transformer

2021

ViViT-L/16x2 Fact. encoder

google-research/scenic keras-team/keras-io

Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification

2020

VoV3D-M (32frames, Kinetics pretrained, single)

youngwanLEE/VoV3D

More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation

2019

bLVNet

IBM/bLVNet-TAM

DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition

2022

DirecFormer

uark-cviu/direcformer

Relational Self-Attention: What's Missing in Attention for Video Understanding

2021

RSANet-R50 (8 frames, ImageNet pretrained, a single clip)

KimManjin/RSA

MotionSqueeze: Neural Motion Feature Learning for Video Understanding

2020

MSNet-R50 (16 frames, ImageNet pretrained)

arunos728/MotionSqueeze arunos728/arunos728.github.io

Action Keypoint Network for Efficient Video Recognition

2022

AK-Net

Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification

2020

VoV3D-M (32frames, from scratch, single)

youngwanLEE/VoV3D

Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification

2020

VoV3D-L (16frames, from scratch, single)

youngwanLEE/VoV3D

TAda! Temporally-Adaptive Convolutions for Video Understanding

2021

TAda2D (ResNet-50, 8 frames)

alibaba-mmai-research/TAdaConv alibaba-mmai-research/pytorch-video-understanding

MoViNets: Mobile Video Networks for Efficient Video Recognition

2021

MoViNet-A2

tensorflow/models towhee-io/towhee Atze00/MoViNet-pytorch

Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification

2020

VoV3D-M (16frames, from scratch, single)

youngwanLEE/VoV3D

MotionSqueeze: Neural Motion Feature Learning for Video Understanding

2020

MSNet-R50 (8 frames, ImageNet pretrained)

arunos728/MotionSqueeze arunos728/arunos728.github.io

MoViNets: Mobile Video Networks for Efficient Video Recognition

2021

MoViNet-A1

tensorflow/models towhee-io/towhee Atze00/MoViNet-pytorch

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

2022

OmniVL

Is Space-Time Attention All You Need for Video Understanding?

2021

TimeSformer-HR

open-mmlab/mmaction2 towhee-io/towhee

Is Space-Time Attention All You Need for Video Understanding?

2021

TimeSformer-L

open-mmlab/mmaction2 towhee-io/towhee

Temporal Reasoning Graph for Activity Recognition

2019

TRG (ResNet-50)

Temporal Pyramid Network for Action Recognition

2020

TPN (TSM-50)

open-mmlab/mmaction2 decisionforce/TPN Zengxianxian727/TPN_paddle

A Multigrid Method for Efficiently Training Video Models

2019

Multigrid

facebookresearch/SlowFast kkahatapitiya/X3D-Multigrid alexandrosstergiou/Squeeze-and-Recursion-Temporal-Gates

SlowFast Networks for Video Recognition

2018

SlowFast

facebookresearch/SlowFast open-mmlab/mmaction2

Temporal Reasoning Graph for Activity Recognition

2019

TRG (Inception-V3)

MoViNets: Mobile Video Networks for Efficient Video Recognition

2021

MoViNet-A0

tensorflow/models towhee-io/towhee Atze00/MoViNet-pytorch

Cooperative Cross-Stream Network for Discriminative Action Representation

2019

CCS + two-stream + TRN

VidTr: Video Transformer Without Convolutions

2021

VidTr-L

Is Space-Time Attention All You Need for Video Understanding?

2021

TimeSformer

open-mmlab/mmaction2 towhee-io/towhee

Self-supervised Video Transformer

2021

SVT

kahnchana/svt

Few-Shot Video Classification via Temporal Alignment

2019

TAM (5-shot)

The "something something" video database for learning and evaluating visual common sense

2017

model3D_1 with left-right augmentation and fps jitter

jayleicn/singularity bit-ml/dyreg-gnn

Attention Distillation for Learning Video Representations

2019

Prob-Distill

Comparative Analysis of CNN-based Spatiotemporal Reasoning in Videos

2019

STM + TRNMultiscale

fubel/stmodeling

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

2024

InternVideo2-6B

opengvlab/internvideo opengvlab/internvideo2

MoViNets: Mobile Video Networks for Efficient Video Recognition

2021

MoViNet-A3

tensorflow/models towhee-io/towhee Atze00/MoViNet-pytorch

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

2021

MViT-L (IN-21K + Kinetics400 pretrain)

rwightman/pytorch-image-models facebookresearch/detectron2

Model	Paper	Top-1 Accuracy	Date
MViTv2-B (IN-21K + Kinetics400 pretrain)	MViTv2: Improved Multiscale Vision Transformers f…	93.40	2021-12-02
RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)	Relational Self-Attention: What's Missing in Atte…	91.10	2021-11-02
MVD (Kinetics400 pretrain, ViT-H, 16 frame)	Masked Video Distillation: Rethinking Masked Feat…	77.30	2022-12-08
InternVideo	InternVideo: General Video Foundation Models via …	77.20	2022-12-06
InternVideo2-1B	InternVideo2: Scaling Foundation Models for Multi…	77.10	2024-03-22
VideoMAE V2-g	VideoMAE V2: Scaling Video Masked Autoencoders wi…	77.00	2023-03-29
MVD (Kinetics400 pretrain, ViT-L, 16 frame)	Masked Video Distillation: Rethinking Masked Feat…	76.70	2022-12-08
Hiera-L (no extra data)	Hiera: A Hierarchical Vision Transformer without …	76.50	2023-06-01
TubeViT-L	Rethinking Video ViTs: Sparse Video Tubes for Joi…	76.10	2022-12-06
VideoMAE (no extra data, ViT-L, 32x2)	VideoMAE: Masked Autoencoders are Data-Efficient …	75.40	2022-03-23
Side4Video (EVA ViT-E/14)	Side4Video: Spatial-Temporal Side Network for Mem…	75.20	2023-11-27
MaskFeat (Kinetics600 pretrain, MViT-L)	Masked Feature Prediction for Self-Supervised Vis…	75.00	2021-12-16
MAR (50% mask, ViT-L, 16x4)	MAR: Masked Autoencoders for Efficient Action Rec…	74.70	2022-07-24
ATM	What Can Simple Arithmetic Operations Do for Temp…	74.60	2023-07-18
MAWS (ViT-L)	The effectiveness of MAE pre-pretraining for bill…	74.40	2023-03-23
VideoMAE (no extra data, ViT-L, 16frame)	VideoMAE: Masked Autoencoders are Data-Efficient …	74.30	2022-03-23
MAR (75% mask, ViT-L, 16x4)	MAR: Masked Autoencoders for Efficient Action Rec…	73.80	2022-07-24
MVD (Kinetics400 pretrain, ViT-B, 16 frame)	Masked Video Distillation: Rethinking Masked Feat…	73.70	2022-12-08
ViC-MAE (ViT-L)	ViC-MAE: Self-Supervised Representation Learning …	73.70	2023-03-21
TAdaFormer-L/14	Temporally-Adaptive Models for Efficient Video Un…	73.60	2023-08-10
TDS-CLIP-ViT-L/14(8frames)	TDS-CLIP: Temporal Difference Side Network for Im…	73.40	2024-08-20
MViTv2-L (IN-21K + Kinetics400 pretrain)	MViTv2: Improved Multiscale Vision Transformers f…	73.30	2021-12-02
AMD(ViT-B/16)	Asymmetric Masked Distillation for Pre-Training S…	73.30	2023-11-06
UniFormerV2-L	UniFormerV2: Spatiotemporal Learning by Arming Im…	73.00	2022-09-22
ST-Adapter (ViT-L, CLIP)	ST-Adapter: Parameter-Efficient Image-to-Video Tr…	72.30	2022-06-27
ZeroI2V ViT-L/14	ZeroI2V: Zero-Cost Adaptation of Pre-trained Tran…	72.20	2023-10-02
MViT-B (IN-21K + Kinetics400 pretrain)	MViTv2: Improved Multiscale Vision Transformers f…	72.10	2021-12-02
CAST(ViT-B/16)	CAST: Cross-Attention in Space and Time for Video…	71.60	2023-11-30
StructVit-B-4-1	Learning Correlation Structures for Vision Transf…	71.50	2024-04-05
OMNIVORE (Swin-B, IN-21K+ Kinetics400 pretrain)	Omnivore: A Single Model for Many Visual Modaliti…	71.40	2022-01-20
BEVT (IN-1K + Kinetics400 pretrain)	BEVT: BERT Pretraining of Video Transformers	71.40	2021-12-02
TAdaConvNeXtV2-B	Temporally-Adaptive Models for Efficient Video Un…	71.10	2023-08-10
MAR (50% mask, ViT-B, 16x4)	MAR: Masked Autoencoders for Efficient Action Rec…	71.00	2022-07-24
MVD (Kinetics400 pretrain, ViT-S, 16 frame)	Masked Video Distillation: Rethinking Masked Feat…	70.90	2022-12-08
CoVeR(JFT-3B)	Co-training Transformer with Videos and Images Im…	70.90	2021-12-14
VideoMAE (no extra data, ViT-B, 16frame)	VideoMAE: Masked Autoencoders are Data-Efficient …	70.80	2022-03-23
AMD(ViT-S/16)	Asymmetric Masked Distillation for Pre-Training S…	70.20	2023-11-06
ILA (ViT-L/14)	Implicit Temporal Modeling with Learnable Alignme…	70.20	2023-04-20
MorphMLP-B (IN-1K)	MorphMLP: An Efficient MLP-Like Backbone for Spat…	70.10	2021-11-24
SIFA	Stand-Alone Inter-Frame Attention in Video Models	69.80	2022-06-14
CoVeR(JFT-300M)	Co-training Transformer with Videos and Images Im…	69.80	2021-12-14
TPS	Spatiotemporal Self-attention Modeling with Tempo…	69.80	2022-07-27
Swin-B (IN-21K + Kinetics400 pretrain)	Video Swin Transformer	69.60	2021-06-24
TDN ResNet101 (one clip, three crop, 8+16 ensemble, ImageNet pretrained, RGB only)	TDN: Temporal Difference Networks for Efficient A…	69.60	2020-12-18
MAR (75% mask, ViT-B, 16x4)	MAR: Masked Autoencoders for Efficient Action Rec…	69.50	2022-07-24
ORViT Mformer-L (ORViT blocks)	Object-Region Video Transformers	69.50	2021-10-13
MML (ensemble)	Mutual Modality Learning for Video Action Classif…	69.02	2020-11-04
MViT-B-24, 32x3	Multiscale Vision Transformers	68.70	2021-04-22
MTV-B	Multiview Transformers for Video Recognition	68.50	2022-01-12
MLP-3D	MLP-3D: A MLP-like 3D Architecture with Grouped T…	68.50	2022-06-13
TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)	TDN: Temporal Difference Networks for Efficient A…	68.20	2020-12-18
Mformer-L	Keeping Your Eye on the Ball: Trajectory Attentio…	68.10	2021-06-09
VIMPAC	VIMPAC: Video Pre-Training via Masked Token Predi…	68.10	2021-06-21
ORViT Mformer (ORViT blocks)	Object-Region Video Transformers	67.90	2021-10-13
MViT-B, 32x3(Kinetics600 pretrain)	Multiscale Vision Transformers	67.80	2021-04-22
GC-TDN Ensemble (R50,8+16)	Group Contextualization for Video Recognition	67.80	2022-03-18
CT-Net Ensemble (R50, 8+12+16+24)	CT-Net: Channel Tensorization Network for Video C…	67.80	2021-06-03
TCM (Ensemble)	Motion-driven Visual Tempo Learning for Video-bas…	67.80	2022-02-24
SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)	Learning Self-Similarity in Space and Time as Gen…	67.70	2021-02-14
RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips	Relational Self-Attention: What's Missing in Atte…	67.70	2021-11-02
SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)	Learning Self-Similarity in Space and Time as Gen…	67.40	2021-02-14
VoV3D-L (32frames, Kinetics pretrained, single)	Diverse Temporal Aggregation and Depthwise Spatio…	67.35	2020-12-01
PLAR	SCP: Soft Conditional Prompt Learning for Aerial …	67.30	2023-05-21
RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)	Relational Self-Attention: What's Missing in Atte…	67.30	2021-11-02
X-Vit (x16)	Space-time Mixing Attention for Video Transformer	67.20	2021-06-10
TAda2D-En (ResNet-50, 8+16 frames)	TAda! Temporally-Adaptive Convolutions for Video …	67.20	2021-10-12
Mformer-HR	Keeping Your Eye on the Ball: Trajectory Attentio…	67.10	2021-06-09
TAdaConvNeXt-T	TAda! Temporally-Adaptive Convolutions for Video …	67.10	2021-10-12
MML (single)	Mutual Modality Learning for Video Action Classif…	66.83	2020-11-04
ILA (ViT-B/16)	Implicit Temporal Modeling with Learnable Alignme…	66.80	2023-04-20
TSM (RGB + Flow)	TSM: Temporal Shift Module for Efficient Video Un…	66.60	2018-11-20
MSNet-R50En (8+16 ensemble, ImageNet pretrained)	MotionSqueeze: Neural Motion Feature Learning for…	66.60	2020-07-20
PAN ResNet101 (RGB only, no Flow)	PAN: Towards Fast Action Recognition via Learning…	66.50	2020-08-08
TSM+W3 (16 frames, RGB ResNet-50)	Knowing What, Where and When to Look: Efficient V…	66.50	2020-04-02
Mformer	Keeping Your Eye on the Ball: Trajectory Attentio…	66.50	2021-06-09
MVFNet-ResNet50 (center crop, 8+16 ensemble, ImageNet pretrained, RGB only)	MVFNet: Multi-View Fusion Network for Efficient V…	66.30	2020-12-13
MViT-B, 16x4	Multiscale Vision Transformers	66.20	2021-04-22
RSANet-R50 (16 frames, ImageNet pretrained, a single clip)	Relational Self-Attention: What's Missing in Atte…	66.00	2021-11-02
VoV3D-L (32frames, from scratch, single)	Diverse Temporal Aggregation and Depthwise Spatio…	65.80	2020-12-01
E3D-L	Maximizing Spatio-Temporal Entropy of Deep 3D CNN…	65.70	2023-03-05
SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)	Learning Self-Similarity in Space and Time as Gen…	65.70	2021-02-14
TAda2D (ResNet-50, 16 frames)	TAda! Temporally-Adaptive Convolutions for Video …	65.60	2021-10-12
ViViT-L/16x2 Fact. encoder	ViViT: A Video Vision Transformer	65.40	2021-03-29
VoV3D-M (32frames, Kinetics pretrained, single)	Diverse Temporal Aggregation and Depthwise Spatio…	65.24	2020-12-01
bLVNet	More Is Less: Learning Efficient Video Representa…	65.20	2019-12-02
DirecFormer	DirecFormer: A Directed Attention in Transformer …	64.94	2022-03-19
RSANet-R50 (8 frames, ImageNet pretrained, a single clip)	Relational Self-Attention: What's Missing in Atte…	64.80	2021-11-02
MSNet-R50 (16 frames, ImageNet pretrained)	MotionSqueeze: Neural Motion Feature Learning for…	64.70	2020-07-20
AK-Net	Action Keypoint Network for Efficient Video Recog…	64.30	2022-01-17
VoV3D-M (32frames, from scratch, single)	Diverse Temporal Aggregation and Depthwise Spatio…	64.20	2020-12-01
VoV3D-L (16frames, from scratch, single)	Diverse Temporal Aggregation and Depthwise Spatio…	64.10	2020-12-01
TAda2D (ResNet-50, 8 frames)	TAda! Temporally-Adaptive Convolutions for Video …	64.00	2021-10-12
MoViNet-A2	MoViNets: Mobile Video Networks for Efficient Vid…	63.50	2021-03-21
VoV3D-M (16frames, from scratch, single)	Diverse Temporal Aggregation and Depthwise Spatio…	63.20	2020-12-01
MSNet-R50 (8 frames, ImageNet pretrained)	MotionSqueeze: Neural Motion Feature Learning for…	63.00	2020-07-20
MoViNet-A1	MoViNets: Mobile Video Networks for Efficient Vid…	62.70	2021-03-21
OmniVL	OmniVL:One Foundation Model for Image-Language an…	62.50	2022-09-15
TimeSformer-HR	Is Space-Time Attention All You Need for Video Un…	62.50	2021-02-09
TimeSformer-L	Is Space-Time Attention All You Need for Video Un…	62.30	2021-02-09
TRG (ResNet-50)	Temporal Reasoning Graph for Activity Recognition	62.20	2019-08-27
TPN (TSM-50)	Temporal Pyramid Network for Action Recognition	62.00	2020-04-07
Multigrid	A Multigrid Method for Efficiently Training Video…	61.70	2019-12-02
SlowFast	SlowFast Networks for Video Recognition	61.70	2018-12-10
TRG (Inception-V3)	Temporal Reasoning Graph for Activity Recognition	61.30	2019-08-27
MoViNet-A0	MoViNets: Mobile Video Networks for Efficient Vid…	61.30	2021-03-21
CCS + two-stream + TRN	Cooperative Cross-Stream Network for Discriminati…	61.20	2019-08-27
VidTr-L	VidTr: Video Transformer Without Convolutions	60.20	2021-04-23
TimeSformer	Is Space-Time Attention All You Need for Video Un…	59.50	2021-02-09
SVT	Self-supervised Video Transformer	59.20	2021-12-02
TAM (5-shot)	Few-Shot Video Classification via Temporal Alignm…	52.30	2019-06-27
model3D_1 with left-right augmentation and fps jitter	The "something something" video database for lear…	51.33	2017-06-13
Prob-Distill	Attention Distillation for Learning Video Represe…	49.90	2019-04-05
STM + TRNMultiscale	Comparative Analysis of CNN-based Spatiotemporal …	47.73	2019-09-11
InternVideo2-6B	InternVideo2: Scaling Foundation Models for Multi…	1.00	2024-03-22
MoViNet-A3	MoViNets: Mobile Video Networks for Efficient Vid…		2021-03-21
MViT-L (IN-21K + Kinetics400 pretrain)	MViTv2: Improved Multiscale Vision Transformers f…		2021-12-02

Something-Something V2

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (116)