ML Research Wiki / Benchmarks / Action Recognition / Something-Something V1

Something-Something V1

Action Recognition Benchmark

Performance Over Time

📊 Showing 66 results | 📏 Metric: Top 1 Accuracy

Top Performing Models

Rank	Model	Paper	Top 1 Accuracy	Date	Code
1	InternVideo 📚	InternVideo: General Video Foundation Models via Generative and Discriminative Learning	70.00	2022-12-06	📦 opengvlab/internvideo 📦 yingsen1/unimd
2	VideoMAE V2-g 📚	VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking	68.70	2023-03-29	📦 OpenGVLab/VideoMAEv2
3	Side4Video (EVA ViT-E/14	Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning	67.30	2023-11-27	📦 whwu95/ATM 📦 HJYao00/Side4Video
4	ATM	What Can Simple Arithmetic Operations Do for Temporal Modeling?	65.60	2023-07-18	📦 whwu95/ATM 📦 HJYao00/Side4Video
5	TAdaFormer-L/14 📚	Temporally-Adaptive Models for Efficient Video Understanding	63.70	2023-08-10	📦 alibaba-mmai-research/TAdaConv
6	TDS-CLIP-ViT-L/14(8frames)	TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning	63.00	2024-08-20	📦 BBYL9413/TDS-CLIP
7	UniFormerV2-L 📚	UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer	62.70	2022-09-22	📦 OpenGVLab/UniFormerV2 📦 innat/UniFormerV2
8	StructVit-B-4-1	Learning Correlation Structures for Vision Transformers	61.30	2024-04-05	-
9	TAdaConvNeXtV2-B 📚	Temporally-Adaptive Models for Efficient Video Understanding	60.70	2023-08-10	📦 alibaba-mmai-research/TAdaConv
10	TPS	Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition	58.30	2022-07-27	📦 martinxm/tps

All Papers (66)

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

2022

InternVideo

opengvlab/internvideo yingsen1/unimd

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

2023

VideoMAE V2-g

OpenGVLab/VideoMAEv2

Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning

2023

Side4Video (EVA ViT-E/14

whwu95/ATM HJYao00/Side4Video

What Can Simple Arithmetic Operations Do for Temporal Modeling?

2023

ATM

whwu95/ATM HJYao00/Side4Video

Temporally-Adaptive Models for Efficient Video Understanding

2023

TAdaFormer-L/14

alibaba-mmai-research/TAdaConv

TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning

2024

TDS-CLIP-ViT-L/14(8frames)

BBYL9413/TDS-CLIP

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

2022

UniFormerV2-L

OpenGVLab/UniFormerV2 innat/UniFormerV2

Learning Correlation Structures for Vision Transformers

2024

StructVit-B-4-1

Temporally-Adaptive Models for Efficient Video Understanding

2023

TAdaConvNeXtV2-B

alibaba-mmai-research/TAdaConv

Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition

2022

TPS

martinxm/tps

Stand-Alone Inter-Frame Attention in Video Models

2022

SIFA

fuchenustc/sifa

EAN: Event Adaptive Network for Enhanced Action Recognition

2021

EAN ResNet50 (single clip, center crop,8+16 ensemble, with sparse Transformer)

tianyuan168326/EAN-Pytorch

Motion-driven Visual Tempo Learning for Video-based Action Recognition

2022

TCM (Ensemble)

yzfly/tcm zphyix/tcm

Busy-Quiet Video Disentangling for Video Classification

2021

BQNEn (ImageNet + K400 pretrained)

guoxih/busy-quiet-net guoxih/Busy-Quiet-Video-Disentangling-for-Video-Classification

TDN: Temporal Difference Networks for Efficient Action Recognition

2020

TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)

MCG-NJU/TDN

Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition

2021

SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)

arunos728/SELFY

CT-Net: Channel Tensorization Network for Video Classification

2021

CT-Net Ensemble (R50, 8+12+16+24)

Andy1621/CT-Net

MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing

2022

MLP-3D

Relational Self-Attention: What's Missing in Attention for Video Understanding

2021

RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)

KimManjin/RSA

Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition

2021

SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)

arunos728/SELFY

Relational Self-Attention: What's Missing in Attention for Video Understanding

2021

RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)

KimManjin/RSA

PAN: Towards Fast Action Recognition via Learning Persistence of Appearance

2020

PAN ResNet101 (RGB only, no Flow)

zhang-can/PAN-PyTorch tianyuan168326/EAN-Pytorch

Gate-Shift Networks for Video Action Recognition

2019

GSM Ensemble InceptionV3 (ImageNet pretrained)

swathikirans/GSM Parth27/ActionRecognitionVideos

MotionSqueeze: Neural Motion Feature Learning for Video Understanding

2020

MSNet-R50En (ensemble)

arunos728/MotionSqueeze arunos728/arunos728.github.io

Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification

2020

VoV3D-L (32frames, Kinetics pretrained, single)

youngwanLEE/VoV3D

MotionSqueeze: Neural Motion Feature Learning for Video Understanding

2020

MSNet-R50En (8+16 ensemble, ImageNet pretrained)

arunos728/MotionSqueeze arunos728/arunos728.github.io

Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition

2021

SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)

arunos728/SELFY

Region-based Non-local Operation for Video Classification

2020

RNL+TSM Ensemble(R50+R101, ImageNet pretrained)

guoxih/region-based-non-local-network

Relational Self-Attention: What's Missing in Attention for Video Understanding

2021

RSANet-R50 (16 frames, ImageNet pretrained, a single clip)

KimManjin/RSA

MVFNet: Multi-View Fusion Network for Efficient Video Recognition

2020

MVFNet-R50EN

whwu95/MVFNet whwu95/DSANet txyugood/PaddleMVF

Action recognition with spatial-temporal discriminative filter banks

2019

GB + DF + LB (ResNet152, ImageNet pretrained)

Video Classification with Channel-Separated Convolutional Networks

2019

ip-CSN-152 (IG-65M pretraining)

open-mmlab/mmaction2 facebookresearch/R2Plus1D

Region-based Non-local Operation for Video Classification

2020

RNL+TSM Ensemble(ResNet50, ImageNet pretrained)

guoxih/region-based-non-local-network

Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification

2020

VoV3D-M (32frames, Kinetics pretrained, single)

youngwanLEE/VoV3D

Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention

2020

TSM+W3 (16 frames, ResNet50)

Action Keypoint Network for Efficient Video Recognition

2022

AK-Net

MotionSqueeze: Neural Motion Feature Learning for Video Understanding

2020

MSNet-R50 (16 frames, ImageNet pretrained)

arunos728/MotionSqueeze arunos728/arunos728.github.io

Video Classification with Channel-Separated Convolutional Networks

2019

ir-CSN-152 (IG-65M pretraining)

open-mmlab/mmaction2 facebookresearch/R2Plus1D

Relational Self-Attention: What's Missing in Attention for Video Understanding

2021

RSANet-R50 (8 frames, ImageNet pretrained, a single clip)

KimManjin/RSA

Gate-Shift Networks for Video Action Recognition

2019

GSM InceptionV3 (16 frames, ImageNet pretrained)

swathikirans/GSM Parth27/ActionRecognitionVideos

Video Classification with Channel-Separated Convolutional Networks

2019

R(2+1)D-152 (IG-65M pretraining)

open-mmlab/mmaction2 facebookresearch/R2Plus1D

MotionSqueeze: Neural Motion Feature Learning for Video Understanding

2020

MSNet-R50 (8 frames, ImageNet pretrained)

arunos728/MotionSqueeze arunos728/arunos728.github.io

TSM: Temporal Shift Module for Efficient Video Understanding

2018

TSM (RGB + Flow)

open-mmlab/mmaction2 towhee-io/towhee

Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification

2020

VoV3D-L (32frames, from scratch, single)

youngwanLEE/VoV3D

Moments in Time Dataset: one million videos for event understanding

2018

ResNet50 I3D (Moments pretrained)

zhoubolei/moments_models metalbubble/moments_models

Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification

2020

VoV3D-M (32frames, from scratch, single)

youngwanLEE/VoV3D

TSM: Temporal Shift Module for Efficient Video Understanding

2018

TSMEn

open-mmlab/mmaction2 towhee-io/towhee

Temporal Reasoning Graph for Activity Recognition

2019

TRG (Inception-V3)

Temporal Reasoning Graph for Activity Recognition

2019

TRG (ResNet-50)

Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification

2020

VoV3D-L (16frames, from scratch, single)

youngwanLEE/VoV3D

Video Classification with Channel-Separated Convolutional Networks

2019

ir-CSN-152

open-mmlab/mmaction2 facebookresearch/R2Plus1D

Recurrent Space-time Graph Neural Networks

2019

RSTG (Kinetics pretrained)

IuliaDuta/RSTG

Moments in Time Dataset: one million videos for event understanding

2018

ResNet50 I3D (Kinetics pretrained)

zhoubolei/moments_models metalbubble/moments_models

Video Classification with Channel-Separated Convolutional Networks

2019

ir-CSN-101

open-mmlab/mmaction2 facebookresearch/R2Plus1D

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

2017

S3D-G (ImageNet pretrained)

kylemin/S3D 3dperceptionlab/visual-wetlandbirds

Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification

2020

VoV3D-M (16frames, from scratch, single)

youngwanLEE/VoV3D

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

2017

S3D

kylemin/S3D 3dperceptionlab/visual-wetlandbirds

TSM: Temporal Shift Module for Efficient Video Understanding

2018

TSM

open-mmlab/mmaction2 towhee-io/towhee

ECO: Efficient Convolutional Network for Online Video Understanding

2018

ECO-Net (ImageNet pretrained)

mzolfaghari/ECO-efficient-video-understanding mindspore-ai/models

ECO: Efficient Convolutional Network for Online Video Understanding

2018

ECO-Net

mzolfaghari/ECO-efficient-video-understanding mindspore-ai/models

Videos as Space-Time Region Graphs

2018

NL I3D + GCN

Non-local Neural Networks

2017

NL I3D

facebookresearch/detectron facebookresearch/SlowFast

Motion Feature Network: Fixed Motion Filter for Action Recognition

2018

Motion Feature Net

Temporal Relational Reasoning in Videos

2017

2-Stream TRN

metalbubble/TRN-pytorch zhoubolei/TRN-pytorch

Hierarchical Feature Aggregation Networks for Video Action Recognition

2019

HF-TSN (ImageNet pretraining)

Temporal Relational Reasoning in Videos

2017

M-TRN

metalbubble/TRN-pytorch zhoubolei/TRN-pytorch

Model	Paper	Top 1 Accuracy	Date
InternVideo	InternVideo: General Video Foundation Models via …	70.00	2022-12-06
VideoMAE V2-g	VideoMAE V2: Scaling Video Masked Autoencoders wi…	68.70	2023-03-29
Side4Video (EVA ViT-E/14	Side4Video: Spatial-Temporal Side Network for Mem…	67.30	2023-11-27
ATM	What Can Simple Arithmetic Operations Do for Temp…	65.60	2023-07-18
TAdaFormer-L/14	Temporally-Adaptive Models for Efficient Video Un…	63.70	2023-08-10
TDS-CLIP-ViT-L/14(8frames)	TDS-CLIP: Temporal Difference Side Network for Im…	63.00	2024-08-20
UniFormerV2-L	UniFormerV2: Spatiotemporal Learning by Arming Im…	62.70	2022-09-22
StructVit-B-4-1	Learning Correlation Structures for Vision Transf…	61.30	2024-04-05
TAdaConvNeXtV2-B	Temporally-Adaptive Models for Efficient Video Un…	60.70	2023-08-10
TPS	Spatiotemporal Self-attention Modeling with Tempo…	58.30	2022-07-27
SIFA	Stand-Alone Inter-Frame Attention in Video Models	57.30	2022-06-14
EAN ResNet50 (single clip, center crop,8+16 ensemble, with sparse Transformer)	EAN: Event Adaptive Network for Enhanced Action R…	57.20	2021-07-22
TCM (Ensemble)	Motion-driven Visual Tempo Learning for Video-bas…	57.20	2022-02-24
BQNEn (ImageNet + K400 pretrained)	Busy-Quiet Video Disentangling for Video Classifi…	57.10	2021-03-29
TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)	TDN: Temporal Difference Networks for Efficient A…	56.80	2020-12-18
SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)	Learning Self-Similarity in Space and Time as Gen…	56.60	2021-02-14
CT-Net Ensemble (R50, 8+12+16+24)	CT-Net: Channel Tensorization Network for Video C…	56.60	2021-06-03
MLP-3D	MLP-3D: A MLP-like 3D Architecture with Grouped T…	56.50	2022-06-13
RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)	Relational Self-Attention: What's Missing in Atte…	56.10	2021-11-02
SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)	Learning Self-Similarity in Space and Time as Gen…	55.80	2021-02-14
RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)	Relational Self-Attention: What's Missing in Atte…	55.50	2021-11-02
PAN ResNet101 (RGB only, no Flow)	PAN: Towards Fast Action Recognition via Learning…	55.30	2020-08-08
GSM Ensemble InceptionV3 (ImageNet pretrained)	Gate-Shift Networks for Video Action Recognition	55.16	2019-12-01
MSNet-R50En (ensemble)	MotionSqueeze: Neural Motion Feature Learning for…	55.10	2020-07-20
VoV3D-L (32frames, Kinetics pretrained, single)	Diverse Temporal Aggregation and Depthwise Spatio…	54.59	2020-12-01
MSNet-R50En (8+16 ensemble, ImageNet pretrained)	MotionSqueeze: Neural Motion Feature Learning for…	54.40	2020-07-20
SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)	Learning Self-Similarity in Space and Time as Gen…	54.30	2021-02-14
RNL+TSM Ensemble(R50+R101, ImageNet pretrained)	Region-based Non-local Operation for Video Classi…	54.10	2020-07-17
RSANet-R50 (16 frames, ImageNet pretrained, a single clip)	Relational Self-Attention: What's Missing in Atte…	54.00	2021-11-02
MVFNet-R50EN	MVFNet: Multi-View Fusion Network for Efficient V…	54.00	2020-12-13
GB + DF + LB (ResNet152, ImageNet pretrained)	Action recognition with spatial-temporal discrimi…	53.40	2019-08-20
ip-CSN-152 (IG-65M pretraining)	Video Classification with Channel-Separated Convo…	53.30	2019-04-04
RNL+TSM Ensemble(ResNet50, ImageNet pretrained)	Region-based Non-local Operation for Video Classi…	52.70	2020-07-17
VoV3D-M (32frames, Kinetics pretrained, single)	Diverse Temporal Aggregation and Depthwise Spatio…	52.68	2020-12-01
TSM+W3 (16 frames, ResNet50)	Knowing What, Where and When to Look: Efficient V…	52.60	2020-04-02
AK-Net	Action Keypoint Network for Efficient Video Recog…	52.50	2022-01-17
MSNet-R50 (16 frames, ImageNet pretrained)	MotionSqueeze: Neural Motion Feature Learning for…	52.10	2020-07-20
ir-CSN-152 (IG-65M pretraining)	Video Classification with Channel-Separated Convo…	52.10	2019-04-04
RSANet-R50 (8 frames, ImageNet pretrained, a single clip)	Relational Self-Attention: What's Missing in Atte…	51.90	2021-11-02
GSM InceptionV3 (16 frames, ImageNet pretrained)	Gate-Shift Networks for Video Action Recognition	51.68	2019-12-01
R(2+1)D-152 (IG-65M pretraining)	Video Classification with Channel-Separated Convo…	51.60	2019-04-04
MSNet-R50 (8 frames, ImageNet pretrained)	MotionSqueeze: Neural Motion Feature Learning for…	50.90	2020-07-20
TSM (RGB + Flow)	TSM: Temporal Shift Module for Efficient Video Un…	50.70	2018-11-20
VoV3D-L (32frames, from scratch, single)	Diverse Temporal Aggregation and Depthwise Spatio…	50.60	2020-12-01
ResNet50 I3D (Moments pretrained)	Moments in Time Dataset: one million videos for e…	50.00	2018-01-09
VoV3D-M (32frames, from scratch, single)	Diverse Temporal Aggregation and Depthwise Spatio…	49.80	2020-12-01
TSMEn	TSM: Temporal Shift Module for Efficient Video Un…	49.70	2018-11-20
TRG (Inception-V3)	Temporal Reasoning Graph for Activity Recognition	49.70	2019-08-27
TRG (ResNet-50)	Temporal Reasoning Graph for Activity Recognition	49.50	2019-08-27
VoV3D-L (16frames, from scratch, single)	Diverse Temporal Aggregation and Depthwise Spatio…	49.50	2020-12-01
ir-CSN-152	Video Classification with Channel-Separated Convo…	49.30	2019-04-04
RSTG (Kinetics pretrained)	Recurrent Space-time Graph Neural Networks	49.20	2019-04-11
ResNet50 I3D (Kinetics pretrained)	Moments in Time Dataset: one million videos for e…	48.60	2018-01-09
ir-CSN-101	Video Classification with Channel-Separated Convo…	48.40	2019-04-04
S3D-G (ImageNet pretrained)	Rethinking Spatiotemporal Feature Learning: Speed…	48.20	2017-12-13
VoV3D-M (16frames, from scratch, single)	Diverse Temporal Aggregation and Depthwise Spatio…	48.10	2020-12-01
S3D	Rethinking Spatiotemporal Feature Learning: Speed…	47.30	2017-12-13
TSM	TSM: Temporal Shift Module for Efficient Video Un…	47.20	2018-11-20
ECO-Net (ImageNet pretrained)	ECO: Efficient Convolutional Network for Online V…	46.40	2018-04-24
ECO-Net	ECO: Efficient Convolutional Network for Online V…	46.40	2018-04-24
NL I3D + GCN	Videos as Space-Time Region Graphs	46.10	2018-06-05
NL I3D	Non-local Neural Networks	44.40	2017-11-21
Motion Feature Net	Motion Feature Network: Fixed Motion Filter for A…	43.90	2018-07-26
2-Stream TRN	Temporal Relational Reasoning in Videos	42.01	2017-11-22
HF-TSN (ImageNet pretraining)	Hierarchical Feature Aggregation Networks for Vid…	41.97	2019-05-29
M-TRN	Temporal Relational Reasoning in Videos	34.40	2017-11-22

Something-Something V1

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (66)