InternVideo
|
InternVideo: General Video Foundation Models via …
|
70.00
|
2022-12-06
|
|
VideoMAE V2-g
|
VideoMAE V2: Scaling Video Masked Autoencoders wi…
|
68.70
|
2023-03-29
|
|
Side4Video (EVA ViT-E/14
|
Side4Video: Spatial-Temporal Side Network for Mem…
|
67.30
|
2023-11-27
|
|
ATM
|
What Can Simple Arithmetic Operations Do for Temp…
|
65.60
|
2023-07-18
|
|
TAdaFormer-L/14
|
Temporally-Adaptive Models for Efficient Video Un…
|
63.70
|
2023-08-10
|
|
TDS-CLIP-ViT-L/14(8frames)
|
TDS-CLIP: Temporal Difference Side Network for Im…
|
63.00
|
2024-08-20
|
|
UniFormerV2-L
|
UniFormerV2: Spatiotemporal Learning by Arming Im…
|
62.70
|
2022-09-22
|
|
StructVit-B-4-1
|
Learning Correlation Structures for Vision Transf…
|
61.30
|
2024-04-05
|
|
TAdaConvNeXtV2-B
|
Temporally-Adaptive Models for Efficient Video Un…
|
60.70
|
2023-08-10
|
|
TPS
|
Spatiotemporal Self-attention Modeling with Tempo…
|
58.30
|
2022-07-27
|
|
SIFA
|
Stand-Alone Inter-Frame Attention in Video Models
|
57.30
|
2022-06-14
|
|
EAN ResNet50 (single clip, center crop,8+16 ensemble, with sparse Transformer)
|
EAN: Event Adaptive Network for Enhanced Action R…
|
57.20
|
2021-07-22
|
|
TCM (Ensemble)
|
Motion-driven Visual Tempo Learning for Video-bas…
|
57.20
|
2022-02-24
|
|
BQNEn (ImageNet + K400 pretrained)
|
Busy-Quiet Video Disentangling for Video Classifi…
|
57.10
|
2021-03-29
|
|
TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)
|
TDN: Temporal Difference Networks for Efficient A…
|
56.80
|
2020-12-18
|
|
SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)
|
Learning Self-Similarity in Space and Time as Gen…
|
56.60
|
2021-02-14
|
|
CT-Net Ensemble (R50, 8+12+16+24)
|
CT-Net: Channel Tensorization Network for Video C…
|
56.60
|
2021-06-03
|
|
MLP-3D
|
MLP-3D: A MLP-like 3D Architecture with Grouped T…
|
56.50
|
2022-06-13
|
|
RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)
|
Relational Self-Attention: What's Missing in Atte…
|
56.10
|
2021-11-02
|
|
SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)
|
Learning Self-Similarity in Space and Time as Gen…
|
55.80
|
2021-02-14
|
|
RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
|
Relational Self-Attention: What's Missing in Atte…
|
55.50
|
2021-11-02
|
|
PAN ResNet101 (RGB only, no Flow)
|
PAN: Towards Fast Action Recognition via Learning…
|
55.30
|
2020-08-08
|
|
GSM Ensemble InceptionV3 (ImageNet pretrained)
|
Gate-Shift Networks for Video Action Recognition
|
55.16
|
2019-12-01
|
|
MSNet-R50En (ensemble)
|
MotionSqueeze: Neural Motion Feature Learning for…
|
55.10
|
2020-07-20
|
|
VoV3D-L (32frames, Kinetics pretrained, single)
|
Diverse Temporal Aggregation and Depthwise Spatio…
|
54.59
|
2020-12-01
|
|
MSNet-R50En (8+16 ensemble, ImageNet pretrained)
|
MotionSqueeze: Neural Motion Feature Learning for…
|
54.40
|
2020-07-20
|
|
SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)
|
Learning Self-Similarity in Space and Time as Gen…
|
54.30
|
2021-02-14
|
|
RNL+TSM Ensemble(R50+R101, ImageNet pretrained)
|
Region-based Non-local Operation for Video Classi…
|
54.10
|
2020-07-17
|
|
RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
|
Relational Self-Attention: What's Missing in Atte…
|
54.00
|
2021-11-02
|
|
MVFNet-R50EN
|
MVFNet: Multi-View Fusion Network for Efficient V…
|
54.00
|
2020-12-13
|
|
GB + DF + LB (ResNet152, ImageNet pretrained)
|
Action recognition with spatial-temporal discrimi…
|
53.40
|
2019-08-20
|
|
ip-CSN-152 (IG-65M pretraining)
|
Video Classification with Channel-Separated Convo…
|
53.30
|
2019-04-04
|
|
RNL+TSM Ensemble(ResNet50, ImageNet pretrained)
|
Region-based Non-local Operation for Video Classi…
|
52.70
|
2020-07-17
|
|
VoV3D-M (32frames, Kinetics pretrained, single)
|
Diverse Temporal Aggregation and Depthwise Spatio…
|
52.68
|
2020-12-01
|
|
TSM+W3 (16 frames, ResNet50)
|
Knowing What, Where and When to Look: Efficient V…
|
52.60
|
2020-04-02
|
|
AK-Net
|
Action Keypoint Network for Efficient Video Recog…
|
52.50
|
2022-01-17
|
|
MSNet-R50 (16 frames, ImageNet pretrained)
|
MotionSqueeze: Neural Motion Feature Learning for…
|
52.10
|
2020-07-20
|
|
ir-CSN-152 (IG-65M pretraining)
|
Video Classification with Channel-Separated Convo…
|
52.10
|
2019-04-04
|
|
RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
|
Relational Self-Attention: What's Missing in Atte…
|
51.90
|
2021-11-02
|
|
GSM InceptionV3 (16 frames, ImageNet pretrained)
|
Gate-Shift Networks for Video Action Recognition
|
51.68
|
2019-12-01
|
|
R(2+1)D-152 (IG-65M pretraining)
|
Video Classification with Channel-Separated Convo…
|
51.60
|
2019-04-04
|
|
MSNet-R50 (8 frames, ImageNet pretrained)
|
MotionSqueeze: Neural Motion Feature Learning for…
|
50.90
|
2020-07-20
|
|
TSM (RGB + Flow)
|
TSM: Temporal Shift Module for Efficient Video Un…
|
50.70
|
2018-11-20
|
|
VoV3D-L (32frames, from scratch, single)
|
Diverse Temporal Aggregation and Depthwise Spatio…
|
50.60
|
2020-12-01
|
|
ResNet50 I3D (Moments pretrained)
|
Moments in Time Dataset: one million videos for e…
|
50.00
|
2018-01-09
|
|
VoV3D-M (32frames, from scratch, single)
|
Diverse Temporal Aggregation and Depthwise Spatio…
|
49.80
|
2020-12-01
|
|
TSMEn
|
TSM: Temporal Shift Module for Efficient Video Un…
|
49.70
|
2018-11-20
|
|
TRG (Inception-V3)
|
Temporal Reasoning Graph for Activity Recognition
|
49.70
|
2019-08-27
|
|
TRG (ResNet-50)
|
Temporal Reasoning Graph for Activity Recognition
|
49.50
|
2019-08-27
|
|
VoV3D-L (16frames, from scratch, single)
|
Diverse Temporal Aggregation and Depthwise Spatio…
|
49.50
|
2020-12-01
|
|
ir-CSN-152
|
Video Classification with Channel-Separated Convo…
|
49.30
|
2019-04-04
|
|
RSTG (Kinetics pretrained)
|
Recurrent Space-time Graph Neural Networks
|
49.20
|
2019-04-11
|
|
ResNet50 I3D (Kinetics pretrained)
|
Moments in Time Dataset: one million videos for e…
|
48.60
|
2018-01-09
|
|
ir-CSN-101
|
Video Classification with Channel-Separated Convo…
|
48.40
|
2019-04-04
|
|
S3D-G (ImageNet pretrained)
|
Rethinking Spatiotemporal Feature Learning: Speed…
|
48.20
|
2017-12-13
|
|
VoV3D-M (16frames, from scratch, single)
|
Diverse Temporal Aggregation and Depthwise Spatio…
|
48.10
|
2020-12-01
|
|
S3D
|
Rethinking Spatiotemporal Feature Learning: Speed…
|
47.30
|
2017-12-13
|
|
TSM
|
TSM: Temporal Shift Module for Efficient Video Un…
|
47.20
|
2018-11-20
|
|
ECO-Net (ImageNet pretrained)
|
ECO: Efficient Convolutional Network for Online V…
|
46.40
|
2018-04-24
|
|
ECO-Net
|
ECO: Efficient Convolutional Network for Online V…
|
46.40
|
2018-04-24
|
|
NL I3D + GCN
|
Videos as Space-Time Region Graphs
|
46.10
|
2018-06-05
|
|
NL I3D
|
Non-local Neural Networks
|
44.40
|
2017-11-21
|
|
Motion Feature Net
|
Motion Feature Network: Fixed Motion Filter for A…
|
43.90
|
2018-07-26
|
|
2-Stream TRN
|
Temporal Relational Reasoning in Videos
|
42.01
|
2017-11-22
|
|
HF-TSN (ImageNet pretraining)
|
Hierarchical Feature Aggregation Networks for Vid…
|
41.97
|
2019-05-29
|
|
M-TRN
|
Temporal Relational Reasoning in Videos
|
34.40
|
2017-11-22
|
|