MViTv2-B (IN-21K + Kinetics400 pretrain)
|
MViTv2: Improved Multiscale Vision Transformers f…
|
93.40
|
2021-12-02
|
|
RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)
|
Relational Self-Attention: What's Missing in Atte…
|
91.10
|
2021-11-02
|
|
MVD (Kinetics400 pretrain, ViT-H, 16 frame)
|
Masked Video Distillation: Rethinking Masked Feat…
|
77.30
|
2022-12-08
|
|
InternVideo
|
InternVideo: General Video Foundation Models via …
|
77.20
|
2022-12-06
|
|
InternVideo2-1B
|
InternVideo2: Scaling Foundation Models for Multi…
|
77.10
|
2024-03-22
|
|
VideoMAE V2-g
|
VideoMAE V2: Scaling Video Masked Autoencoders wi…
|
77.00
|
2023-03-29
|
|
MVD (Kinetics400 pretrain, ViT-L, 16 frame)
|
Masked Video Distillation: Rethinking Masked Feat…
|
76.70
|
2022-12-08
|
|
Hiera-L (no extra data)
|
Hiera: A Hierarchical Vision Transformer without …
|
76.50
|
2023-06-01
|
|
TubeViT-L
|
Rethinking Video ViTs: Sparse Video Tubes for Joi…
|
76.10
|
2022-12-06
|
|
VideoMAE (no extra data, ViT-L, 32x2)
|
VideoMAE: Masked Autoencoders are Data-Efficient …
|
75.40
|
2022-03-23
|
|
Side4Video (EVA ViT-E/14)
|
Side4Video: Spatial-Temporal Side Network for Mem…
|
75.20
|
2023-11-27
|
|
MaskFeat (Kinetics600 pretrain, MViT-L)
|
Masked Feature Prediction for Self-Supervised Vis…
|
75.00
|
2021-12-16
|
|
MAR (50% mask, ViT-L, 16x4)
|
MAR: Masked Autoencoders for Efficient Action Rec…
|
74.70
|
2022-07-24
|
|
ATM
|
What Can Simple Arithmetic Operations Do for Temp…
|
74.60
|
2023-07-18
|
|
MAWS (ViT-L)
|
The effectiveness of MAE pre-pretraining for bill…
|
74.40
|
2023-03-23
|
|
VideoMAE (no extra data, ViT-L, 16frame)
|
VideoMAE: Masked Autoencoders are Data-Efficient …
|
74.30
|
2022-03-23
|
|
MAR (75% mask, ViT-L, 16x4)
|
MAR: Masked Autoencoders for Efficient Action Rec…
|
73.80
|
2022-07-24
|
|
MVD (Kinetics400 pretrain, ViT-B, 16 frame)
|
Masked Video Distillation: Rethinking Masked Feat…
|
73.70
|
2022-12-08
|
|
ViC-MAE (ViT-L)
|
ViC-MAE: Self-Supervised Representation Learning …
|
73.70
|
2023-03-21
|
|
TAdaFormer-L/14
|
Temporally-Adaptive Models for Efficient Video Un…
|
73.60
|
2023-08-10
|
|
TDS-CLIP-ViT-L/14(8frames)
|
TDS-CLIP: Temporal Difference Side Network for Im…
|
73.40
|
2024-08-20
|
|
MViTv2-L (IN-21K + Kinetics400 pretrain)
|
MViTv2: Improved Multiscale Vision Transformers f…
|
73.30
|
2021-12-02
|
|
AMD(ViT-B/16)
|
Asymmetric Masked Distillation for Pre-Training S…
|
73.30
|
2023-11-06
|
|
UniFormerV2-L
|
UniFormerV2: Spatiotemporal Learning by Arming Im…
|
73.00
|
2022-09-22
|
|
ST-Adapter (ViT-L, CLIP)
|
ST-Adapter: Parameter-Efficient Image-to-Video Tr…
|
72.30
|
2022-06-27
|
|
ZeroI2V ViT-L/14
|
ZeroI2V: Zero-Cost Adaptation of Pre-trained Tran…
|
72.20
|
2023-10-02
|
|
MViT-B (IN-21K + Kinetics400 pretrain)
|
MViTv2: Improved Multiscale Vision Transformers f…
|
72.10
|
2021-12-02
|
|
CAST(ViT-B/16)
|
CAST: Cross-Attention in Space and Time for Video…
|
71.60
|
2023-11-30
|
|
StructVit-B-4-1
|
Learning Correlation Structures for Vision Transf…
|
71.50
|
2024-04-05
|
|
OMNIVORE (Swin-B, IN-21K+ Kinetics400 pretrain)
|
Omnivore: A Single Model for Many Visual Modaliti…
|
71.40
|
2022-01-20
|
|
BEVT (IN-1K + Kinetics400 pretrain)
|
BEVT: BERT Pretraining of Video Transformers
|
71.40
|
2021-12-02
|
|
TAdaConvNeXtV2-B
|
Temporally-Adaptive Models for Efficient Video Un…
|
71.10
|
2023-08-10
|
|
MAR (50% mask, ViT-B, 16x4)
|
MAR: Masked Autoencoders for Efficient Action Rec…
|
71.00
|
2022-07-24
|
|
MVD (Kinetics400 pretrain, ViT-S, 16 frame)
|
Masked Video Distillation: Rethinking Masked Feat…
|
70.90
|
2022-12-08
|
|
CoVeR(JFT-3B)
|
Co-training Transformer with Videos and Images Im…
|
70.90
|
2021-12-14
|
|
VideoMAE (no extra data, ViT-B, 16frame)
|
VideoMAE: Masked Autoencoders are Data-Efficient …
|
70.80
|
2022-03-23
|
|
AMD(ViT-S/16)
|
Asymmetric Masked Distillation for Pre-Training S…
|
70.20
|
2023-11-06
|
|
ILA (ViT-L/14)
|
Implicit Temporal Modeling with Learnable Alignme…
|
70.20
|
2023-04-20
|
|
MorphMLP-B (IN-1K)
|
MorphMLP: An Efficient MLP-Like Backbone for Spat…
|
70.10
|
2021-11-24
|
|
SIFA
|
Stand-Alone Inter-Frame Attention in Video Models
|
69.80
|
2022-06-14
|
|
CoVeR(JFT-300M)
|
Co-training Transformer with Videos and Images Im…
|
69.80
|
2021-12-14
|
|
TPS
|
Spatiotemporal Self-attention Modeling with Tempo…
|
69.80
|
2022-07-27
|
|
Swin-B (IN-21K + Kinetics400 pretrain)
|
Video Swin Transformer
|
69.60
|
2021-06-24
|
|
TDN ResNet101 (one clip, three crop, 8+16 ensemble, ImageNet pretrained, RGB only)
|
TDN: Temporal Difference Networks for Efficient A…
|
69.60
|
2020-12-18
|
|
MAR (75% mask, ViT-B, 16x4)
|
MAR: Masked Autoencoders for Efficient Action Rec…
|
69.50
|
2022-07-24
|
|
ORViT Mformer-L (ORViT blocks)
|
Object-Region Video Transformers
|
69.50
|
2021-10-13
|
|
MML (ensemble)
|
Mutual Modality Learning for Video Action Classif…
|
69.02
|
2020-11-04
|
|
MViT-B-24, 32x3
|
Multiscale Vision Transformers
|
68.70
|
2021-04-22
|
|
MTV-B
|
Multiview Transformers for Video Recognition
|
68.50
|
2022-01-12
|
|
MLP-3D
|
MLP-3D: A MLP-like 3D Architecture with Grouped T…
|
68.50
|
2022-06-13
|
|
TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)
|
TDN: Temporal Difference Networks for Efficient A…
|
68.20
|
2020-12-18
|
|
Mformer-L
|
Keeping Your Eye on the Ball: Trajectory Attentio…
|
68.10
|
2021-06-09
|
|
VIMPAC
|
VIMPAC: Video Pre-Training via Masked Token Predi…
|
68.10
|
2021-06-21
|
|
ORViT Mformer (ORViT blocks)
|
Object-Region Video Transformers
|
67.90
|
2021-10-13
|
|
MViT-B, 32x3(Kinetics600 pretrain)
|
Multiscale Vision Transformers
|
67.80
|
2021-04-22
|
|
GC-TDN Ensemble (R50,8+16)
|
Group Contextualization for Video Recognition
|
67.80
|
2022-03-18
|
|
CT-Net Ensemble (R50, 8+12+16+24)
|
CT-Net: Channel Tensorization Network for Video C…
|
67.80
|
2021-06-03
|
|
TCM (Ensemble)
|
Motion-driven Visual Tempo Learning for Video-bas…
|
67.80
|
2022-02-24
|
|
SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)
|
Learning Self-Similarity in Space and Time as Gen…
|
67.70
|
2021-02-14
|
|
RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips
|
Relational Self-Attention: What's Missing in Atte…
|
67.70
|
2021-11-02
|
|
SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)
|
Learning Self-Similarity in Space and Time as Gen…
|
67.40
|
2021-02-14
|
|
VoV3D-L (32frames, Kinetics pretrained, single)
|
Diverse Temporal Aggregation and Depthwise Spatio…
|
67.35
|
2020-12-01
|
|
PLAR
|
SCP: Soft Conditional Prompt Learning for Aerial …
|
67.30
|
2023-05-21
|
|
RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
|
Relational Self-Attention: What's Missing in Atte…
|
67.30
|
2021-11-02
|
|
X-Vit (x16)
|
Space-time Mixing Attention for Video Transformer
|
67.20
|
2021-06-10
|
|
TAda2D-En (ResNet-50, 8+16 frames)
|
TAda! Temporally-Adaptive Convolutions for Video …
|
67.20
|
2021-10-12
|
|
Mformer-HR
|
Keeping Your Eye on the Ball: Trajectory Attentio…
|
67.10
|
2021-06-09
|
|
TAdaConvNeXt-T
|
TAda! Temporally-Adaptive Convolutions for Video …
|
67.10
|
2021-10-12
|
|
MML (single)
|
Mutual Modality Learning for Video Action Classif…
|
66.83
|
2020-11-04
|
|
ILA (ViT-B/16)
|
Implicit Temporal Modeling with Learnable Alignme…
|
66.80
|
2023-04-20
|
|
TSM (RGB + Flow)
|
TSM: Temporal Shift Module for Efficient Video Un…
|
66.60
|
2018-11-20
|
|
MSNet-R50En (8+16 ensemble, ImageNet pretrained)
|
MotionSqueeze: Neural Motion Feature Learning for…
|
66.60
|
2020-07-20
|
|
PAN ResNet101 (RGB only, no Flow)
|
PAN: Towards Fast Action Recognition via Learning…
|
66.50
|
2020-08-08
|
|
TSM+W3 (16 frames, RGB ResNet-50)
|
Knowing What, Where and When to Look: Efficient V…
|
66.50
|
2020-04-02
|
|
Mformer
|
Keeping Your Eye on the Ball: Trajectory Attentio…
|
66.50
|
2021-06-09
|
|
MVFNet-ResNet50 (center crop, 8+16 ensemble, ImageNet pretrained, RGB only)
|
MVFNet: Multi-View Fusion Network for Efficient V…
|
66.30
|
2020-12-13
|
|
MViT-B, 16x4
|
Multiscale Vision Transformers
|
66.20
|
2021-04-22
|
|
RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
|
Relational Self-Attention: What's Missing in Atte…
|
66.00
|
2021-11-02
|
|
VoV3D-L (32frames, from scratch, single)
|
Diverse Temporal Aggregation and Depthwise Spatio…
|
65.80
|
2020-12-01
|
|
E3D-L
|
Maximizing Spatio-Temporal Entropy of Deep 3D CNN…
|
65.70
|
2023-03-05
|
|
SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)
|
Learning Self-Similarity in Space and Time as Gen…
|
65.70
|
2021-02-14
|
|
TAda2D (ResNet-50, 16 frames)
|
TAda! Temporally-Adaptive Convolutions for Video …
|
65.60
|
2021-10-12
|
|
ViViT-L/16x2 Fact. encoder
|
ViViT: A Video Vision Transformer
|
65.40
|
2021-03-29
|
|
VoV3D-M (32frames, Kinetics pretrained, single)
|
Diverse Temporal Aggregation and Depthwise Spatio…
|
65.24
|
2020-12-01
|
|
bLVNet
|
More Is Less: Learning Efficient Video Representa…
|
65.20
|
2019-12-02
|
|
DirecFormer
|
DirecFormer: A Directed Attention in Transformer …
|
64.94
|
2022-03-19
|
|
RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
|
Relational Self-Attention: What's Missing in Atte…
|
64.80
|
2021-11-02
|
|
MSNet-R50 (16 frames, ImageNet pretrained)
|
MotionSqueeze: Neural Motion Feature Learning for…
|
64.70
|
2020-07-20
|
|
AK-Net
|
Action Keypoint Network for Efficient Video Recog…
|
64.30
|
2022-01-17
|
|
VoV3D-M (32frames, from scratch, single)
|
Diverse Temporal Aggregation and Depthwise Spatio…
|
64.20
|
2020-12-01
|
|
VoV3D-L (16frames, from scratch, single)
|
Diverse Temporal Aggregation and Depthwise Spatio…
|
64.10
|
2020-12-01
|
|
TAda2D (ResNet-50, 8 frames)
|
TAda! Temporally-Adaptive Convolutions for Video …
|
64.00
|
2021-10-12
|
|
MoViNet-A2
|
MoViNets: Mobile Video Networks for Efficient Vid…
|
63.50
|
2021-03-21
|
|
VoV3D-M (16frames, from scratch, single)
|
Diverse Temporal Aggregation and Depthwise Spatio…
|
63.20
|
2020-12-01
|
|
MSNet-R50 (8 frames, ImageNet pretrained)
|
MotionSqueeze: Neural Motion Feature Learning for…
|
63.00
|
2020-07-20
|
|
MoViNet-A1
|
MoViNets: Mobile Video Networks for Efficient Vid…
|
62.70
|
2021-03-21
|
|
OmniVL
|
OmniVL:One Foundation Model for Image-Language an…
|
62.50
|
2022-09-15
|
|
TimeSformer-HR
|
Is Space-Time Attention All You Need for Video Un…
|
62.50
|
2021-02-09
|
|
TimeSformer-L
|
Is Space-Time Attention All You Need for Video Un…
|
62.30
|
2021-02-09
|
|
TRG (ResNet-50)
|
Temporal Reasoning Graph for Activity Recognition
|
62.20
|
2019-08-27
|
|
TPN (TSM-50)
|
Temporal Pyramid Network for Action Recognition
|
62.00
|
2020-04-07
|
|
Multigrid
|
A Multigrid Method for Efficiently Training Video…
|
61.70
|
2019-12-02
|
|
SlowFast
|
SlowFast Networks for Video Recognition
|
61.70
|
2018-12-10
|
|
TRG (Inception-V3)
|
Temporal Reasoning Graph for Activity Recognition
|
61.30
|
2019-08-27
|
|
MoViNet-A0
|
MoViNets: Mobile Video Networks for Efficient Vid…
|
61.30
|
2021-03-21
|
|
CCS + two-stream + TRN
|
Cooperative Cross-Stream Network for Discriminati…
|
61.20
|
2019-08-27
|
|
VidTr-L
|
VidTr: Video Transformer Without Convolutions
|
60.20
|
2021-04-23
|
|
TimeSformer
|
Is Space-Time Attention All You Need for Video Un…
|
59.50
|
2021-02-09
|
|
SVT
|
Self-supervised Video Transformer
|
59.20
|
2021-12-02
|
|
TAM (5-shot)
|
Few-Shot Video Classification via Temporal Alignm…
|
52.30
|
2019-06-27
|
|
model3D_1 with left-right augmentation and fps jitter
|
The "something something" video database for lear…
|
51.33
|
2017-06-13
|
|
Prob-Distill
|
Attention Distillation for Learning Video Represe…
|
49.90
|
2019-04-05
|
|
STM + TRNMultiscale
|
Comparative Analysis of CNN-based Spatiotemporal …
|
47.73
|
2019-09-11
|
|
InternVideo2-6B
|
InternVideo2: Scaling Foundation Models for Multi…
|
1.00
|
2024-03-22
|
|
MoViNet-A3
|
MoViNets: Mobile Video Networks for Efficient Vid…
|
|
2021-03-21
|
|
MViT-L (IN-21K + Kinetics400 pretrain)
|
MViTv2: Improved Multiscale Vision Transformers f…
|
|
2021-12-02
|
|