ML Research Wiki / Benchmarks / Action Recognition / Something-Something V2

Something-Something V2

Action Recognition Benchmark

Performance Over Time

📊 Showing 116 results | 📏 Metric: Top-1 Accuracy

Top Performing Models

Rank Model Paper Top-1 Accuracy Date Code
1 MViTv2-B (IN-21K + Kinetics400 pretrain) MViTv2: Improved Multiscale Vision Transformers for Classification and Detection 93.40 2021-12-02 📦 rwightman/pytorch-image-models 📦 facebookresearch/detectron2 📦 facebookresearch/SlowFast
2 RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips) Relational Self-Attention: What's Missing in Attention for Video Understanding 91.10 2021-11-02 📦 KimManjin/RSA
3 MVD (Kinetics400 pretrain, ViT-H, 16 frame) 📚 Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning 77.30 2022-12-08 📦 ruiwang2021/mvd 📦 2023-MindSpore-4/Code-5 📦 Mind23-2/MindCode-3 📦 Mind23-2/MindCode-101
4 InternVideo 📚 InternVideo: General Video Foundation Models via Generative and Discriminative Learning 77.20 2022-12-06 📦 opengvlab/internvideo 📦 yingsen1/unimd
5 InternVideo2-1B 📚 InternVideo2: Scaling Foundation Models for Multimodal Video Understanding 77.10 2024-03-22 📦 opengvlab/internvideo 📦 opengvlab/internvideo2
6 VideoMAE V2-g 📚 VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking 77.00 2023-03-29 📦 OpenGVLab/VideoMAEv2
7 MVD (Kinetics400 pretrain, ViT-L, 16 frame) 📚 Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning 76.70 2022-12-08 📦 ruiwang2021/mvd 📦 2023-MindSpore-4/Code-5 📦 Mind23-2/MindCode-3 📦 Mind23-2/MindCode-101
8 Hiera-L (no extra data) Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles 76.50 2023-06-01 📦 huggingface/pytorch-image-models 📦 facebookresearch/hiera 📦 leondgarse/keras_cv_attention_models 📦 birder/birder
9 TubeViT-L Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning 76.10 2022-12-06 📦 daniel-code/TubeViT
10 VideoMAE (no extra data, ViT-L, 32x2) VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training 75.40 2022-03-23 📦 huggingface/transformers 📦 MCG-NJU/VideoMAE 📦 MCG-NJU/VideoMAE-Action-Detection

All Papers (116)

TDN: Temporal Difference Networks for Efficient Action Recognition

2020
TDN ResNet101 (one clip, three crop, 8+16 ensemble, ImageNet pretrained, RGB only)

Object-Region Video Transformers

2021
ORViT Mformer-L (ORViT blocks)

TDN: Temporal Difference Networks for Efficient Action Recognition

2020
TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)

Object-Region Video Transformers

2021
ORViT Mformer (ORViT blocks)

Relational Self-Attention: What's Missing in Attention for Video Understanding

2021
RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)

MVFNet: Multi-View Fusion Network for Efficient Video Recognition

2020
MVFNet-ResNet50 (center crop, 8+16 ensemble, ImageNet pretrained, RGB only)