ML Research Wiki / Benchmarks / Action Detection / UCF101-24

UCF101-24

Action Detection Benchmark

Performance Over Time

📊 Showing 15 results | 📏 Metric: Frame-mAP 0.5

Top Performing Models

Rank	Model	Paper	Frame-mAP 0.5	Date	Code
1	STAR/L 📚	End-to-End Spatio-Temporal Action Localisation with Video Transformers	90.30	2023-04-24	-
2	SiA	Scaling Open-Vocabulary Action Detection	88.50	2025-04-04	📦 siatheindochinese/sia_act_placeholder
3	YOWO + LFB	You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization	87.30	2019-11-15	📦 wei-tim/YOWO 📦 zwtu/YOWO-Paddle 📦 BoChenUIUC/YOWO 📦 nuschandra/Tennis-Stroke-Detection 📦 Stepphonwol/my_yowo
4	HIT	Holistic Interaction Transformer Network for Action Detection	84.80	2022-10-23	📦 joslefaure/hit
5	YOWO	You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization	80.40	2019-11-15	📦 wei-tim/YOWO 📦 zwtu/YOWO-Paddle 📦 BoChenUIUC/YOWO 📦 nuschandra/Tennis-Stroke-Detection 📦 Stepphonwol/my_yowo
6	Two-in-one Two Stream	Dance with Flow: Two-in-One Stream Action Detection	78.48	2019-04-01	📦 jiaozizhao/Two-in-One-ActionDetection
7	MOC	Actions as Moving Points	77.80	2020-01-14	📦 MCG-NJU/MOC-Detector 📦 NEUdeep/MOC-Detector-Pytorch1.4
8	Faster-RCNN + two-stream I3D conv	AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions	76.30	2017-05-23	📦 tensorflow/models 📦 open-mmlab/mmaction2 📦 Whiffe/Custom-ava-dataset_Custom-Spatio-Temporally-Action-Video-Dataset
9	Two-in-one	Dance with Flow: Two-in-One Stream Action Detection	75.48	2019-04-01	📦 jiaozizhao/Two-in-One-ActionDetection
10	STEP	STEP: Spatio-Temporal Progressive Learning for Video Action Detection	75.00	2019-04-19	📦 NVlabs/STEP

All Papers (15)

End-to-End Spatio-Temporal Action Localisation with Video Transformers

2023

STAR/L

Scaling Open-Vocabulary Action Detection

2025

SiA

siatheindochinese/sia_act_placeholder

You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization

2019

YOWO + LFB

wei-tim/YOWO zwtu/YOWO-Paddle

Holistic Interaction Transformer Network for Action Detection

2022

HIT

joslefaure/hit

You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization

2019

YOWO

wei-tim/YOWO zwtu/YOWO-Paddle

Dance with Flow: Two-in-One Stream Action Detection

2019

Two-in-one Two Stream

jiaozizhao/Two-in-One-ActionDetection

Actions as Moving Points

2020

MOC

MCG-NJU/MOC-Detector NEUdeep/MOC-Detector-Pytorch1.4

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

2017

Faster-RCNN + two-stream I3D conv

tensorflow/models open-mmlab/mmaction2

Dance with Flow: Two-in-One Stream Action Detection

2019

Two-in-one

jiaozizhao/Two-in-One-ActionDetection

STEP: Spatio-Temporal Progressive Learning for Video Action Detection

2019

STEP

NVlabs/STEP

Stable Mean Teacher for Semi-supervised Video Action Detection

2024

Stable Mean Teacher (I3D)

akash2907/stable_mean_teacher AKASH2907/stable-mean-teacher

TACNet: Transition-Aware Context Network for Spatio-Temporal Action Detection

2019

TACNet

End-to-End Semi-Supervised Learning for Video Action Detection

2022

E2E-SSL (I3D)

AKASH2907/End-to-End-Semi-Supervised-Learning-for-Video-Action-Detection

Finding Action Tubes with a Sparse-to-Dense Framework

2020

DTS

Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos

2017

T-CNN

cyberpunk317/Action_detection

UCF101-24

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (15)

End-to-End Spatio-Temporal Action Localisation with Video Transformers

Scaling Open-Vocabulary Action Detection

You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization

Holistic Interaction Transformer Network for Action Detection

You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization

Dance with Flow: Two-in-One Stream Action Detection

Actions as Moving Points

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

Dance with Flow: Two-in-One Stream Action Detection

STEP: Spatio-Temporal Progressive Learning for Video Action Detection

Stable Mean Teacher for Semi-supervised Video Action Detection

TACNet: Transition-Aware Context Network for Spatio-Temporal Action Detection

End-to-End Semi-Supervised Learning for Video Action Detection

Finding Action Tubes with a Sparse-to-Dense Framework

Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos

Model	Paper	Frame-mAP 0.5	Date
STAR/L	End-to-End Spatio-Temporal Action Localisation wi…	90.30	2023-04-24
SiA	Scaling Open-Vocabulary Action Detection	88.50	2025-04-04
YOWO + LFB	You Only Watch Once: A Unified CNN Architecture f…	87.30	2019-11-15
HIT	Holistic Interaction Transformer Network for Acti…	84.80	2022-10-23
YOWO	You Only Watch Once: A Unified CNN Architecture f…	80.40	2019-11-15
Two-in-one Two Stream	Dance with Flow: Two-in-One Stream Action Detecti…	78.48	2019-04-01
MOC	Actions as Moving Points	77.80	2020-01-14
Faster-RCNN + two-stream I3D conv	AVA: A Video Dataset of Spatio-temporally Localiz…	76.30	2017-05-23
Two-in-one	Dance with Flow: Two-in-One Stream Action Detecti…	75.48	2019-04-01
STEP	STEP: Spatio-Temporal Progressive Learning for Vi…	75.00	2019-04-19
Stable Mean Teacher (I3D)	Stable Mean Teacher for Semi-supervised Video Act…	73.90	2024-12-10
TACNet	TACNet: Transition-Aware Context Network for Spat…	72.10	2019-05-31
E2E-SSL (I3D)	End-to-End Semi-Supervised Learning for Video Act…	69.90	2022-03-08
DTS	Finding Action Tubes with a Sparse-to-Dense Frame…	54.00	2020-08-30
T-CNN	Tube Convolutional Neural Network (T-CNN) for Act…	41.37	2017-03-30