ML Research Wiki / Benchmarks / Moment Retrieval / Charades-STA

Charades-STA

Moment Retrieval Benchmark

Performance Over Time

📊 Showing 25 results | 📏 Metric: R@1 IoU=0.5

Top Performing Models

Rank	Model	Paper	R@1 IoU=0.5	Date	Code
1	SG-DETR (w/ PT) 📚	Saliency-Guided DETR for Moment Retrieval and Highlight Detection	71.10	2024-10-02	📦 ai-forever/sg-detr
2	LLaVA-MR	LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval	70.65	2024-11-21	📦 swordlidev/LLaVA-MR
3	FlashVTG	FlashVTG: Feature Layering and Adaptive Score Handling Network for Video Temporal Grounding	70.32	2024-12-18	📦 zhuo-cao/flashvtg
4	SG-DETR	Saliency-Guided DETR for Moment Retrieval and Highlight Detection	70.20	2024-10-02	📦 ai-forever/sg-detr
5	InternVideo2-6B	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	70.03	2024-03-22	📦 opengvlab/internvideo 📦 opengvlab/internvideo2
6	InternVideo2-1B	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	68.36	2024-03-22	📦 opengvlab/internvideo 📦 opengvlab/internvideo2
7	VideoChat-T (FT) 📚	TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning	67.10	2024-10-25	📦 OpenGVLab/TimeSuite
8	UniMD+Sync.	UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection	63.98	2024-04-07	📦 yingsen1/unimd
9	UnLoc-L	UnLoc: A Unified Framework for Video Localization Tasks	60.80	2023-08-21	📦 google-research/scenic
10	BAM-DETR	BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos	59.95	2023-11-30	📦 Pilhyeon/BAM-DETR

All Papers (25)

Saliency-Guided DETR for Moment Retrieval and Highlight Detection

2024

SG-DETR (w/ PT)

ai-forever/sg-detr

LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

2024

LLaVA-MR

swordlidev/LLaVA-MR

FlashVTG: Feature Layering and Adaptive Score Handling Network for Video Temporal Grounding

2024

FlashVTG

zhuo-cao/flashvtg

Saliency-Guided DETR for Moment Retrieval and Highlight Detection

2024

SG-DETR

ai-forever/sg-detr

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

2024

InternVideo2-6B

opengvlab/internvideo opengvlab/internvideo2

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

2024

InternVideo2-1B

opengvlab/internvideo opengvlab/internvideo2

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

2024

VideoChat-T (FT)

OpenGVLab/TimeSuite

UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

2024

UniMD+Sync.

yingsen1/unimd

UnLoc: A Unified Framework for Video Localization Tasks

2023

UnLoc-L

google-research/scenic

BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos

2023

BAM-DETR

Pilhyeon/BAM-DETR

Background-aware Moment Detection for Video Moment Retrieval

2023

BM-DETR

minjoong507/bm-detr

Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection

2023

UVCOM

easonxiao-888/uvcom

Correlation-Guided Query-Dependency Calibration for Video Temporal Grounding

2023

CG-DETR

wjun0830/qd-detr wjun0830/cgdetr

Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval

2024

LLMEPET

fletcherjiang/llmepet

UnLoc: A Unified Framework for Video Localization Tasks

2023

UnLoc-B

google-research/scenic

Query-Dependent Video Representation for Moment Retrieval and Highlight Detection

2023

QD-DETR (Only Video)

wjun0830/qd-detr

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

2024

video-mamba-suite

opengvlab/video-mamba-suite

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

2021

Moment-DETR w/ PT (on 10K HowTo100M videos)

jayleicn/moment_detr tencentarc/umt

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

2021

Moment-DETR

jayleicn/moment_detr tencentarc/umt

LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval and Highlight Detection

2025

LD-DETR

qingchen239/ld-detr

VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

2024

VideoLights-B-pt

dpaul06/VideoLights

UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection

2022

UMT (VO)

tencentarc/umt MindCode-4/code-5 MS-P3/code7

UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection

2022

UMT (VA)

tencentarc/umt MindCode-4/code-5 MS-P3/code7

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

2024

VideoChat-T (ZS)

OpenGVLab/TimeSuite

SimVTP: Simple Video Text Pre-training with Masked Autoencoders

2022

SimVTP

Model	Paper	R@1 IoU=0.5	Date
SG-DETR (w/ PT)	Saliency-Guided DETR for Moment Retrieval and Hig…	71.10	2024-10-02
LLaVA-MR	LLaVA-MR: Large Language-and-Vision Assistant for…	70.65	2024-11-21
FlashVTG	FlashVTG: Feature Layering and Adaptive Score Han…	70.32	2024-12-18
SG-DETR	Saliency-Guided DETR for Moment Retrieval and Hig…	70.20	2024-10-02
InternVideo2-6B	InternVideo2: Scaling Foundation Models for Multi…	70.03	2024-03-22
InternVideo2-1B	InternVideo2: Scaling Foundation Models for Multi…	68.36	2024-03-22
VideoChat-T (FT)	TimeSuite: Improving MLLMs for Long Video Underst…	67.10	2024-10-25
UniMD+Sync.	UniMD: Towards Unifying Moment Retrieval and Temp…	63.98	2024-04-07
UnLoc-L	UnLoc: A Unified Framework for Video Localization…	60.80	2023-08-21
BAM-DETR	BAM-DETR: Boundary-Aligned Moment Detection Trans…	59.95	2023-11-30
BM-DETR	Background-aware Moment Detection for Video Momen…	59.48	2023-06-05
UVCOM	Bridging the Gap: A Unified Video Comprehension F…	59.25	2023-11-28
CG-DETR	Correlation-Guided Query-Dependency Calibration f…	58.44	2023-11-15
LLMEPET	Prior Knowledge Integration via LLM Encoding and …	58.31	2024-07-21
UnLoc-B	UnLoc: A Unified Framework for Video Localization…	58.10	2023-08-21
QD-DETR (Only Video)	Query-Dependent Video Representation for Moment R…	57.31	2023-03-24
video-mamba-suite	Video Mamba Suite: State Space Model as a Versati…	57.18	2024-03-14
Moment-DETR w/ PT (on 10K HowTo100M videos)	QVHighlights: Detecting Moments and Highlights in…	55.65	2021-07-20
Moment-DETR	QVHighlights: Detecting Moments and Highlights in…	53.63	2021-07-20
LD-DETR	LD-DETR: Loop Decoder DEtection TRansformer for V…	53.44	2025-01-18
VideoLights-B-pt	VideoLights: Feature Refinement and Cross-Task Al…	52.94	2024-12-02
UMT (VO)	UMT: Unified Multi-modal Transformers for Joint V…	49.35	2022-03-23
UMT (VA)	UMT: Unified Multi-modal Transformers for Joint V…	48.31	2022-03-23
VideoChat-T (ZS)	TimeSuite: Improving MLLMs for Long Video Underst…	45.43	2024-10-25
SimVTP	SimVTP: Simple Video Text Pre-training with Maske…	44.70	2022-12-07

Charades-STA

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (25)

Saliency-Guided DETR for Moment Retrieval and Highlight Detection

LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

FlashVTG: Feature Layering and Adaptive Score Handling Network for Video Temporal Grounding

Saliency-Guided DETR for Moment Retrieval and Highlight Detection

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

UnLoc: A Unified Framework for Video Localization Tasks

BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos

Background-aware Moment Detection for Video Moment Retrieval

Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection

Correlation-Guided Query-Dependency Calibration for Video Temporal Grounding

Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval

UnLoc: A Unified Framework for Video Localization Tasks

Query-Dependent Video Representation for Moment Retrieval and Highlight Detection

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval and Highlight Detection

VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection

UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

SimVTP: Simple Video Text Pre-training with Masked Autoencoders