ML Research Wiki / Benchmarks / Video Retrieval / ActivityNet

ActivityNet

Video Retrieval Benchmark

Performance Over Time

📊 Showing 31 results | 📏 Metric: text-to-video R@1

Top Performing Models

Rank	Model	Paper	text-to-video R@1	Date	Code
1	InternVideo2-6B 📚	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	74.10	2024-03-22	📦 opengvlab/internvideo 📦 opengvlab/internvideo2
2	VAST 📚	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	70.50	2023-05-29	📦 TXH-mercury/VALOR 📦 txh-mercury/vast
3	VALOR 📚	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	70.10	2023-04-17	📦 TXH-mercury/VALOR
4	GRAM 📚	Gramian Multimodal Representation Learning and Alignment	69.90	2024-12-16	📦 ispamm/GRAM 📦 luigisigillo/gwit
5	COSA 📚	COSA: Concatenated Sample Pretrained Vision-Language Foundation Model	67.30	2023-06-15	📦 txh-mercury/cosa
6	UMT-L (ViT-L/16) 📚	Unmasked Teacher: Towards Training-Efficient Video Foundation Models	66.80	2023-03-28	📦 opengvlab/unmasked_teacher
7	vid-TLDR (UMT-L) 📚	vid-TLDR: Training Free Token merging for Light-weight Video Transformer	66.70	2024-03-20	📦 mlvlab/vid-tldr
8	InternVideo 📚	InternVideo: General Video Foundation Models via Generative and Discriminative Learning	62.20	2022-12-06	📦 opengvlab/internvideo 📦 yingsen1/unimd
9	CLIP-ViP 📚	CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment	61.40	2022-09-14	📦 microsoft/xpretrain
10	HunYuan_tvr 📚	Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations	57.30	2022-04-07	-

All Papers (31)

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

2024

InternVideo2-6B

opengvlab/internvideo opengvlab/internvideo2

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

2023

VAST

TXH-mercury/VALOR txh-mercury/vast

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

2023

VALOR

TXH-mercury/VALOR

Gramian Multimodal Representation Learning and Alignment

2024

GRAM

ispamm/GRAM luigisigillo/gwit

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

2023

COSA

txh-mercury/cosa

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

2023

UMT-L (ViT-L/16)

opengvlab/unmasked_teacher

vid-TLDR: Training Free Token merging for Light-weight Video Transformer

2024

vid-TLDR (UMT-L)

mlvlab/vid-tldr

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

2022

InternVideo

opengvlab/internvideo yingsen1/unimd

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

2022

CLIP-ViP

microsoft/xpretrain

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

2022

HunYuan_tvr

VindLU: A Recipe for Effective Video-and-Language Pretraining

2022

VindLU

klauscc/vindlu

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

2023

TESTA (ViT-B/16)

renshuhuai-andy/testa

RTQ: Rethinking Video-language Understanding Based on Image-text Model

2023

RTQ

SCZwangxiao/RTQ-MM2023 sczwangxiao/tsgvs-mm2023

Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

2023

DMAE (ViT-B/32)

alipay/Ant-Multi-Modal-Framework

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

2021

CAMoE

starmemda/camow starmemda/CAMoE

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

2022

EMCL-Net++

jpthu17/emcl jpthu17/diffusionret

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

2022

HiTeA

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

2023

DiffusionRet+QB-Norm

jpthu17/emcl jpthu17/diffusionret

Revealing Single Frame Bias for Video-and-Language Learning

2022

Singularity

jayleicn/ClipBERT jayleicn/singularity

CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

2022

CenterCLIP (ViT-B/16)

mzhaoshuai/CenterCLIP

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

2022

X-CLIP

xuguohai/X-CLIP MindCode-4/code-5 MindSpore-scientific/code-7

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

2023

DiffusionRet

jpthu17/emcl jpthu17/diffusionret

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

2023

HBI

jpthu17/emcl jpthu17/diffusionret

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

2022

EMCL-Net

jpthu17/emcl jpthu17/diffusionret

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

2021

CLIP4Clip

towhee-io/towhee ArrowLuo/CLIP4Clip

TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment

2021

TACo

Multi-modal Transformer for Video Retrieval

2020

MMT-Pretrained

gabeur/mmt

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

2021

HD-VILA

microsoft/xpretrain

Video and Text Matching with Conditioned Embeddings

2021

Ours

ameenali/videomatch

Multi-modal Transformer for Video Retrieval

2020

MMT

gabeur/mmt

Use What You Have: Video Retrieval Using Representations From Collaborative Experts

2019

Collaborative Experts

albanie/collaborative-experts caijincen712/CE hbzhang/cvpr2020

Model	Paper	text-to-video R@1	Date
InternVideo2-6B	InternVideo2: Scaling Foundation Models for Multi…	74.10	2024-03-22
VAST	VAST: A Vision-Audio-Subtitle-Text Omni-Modality …	70.50	2023-05-29
VALOR	VALOR: Vision-Audio-Language Omni-Perception Pret…	70.10	2023-04-17
GRAM	Gramian Multimodal Representation Learning and Al…	69.90	2024-12-16
COSA	COSA: Concatenated Sample Pretrained Vision-Langu…	67.30	2023-06-15
UMT-L (ViT-L/16)	Unmasked Teacher: Towards Training-Efficient Vide…	66.80	2023-03-28
vid-TLDR (UMT-L)	vid-TLDR: Training Free Token merging for Light-w…	66.70	2024-03-20
InternVideo	InternVideo: General Video Foundation Models via …	62.20	2022-12-06
CLIP-ViP	CLIP-ViP: Adapting Pre-trained Image-Text Model t…	61.40	2022-09-14
HunYuan_tvr	Tencent Text-Video Retrieval: Hierarchical Cross-…	57.30	2022-04-07
VindLU	VindLU: A Recipe for Effective Video-and-Language…	55.00	2022-12-09
TESTA (ViT-B/16)	TESTA: Temporal-Spatial Token Aggregation for Lon…	54.80	2023-10-29
RTQ	RTQ: Rethinking Video-language Understanding Base…	53.50	2023-12-01
DMAE (ViT-B/32)	Dual-Modal Attention-Enhanced Text-Video Retrieva…	53.40	2023-09-20
CAMoE	Improving Video-Text Retrieval by Multi-Stream Co…	51.00	2021-09-09
EMCL-Net++	Expectation-Maximization Contrastive Learning for…	50.60	2022-11-21
HiTeA	HiTeA: Hierarchical Temporal-Aware Video-Language…	49.70	2022-12-30
DiffusionRet+QB-Norm	DiffusionRet: Generative Text-Video Retrieval wit…	48.10	2023-03-17
Singularity	Revealing Single Frame Bias for Video-and-Languag…	47.10	2022-06-07
CenterCLIP (ViT-B/16)	CenterCLIP: Token Clustering for Efficient Text-V…	46.20	2022-05-02
X-CLIP	X-CLIP: End-to-End Multi-grained Contrastive Lear…	46.20	2022-07-15
DiffusionRet	DiffusionRet: Generative Text-Video Retrieval wit…	45.80	2023-03-17
HBI	Video-Text as Game Players: Hierarchical Banzhaf …	42.20	2023-03-25
EMCL-Net	Expectation-Maximization Contrastive Learning for…	41.20	2022-11-21
CLIP4Clip	CLIP4Clip: An Empirical Study of CLIP for End to …	40.50	2021-04-18
TACo	TACo: Token-aware Cascade Contrastive Learning fo…	30.40	2021-08-23
MMT-Pretrained	Multi-modal Transformer for Video Retrieval	28.70	2020-07-21
HD-VILA	Advancing High-Resolution Video-Language Represen…	28.50	2021-11-19
Ours	Video and Text Matching with Conditioned Embeddin…	25.40	2021-10-21
MMT	Multi-modal Transformer for Video Retrieval	22.70	2020-07-21
Collaborative Experts	Use What You Have: Video Retrieval Using Represen…	20.50	2019-07-31

ActivityNet

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (31)

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

Gramian Multimodal Representation Learning and Alignment

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

vid-TLDR: Training Free Token merging for Light-weight Video Transformer

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

VindLU: A Recipe for Effective Video-and-Language Pretraining

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

RTQ: Rethinking Video-language Understanding Based on Image-text Model

Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

Revealing Single Frame Bias for Video-and-Language Learning

CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment

Multi-modal Transformer for Video Retrieval

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

Video and Text Matching with Conditioned Embeddings

Multi-modal Transformer for Video Retrieval

Use What You Have: Video Retrieval Using Representations From Collaborative Experts