ML Research Wiki / Benchmarks / Video Retrieval / MSVD

MSVD

Video Retrieval Benchmark

Performance Over Time

📊 Showing 24 results | 📏 Metric: text-to-video R@1

Top Performing Models

Rank	Model	Paper	text-to-video R@1	Date	Code
1	InternVideo2-6B 📚	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	61.40	2024-03-22	📦 opengvlab/internvideo 📦 opengvlab/internvideo2
2	HunYuan_tvr (huge) 📚	Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations	59.00	2022-04-07	-
3	InternVideo 📚	InternVideo: General Video Foundation Models via Generative and Discriminative Learning	58.40	2022-12-06	📦 opengvlab/internvideo 📦 yingsen1/unimd
4	HunYuan_tvr 📚	Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations	58.20	2022-04-07	-
5	vid-TLDR (UMT-L) 📚	vid-TLDR: Training Free Token merging for Light-weight Video Transformer	57.90	2024-03-20	📦 mlvlab/vid-tldr
6	VLAB 📚	VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending	57.50	2023-05-22	-
7	MDMMT-2 📚	MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization	56.80	2022-03-14	-
8	Side4Video	Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning	56.10	2023-11-27	📦 whwu95/ATM 📦 HJYao00/Side4Video
9	CAMoE 📚	Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss	51.80	2021-09-09	📦 starmemda/camow 📦 starmemda/CAMoE
10	Cap4Video	Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?	51.80	2022-12-31	📦 whwu95/Cap4Video 📦 whwu95/text4vis 📦 whwu95/GPT4Vis 📦 whwu95/BIKE

All Papers (24)

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

2024

InternVideo2-6B

opengvlab/internvideo opengvlab/internvideo2

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

2022

HunYuan_tvr (huge)

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

2022

InternVideo

opengvlab/internvideo yingsen1/unimd

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

2022

HunYuan_tvr

vid-TLDR: Training Free Token merging for Light-weight Video Transformer

2024

vid-TLDR (UMT-L)

mlvlab/vid-tldr

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

2023

VLAB

MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

2022

MDMMT-2

Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning

2023

Side4Video

whwu95/ATM HJYao00/Side4Video

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

2021

CAMoE

starmemda/camow starmemda/CAMoE

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

2022

Cap4Video

whwu95/Cap4Video whwu95/text4vis

CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

2022

CenterCLIP (ViT-B/16)

mzhaoshuai/CenterCLIP

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

2022

X-CLIP

xuguohai/X-CLIP MindCode-4/code-5 MindSpore-scientific/code-7

Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

2023

DMAE (ViT-B/32)

alipay/Ant-Multi-Modal-Framework

Cross Modal Retrieval with Querybank Normalisation

2021

QB-Norm+CLIP2Video

ioanacroi/qb-norm

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

2023

DiffusionRet+QB-Norm

jpthu17/emcl jpthu17/diffusionret

Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval

2023

PAU

leolee99/pau

X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

2022

X-Pool

layer6ai-labs/xpool

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

2023

DiffusionRet

jpthu17/emcl jpthu17/diffusionret

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

2021

CLIP4Clip

towhee-io/towhee ArrowLuo/CLIP4Clip

Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval

2021

LAFF

ruc-aimc-lab/laff

A Straightforward Framework For Video Retrieval Using CLIP

2021

CLIP

Deferf/CLIP_Video_Representation

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

2021

FROZEN

towhee-io/towhee m-bain/webvid

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

2020

SSML

elad-amrani/ssml

Use What You Have: Video Retrieval Using Representations From Collaborative Experts

2019

Collaborative Experts

albanie/collaborative-experts caijincen712/CE hbzhang/cvpr2020

Model	Paper	text-to-video R@1	Date
InternVideo2-6B	InternVideo2: Scaling Foundation Models for Multi…	61.40	2024-03-22
HunYuan_tvr (huge)	Tencent Text-Video Retrieval: Hierarchical Cross-…	59.00	2022-04-07
InternVideo	InternVideo: General Video Foundation Models via …	58.40	2022-12-06
HunYuan_tvr	Tencent Text-Video Retrieval: Hierarchical Cross-…	58.20	2022-04-07
vid-TLDR (UMT-L)	vid-TLDR: Training Free Token merging for Light-w…	57.90	2024-03-20
VLAB	VLAB: Enhancing Video Language Pre-training by Fe…	57.50	2023-05-22
MDMMT-2	MDMMT-2: Multidomain Multimodal Transformer for V…	56.80	2022-03-14
Side4Video	Side4Video: Spatial-Temporal Side Network for Mem…	56.10	2023-11-27
CAMoE	Improving Video-Text Retrieval by Multi-Stream Co…	51.80	2021-09-09
Cap4Video	Cap4Video: What Can Auxiliary Captions Do for Tex…	51.80	2022-12-31
CenterCLIP (ViT-B/16)	CenterCLIP: Token Clustering for Efficient Text-V…	50.60	2022-05-02
X-CLIP	X-CLIP: End-to-End Multi-grained Contrastive Lear…	50.40	2022-07-15
DMAE (ViT-B/32)	Dual-Modal Attention-Enhanced Text-Video Retrieva…	48.70	2023-09-20
QB-Norm+CLIP2Video	Cross Modal Retrieval with Querybank Normalisation	48.00	2021-12-23
DiffusionRet+QB-Norm	DiffusionRet: Generative Text-Video Retrieval wit…	47.90	2023-03-17
PAU	Prototype-based Aleatoric Uncertainty Quantificat…	47.30	2023-09-29
X-Pool	X-Pool: Cross-Modal Language-Video Attention for …	47.20	2022-03-28
DiffusionRet	DiffusionRet: Generative Text-Video Retrieval wit…	46.60	2023-03-17
CLIP4Clip	CLIP4Clip: An Empirical Study of CLIP for End to …	46.20	2021-04-18
LAFF	Lightweight Attentional Feature Fusion: A New Bas…	45.40	2021-12-03
CLIP	A Straightforward Framework For Video Retrieval U…	37.00	2021-02-24
FROZEN	Frozen in Time: A Joint Video and Image Encoder f…	33.70	2021-04-01
SSML	Noise Estimation Using Density Estimation for Sel…	20.30	2020-03-06
Collaborative Experts	Use What You Have: Video Retrieval Using Represen…	19.80	2019-07-31

MSVD

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (24)

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

vid-TLDR: Training Free Token merging for Light-weight Video Transformer

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

Cross Modal Retrieval with Querybank Normalisation

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval

X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval

A Straightforward Framework For Video Retrieval Using CLIP

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

Use What You Have: Video Retrieval Using Representations From Collaborative Experts