ML Research Wiki / Benchmarks / Zero-Shot Video Retrieval / MSR-VTT

MSR-VTT

Zero-Shot Video Retrieval Benchmark

Performance Over Time

📊 Showing 41 results | 📏 Metric: text-to-video R@1

Top Performing Models

Rank	Model	Paper	text-to-video R@1	Date	Code
1	InternVideo2-6B 📚	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	55.90	2024-03-22	📦 opengvlab/internvideo 📦 opengvlab/internvideo2
2	GRAM 📚	Gramian Multimodal Representation Learning and Alignment	54.80	2024-12-16	📦 ispamm/GRAM 📦 luigisigillo/gwit
3	InternVideo2-1B 📚	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	51.90	2024-03-22	📦 opengvlab/internvideo 📦 opengvlab/internvideo2
4	VAST, HowToCaption-finetuned	HowToCaption: Prompting LLMs to Transform Video Annotations at Scale	50.00	2023-10-07	📦 ninatu/howtocaption
5	FluxViT-B 📚	Make Your Training Flexible: Towards Deployment-Efficient Video Models	49.90	2025-03-18	📦 opengvlab/fluxvit
6	VAST 📚	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	49.30	2023-05-29	📦 TXH-mercury/VALOR 📦 txh-mercury/vast
7	mPLUG-2	mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	47.10	2023-02-01	📦 modelscope/modelscope 📦 x-plug/mplug-owl 📦 alibaba/AliceMind 📦 X-PLUG/mPLUG-2
8	FluxViT-S 📚	Make Your Training Flexible: Towards Deployment-Efficient Video Models	45.00	2025-03-18	📦 opengvlab/fluxvit
9	LanguageBind(ViT-H/14) 📚	LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment	44.80	2023-10-03	📦 PKU-YuanGroup/Video-LLaVA 📦 PKU-YuanGroup/MoE-LLaVA 📦 pku-yuangroup/languagebind
10	LanguageBind(ViT-L/14) 📚	LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment	42.80	2023-10-03	📦 PKU-YuanGroup/Video-LLaVA 📦 PKU-YuanGroup/MoE-LLaVA 📦 pku-yuangroup/languagebind

All Papers (41)

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

2024

InternVideo2-6B

opengvlab/internvideo opengvlab/internvideo2

Gramian Multimodal Representation Learning and Alignment

2024

GRAM

ispamm/GRAM luigisigillo/gwit

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

2024

InternVideo2-1B

opengvlab/internvideo opengvlab/internvideo2

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

2023

VAST, HowToCaption-finetuned

ninatu/howtocaption

Make Your Training Flexible: Towards Deployment-Efficient Video Models

2025

FluxViT-B

opengvlab/fluxvit

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

2023

VAST

TXH-mercury/VALOR txh-mercury/vast

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

2023

mPLUG-2

modelscope/modelscope x-plug/mplug-owl

Make Your Training Flexible: Towards Deployment-Efficient Video Models

2025

FluxViT-S

opengvlab/fluxvit

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

2023

LanguageBind(ViT-H/14)

PKU-YuanGroup/Video-LLaVA PKU-YuanGroup/MoE-LLaVA

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

2023

LanguageBind(ViT-L/14)

PKU-YuanGroup/Video-LLaVA PKU-YuanGroup/MoE-LLaVA

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

2023

UMT-L (ViT-L/16)

opengvlab/unmasked_teacher

vid-TLDR: Training Free Token merging for Light-weight Video Transformer

2024

vid-TLDR (UMT-L)

mlvlab/vid-tldr

BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning

2023

BT-Adapter

farewellthree/BT-Adapter

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

2022

InternVideo

opengvlab/internvideo yingsen1/unimd

Florence: A New Foundation Model for Computer Vision

2021

Florence

microsoft/unicl MindCode-4/code-3

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

2023

HowToCaption

ninatu/howtocaption

ImageBind: One Embedding Space To Bind Them All

2023

ImageBind

facebookresearch/imagebind klemens-floege/oneprot ginihumer/amumo

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

2022

OmniVL

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

2022

HiTeA-17M

Revealing Single Frame Bias for Video-and-Language Learning

2022

Singularity-17M

jayleicn/ClipBERT jayleicn/singularity

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

2021

CLIP4Clip

towhee-io/towhee ArrowLuo/CLIP4Clip

Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning

2022

Yatai Ji et. al.

iigroup/scl

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

2022

HiTeA-5M

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

2021

VATT-MBS

google-research/google-research akashe/ProgrammingInterview

Revealing Single Frame Bias for Video-and-Language Learning

2022

Singularity-5M

jayleicn/ClipBERT jayleicn/singularity

Clover: Towards A Unified Video-Language Alignment and Fusion Model

2022

Clover

leeyn-43/clover

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

2022

MILES

tencentarc/mcq

Bridging Video-text Retrieval with Multiple Choice Questions

2022

Y. Ge et. al.

towhee-io/towhee tencentarc/mcq

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

2021

VIOLET

tsujuifu/pytorch_violet

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

2021

FROZEN

towhee-io/towhee m-bain/webvid

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

2021

ALPRO

salesforce/alpro

Object-aware Video-language Pre-training for Retrieval

2021

OA-Trans

FingerRec/OA-Transformer

LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval

2022

LaT

Learning Audio-Video Modalities from Image Captions

2022

A. Nagrani et. al.

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

2021

HD-VILA

microsoft/xpretrain

Multi-modal Transformer for Video Retrieval

2020

MMT

gabeur/mmt

Multi-granularity Correspondence Learning from Long-term Noisy Videos

2024

Norton

XLearning-SCU/2024-ICLR-Norton

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

2021

VideoCLIP

facebookresearch/fairseq pytorch/fairseq

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

2019

MIL-NCE

antoine77340/MIL-NCE_HowTo100M antoine77340/milnce_howto100m

TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment

2021

TACo

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

2020

SSML

elad-amrani/ssml

Model	Paper	text-to-video R@1	Date
InternVideo2-6B	InternVideo2: Scaling Foundation Models for Multi…	55.90	2024-03-22
GRAM	Gramian Multimodal Representation Learning and Al…	54.80	2024-12-16
InternVideo2-1B	InternVideo2: Scaling Foundation Models for Multi…	51.90	2024-03-22
VAST, HowToCaption-finetuned	HowToCaption: Prompting LLMs to Transform Video A…	50.00	2023-10-07
FluxViT-B	Make Your Training Flexible: Towards Deployment-E…	49.90	2025-03-18
VAST	VAST: A Vision-Audio-Subtitle-Text Omni-Modality …	49.30	2023-05-29
mPLUG-2	mPLUG-2: A Modularized Multi-modal Foundation Mod…	47.10	2023-02-01
FluxViT-S	Make Your Training Flexible: Towards Deployment-E…	45.00	2025-03-18
LanguageBind(ViT-H/14)	LanguageBind: Extending Video-Language Pretrainin…	44.80	2023-10-03
LanguageBind(ViT-L/14)	LanguageBind: Extending Video-Language Pretrainin…	42.80	2023-10-03
UMT-L (ViT-L/16)	Unmasked Teacher: Towards Training-Efficient Vide…	42.60	2023-03-28
vid-TLDR (UMT-L)	vid-TLDR: Training Free Token merging for Light-w…	42.10	2024-03-20
BT-Adapter	BT-Adapter: Video Conversation is Feasible Withou…	40.90	2023-09-27
InternVideo	InternVideo: General Video Foundation Models via …	40.70	2022-12-06
Florence	Florence: A New Foundation Model for Computer Vis…	37.60	2021-11-22
HowToCaption	HowToCaption: Prompting LLMs to Transform Video A…	37.60	2023-10-07
ImageBind	ImageBind: One Embedding Space To Bind Them All	36.80	2023-05-09
OmniVL	OmniVL:One Foundation Model for Image-Language an…	34.60	2022-09-15
HiTeA-17M	HiTeA: Hierarchical Temporal-Aware Video-Language…	34.40	2022-12-30
Singularity-17M	Revealing Single Frame Bias for Video-and-Languag…	34.00	2022-06-07
CLIP4Clip	CLIP4Clip: An Empirical Study of CLIP for End to …	32.00	2021-04-18
Yatai Ji et. al.	Seeing What You Miss: Vision-Language Pre-trainin…	30.90	2022-11-24
HiTeA-5M	HiTeA: Hierarchical Temporal-Aware Video-Language…	29.90	2022-12-30
VATT-MBS	VATT: Transformers for Multimodal Self-Supervised…	29.70	2021-04-22
Singularity-5M	Revealing Single Frame Bias for Video-and-Languag…	28.40	2022-06-07
Clover	Clover: Towards A Unified Video-Language Alignmen…	26.40	2022-07-16
MILES	MILES: Visual BERT Pre-training with Injected Lan…	26.10	2022-04-26
Y. Ge et. al.	Bridging Video-text Retrieval with Multiple Choic…	26.00	2022-01-13
VIOLET	VIOLET : End-to-End Video-Language Transformers w…	25.90	2021-11-24
FROZEN	Frozen in Time: A Joint Video and Image Encoder f…	24.70	2021-04-01
ALPRO	Align and Prompt: Video-and-Language Pre-training…	24.10	2021-12-17
OA-Trans	Object-aware Video-language Pre-training for Retr…	23.40	2021-12-01
LaT	LaT: Latent Translation with Cycle-Consistency fo…	23.40	2022-07-11
A. Nagrani et. al.	Learning Audio-Video Modalities from Image Captio…	19.40	2022-04-01
HD-VILA	Advancing High-Resolution Video-Language Represen…	14.60	2021-11-19
MMT	Multi-modal Transformer for Video Retrieval	14.40	2020-07-21
Norton	Multi-granularity Correspondence Learning from Lo…	10.70	2024-01-30
VideoCLIP	VideoCLIP: Contrastive Pre-training for Zero-shot…	10.40	2021-09-28
MIL-NCE	End-to-End Learning of Visual Representations fro…	9.90	2019-12-13
TACo	TACo: Token-aware Cascade Contrastive Learning fo…	9.80	2021-08-23
SSML	Noise Estimation Using Density Estimation for Sel…	8.00	2020-03-06

MSR-VTT

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (41)