ML Research Wiki / Benchmarks / Video Retrieval / MSR-VTT

MSR-VTT

Video Retrieval Benchmark

Performance Over Time

📊 Showing 38 results | 📏 Metric: text-to-video R@1

Top Performing Models

Rank	Model	Paper	text-to-video R@1	Date	Code
1	GRAM 📚	Gramian Multimodal Representation Learning and Alignment	64.00	2024-12-16	📦 ispamm/GRAM 📦 luigisigillo/gwit
2	VAST 📚	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	63.90	2023-05-29	📦 TXH-mercury/VALOR 📦 txh-mercury/vast
3	InternVideo2-6B 📚	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	62.80	2024-03-22	📦 opengvlab/internvideo 📦 opengvlab/internvideo2
4	VALOR 📚	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	59.90	2023-04-17	📦 TXH-mercury/VALOR
5	UMT-L (ViT-L/16) 📚	Unmasked Teacher: Towards Training-Efficient Video Foundation Models	58.80	2023-03-28	📦 opengvlab/unmasked_teacher
6	vid-TLDR (UMT-L) 📚	vid-TLDR: Training Free Token merging for Light-weight Video Transformer	58.10	2024-03-20	📦 mlvlab/vid-tldr
7	COSA 📚	COSA: Concatenated Sample Pretrained Vision-Language Foundation Model	57.90	2023-06-15	📦 txh-mercury/cosa
8	InternVideo 📚	InternVideo: General Video Foundation Models via Generative and Discriminative Learning	55.20	2022-12-06	📦 opengvlab/internvideo 📦 yingsen1/unimd
9	VLAB 📚	VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending	55.10	2023-05-22	-
10	TEFAL	Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment	52.00	2023-07-24	-

All Papers (38)

Gramian Multimodal Representation Learning and Alignment

2024

GRAM

ispamm/GRAM luigisigillo/gwit

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

2023

VAST

TXH-mercury/VALOR txh-mercury/vast

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

2024

InternVideo2-6B

opengvlab/internvideo opengvlab/internvideo2

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

2023

VALOR

TXH-mercury/VALOR

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

2023

UMT-L (ViT-L/16)

opengvlab/unmasked_teacher

vid-TLDR: Training Free Token merging for Light-weight Video Transformer

2024

vid-TLDR (UMT-L)

mlvlab/vid-tldr

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

2023

COSA

txh-mercury/cosa

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

2022

InternVideo

opengvlab/internvideo yingsen1/unimd

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

2023

VLAB

Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment

2023

TEFAL

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

2023

UCoFiA

ziyang412/ucofia

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

2022

OmniVL

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

2021

CLIP4Clip-seqTransf

towhee-io/towhee ArrowLuo/CLIP4Clip

MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models

2023

All-in-one + MELTR

mlvlab/MELTR

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

2022

VIOLETv2

tsujuifu/pytorch_empirical-mvm

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

2021

HD-VILA

microsoft/xpretrain

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

2022

VideoCoCa (zero-shot)

MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

2022

MDMMT-2

MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models

2023

VIOLET + MELTR

mlvlab/MELTR

CLIP2TV: Align, Match and Distill for Video-Text Retrieval

2021

CLIP2TV

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

2021

CAMoE

starmemda/camow starmemda/CAMoE

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

2021

FROZEN

towhee-io/towhee m-bain/webvid

COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

2022

COTS

CoCa: Contrastive Captioners are Image-Text Foundation Models

2022

CoCa (zero-shot)

mlfoundations/open_clip facebookresearch/multimodal

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

2021

CLIP2Video

CryhanFang/CLIP2Video

Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval

2021

LAFF

ruc-aimc-lab/laff

MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models

2023

UniVL + MELTR

mlvlab/MELTR

Video and Text Matching with Conditioned Embeddings

2021

Ours

ameenali/videomatch

TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment

2021

TACo

MDMMT: Multidomain Multimodal Transformer for Video Retrieval

2021

MDMMT

towhee-io/towhee papermsucode/mdmmt willard-yuan/video-text-retrieval-papers

A Straightforward Framework For Video Retrieval Using CLIP

2021

CLIP

Deferf/CLIP_Video_Representation

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

2020

UniVL

microsoft/UniVL wqliu657/UniVL

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

2019

Text-Video Embedding

antoine77340/MIL-NCE_HowTo100M antoine77340/milnce_howto100m

RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval

2022

RoME

buraksatar/RoME_video_retrieval

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

2018

JSFusion

antoine77340/howto100m ruc-aimc-lab/nt2vr

Use What You Have: Video Retrieval Using Representations From Collaborative Experts

2019

Collaborative Experts

albanie/collaborative-experts caijincen712/CE hbzhang/cvpr2020

Temporal Tessellation: A Unified Approach for Video Analysis

2016

Kaufman

dot27/temporal-tessellation

Learning Language-Visual Embedding for Movie Understanding with Natural-Language

2016

C+LSTM+SA+FC7

Model	Paper	text-to-video R@1	Date
GRAM	Gramian Multimodal Representation Learning and Al…	64.00	2024-12-16
VAST	VAST: A Vision-Audio-Subtitle-Text Omni-Modality …	63.90	2023-05-29
InternVideo2-6B	InternVideo2: Scaling Foundation Models for Multi…	62.80	2024-03-22
VALOR	VALOR: Vision-Audio-Language Omni-Perception Pret…	59.90	2023-04-17
UMT-L (ViT-L/16)	Unmasked Teacher: Towards Training-Efficient Vide…	58.80	2023-03-28
vid-TLDR (UMT-L)	vid-TLDR: Training Free Token merging for Light-w…	58.10	2024-03-20
COSA	COSA: Concatenated Sample Pretrained Vision-Langu…	57.90	2023-06-15
InternVideo	InternVideo: General Video Foundation Models via …	55.20	2022-12-06
VLAB	VLAB: Enhancing Video Language Pre-training by Fe…	55.10	2023-05-22
TEFAL	Audio-Enhanced Text-to-Video Retrieval using Text…	52.00	2023-07-24
UCoFiA	Unified Coarse-to-Fine Alignment for Video-Text R…	49.40	2023-09-18
OmniVL	OmniVL:One Foundation Model for Image-Language an…	47.80	2022-09-15
CLIP4Clip-seqTransf	CLIP4Clip: An Empirical Study of CLIP for End to …	44.50	2021-04-18
All-in-one + MELTR	MELTR: Meta Loss Transformer for Learning to Fine…	38.60	2023-03-23
VIOLETv2	An Empirical Study of End-to-End Video-Language T…	37.20	2022-09-04
HD-VILA	Advancing High-Resolution Video-Language Represen…	35.60	2021-11-19
VideoCoCa (zero-shot)	VideoCoCa: Video-Text Modeling with Zero-Shot Tra…	34.30	2022-12-09
MDMMT-2	MDMMT-2: Multidomain Multimodal Transformer for V…	33.70	2022-03-14
VIOLET + MELTR	MELTR: Meta Loss Transformer for Learning to Fine…	33.60	2023-03-23
CLIP2TV	CLIP2TV: Align, Match and Distill for Video-Text …	33.10	2021-11-10
CAMoE	Improving Video-Text Retrieval by Multi-Stream Co…	32.90	2021-09-09
FROZEN	Frozen in Time: A Joint Video and Image Encoder f…	32.50	2021-04-01
COTS	COTS: Collaborative Two-Stream Vision-Language Pr…	32.10	2022-04-15
CoCa (zero-shot)	CoCa: Contrastive Captioners are Image-Text Found…	30.00	2022-05-04
CLIP2Video	CLIP2Video: Mastering Video-Text Retrieval via Im…	29.80	2021-06-21
LAFF	Lightweight Attentional Feature Fusion: A New Bas…	29.10	2021-12-03
UniVL + MELTR	MELTR: Meta Loss Transformer for Learning to Fine…	28.50	2023-03-23
Ours	Video and Text Matching with Conditioned Embeddin…	26.00	2021-10-21
TACo	TACo: Token-aware Cascade Contrastive Learning fo…	24.80	2021-08-23
MDMMT	MDMMT: Multidomain Multimodal Transformer for Vid…	23.10	2021-03-19
CLIP	A Straightforward Framework For Video Retrieval U…	21.40	2021-02-24
UniVL	UniVL: A Unified Video and Language Pre-Training …	21.20	2020-02-15
Text-Video Embedding	HowTo100M: Learning a Text-Video Embedding by Wat…	14.90	2019-06-07
RoME	RoME: Role-aware Mixture-of-Expert Transformer fo…	10.70	2022-06-26
JSFusion	A Joint Sequence Fusion Model for Video Question …	10.20	2018-08-07
Collaborative Experts	Use What You Have: Video Retrieval Using Represen…	10.00	2019-07-31
Kaufman	Temporal Tessellation: A Unified Approach for Vid…	4.70	2016-12-21
C+LSTM+SA+FC7	Learning Language-Visual Embedding for Movie Unde…	4.20	2016-09-26

MSR-VTT

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (38)