ML Research Wiki / Benchmarks / Video Retrieval / LSMDC

LSMDC

Video Retrieval Benchmark

Performance Over Time

📊 Showing 38 results | 📏 Metric: text-to-video R@1

Top Performing Models

Rank	Model	Paper	text-to-video R@1	Date	Code
1	EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)	Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations	53.70	2022-11-21	📦 jpthu17/emcl 📦 jpthu17/diffusionret 📦 jpthu17/HBI 📦 jpthu17/dicosa
2	InternVideo2-6B 📚	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	46.40	2024-03-22	📦 opengvlab/internvideo 📦 opengvlab/internvideo2
3	vid-TLDR (UMT-L) 📚	vid-TLDR: Training Free Token merging for Light-weight Video Transformer	43.10	2024-03-20	📦 mlvlab/vid-tldr
4	UMT-L (ViT-L/16) 📚	Unmasked Teacher: Towards Training-Efficient Video Foundation Models	43.00	2023-03-28	📦 opengvlab/unmasked_teacher
5	HunYuan_tvr (huge) 📚	Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations	40.40	2022-04-07	-
6	COSA 📚	COSA: Concatenated Sample Pretrained Vision-Language Foundation Model	39.40	2023-06-15	📦 txh-mercury/cosa
7	mPLUG-2 📚	mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	34.40	2023-02-01	📦 modelscope/modelscope 📦 x-plug/mplug-owl 📦 alibaba/AliceMind 📦 X-PLUG/mPLUG-2
8	VALOR 📚	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	34.20	2023-04-17	📦 TXH-mercury/VALOR
9	InternVideo 📚	InternVideo: General Video Foundation Models via Generative and Discriminative Learning	34.00	2022-12-06	📦 opengvlab/internvideo 📦 yingsen1/unimd
10	CLIP-ViP 📚	CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment	30.70	2022-09-14	📦 microsoft/xpretrain

All Papers (38)

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

2022

EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)

jpthu17/emcl jpthu17/diffusionret

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

2024

InternVideo2-6B

opengvlab/internvideo opengvlab/internvideo2

vid-TLDR: Training Free Token merging for Light-weight Video Transformer

2024

vid-TLDR (UMT-L)

mlvlab/vid-tldr

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

2023

UMT-L (ViT-L/16)

opengvlab/unmasked_teacher

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

2022

HunYuan_tvr (huge)

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

2023

COSA

txh-mercury/cosa

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

2023

mPLUG-2

modelscope/modelscope x-plug/mplug-owl

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

2023

VALOR

TXH-mercury/VALOR

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

2022

InternVideo

opengvlab/internvideo yingsen1/unimd

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

2022

CLIP-ViP

microsoft/xpretrain

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

2022

HunYuan_tvr

Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring

2023

STAN

farewellthree/stan

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

2022

HiTeA

MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

2022

MDMMT-2

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

2022

X-CLIP

xuguohai/X-CLIP MindCode-4/code-5 MindSpore-scientific/code-7

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

2022

EMCL-Net++

jpthu17/emcl jpthu17/diffusionret

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

2021

CAMoE

starmemda/camow starmemda/CAMoE

X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

2022

X-Pool

layer6ai-labs/xpool

Clover: Towards A Unified Video-Language Alignment and Fusion Model

2022

Clover

leeyn-43/clover

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

2023

DiffusionRet

jpthu17/emcl jpthu17/diffusionret

CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

2022

CenterCLIP (ViT-B/16)

mzhaoshuai/CenterCLIP

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

2022

VIOLETv2

tsujuifu/pytorch_empirical-mvm

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

2022

EMCL-Net

jpthu17/emcl jpthu17/diffusionret

Cross Modal Retrieval with Querybank Normalisation

2021

QB-Norm+CLIP4Clip

ioanacroi/qb-norm

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

2021

CLIP4Clip

towhee-io/towhee ArrowLuo/CLIP4Clip

MDMMT: Multidomain Multimodal Transformer for Video Retrieval

2021

MDMMT

towhee-io/towhee papermsucode/mdmmt willard-yuan/video-text-retrieval-papers

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

2021

HD-VILA

microsoft/xpretrain

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

2021

FROZEN

towhee-io/towhee m-bain/webvid

Video and Text Matching with Conditioned Embeddings

2021

Ours

ameenali/videomatch

Multi-modal Transformer for Video Retrieval

2020

MMT-Pretrained

gabeur/mmt

Multi-modal Transformer for Video Retrieval

2020

MMT

gabeur/mmt

A Straightforward Framework For Video Retrieval Using CLIP

2021

CLIP

Deferf/CLIP_Video_Representation

Use What You Have: Video Retrieval Using Representations From Collaborative Experts

2019

Collaborative Experts

albanie/collaborative-experts caijincen712/CE hbzhang/cvpr2020

Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

2018

MoEE

jayleicn/TVRetrieval antoine77340/Mixture-of-Embedding-Experts

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

2018

JSFusion

antoine77340/howto100m ruc-aimc-lab/nt2vr

Learning from Video and Text via Large-Scale Discriminative Clustering

2017

Large-Scale Discriminative Clustering

jpeyre/unrel antoine77340/iccv17learning

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

2019

Text-Video Embedding

antoine77340/MIL-NCE_HowTo100M antoine77340/milnce_howto100m

End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering

2016

CT-SAN

Model	Paper	text-to-video R@1	Date
EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)	Expectation-Maximization Contrastive Learning for…	53.70	2022-11-21
InternVideo2-6B	InternVideo2: Scaling Foundation Models for Multi…	46.40	2024-03-22
vid-TLDR (UMT-L)	vid-TLDR: Training Free Token merging for Light-w…	43.10	2024-03-20
UMT-L (ViT-L/16)	Unmasked Teacher: Towards Training-Efficient Vide…	43.00	2023-03-28
HunYuan_tvr (huge)	Tencent Text-Video Retrieval: Hierarchical Cross-…	40.40	2022-04-07
COSA	COSA: Concatenated Sample Pretrained Vision-Langu…	39.40	2023-06-15
mPLUG-2	mPLUG-2: A Modularized Multi-modal Foundation Mod…	34.40	2023-02-01
VALOR	VALOR: Vision-Audio-Language Omni-Perception Pret…	34.20	2023-04-17
InternVideo	InternVideo: General Video Foundation Models via …	34.00	2022-12-06
CLIP-ViP	CLIP-ViP: Adapting Pre-trained Image-Text Model t…	30.70	2022-09-14
HunYuan_tvr	Tencent Text-Video Retrieval: Hierarchical Cross-…	29.70	2022-04-07
STAN	Revisiting Temporal Modeling for CLIP-based Image…	29.20	2023-01-26
HiTeA	HiTeA: Hierarchical Temporal-Aware Video-Language…	28.70	2022-12-30
MDMMT-2	MDMMT-2: Multidomain Multimodal Transformer for V…	26.90	2022-03-14
X-CLIP	X-CLIP: End-to-End Multi-grained Contrastive Lear…	26.10	2022-07-15
EMCL-Net++	Expectation-Maximization Contrastive Learning for…	25.90	2022-11-21
CAMoE	Improving Video-Text Retrieval by Multi-Stream Co…	25.90	2021-09-09
X-Pool	X-Pool: Cross-Modal Language-Video Attention for …	25.20	2022-03-28
Clover	Clover: Towards A Unified Video-Language Alignmen…	24.80	2022-07-16
DiffusionRet	DiffusionRet: Generative Text-Video Retrieval wit…	24.40	2023-03-17
CenterCLIP (ViT-B/16)	CenterCLIP: Token Clustering for Efficient Text-V…	24.20	2022-05-02
VIOLETv2	An Empirical Study of End-to-End Video-Language T…	24.00	2022-09-04
EMCL-Net	Expectation-Maximization Contrastive Learning for…	23.90	2022-11-21
QB-Norm+CLIP4Clip	Cross Modal Retrieval with Querybank Normalisation	22.40	2021-12-23
CLIP4Clip	CLIP4Clip: An Empirical Study of CLIP for End to …	21.60	2021-04-18
MDMMT	MDMMT: Multidomain Multimodal Transformer for Vid…	18.80	2021-03-19
HD-VILA	Advancing High-Resolution Video-Language Represen…	17.40	2021-11-19
FROZEN	Frozen in Time: A Joint Video and Image Encoder f…	15.00	2021-04-01
Ours	Video and Text Matching with Conditioned Embeddin…	14.90	2021-10-21
MMT-Pretrained	Multi-modal Transformer for Video Retrieval	13.50	2020-07-21
MMT	Multi-modal Transformer for Video Retrieval	13.20	2020-07-21
CLIP	A Straightforward Framework For Video Retrieval U…	11.30	2021-02-24
Collaborative Experts	Use What You Have: Video Retrieval Using Represen…	11.20	2019-07-31
MoEE	Learning a Text-Video Embedding from Incomplete a…	10.10	2018-04-07
JSFusion	A Joint Sequence Fusion Model for Video Question …	9.10	2018-08-07
Large-Scale Discriminative Clustering	Learning from Video and Text via Large-Scale Disc…	7.30	2017-07-27
Text-Video Embedding	HowTo100M: Learning a Text-Video Embedding by Wat…	7.20	2019-06-07
CT-SAN	End-to-end Concept Word Detection for Video Capti…	5.10	2016-10-10

LSMDC

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (38)