ML Research Wiki / Benchmarks / Zero-Shot Video Retrieval / LSMDC

LSMDC

Zero-Shot Video Retrieval Benchmark

Performance Over Time

📊 Showing 16 results | 📏 Metric: text-to-video R@1

Top Performing Models

Rank	Model	Paper	text-to-video R@1	Date	Code
1	InternVideo2-6B 📚	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	33.80	2024-03-22	📦 opengvlab/internvideo 📦 opengvlab/internvideo2
2	InternVideo2-1B 📚	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	32.00	2024-03-22	📦 opengvlab/internvideo 📦 opengvlab/internvideo2
3	VAST, HowToCaption-finetuned	HowToCaption: Prompting LLMs to Transform Video Annotations at Scale	27.70	2023-10-07	📦 ninatu/howtocaption
4	UMT-L (ViT-L/16) 📚	Unmasked Teacher: Towards Training-Efficient Video Foundation Models	25.20	2023-03-28	📦 opengvlab/unmasked_teacher
5	mPLUG-2	mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	24.10	2023-02-01	📦 modelscope/modelscope 📦 x-plug/mplug-owl 📦 alibaba/AliceMind 📦 X-PLUG/mPLUG-2
6	BT-Adapter	BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning	19.50	2023-09-27	📦 farewellthree/BT-Adapter
7	HiTeA-17M 📚	HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training	18.30	2022-12-30	-
8	InternVideo 📚	InternVideo: General Video Foundation Models via Generative and Discriminative Learning	17.60	2022-12-06	📦 opengvlab/internvideo 📦 yingsen1/unimd
9	HowToCaption	HowToCaption: Prompting LLMs to Transform Video Annotations at Scale	17.30	2023-10-07	📦 ninatu/howtocaption
10	Yatai Ji et. al.	Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning	17.20	2022-11-24	📦 iigroup/scl

All Papers (16)

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

2024

InternVideo2-6B

opengvlab/internvideo opengvlab/internvideo2

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

2024

InternVideo2-1B

opengvlab/internvideo opengvlab/internvideo2

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

2023

VAST, HowToCaption-finetuned

ninatu/howtocaption

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

2023

UMT-L (ViT-L/16)

opengvlab/unmasked_teacher

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

2023

mPLUG-2

modelscope/modelscope x-plug/mplug-owl

BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning

2023

BT-Adapter

farewellthree/BT-Adapter

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

2022

HiTeA-17M

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

2022

InternVideo

opengvlab/internvideo yingsen1/unimd

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

2023

HowToCaption

ninatu/howtocaption

Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning

2022

Yatai Ji et. al.

iigroup/scl

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

2022

HiTeA-5M

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

2021

CLIP4Clip

towhee-io/towhee ArrowLuo/CLIP4Clip

Clover: Towards A Unified Video-Language Alignment and Fusion Model

2022

Clover

leeyn-43/clover

Bridging Video-text Retrieval with Multiple Choice Questions

2022

Y. Ge et. al.

towhee-io/towhee tencentarc/mcq

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

2022

MILES

tencentarc/mcq

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

2020

SSML

elad-amrani/ssml

LSMDC

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (16)

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

Clover: Towards A Unified Video-Language Alignment and Fusion Model

Bridging Video-text Retrieval with Multiple Choice Questions

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

Model	Paper	text-to-video R@1	Date
InternVideo2-6B	InternVideo2: Scaling Foundation Models for Multi…	33.80	2024-03-22
InternVideo2-1B	InternVideo2: Scaling Foundation Models for Multi…	32.00	2024-03-22
VAST, HowToCaption-finetuned	HowToCaption: Prompting LLMs to Transform Video A…	27.70	2023-10-07
UMT-L (ViT-L/16)	Unmasked Teacher: Towards Training-Efficient Vide…	25.20	2023-03-28
mPLUG-2	mPLUG-2: A Modularized Multi-modal Foundation Mod…	24.10	2023-02-01
BT-Adapter	BT-Adapter: Video Conversation is Feasible Withou…	19.50	2023-09-27
HiTeA-17M	HiTeA: Hierarchical Temporal-Aware Video-Language…	18.30	2022-12-30
InternVideo	InternVideo: General Video Foundation Models via …	17.60	2022-12-06
HowToCaption	HowToCaption: Prompting LLMs to Transform Video A…	17.30	2023-10-07
Yatai Ji et. al.	Seeing What You Miss: Vision-Language Pre-trainin…	17.20	2022-11-24
HiTeA-5M	HiTeA: Hierarchical Temporal-Aware Video-Language…	15.50	2022-12-30
CLIP4Clip	CLIP4Clip: An Empirical Study of CLIP for End to …	15.10	2021-04-18
Clover	Clover: Towards A Unified Video-Language Alignmen…	14.70	2022-07-16
Y. Ge et. al.	Bridging Video-text Retrieval with Multiple Choic…	12.20	2022-01-13
MILES	MILES: Visual BERT Pre-training with Injected Lan…	11.10	2022-04-26
SSML	Noise Estimation Using Density Estimation for Sel…	4.20	2020-03-06