ML Research Wiki / Benchmarks / Zero-Shot Video Retrieval / MSR-VTT

MSR-VTT

Zero-Shot Video Retrieval Benchmark

Performance Over Time

📊 Showing 41 results | 📏 Metric: text-to-video R@1

Top Performing Models

Rank Model Paper text-to-video R@1 Date Code
1 InternVideo2-6B 📚 InternVideo2: Scaling Foundation Models for Multimodal Video Understanding 55.90 2024-03-22 📦 opengvlab/internvideo 📦 opengvlab/internvideo2
2 GRAM 📚 Gramian Multimodal Representation Learning and Alignment 54.80 2024-12-16 📦 ispamm/GRAM 📦 luigisigillo/gwit
3 InternVideo2-1B 📚 InternVideo2: Scaling Foundation Models for Multimodal Video Understanding 51.90 2024-03-22 📦 opengvlab/internvideo 📦 opengvlab/internvideo2
4 VAST, HowToCaption-finetuned HowToCaption: Prompting LLMs to Transform Video Annotations at Scale 50.00 2023-10-07 📦 ninatu/howtocaption
5 FluxViT-B 📚 Make Your Training Flexible: Towards Deployment-Efficient Video Models 49.90 2025-03-18 📦 opengvlab/fluxvit
6 VAST 📚 VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset 49.30 2023-05-29 📦 TXH-mercury/VALOR 📦 txh-mercury/vast
7 mPLUG-2 mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video 47.10 2023-02-01 📦 modelscope/modelscope 📦 x-plug/mplug-owl 📦 alibaba/AliceMind 📦 X-PLUG/mPLUG-2
8 FluxViT-S 📚 Make Your Training Flexible: Towards Deployment-Efficient Video Models 45.00 2025-03-18 📦 opengvlab/fluxvit
9 LanguageBind(ViT-H/14) 📚 LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment 44.80 2023-10-03 📦 PKU-YuanGroup/Video-LLaVA 📦 PKU-YuanGroup/MoE-LLaVA 📦 pku-yuangroup/languagebind
10 LanguageBind(ViT-L/14) 📚 LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment 42.80 2023-10-03 📦 PKU-YuanGroup/Video-LLaVA 📦 PKU-YuanGroup/MoE-LLaVA 📦 pku-yuangroup/languagebind

All Papers (41)