ML Research Wiki / Benchmarks / Zero-Shot Video Retrieval / DiDeMo

DiDeMo

Zero-Shot Video Retrieval Benchmark

Performance Over Time

📊 Showing 26 results | 📏 Metric: text-to-video R@1

Top Performing Models

Rank Model Paper text-to-video R@1 Date Code
1 InternVideo2-6B 📚 InternVideo2: Scaling Foundation Models for Multimodal Video Understanding 57.90 2024-03-22 📦 opengvlab/internvideo 📦 opengvlab/internvideo2
2 InternVideo2-1B 📚 InternVideo2: Scaling Foundation Models for Multimodal Video Understanding 57.00 2024-03-22 📦 opengvlab/internvideo 📦 opengvlab/internvideo2
3 VAST 📚 VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset 55.50 2023-05-29 📦 TXH-mercury/VALOR 📦 txh-mercury/vast
4 GRAM 📚 Gramian Multimodal Representation Learning and Alignment 54.20 2024-12-16 📦 ispamm/GRAM 📦 luigisigillo/gwit
5 vid-TLDR (UMT-L) 📚 vid-TLDR: Training Free Token merging for Light-weight Video Transformer 52.00 2024-03-20 📦 mlvlab/vid-tldr
6 UMT-L (ViT-L/16) 📚 Unmasked Teacher: Towards Training-Efficient Video Foundation Models 48.60 2023-03-28 📦 opengvlab/unmasked_teacher
7 mPLUG-2 mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video 45.70 2023-02-01 📦 modelscope/modelscope 📦 x-plug/mplug-owl 📦 alibaba/AliceMind 📦 X-PLUG/mPLUG-2
8 HiTeA-17M 📚 HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training 43.20 2022-12-30 -
9 LanguageBind(ViT-H/14) 📚 LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment 39.90 2023-10-03 📦 PKU-YuanGroup/Video-LLaVA 📦 PKU-YuanGroup/MoE-LLaVA 📦 pku-yuangroup/languagebind
10 LanguageBind(ViT-L/14) 📚 LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment 39.70 2023-10-03 📦 PKU-YuanGroup/Video-LLaVA 📦 PKU-YuanGroup/MoE-LLaVA 📦 pku-yuangroup/languagebind

All Papers (26)