ML Research Wiki / Benchmarks / Video Retrieval / DiDeMo

DiDeMo

Video Retrieval Benchmark

Performance Over Time

📊 Showing 39 results | 📏 Metric: text-to-video R@1

Top Performing Models

Rank	Model	Paper	text-to-video R@1	Date	Code
1	InternVideo2-6B 📚	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	74.20	2024-03-22	📦 opengvlab/internvideo 📦 opengvlab/internvideo2
2	vid-TLDR (UMT-L) 📚	vid-TLDR: Training Free Token merging for Light-weight Video Transformer	72.30	2024-03-20	📦 mlvlab/vid-tldr
3	VAST 📚	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	72.00	2023-05-29	📦 TXH-mercury/VALOR 📦 txh-mercury/vast
4	COSA 📚	COSA: Concatenated Sample Pretrained Vision-Language Foundation Model	70.50	2023-06-15	📦 txh-mercury/cosa
5	UMT-L (ViT-L/16) 📚	Unmasked Teacher: Towards Training-Efficient Video Foundation Models	70.40	2023-03-28	📦 opengvlab/unmasked_teacher
6	GRAM 📚	Gramian Multimodal Representation Learning and Alignment	67.30	2024-12-16	📦 ispamm/GRAM 📦 luigisigillo/gwit
7	VALOR 📚	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	61.50	2023-04-17	📦 TXH-mercury/VALOR
8	TESTA (ViT-B/16) 📚	TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding	61.20	2023-10-29	📦 renshuhuai-andy/testa
9	VindLU 📚	VindLU: A Recipe for Effective Video-and-Language Pretraining	61.20	2022-12-09	📦 klauscc/vindlu
10	InternVideo 📚	InternVideo: General Video Foundation Models via Generative and Discriminative Learning	57.90	2022-12-06	📦 opengvlab/internvideo 📦 yingsen1/unimd

All Papers (39)

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

2024

InternVideo2-6B

opengvlab/internvideo opengvlab/internvideo2

vid-TLDR: Training Free Token merging for Light-weight Video Transformer

2024

vid-TLDR (UMT-L)

mlvlab/vid-tldr

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

2023

VAST

TXH-mercury/VALOR txh-mercury/vast

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

2023

COSA

txh-mercury/cosa

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

2023

UMT-L (ViT-L/16)

opengvlab/unmasked_teacher

Gramian Multimodal Representation Learning and Alignment

2024

GRAM

ispamm/GRAM luigisigillo/gwit

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

2023

VALOR

TXH-mercury/VALOR

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

2023

TESTA (ViT-B/16)

renshuhuai-andy/testa

VindLU: A Recipe for Effective Video-and-Language Pretraining

2022

VindLU

klauscc/vindlu

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

2022

InternVideo

opengvlab/internvideo yingsen1/unimd

RTQ: Rethinking Video-language Understanding Based on Image-text Model

2023

RTQ

SCZwangxiao/RTQ-MM2023 sczwangxiao/tsgvs-mm2023

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

2023

VLAB

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

2022

HiTeA

MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling

2023

MuLTI

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

2023

mPLUG-2

modelscope/modelscope x-plug/mplug-owl

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

2022

CLIP-ViP

microsoft/xpretrain

Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring

2023

STAN

farewellthree/stan

Revealing Single Frame Bias for Video-and-Language Learning

2022

Singularity

jayleicn/ClipBERT jayleicn/singularity

Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

2023

DMAE (ViT-B/32)

alipay/Ant-Multi-Modal-Framework

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

2022

HunYuan_tvr (huge)

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

2022

OmniVL

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

2022

HunYuan_tvr

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

2022

Cap4Video

whwu95/Cap4Video whwu95/text4vis

Clover: Towards A Unified Video-Language Alignment and Fusion Model

2022

Clover

leeyn-43/clover

Disentangled Representation Learning for Text-Video Retrieval

2022

DRL

towhee-io/towhee foolwood/DRL

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

2023

DiffusionRet+QB-Norm

jpthu17/emcl jpthu17/diffusionret

Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval

2023

PAU

leolee99/pau

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

2022

VIOLETv2

tsujuifu/pytorch_empirical-mvm

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

2022

X-CLIP

xuguohai/X-CLIP MindCode-4/code-5 MindSpore-scientific/code-7

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

2023

HBI

jpthu17/emcl jpthu17/diffusionret

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

2023

DiffusionRet

jpthu17/emcl jpthu17/diffusionret

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

2021

CAMoE

starmemda/camow starmemda/CAMoE

Cross Modal Retrieval with Querybank Normalisation

2021

QB-Norm+CLIP4Clip

ioanacroi/qb-norm

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

2021

CLIP4Clip

towhee-io/towhee ArrowLuo/CLIP4Clip

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

2021

ALPRO

salesforce/alpro

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

2021

FROZEN

towhee-io/towhee m-bain/webvid

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

2021

HD-VILA

microsoft/xpretrain

Rudder: A Cross Lingual Video and Text Retrieval Dataset

2021

PO Loss

nshubham655/RUDDER

Use What You Have: Video Retrieval Using Representations From Collaborative Experts

2019

Collaborative Experts

albanie/collaborative-experts caijincen712/CE hbzhang/cvpr2020

Model	Paper	text-to-video R@1	Date
InternVideo2-6B	InternVideo2: Scaling Foundation Models for Multi…	74.20	2024-03-22
vid-TLDR (UMT-L)	vid-TLDR: Training Free Token merging for Light-w…	72.30	2024-03-20
VAST	VAST: A Vision-Audio-Subtitle-Text Omni-Modality …	72.00	2023-05-29
COSA	COSA: Concatenated Sample Pretrained Vision-Langu…	70.50	2023-06-15
UMT-L (ViT-L/16)	Unmasked Teacher: Towards Training-Efficient Vide…	70.40	2023-03-28
GRAM	Gramian Multimodal Representation Learning and Al…	67.30	2024-12-16
VALOR	VALOR: Vision-Audio-Language Omni-Perception Pret…	61.50	2023-04-17
TESTA (ViT-B/16)	TESTA: Temporal-Spatial Token Aggregation for Lon…	61.20	2023-10-29
VindLU	VindLU: A Recipe for Effective Video-and-Language…	61.20	2022-12-09
InternVideo	InternVideo: General Video Foundation Models via …	57.90	2022-12-06
RTQ	RTQ: Rethinking Video-language Understanding Base…	57.60	2023-12-01
VLAB	VLAB: Enhancing Video Language Pre-training by Fe…	56.80	2023-05-22
HiTeA	HiTeA: Hierarchical Temporal-Aware Video-Language…	56.50	2022-12-30
MuLTI	MuLTI: Efficient Video-and-Language Understanding…	56.50	2023-03-10
mPLUG-2	mPLUG-2: A Modularized Multi-modal Foundation Mod…	56.40	2023-02-01
CLIP-ViP	CLIP-ViP: Adapting Pre-trained Image-Text Model t…	55.30	2022-09-14
STAN	Revisiting Temporal Modeling for CLIP-based Image…	54.60	2023-01-26
Singularity	Revealing Single Frame Bias for Video-and-Languag…	53.90	2022-06-07
DMAE (ViT-B/32)	Dual-Modal Attention-Enhanced Text-Video Retrieva…	52.70	2023-09-20
HunYuan_tvr (huge)	Tencent Text-Video Retrieval: Hierarchical Cross-…	52.70	2022-04-07
OmniVL	OmniVL:One Foundation Model for Image-Language an…	52.40	2022-09-15
HunYuan_tvr	Tencent Text-Video Retrieval: Hierarchical Cross-…	52.10	2022-04-07
Cap4Video	Cap4Video: What Can Auxiliary Captions Do for Tex…	52.00	2022-12-31
Clover	Clover: Towards A Unified Video-Language Alignmen…	50.10	2022-07-16
DRL	Disentangled Representation Learning for Text-Vid…	49.00	2022-03-14
DiffusionRet+QB-Norm	DiffusionRet: Generative Text-Video Retrieval wit…	48.90	2023-03-17
PAU	Prototype-based Aleatoric Uncertainty Quantificat…	48.60	2023-09-29
VIOLETv2	An Empirical Study of End-to-End Video-Language T…	47.90	2022-09-04
X-CLIP	X-CLIP: End-to-End Multi-grained Contrastive Lear…	47.80	2022-07-15
HBI	Video-Text as Game Players: Hierarchical Banzhaf …	46.90	2023-03-25
DiffusionRet	DiffusionRet: Generative Text-Video Retrieval wit…	46.70	2023-03-17
CAMoE	Improving Video-Text Retrieval by Multi-Stream Co…	43.80	2021-09-09
QB-Norm+CLIP4Clip	Cross Modal Retrieval with Querybank Normalisation	43.50	2021-12-23
CLIP4Clip	CLIP4Clip: An Empirical Study of CLIP for End to …	43.40	2021-04-18
ALPRO	Align and Prompt: Video-and-Language Pre-training…	35.90	2021-12-17
FROZEN	Frozen in Time: A Joint Video and Image Encoder f…	31.00	2021-04-01
HD-VILA	Advancing High-Resolution Video-Language Represen…	28.80	2021-11-19
PO Loss	Rudder: A Cross Lingual Video and Text Retrieval …	16.30	2021-03-09
Collaborative Experts	Use What You Have: Video Retrieval Using Represen…	16.10	2019-07-31

DiDeMo

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (39)