EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)
|
Expectation-Maximization Contrastive Learning for…
|
53.70
|
2022-11-21
|
|
InternVideo2-6B
|
InternVideo2: Scaling Foundation Models for Multi…
|
46.40
|
2024-03-22
|
|
vid-TLDR (UMT-L)
|
vid-TLDR: Training Free Token merging for Light-w…
|
43.10
|
2024-03-20
|
|
UMT-L (ViT-L/16)
|
Unmasked Teacher: Towards Training-Efficient Vide…
|
43.00
|
2023-03-28
|
|
HunYuan_tvr (huge)
|
Tencent Text-Video Retrieval: Hierarchical Cross-…
|
40.40
|
2022-04-07
|
|
COSA
|
COSA: Concatenated Sample Pretrained Vision-Langu…
|
39.40
|
2023-06-15
|
|
mPLUG-2
|
mPLUG-2: A Modularized Multi-modal Foundation Mod…
|
34.40
|
2023-02-01
|
|
VALOR
|
VALOR: Vision-Audio-Language Omni-Perception Pret…
|
34.20
|
2023-04-17
|
|
InternVideo
|
InternVideo: General Video Foundation Models via …
|
34.00
|
2022-12-06
|
|
CLIP-ViP
|
CLIP-ViP: Adapting Pre-trained Image-Text Model t…
|
30.70
|
2022-09-14
|
|
HunYuan_tvr
|
Tencent Text-Video Retrieval: Hierarchical Cross-…
|
29.70
|
2022-04-07
|
|
STAN
|
Revisiting Temporal Modeling for CLIP-based Image…
|
29.20
|
2023-01-26
|
|
HiTeA
|
HiTeA: Hierarchical Temporal-Aware Video-Language…
|
28.70
|
2022-12-30
|
|
MDMMT-2
|
MDMMT-2: Multidomain Multimodal Transformer for V…
|
26.90
|
2022-03-14
|
|
X-CLIP
|
X-CLIP: End-to-End Multi-grained Contrastive Lear…
|
26.10
|
2022-07-15
|
|
EMCL-Net++
|
Expectation-Maximization Contrastive Learning for…
|
25.90
|
2022-11-21
|
|
CAMoE
|
Improving Video-Text Retrieval by Multi-Stream Co…
|
25.90
|
2021-09-09
|
|
X-Pool
|
X-Pool: Cross-Modal Language-Video Attention for …
|
25.20
|
2022-03-28
|
|
Clover
|
Clover: Towards A Unified Video-Language Alignmen…
|
24.80
|
2022-07-16
|
|
DiffusionRet
|
DiffusionRet: Generative Text-Video Retrieval wit…
|
24.40
|
2023-03-17
|
|
CenterCLIP (ViT-B/16)
|
CenterCLIP: Token Clustering for Efficient Text-V…
|
24.20
|
2022-05-02
|
|
VIOLETv2
|
An Empirical Study of End-to-End Video-Language T…
|
24.00
|
2022-09-04
|
|
EMCL-Net
|
Expectation-Maximization Contrastive Learning for…
|
23.90
|
2022-11-21
|
|
QB-Norm+CLIP4Clip
|
Cross Modal Retrieval with Querybank Normalisation
|
22.40
|
2021-12-23
|
|
CLIP4Clip
|
CLIP4Clip: An Empirical Study of CLIP for End to …
|
21.60
|
2021-04-18
|
|
MDMMT
|
MDMMT: Multidomain Multimodal Transformer for Vid…
|
18.80
|
2021-03-19
|
|
HD-VILA
|
Advancing High-Resolution Video-Language Represen…
|
17.40
|
2021-11-19
|
|
FROZEN
|
Frozen in Time: A Joint Video and Image Encoder f…
|
15.00
|
2021-04-01
|
|
Ours
|
Video and Text Matching with Conditioned Embeddin…
|
14.90
|
2021-10-21
|
|
MMT-Pretrained
|
Multi-modal Transformer for Video Retrieval
|
13.50
|
2020-07-21
|
|
MMT
|
Multi-modal Transformer for Video Retrieval
|
13.20
|
2020-07-21
|
|
CLIP
|
A Straightforward Framework For Video Retrieval U…
|
11.30
|
2021-02-24
|
|
Collaborative Experts
|
Use What You Have: Video Retrieval Using Represen…
|
11.20
|
2019-07-31
|
|
MoEE
|
Learning a Text-Video Embedding from Incomplete a…
|
10.10
|
2018-04-07
|
|
JSFusion
|
A Joint Sequence Fusion Model for Video Question …
|
9.10
|
2018-08-07
|
|
Large-Scale Discriminative Clustering
|
Learning from Video and Text via Large-Scale Disc…
|
7.30
|
2017-07-27
|
|
Text-Video Embedding
|
HowTo100M: Learning a Text-Video Embedding by Wat…
|
7.20
|
2019-06-07
|
|
CT-SAN
|
End-to-end Concept Word Detection for Video Capti…
|
5.10
|
2016-10-10
|
|