GRAM
|
Gramian Multimodal Representation Learning and Al…
|
64.00
|
2024-12-16
|
|
VAST
|
VAST: A Vision-Audio-Subtitle-Text Omni-Modality …
|
63.90
|
2023-05-29
|
|
InternVideo2-6B
|
InternVideo2: Scaling Foundation Models for Multi…
|
62.80
|
2024-03-22
|
|
VALOR
|
VALOR: Vision-Audio-Language Omni-Perception Pret…
|
59.90
|
2023-04-17
|
|
UMT-L (ViT-L/16)
|
Unmasked Teacher: Towards Training-Efficient Vide…
|
58.80
|
2023-03-28
|
|
vid-TLDR (UMT-L)
|
vid-TLDR: Training Free Token merging for Light-w…
|
58.10
|
2024-03-20
|
|
COSA
|
COSA: Concatenated Sample Pretrained Vision-Langu…
|
57.90
|
2023-06-15
|
|
InternVideo
|
InternVideo: General Video Foundation Models via …
|
55.20
|
2022-12-06
|
|
VLAB
|
VLAB: Enhancing Video Language Pre-training by Fe…
|
55.10
|
2023-05-22
|
|
TEFAL
|
Audio-Enhanced Text-to-Video Retrieval using Text…
|
52.00
|
2023-07-24
|
|
UCoFiA
|
Unified Coarse-to-Fine Alignment for Video-Text R…
|
49.40
|
2023-09-18
|
|
OmniVL
|
OmniVL:One Foundation Model for Image-Language an…
|
47.80
|
2022-09-15
|
|
CLIP4Clip-seqTransf
|
CLIP4Clip: An Empirical Study of CLIP for End to …
|
44.50
|
2021-04-18
|
|
All-in-one + MELTR
|
MELTR: Meta Loss Transformer for Learning to Fine…
|
38.60
|
2023-03-23
|
|
VIOLETv2
|
An Empirical Study of End-to-End Video-Language T…
|
37.20
|
2022-09-04
|
|
HD-VILA
|
Advancing High-Resolution Video-Language Represen…
|
35.60
|
2021-11-19
|
|
VideoCoCa (zero-shot)
|
VideoCoCa: Video-Text Modeling with Zero-Shot Tra…
|
34.30
|
2022-12-09
|
|
MDMMT-2
|
MDMMT-2: Multidomain Multimodal Transformer for V…
|
33.70
|
2022-03-14
|
|
VIOLET + MELTR
|
MELTR: Meta Loss Transformer for Learning to Fine…
|
33.60
|
2023-03-23
|
|
CLIP2TV
|
CLIP2TV: Align, Match and Distill for Video-Text …
|
33.10
|
2021-11-10
|
|
CAMoE
|
Improving Video-Text Retrieval by Multi-Stream Co…
|
32.90
|
2021-09-09
|
|
FROZEN
|
Frozen in Time: A Joint Video and Image Encoder f…
|
32.50
|
2021-04-01
|
|
COTS
|
COTS: Collaborative Two-Stream Vision-Language Pr…
|
32.10
|
2022-04-15
|
|
CoCa (zero-shot)
|
CoCa: Contrastive Captioners are Image-Text Found…
|
30.00
|
2022-05-04
|
|
CLIP2Video
|
CLIP2Video: Mastering Video-Text Retrieval via Im…
|
29.80
|
2021-06-21
|
|
LAFF
|
Lightweight Attentional Feature Fusion: A New Bas…
|
29.10
|
2021-12-03
|
|
UniVL + MELTR
|
MELTR: Meta Loss Transformer for Learning to Fine…
|
28.50
|
2023-03-23
|
|
Ours
|
Video and Text Matching with Conditioned Embeddin…
|
26.00
|
2021-10-21
|
|
TACo
|
TACo: Token-aware Cascade Contrastive Learning fo…
|
24.80
|
2021-08-23
|
|
MDMMT
|
MDMMT: Multidomain Multimodal Transformer for Vid…
|
23.10
|
2021-03-19
|
|
CLIP
|
A Straightforward Framework For Video Retrieval U…
|
21.40
|
2021-02-24
|
|
UniVL
|
UniVL: A Unified Video and Language Pre-Training …
|
21.20
|
2020-02-15
|
|
Text-Video Embedding
|
HowTo100M: Learning a Text-Video Embedding by Wat…
|
14.90
|
2019-06-07
|
|
RoME
|
RoME: Role-aware Mixture-of-Expert Transformer fo…
|
10.70
|
2022-06-26
|
|
JSFusion
|
A Joint Sequence Fusion Model for Video Question …
|
10.20
|
2018-08-07
|
|
Collaborative Experts
|
Use What You Have: Video Retrieval Using Represen…
|
10.00
|
2019-07-31
|
|
Kaufman
|
Temporal Tessellation: A Unified Approach for Vid…
|
4.70
|
2016-12-21
|
|
C+LSTM+SA+FC7
|
Learning Language-Visual Embedding for Movie Unde…
|
4.20
|
2016-09-26
|
|