VAST
|
VAST: A Vision-Audio-Subtitle-Text Omni-Modality …
|
80.80
|
2023-05-29
|
|
VideoCLIP
|
VideoCLIP: Contrastive Pre-training for Zero-shot…
|
75.00
|
2021-09-28
|
|
UniVL + MELTR
|
MELTR: Meta Loss Transformer for Learning to Fine…
|
74.80
|
2023-03-23
|
|
MDMMT-2
|
MDMMT-2: Multidomain Multimodal Transformer for V…
|
74.80
|
2022-03-14
|
|
TACo
|
TACo: Token-aware Cascade Contrastive Learning fo…
|
72.70
|
2021-08-23
|
|
OmniVec
|
OmniVec: Learning robust representations with cro…
|
70.80
|
2023-11-07
|
|
UniVL
|
UniVL: A Unified Video and Language Pre-Training …
|
70.00
|
2020-02-15
|
|
VLM
|
VLM: Task-agnostic Video-Language Model Pre-train…
|
69.38
|
2021-05-20
|
|
OmniVec (pretrained)
|
OmniVec: Learning robust representations with cro…
|
64.20
|
2023-11-07
|
|
VideoCLIP (zero-shot)
|
VideoCLIP: Contrastive Pre-training for Zero-shot…
|
63.10
|
2021-09-28
|
|
VideoCoCa (zero-shot)
|
VideoCoCa: Video-Text Modeling with Zero-Shot Tra…
|
55.20
|
2022-12-09
|
|
COOT
|
COOT: Cooperative Hierarchical Transformer for Vi…
|
52.30
|
2020-11-01
|
|
Text-Video Embedding
|
HowTo100M: Learning a Text-Video Embedding by Wat…
|
35.30
|
2019-06-07
|
|
RoME
|
RoME: Role-aware Mixture-of-Expert Transformer fo…
|
25.20
|
2022-06-26
|
|
Satar et al.
|
Semantic Role Aware Correlation Transformer for T…
|
20.80
|
2022-06-26
|
|