InternVideo2-6B
|
InternVideo2: Scaling Foundation Models for Multi…
|
74.20
|
2024-03-22
|
|
vid-TLDR (UMT-L)
|
vid-TLDR: Training Free Token merging for Light-w…
|
72.30
|
2024-03-20
|
|
VAST
|
VAST: A Vision-Audio-Subtitle-Text Omni-Modality …
|
72.00
|
2023-05-29
|
|
COSA
|
COSA: Concatenated Sample Pretrained Vision-Langu…
|
70.50
|
2023-06-15
|
|
UMT-L (ViT-L/16)
|
Unmasked Teacher: Towards Training-Efficient Vide…
|
70.40
|
2023-03-28
|
|
GRAM
|
Gramian Multimodal Representation Learning and Al…
|
67.30
|
2024-12-16
|
|
VALOR
|
VALOR: Vision-Audio-Language Omni-Perception Pret…
|
61.50
|
2023-04-17
|
|
TESTA (ViT-B/16)
|
TESTA: Temporal-Spatial Token Aggregation for Lon…
|
61.20
|
2023-10-29
|
|
VindLU
|
VindLU: A Recipe for Effective Video-and-Language…
|
61.20
|
2022-12-09
|
|
InternVideo
|
InternVideo: General Video Foundation Models via …
|
57.90
|
2022-12-06
|
|
RTQ
|
RTQ: Rethinking Video-language Understanding Base…
|
57.60
|
2023-12-01
|
|
VLAB
|
VLAB: Enhancing Video Language Pre-training by Fe…
|
56.80
|
2023-05-22
|
|
HiTeA
|
HiTeA: Hierarchical Temporal-Aware Video-Language…
|
56.50
|
2022-12-30
|
|
MuLTI
|
MuLTI: Efficient Video-and-Language Understanding…
|
56.50
|
2023-03-10
|
|
mPLUG-2
|
mPLUG-2: A Modularized Multi-modal Foundation Mod…
|
56.40
|
2023-02-01
|
|
CLIP-ViP
|
CLIP-ViP: Adapting Pre-trained Image-Text Model t…
|
55.30
|
2022-09-14
|
|
STAN
|
Revisiting Temporal Modeling for CLIP-based Image…
|
54.60
|
2023-01-26
|
|
Singularity
|
Revealing Single Frame Bias for Video-and-Languag…
|
53.90
|
2022-06-07
|
|
DMAE (ViT-B/32)
|
Dual-Modal Attention-Enhanced Text-Video Retrieva…
|
52.70
|
2023-09-20
|
|
HunYuan_tvr (huge)
|
Tencent Text-Video Retrieval: Hierarchical Cross-…
|
52.70
|
2022-04-07
|
|
OmniVL
|
OmniVL:One Foundation Model for Image-Language an…
|
52.40
|
2022-09-15
|
|
HunYuan_tvr
|
Tencent Text-Video Retrieval: Hierarchical Cross-…
|
52.10
|
2022-04-07
|
|
Cap4Video
|
Cap4Video: What Can Auxiliary Captions Do for Tex…
|
52.00
|
2022-12-31
|
|
Clover
|
Clover: Towards A Unified Video-Language Alignmen…
|
50.10
|
2022-07-16
|
|
DRL
|
Disentangled Representation Learning for Text-Vid…
|
49.00
|
2022-03-14
|
|
DiffusionRet+QB-Norm
|
DiffusionRet: Generative Text-Video Retrieval wit…
|
48.90
|
2023-03-17
|
|
PAU
|
Prototype-based Aleatoric Uncertainty Quantificat…
|
48.60
|
2023-09-29
|
|
VIOLETv2
|
An Empirical Study of End-to-End Video-Language T…
|
47.90
|
2022-09-04
|
|
X-CLIP
|
X-CLIP: End-to-End Multi-grained Contrastive Lear…
|
47.80
|
2022-07-15
|
|
HBI
|
Video-Text as Game Players: Hierarchical Banzhaf …
|
46.90
|
2023-03-25
|
|
DiffusionRet
|
DiffusionRet: Generative Text-Video Retrieval wit…
|
46.70
|
2023-03-17
|
|
CAMoE
|
Improving Video-Text Retrieval by Multi-Stream Co…
|
43.80
|
2021-09-09
|
|
QB-Norm+CLIP4Clip
|
Cross Modal Retrieval with Querybank Normalisation
|
43.50
|
2021-12-23
|
|
CLIP4Clip
|
CLIP4Clip: An Empirical Study of CLIP for End to …
|
43.40
|
2021-04-18
|
|
ALPRO
|
Align and Prompt: Video-and-Language Pre-training…
|
35.90
|
2021-12-17
|
|
FROZEN
|
Frozen in Time: A Joint Video and Image Encoder f…
|
31.00
|
2021-04-01
|
|
HD-VILA
|
Advancing High-Resolution Video-Language Represen…
|
28.80
|
2021-11-19
|
|
PO Loss
|
Rudder: A Cross Lingual Video and Text Retrieval …
|
16.30
|
2021-03-09
|
|
Collaborative Experts
|
Use What You Have: Video Retrieval Using Represen…
|
16.10
|
2019-07-31
|
|