GRAM
|
Gramian Multimodal Representation Learning and Al…
|
87.70
|
2024-12-16
|
|
VAST
|
VAST: A Vision-Audio-Subtitle-Text Omni-Modality …
|
83.00
|
2023-05-29
|
|
VALOR
|
VALOR: Vision-Audio-Language Omni-Perception Pret…
|
78.50
|
2023-04-17
|
|
InternVideo2-6B
|
InternVideo2: Scaling Foundation Models for Multi…
|
75.50
|
2024-03-22
|
|
Unmasked Teacher
|
Unmasked Teacher: Towards Training-Efficient Vide…
|
72.00
|
2023-03-28
|
|
InternVideo
|
InternVideo: General Video Foundation Models via …
|
71.10
|
2022-12-06
|
|
Side4Video
|
Side4Video: Spatial-Temporal Side Network for Mem…
|
68.80
|
2023-11-27
|
|
Cap4Video
|
Cap4Video: What Can Auxiliary Captions Do for Tex…
|
66.60
|
2022-12-31
|
|
TS2-Net
|
TS2-Net: Token Shift and Selection Transformer fo…
|
59.10
|
2022-07-16
|
|
LAFF
|
Lightweight Attentional Feature Fusion: A New Bas…
|
59.10
|
2021-12-03
|
|
QB-Norm+CLIP2Video
|
Cross Modal Retrieval with Querybank Normalisation
|
58.80
|
2021-12-23
|
|
CLIP2Video
|
CLIP2Video: Mastering Video-Text Retrieval via Im…
|
57.30
|
2021-06-21
|
|