MaMMUT (ours)
|
MaMMUT: A Simple Architecture for Joint Learning …
|
73.60
|
2023-03-29
|
|
Vid2Seq
|
Vid2Seq: Large-Scale Pretraining of a Visual Lang…
|
64.60
|
2023-02-27
|
|
VIOLETv2
|
An Empirical Study of End-to-End Video-Language T…
|
58.00
|
2022-09-04
|
|
mPLUG-2
|
mPLUG-2: A Modularized Multi-modal Foundation Mod…
|
57.80
|
2023-02-01
|
|
VAST
|
VAST: A Vision-Audio-Subtitle-Text Omni-Modality …
|
56.70
|
2023-05-29
|
|
GIT2
|
GIT: A Generative Image-to-text Transformer for V…
|
54.80
|
2022-05-27
|
|
VLAB
|
VLAB: Enhancing Video Language Pre-training by Fe…
|
54.60
|
2023-05-22
|
|
VALOR
|
VALOR: Vision-Audio-Language Omni-Perception Pret…
|
54.40
|
2023-04-17
|
|
VideoCoCa
|
VideoCoCa: Video-Text Modeling with Zero-Shot Tra…
|
53.80
|
2022-12-09
|
|
COSA
|
COSA: Concatenated Sample Pretrained Vision-Langu…
|
53.70
|
2023-06-15
|
|
HowToCaption
|
HowToCaption: Prompting LLMs to Transform Video A…
|
49.80
|
2023-10-07
|
|
RTQ
|
RTQ: Rethinking Video-language Understanding Base…
|
49.60
|
2023-12-01
|
|
HiTeA
|
HiTeA: Hierarchical Temporal-Aware Video-Language…
|
49.20
|
2022-12-30
|
|
MV-GPT
|
End-to-end Generative Pretraining for Multimodal …
|
48.90
|
2022-01-20
|
|
CLIP-DCD
|
CLIP Meets Video Captioning: Concept-Aware Repres…
|
48.20
|
2021-11-30
|
|
TextKG
|
Text with Knowledge Graph Augmented Transformer f…
|
46.60
|
2023-03-22
|
|
EMCL-Net
|
Expectation-Maximization Contrastive Learning for…
|
45.30
|
2022-11-21
|
|
SEM-POS
|
SEM-POS: Grammatically and Semantically Correct V…
|
45.20
|
2023-03-26
|
|
CoCap (ViT/L14)
|
Accurate and Fast Compressed Video Captioning
|
44.40
|
2023-09-22
|
|
VASTA (Vatex-backbone)
|
Diverse Video Captioning by Adaptive Spatio-tempo…
|
44.21
|
2022-08-19
|
|
UniVL + MELTR
|
MELTR: Meta Loss Transformer for Learning to Fine…
|
44.17
|
2023-03-23
|
|
VASTA (Kinetics-backbone)
|
Diverse Video Captioning by Adaptive Spatio-tempo…
|
43.40
|
2022-08-19
|
|