MaMMUT
|
MaMMUT: A Simple Architecture for Joint Learning …
|
195.60
|
2023-03-29
|
|
Vid2Seq
|
Vid2Seq: Large-Scale Pretraining of a Visual Lang…
|
146.20
|
2023-02-27
|
|
VIOLETv2
|
An Empirical Study of End-to-End Video-Language T…
|
139.20
|
2022-09-04
|
|
VALOR
|
VALOR: Vision-Audio-Language Omni-Perception Pret…
|
80.70
|
2023-04-17
|
|
VLAB
|
VLAB: Enhancing Video Language Pre-training by Fe…
|
79.30
|
2023-05-22
|
|
COSA
|
COSA: Concatenated Sample Pretrained Vision-Langu…
|
76.50
|
2023-06-15
|
|
HiTeA
|
HiTeA: Hierarchical Temporal-Aware Video-Language…
|
71.00
|
2022-12-30
|
|
mPLUG-2
|
mPLUG-2: A Modularized Multi-modal Foundation Mod…
|
70.50
|
2023-02-01
|
|
HowToCaption
|
HowToCaption: Prompting LLMs to Transform Video A…
|
70.40
|
2023-10-07
|
|
RTQ
|
RTQ: Rethinking Video-language Understanding Base…
|
66.90
|
2023-12-01
|
|
CoCap (ViT/L14)
|
Accurate and Fast Compressed Video Captioning
|
60.10
|
2023-09-22
|
|
SEM-POS
|
SEM-POS: Grammatically and Semantically Correct V…
|
60.10
|
2023-03-26
|
|
VASTA (Vatex-backbone)
|
Diverse Video Captioning by Adaptive Spatio-tempo…
|
59.20
|
2022-08-19
|
|
VASTA (Kinetics-backbone)
|
Diverse Video Captioning by Adaptive Spatio-tempo…
|
56.10
|
2022-08-19
|
|