VAST
|
VAST: A Vision-Audio-Subtitle-Text Omni-Modality …
|
18.20
|
2023-05-29
|
|
UniVL + MELTR
|
MELTR: Meta Loss Transformer for Learning to Fine…
|
17.92
|
2023-03-23
|
|
UniVL
|
UniVL: A Unified Video and Language Pre-Training …
|
17.35
|
2020-02-15
|
|
VideoCoCa
|
VideoCoCa: Video-Text Modeling with Zero-Shot Tra…
|
14.20
|
2022-12-09
|
|
VLM
|
VLM: Task-agnostic Video-Language Model Pre-train…
|
12.27
|
2021-05-20
|
|
E2vidD6-MASSvid-BiD
|
Multimodal Pretraining for Dense Video Captioning
|
12.04
|
2020-11-10
|
|
TextKG
|
Text with Knowledge Graph Augmented Transformer f…
|
11.70
|
2023-03-22
|
|
COOT
|
COOT: Cooperative Hierarchical Transformer for Vi…
|
11.30
|
2020-11-01
|
|
COSA
|
COSA: Concatenated Sample Pretrained Vision-Langu…
|
10.10
|
2023-06-15
|
|
HowToCaption
|
HowToCaption: Prompting LLMs to Transform Video A…
|
8.80
|
2023-10-07
|
|
OmniVL
|
OmniVL:One Foundation Model for Image-Language an…
|
8.72
|
2022-09-15
|
|
Zhou
|
End-to-End Dense Video Captioning with Masked Tra…
|
4.38
|
2018-04-03
|
|
VideoBERT + S3D
|
VideoBERT: A Joint Model for Video and Language R…
|
4.33
|
2019-04-03
|
|
MA-LMM
|
MA-LMM: Memory-Augmented Large Multimodal Model f…
|
1.31
|
2024-04-08
|
|