VLAB
|
VLAB: Enhancing Video Language Pre-training by Fe…
|
0.61
|
2023-05-22
|
|
MA-LMM
|
MA-LMM: Memory-Augmented Large Multimodal Model f…
|
0.61
|
2024-04-08
|
|
MaMMUT (ours)
|
MaMMUT: A Simple Architecture for Joint Learning …
|
0.60
|
2023-03-29
|
|
VALOR
|
VALOR: Vision-Audio-Language Omni-Perception Pret…
|
0.60
|
2023-04-17
|
|
VAST
|
VAST: A Vision-Audio-Subtitle-Text Omni-Modality …
|
0.60
|
2023-05-29
|
|
COSA
|
COSA: Concatenated Sample Pretrained Vision-Langu…
|
0.60
|
2023-06-15
|
|
mPLUG-2
|
mPLUG-2: A Modularized Multi-modal Foundation Mod…
|
0.58
|
2023-02-01
|
|
VideoCoCa
|
VideoCoCa: Video-Text Modeling with Zero-Shot Tra…
|
0.57
|
2022-12-09
|
|
GIT
|
GIT: A Generative Image-to-text Transformer for V…
|
0.57
|
2022-05-27
|
|
FrozenBiLM+
|
Open-vocabulary Video Question Answering: A New B…
|
0.56
|
2023-08-18
|
|
HiTeA
|
HiTeA: Hierarchical Temporal-Aware Video-Language…
|
0.56
|
2022-12-30
|
|
InternVideo
|
InternVideo: General Video Foundation Models via …
|
0.56
|
2022-12-06
|
|
UMT-L (ViT-L/16)
|
Unmasked Teacher: Towards Training-Efficient Vide…
|
0.55
|
2023-03-28
|
|
vid-TLDR (UMT-L)
|
vid-TLDR: Training Free Token merging for Light-w…
|
0.55
|
2024-03-20
|
|
VIOLETv2
|
An Empirical Study of End-to-End Video-Language T…
|
0.55
|
2022-09-04
|
|
MuLTI
|
MuLTI: Efficient Video-and-Language Understanding…
|
0.55
|
2023-03-10
|
|
X2-VLM (large)
|
X$^2$-VLM: All-In-One Pre-trained Model For Visio…
|
0.55
|
2022-11-22
|
|
X2-VLM (base)
|
X$^2$-VLM: All-In-One Pre-trained Model For Visio…
|
0.53
|
2022-11-22
|
|
Clover
|
Clover: Towards A Unified Video-Language Alignmen…
|
0.52
|
2022-07-16
|
|
VIOLET + MELTR
|
MELTR: Meta Loss Transformer for Learning to Fine…
|
0.52
|
2023-03-23
|
|
OmniVL
|
OmniVL:One Foundation Model for Image-Language an…
|
0.51
|
2022-09-15
|
|
VIOLET+
|
Open-vocabulary Video Question Answering: A New B…
|
0.50
|
2023-08-18
|
|
Co-Tokenization
|
Video Question Answering with Iterative Video-Tex…
|
0.49
|
2022-08-01
|
|
All-in-one-B
|
All in One: Exploring Unified Video-Language Pre-…
|
0.48
|
2022-03-14
|
|
JustAsk+
|
Open-vocabulary Video Question Answering: A New B…
|
0.48
|
2023-08-18
|
|
GIT+MDF
|
Self-Adaptive Sampling for Efficient Video Questi…
|
0.47
|
2023-07-09
|
|
AIO+MIF
|
Self-Adaptive Sampling for Efficient Video Questi…
|
0.47
|
2023-07-09
|
|
ALPRO
|
Align and Prompt: Video-and-Language Pre-training…
|
0.46
|
2021-12-17
|
|
All-in-one+
|
Open-vocabulary Video Question Answering: A New B…
|
0.44
|
2023-08-18
|
|
DualVGR
|
DualVGR: A Dual-Visual Graph Reasoning Unit for V…
|
0.39
|
2021-07-10
|
|
HCRN
|
Hierarchical Conditional Relation Networks for Vi…
|
0.36
|
2020-02-25
|
|
SSML
|
Noise Estimation Using Density Estimation for Sel…
|
0.35
|
2020-03-06
|
|
HMEMA
|
Heterogeneous Memory Enhanced Multimodal Attentio…
|
0.34
|
2019-04-08
|
|
Co-Mem
|
Motion-Appearance Co-Memory Networks for Video Qu…
|
0.32
|
2018-03-29
|
|
ST-VQA
|
TGIF-QA: Toward Spatio-Temporal Reasoning in Visu…
|
0.31
|
2017-04-14
|
|