GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)
|
Composing Ensembles of Pre-trained Models via Ite…
|
61.20
|
2022-10-20
|
|
GPT-2 + CLIP-32 (Zero-Shot)
|
Composing Ensembles of Pre-trained Models via Ite…
|
58.40
|
2022-10-20
|
|
VideoCoCa
|
VideoCoCa: Video-Text Modeling with Zero-Shot Tra…
|
56.10
|
2022-12-09
|
|
Mirasol3B
|
Mirasol3B: A Multimodal Autoregressive model for …
|
51.13
|
2023-11-09
|
|
VAST
|
VAST: A Vision-Audio-Subtitle-Text Omni-Modality …
|
50.40
|
2023-05-29
|
|
COSA
|
COSA: Concatenated Sample Pretrained Vision-Langu…
|
49.90
|
2023-06-15
|
|
MA-LMM
|
MA-LMM: Memory-Augmented Large Multimodal Model f…
|
49.80
|
2024-04-08
|
|
VideoChat2
|
MVBench: A Comprehensive Multi-modal Video Unders…
|
49.10
|
2023-11-28
|
|
VALOR
|
VALOR: Vision-Audio-Language Omni-Perception Pret…
|
48.60
|
2023-04-17
|
|
UMT-L (ViT-L/16)
|
Unmasked Teacher: Towards Training-Efficient Vide…
|
47.90
|
2023-03-28
|
|
LLaMA-VID-13B (2 Token)
|
LLaMA-VID: An Image is Worth 2 Tokens in Large La…
|
47.50
|
2023-11-28
|
|
LLaMA-VID-7B (2 Token)
|
LLaMA-VID: An Image is Worth 2 Tokens in Large La…
|
47.40
|
2023-11-28
|
|
Chat-UniVi-13B
|
Chat-UniVi: Unified Visual Representation Empower…
|
46.40
|
2023-11-14
|
|
BT-Adapter (zero-shot)
|
BT-Adapter: Video Conversation is Feasible Withou…
|
46.10
|
2023-09-27
|
|
MovieChat
|
MovieChat: From Dense Token to Sparse Memory for …
|
45.70
|
2023-07-31
|
|
Video-LLaVA
|
Video-LLaVA: Learning United Visual Representatio…
|
45.30
|
2023-11-16
|
|
TESTA (ViT-B/16)
|
TESTA: Temporal-Spatial Token Aggregation for Lon…
|
45.00
|
2023-10-29
|
|
FrozenBiLM+
|
Open-vocabulary Video Question Answering: A New B…
|
44.80
|
2023-08-18
|
|
VindLU
|
VindLU: A Recipe for Effective Video-and-Language…
|
44.70
|
2022-12-09
|
|
Singularity-temporal
|
Revealing Single Frame Bias for Video-and-Languag…
|
44.10
|
2022-06-07
|
|
FrozenBiLM
|
Zero-Shot Video Question Answering via Frozen Bid…
|
43.20
|
2022-06-16
|
|
Singularity
|
Revealing Single Frame Bias for Video-and-Languag…
|
43.10
|
2022-06-07
|
|
Text + Text (no Multimodal Pretext Training)
|
Towards Fast Adaptation of Pretrained Contrastive…
|
41.40
|
2022-06-05
|
|
All-in-one+
|
Open-vocabulary Video Question Answering: A New B…
|
40.00
|
2023-08-18
|
|
VIOLET+
|
Open-vocabulary Video Question Answering: A New B…
|
39.70
|
2023-08-18
|
|
Just Ask (fine-tune)
|
Just Ask: Learning to Answer Questions from Milli…
|
38.90
|
2020-12-01
|
|
LocVLM-Vid-B+
|
Learning to Localize Objects Improves Spatial Rea…
|
38.20
|
2024-04-11
|
|
LocVLM-Vid-B
|
Learning to Localize Objects Improves Spatial Rea…
|
37.40
|
2024-04-11
|
|
Video-ChatGPT
|
Video-ChatGPT: Towards Detailed Video Understandi…
|
35.20
|
2023-06-08
|
|
LLaMA Adapter V2
|
LLaMA-Adapter V2: Parameter-Efficient Visual Inst…
|
34.20
|
2023-04-28
|
|
E-SA
|
ActivityNet-QA: A Dataset for Understanding Compl…
|
31.80
|
2019-06-06
|
|
E-MN
|
ActivityNet-QA: A Dataset for Understanding Compl…
|
27.10
|
2019-06-06
|
|
Video Chat
|
VideoChat: Chat-Centric Video Understanding
|
26.50
|
2023-05-10
|
|
FrozenBiLM (0-shot)
|
Zero-Shot Video Question Answering via Frozen Bid…
|
25.90
|
2022-06-16
|
|
E-VQA
|
ActivityNet-QA: A Dataset for Understanding Compl…
|
25.10
|
2019-06-06
|
|
Just Ask (0-shot)
|
Just Ask: Learning to Answer Questions from Milli…
|
12.20
|
2020-12-01
|
|