VLAP (4 frames)
|
ViLA: Efficient Video-Language Alignment for Vide…
|
67.10
|
2023-12-13
|
|
LLaMA-VQA
|
Large Language Models are Temporal and Causal Rea…
|
65.40
|
2023-10-24
|
|
SeViLA
|
Self-Chained Image-Language Model for Video Local…
|
64.90
|
2023-05-11
|
|
InternVideo
|
InternVideo: General Video Foundation Models via …
|
58.70
|
2022-12-06
|
|
GF(sup)
|
Glance and Focus: Memory Prompting for Multi-Even…
|
53.94
|
2024-01-03
|
|
GF(uns)
|
Glance and Focus: Memory Prompting for Multi-Even…
|
53.86
|
2024-01-03
|
|
MIST
|
MIST: Multi-modal Iterative Spatial-Temporal Tran…
|
51.13
|
2022-12-19
|
|
Temp[ATP]
|
Revisiting the "Video" in Video-Language Understa…
|
48.37
|
2022-06-03
|
|
AnyMAL-70B (0-shot)
|
AnyMAL: An Efficient and Scalable Any-Modality Au…
|
48.20
|
2023-09-27
|
|
All-in-one
|
All in One: Exploring Unified Video-Language Pre-…
|
47.50
|
2022-03-14
|
|
TraveLER (0-shot)
|
TraveLER: A Modular Multi-LMM Agent Framework for…
|
44.90
|
2024-04-01
|
|
SeViLA (0-shot)
|
Self-Chained Image-Language Model for Video Local…
|
44.60
|
2023-05-11
|
|
Flamingo-9B (4-shot)
|
Flamingo: a Visual Language Model for Few-Shot Le…
|
42.80
|
2022-04-29
|
|
Flamingo-80B (4-shot)
|
Flamingo: a Visual Language Model for Few-Shot Le…
|
42.40
|
2022-04-29
|
|
Flamingo-9B (0-shot)
|
Flamingo: a Visual Language Model for Few-Shot Le…
|
41.80
|
2022-04-29
|
|
Flamingo-80B (0-shot)
|
Flamingo: a Visual Language Model for Few-Shot Le…
|
39.70
|
2022-04-29
|
|
SHG-VQA (trained from scratch)
|
Learning Situation Hyper-Graphs for Video Questio…
|
39.47
|
2023-04-18
|
|