LinVT-Qwen2-VL
(7B)
|
LinVT: Empower Your Image-level Large Language Mo…
|
69.30
|
2024-12-06
|
|
Tarsier (34B)
|
Tarsier: Recipes for Training and Evaluating Larg…
|
67.60
|
2024-06-30
|
|
InternVideo2
|
InternVideo2: Scaling Foundation Models for Multi…
|
67.20
|
2024-03-22
|
|
LongVU (7B)
|
LongVU: Spatiotemporal Adaptive Compression for L…
|
66.90
|
2024-10-22
|
|
Oryx(34B)
|
Oryx MLLM: On-Demand Spatial-Temporal Understandi…
|
64.70
|
2024-09-19
|
|
VideoLLaMA2 (72B)
|
VideoLLaMA 2: Advancing Spatial-Temporal Modeling…
|
62.00
|
2024-06-11
|
|
VideoChat-T (7B)
|
TimeSuite: Improving MLLMs for Long Video Underst…
|
59.90
|
2024-10-25
|
|
mPLUG-Owl3(7B)
|
mPLUG-Owl3: Towards Long Image-Sequence Understan…
|
59.50
|
2024-08-09
|
|
PPLLaVA (7b)
|
PPLLaVA: Varied Video Sequence Understanding With…
|
59.20
|
2024-11-04
|
|
VideoGPT+
|
VideoGPT+: Integrating Image and Video Encoders f…
|
58.70
|
2024-06-13
|
|
PLLaVA
|
PLLaVA : Parameter-free LLaVA Extension from Imag…
|
58.10
|
2024-04-25
|
|
ST-LLM
|
ST-LLM: Large Language Models Are Effective Tempo…
|
54.90
|
2024-03-30
|
|
VideoChat2
|
MVBench: A Comprehensive Multi-modal Video Unders…
|
51.90
|
2023-11-28
|
|
HawkEye
|
HawkEye: Training Video-Text LLMs for Grounding T…
|
47.55
|
2024-03-15
|
|
SPHINX-Plus
|
SPHINX-X: Scaling Data and Parameters for a Famil…
|
39.70
|
2024-02-08
|
|
TimeChat
|
TimeChat: A Time-sensitive Multimodal Large Langu…
|
38.50
|
2023-12-04
|
|
LLaVa
|
Visual Instruction Tuning
|
36.00
|
2023-04-17
|
|
VideoChat
|
VideoChat: Chat-Centric Video Understanding
|
35.50
|
2023-05-10
|
|
VideoLLaMA
|
Video-LLaMA: An Instruction-tuned Audio-Visual La…
|
34.10
|
2023-06-05
|
|
Video-ChatGPT
|
Video-ChatGPT: Towards Detailed Video Understandi…
|
32.70
|
2023-06-08
|
|
InstructBLIP
|
InstructBLIP: Towards General-purpose Vision-Lang…
|
32.50
|
2023-05-11
|
|
MiniGPT4
|
MiniGPT-4: Enhancing Vision-Language Understandin…
|
18.80
|
2023-04-20
|
|