LinVT-Qwen2-VL
(7B)
|
LinVT: Empower Your Image-level Large Language Mo…
|
85.50
|
2024-12-06
|
|
InternVL-2.5(8B)
|
Expanding Performance Boundaries of Open-Source M…
|
85.50
|
2024-12-06
|
|
VideoLLaMA3(7B)
|
VideoLLaMA 3: Frontier Multimodal Foundation Mode…
|
84.50
|
2025-01-22
|
|
PLM-8B
|
PerceptionLM: Open-Access Data and Models for Det…
|
84.10
|
2025-04-17
|
|
BIMBA-LLaVA-Qwen2-7B
|
BIMBA: Selective-Scan Compression for Long-Range …
|
83.73
|
2025-03-12
|
|
PLM-3B
|
PerceptionLM: Open-Access Data and Models for Det…
|
83.40
|
2025-04-17
|
|
LLaVA-Video
|
Video Instruction Tuning With Synthetic Data
|
83.20
|
2024-10-03
|
|
NVILA(8B)
|
NVILA: Efficient Frontier Visual Language Models
|
82.20
|
2024-12-05
|
|
Oryx-1.5(7B)
|
Oryx MLLM: On-Demand Spatial-Temporal Understandi…
|
81.80
|
2024-09-19
|
|
Qwen2-VL(7B)
|
Qwen2-VL: Enhancing Vision-Language Model's Perce…
|
81.20
|
2024-09-18
|
|
LongVILA(7B)
|
LongVILA: Scaling Long-Context Visual Language Mo…
|
80.70
|
2024-08-19
|
|
PLM-1B
|
PerceptionLM: Open-Access Data and Models for Det…
|
80.30
|
2025-04-17
|
|
LLaVA-OV(72B)
|
LLaVA-OneVision: Easy Visual Task Transfer
|
80.20
|
2024-08-06
|
|
VideoChat2_HD_mistral
|
MVBench: A Comprehensive Multi-modal Video Unders…
|
79.50
|
2023-11-28
|
|
LLaVA-OV(7B)
|
LLaVA-OneVision: Easy Visual Task Transfer
|
79.40
|
2024-08-06
|
|
LLaVA-NeXT-Interleave(14B)
|
LLaVA-NeXT-Interleave: Tackling Multi-image, Vide…
|
79.10
|
2024-07-10
|
|
VideoChat2_mistral
|
MVBench: A Comprehensive Multi-modal Video Unders…
|
78.60
|
2023-11-28
|
|
mPLUG-Owl3(8B)
|
mPLUG-Owl3: Towards Long Image-Sequence Understan…
|
78.60
|
2024-08-09
|
|
LLaVA-NeXT-Interleave(7B)
|
LLaVA-NeXT-Interleave: Tackling Multi-image, Vide…
|
78.20
|
2024-07-10
|
|
LLaVA-NeXT-Interleave(DPO)
|
LLaVA-NeXT-Interleave: Tackling Multi-image, Vide…
|
77.90
|
2024-07-10
|
|
Vamos
|
Vamos: Versatile Action Models for Video Understa…
|
77.30
|
2023-11-22
|
|
ViLA (3B)
|
ViLA: Efficient Video-Language Alignment for Vide…
|
75.60
|
2023-12-13
|
|
VideoLLaMA2.1(7B)
|
VideoLLaMA 2: Advancing Spatial-Temporal Modeling…
|
75.60
|
2024-06-11
|
|
LLaMA-VQA (33B)
|
Large Language Models are Temporal and Causal Rea…
|
75.50
|
2023-10-24
|
|
ViLA (3B, 4 frames)
|
ViLA: Efficient Video-Language Alignment for Vide…
|
74.40
|
2023-12-13
|
|
CREMA
|
CREMA: Generalizable and Efficient Video-Language…
|
73.90
|
2024-02-08
|
|
SeViLA
|
Self-Chained Image-Language Model for Video Local…
|
73.80
|
2023-05-11
|
|
TCR
|
Text-Conditioned Resampler For Long Form Video Un…
|
73.50
|
2023-12-19
|
|
LSTP
|
Efficient Temporal Extrapolation of Multimodal La…
|
72.10
|
2024-02-25
|
|
Mirasol3B
|
Mirasol3B: A Multimodal Autoregressive model for …
|
72.00
|
2023-11-09
|
|
VideoChat2
|
MVBench: A Comprehensive Multi-modal Video Unders…
|
68.60
|
2023-11-28
|
|
RTQ
|
RTQ: Rethinking Video-language Understanding Base…
|
63.20
|
2023-12-01
|
|
HiTeA
|
HiTeA: Hierarchical Temporal-Aware Video-Language…
|
63.10
|
2022-12-30
|
|
CoVGT(PT)
|
Contrastive Video Question Answering via Video Gr…
|
60.70
|
2023-02-27
|
|
SeViT
|
Semi-Parametric Video-Grounded Text Generation
|
60.60
|
2023-01-27
|
|
ViperGPT(0-shot)
|
ViperGPT: Visual Inference via Python Execution f…
|
60.00
|
2023-03-14
|
|
CoVGT
|
Contrastive Video Question Answering via Video Gr…
|
60.00
|
2023-02-27
|
|
GF
|
Glance and Focus: Memory Prompting for Multi-Even…
|
58.83
|
2024-01-03
|
|
VFC
|
Verbs in Action: Improving verb understanding in …
|
58.60
|
2023-04-13
|
|
ATM
|
ATM: Action Temporality Modeling for Video Questi…
|
58.30
|
2023-09-05
|
|
MIST
|
MIST: Multi-modal Iterative Spatial-Temporal Tran…
|
57.20
|
2022-12-19
|
|
VGT(PT)
|
Video Graph Transformer for Video Question Answer…
|
56.90
|
2022-07-12
|
|
PAXION
|
Paxion: Patching Action Knowledge in Video-Langua…
|
56.90
|
2023-05-18
|
|
VGT
|
Video Graph Transformer for Video Question Answer…
|
55.00
|
2022-07-12
|
|
ATP
|
Revisiting the "Video" in Video-Language Understa…
|
54.30
|
2022-06-03
|
|
P3D-G
|
(2.5+1)D Spatio-Temporal Scene Graphs for Video Q…
|
53.40
|
2022-02-18
|
|
HQGA
|
Video as Conditional Graph Hierarchy for Multi-Gr…
|
51.40
|
2021-12-12
|
|