Qwen2-VL-72B
|
Qwen2-VL: Enhancing Vision-Language Model's Perce…
|
50.40
|
2024-09-18
|
|
LLaVA-OneVision-Qwen2-72B
|
LLaVA-OneVision: Easy Visual Task Transfer
|
48.40
|
2024-08-06
|
|
LLaVA-OneVision-Qwen2-7B
|
LLaVA-OneVision: Easy Visual Task Transfer
|
41.60
|
2024-08-06
|
|
Qwen2-VL-7B
|
Qwen2-VL: Enhancing Vision-Language Model's Perce…
|
40.20
|
2024-09-18
|
|
Gemini-1.5-Pro (CoT)
|
Gemini 1.5: Unlocking multimodal understanding ac…
|
37.00
|
2024-03-08
|
|
VideoLLaMA2-72B
|
VideoLLaMA 2: Advancing Spatial-Temporal Modeling…
|
36.20
|
2024-06-11
|
|
Gemini-1.5-Pro
|
Gemini 1.5: Unlocking multimodal understanding ac…
|
35.80
|
2024-03-08
|
|
MiniCPM-2.6
|
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
|
32.60
|
2024-08-03
|
|
InternLM-XC-2.5 (CoT)
|
InternLM-XComposer-2.5: A Versatile Large Vision …
|
30.80
|
2024-07-03
|
|
InternLM-XC-2.5
|
InternLM-XComposer-2.5: A Versatile Large Vision …
|
28.80
|
2024-07-03
|
|
Video-LLaVA-7B
|
Video-LLaVA: Learning United Visual Representatio…
|
24.80
|
2023-11-16
|
|
MA-LMM-Vicuna-7B
|
MA-LMM: Memory-Augmented Large Multimodal Model f…
|
23.80
|
2024-04-08
|
|
VTimeLLM
|
VTimeLLM: Empower LLM to Grasp Video Moments
|
19.40
|
2023-11-30
|
|
VideoCLIP
|
VideoCLIP: Contrastive Pre-training for Zero-shot…
|
17.00
|
2021-09-28
|
|
LanguageBind
|
LanguageBind: Extending Video-Language Pretrainin…
|
10.60
|
2023-10-03
|
|
ImageBind
|
ImageBind: One Embedding Space To Bind Them All
|
9.40
|
2023-05-09
|
|