ML Research Wiki / Benchmarks / Temporal Relation Extraction / Vinoground

Vinoground

Temporal Relation Extraction Benchmark

Performance Over Time

📊 Showing 16 results | 📏 Metric: Text Score

Top Performing Models

Rank	Model	Paper	Text Score	Date	Code
1	Qwen2-VL-72B	Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	50.40	2024-09-18	📦 qwenlm/qwen2-vl 📦 qwenlm/qwen2.5-vl 📦 juruobenruo/DexVLA
2	LLaVA-OneVision-Qwen2-72B	LLaVA-OneVision: Easy Visual Task Transfer	48.40	2024-08-06	📦 evolvinglmms-lab/lmms-eval 📦 MindSpore-scientific-2/code-14
3	LLaVA-OneVision-Qwen2-7B	LLaVA-OneVision: Easy Visual Task Transfer	41.60	2024-08-06	📦 evolvinglmms-lab/lmms-eval 📦 MindSpore-scientific-2/code-14
4	Qwen2-VL-7B	Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	40.20	2024-09-18	📦 qwenlm/qwen2-vl 📦 qwenlm/qwen2.5-vl 📦 juruobenruo/DexVLA
5	Gemini-1.5-Pro (CoT)	Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	37.00	2024-03-08	📦 dlvuldet/primevul
6	VideoLLaMA2-72B	VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	36.20	2024-06-11	📦 damo-nlp-sg/videollama2 📦 damo-nlp-sg/videollama3 📦 damo-nlp-sg/inf-clip
7	Gemini-1.5-Pro	Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	35.80	2024-03-08	📦 dlvuldet/primevul
8	MiniCPM-2.6	MiniCPM-V: A GPT-4V Level MLLM on Your Phone	32.60	2024-08-03	📦 openbmb/minicpm-v 📦 OpenBMB/MiniCPM-o
9	InternLM-XC-2.5 (CoT)	InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output	30.80	2024-07-03	📦 internlm/internlm-xcomposer
10	InternLM-XC-2.5	InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output	28.80	2024-07-03	📦 internlm/internlm-xcomposer

All Papers (16)

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

2024

Qwen2-VL-72B

qwenlm/qwen2-vl qwenlm/qwen2.5-vl

LLaVA-OneVision: Easy Visual Task Transfer

2024

LLaVA-OneVision-Qwen2-72B

evolvinglmms-lab/lmms-eval MindSpore-scientific-2/code-14

LLaVA-OneVision: Easy Visual Task Transfer

2024

LLaVA-OneVision-Qwen2-7B

evolvinglmms-lab/lmms-eval MindSpore-scientific-2/code-14

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

2024

Qwen2-VL-7B

qwenlm/qwen2-vl qwenlm/qwen2.5-vl

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

2024

Gemini-1.5-Pro (CoT)

dlvuldet/primevul

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

2024

VideoLLaMA2-72B

damo-nlp-sg/videollama2 damo-nlp-sg/videollama3 damo-nlp-sg/inf-clip

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

2024

Gemini-1.5-Pro

dlvuldet/primevul

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

2024

MiniCPM-2.6

openbmb/minicpm-v OpenBMB/MiniCPM-o

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

2024

InternLM-XC-2.5 (CoT)

internlm/internlm-xcomposer

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

2024

InternLM-XC-2.5

internlm/internlm-xcomposer

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

2023

Video-LLaVA-7B

PKU-YuanGroup/Video-LLaVA PKU-YuanGroup/MoE-LLaVA

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

2024

MA-LMM-Vicuna-7B

boheumd/MA-LMM

VTimeLLM: Empower LLM to Grasp Video Moments

2023

VTimeLLM

huangb23/vtimellm

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

2021

VideoCLIP

facebookresearch/fairseq pytorch/fairseq

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

2023

LanguageBind

PKU-YuanGroup/Video-LLaVA PKU-YuanGroup/MoE-LLaVA

ImageBind: One Embedding Space To Bind Them All

2023

ImageBind

facebookresearch/imagebind klemens-floege/oneprot ginihumer/amumo

Vinoground

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (16)

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

LLaVA-OneVision: Easy Visual Task Transfer

LLaVA-OneVision: Easy Visual Task Transfer

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

VTimeLLM: Empower LLM to Grasp Video Moments

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

ImageBind: One Embedding Space To Bind Them All

Model	Paper	Text Score	Date
Qwen2-VL-72B	Qwen2-VL: Enhancing Vision-Language Model's Perce…	50.40	2024-09-18
LLaVA-OneVision-Qwen2-72B	LLaVA-OneVision: Easy Visual Task Transfer	48.40	2024-08-06
LLaVA-OneVision-Qwen2-7B	LLaVA-OneVision: Easy Visual Task Transfer	41.60	2024-08-06
Qwen2-VL-7B	Qwen2-VL: Enhancing Vision-Language Model's Perce…	40.20	2024-09-18
Gemini-1.5-Pro (CoT)	Gemini 1.5: Unlocking multimodal understanding ac…	37.00	2024-03-08
VideoLLaMA2-72B	VideoLLaMA 2: Advancing Spatial-Temporal Modeling…	36.20	2024-06-11
Gemini-1.5-Pro	Gemini 1.5: Unlocking multimodal understanding ac…	35.80	2024-03-08
MiniCPM-2.6	MiniCPM-V: A GPT-4V Level MLLM on Your Phone	32.60	2024-08-03
InternLM-XC-2.5 (CoT)	InternLM-XComposer-2.5: A Versatile Large Vision …	30.80	2024-07-03
InternLM-XC-2.5	InternLM-XComposer-2.5: A Versatile Large Vision …	28.80	2024-07-03
Video-LLaVA-7B	Video-LLaVA: Learning United Visual Representatio…	24.80	2023-11-16
MA-LMM-Vicuna-7B	MA-LMM: Memory-Augmented Large Multimodal Model f…	23.80	2024-04-08
VTimeLLM	VTimeLLM: Empower LLM to Grasp Video Moments	19.40	2023-11-30
VideoCLIP	VideoCLIP: Contrastive Pre-training for Zero-shot…	17.00	2021-09-28
LanguageBind	LanguageBind: Extending Video-Language Pretrainin…	10.60	2023-10-03
ImageBind	ImageBind: One Embedding Space To Bind Them All	9.40	2023-05-09