VLM2-Bench

Name: VLM2-Bench
Published: 2025-02-17
License: CC BY-NC 4.0 License

VLM²-Bench

Dataset Information

Modalities

Images, Videos, Texts

Languages

English

Introduced

2025

License

CC BY-NC 4.0 License

Homepage

Official Website

Contents

Overview
Associated Benchmarks
Recent Benchmark Submissions
Research Papers

Overview

VLM²-Bench: Benchmarking Vision-Language Models on Visual Cue Matching

Description

VLM²-Bench is the first comprehensive benchmark designed to evaluate vision-language models' (VLMs) ability to visually link matching cues across multi-image sequences and videos. The benchmark consists of 9 subtasks with over 3,000 test cases, focusing on fundamental visual linking capabilities that humans use daily. A key example is identifying the same person across different photos without prior knowledge of their identity.

Through extensive evaluation of eight open-source VLMs and GPT-4o using various prompting techniques, we uncover significant challenges in visual cue linking. Even the best-performing model, GPT-4o, falls 34.80% below human-level performance. Our analysis highlights critical areas for improvement:
1. Enhancing core visual understanding with reduced reliance on prior knowledge.
2. Better integration of language reasoning within visual tasks.
3. Developing training approaches that improve independent visual relationship inference.

Dataset Characteristics

Size: 3,000+ test cases
Modalities: Text, image, video
Question Types: True/False, multiple-choice, numerical, open-ended
Generation Process: Semi-automated with human verification
Structure: Organized into three primary categories:
General Cue (GC): Evaluates visual element tracking and matching.
Object-centric Cue (OC): Focuses on object comparison, counting, and grouping.
Person-centric Cue (PC): Measures the ability to compare, count, group, and describe individuals across frames.

Potential Use Cases

Benchmarking vision-language models (VLMs) for real-world multi-modal reasoning.
Evaluating visual linking abilities and spatial awareness in large models.
Analyzing weaknesses in object permanence and relational inference.
Providing insights for improving next-generation vision-language architectures.

Paper & Code

📄 Paper: VLM²-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues
📂 Code Repository: GitHub - vlm2-bench/VLM2-Bench

BibTeX Citation

@misc{zhang2025vlm2benchcloserlookvlms,
      title={VLM$^2$-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues}, 
      author={Jianshu Zhang and Dongyu Yao and Renjie Pi and Paul Pu Liang and Yi R. Fung},
      year={2025},
      eprint={2502.12084},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.12084}
}

Variants: VLM2-Bench

Associated Benchmarks

This dataset is used in 1 benchmark:

Visual Question Answering (VQA) - Metrics: GC-mat, GC-trk, OC-cpr, OC-cnt, OC-grp, PC-cpr, PC-cnt, PC-grp, PC-VID, Average Score on VLM2-bench (9 subtasks)

Recent Benchmark Submissions

Task	Model	Paper	Date
Visual Question Answering (VQA)	Qwen2.5-VL-7B	Qwen2.5-VL Technical Report	2025-02-19
Visual Question Answering (VQA)	InternVL2.5-8B	Expanding Performance Boundaries of Open-Source …	2024-12-06
Visual Question Answering (VQA)	InternVL2.5-26B	Expanding Performance Boundaries of Open-Source …	2024-12-06
Visual Question Answering (VQA)	GPT-4o	GPT-4o System Card	2024-10-25
Visual Question Answering (VQA)	LLaVA-Video-7B	Video Instruction Tuning With Synthetic …	2024-10-03
Visual Question Answering (VQA)	Qwen2-VL-7B	Qwen2-VL: Enhancing Vision-Language Model's Perception …	2024-09-18
Visual Question Answering (VQA)	mPLUG-Owl3-7B	mPLUG-Owl3: Towards Long Image-Sequence Understanding …	2024-08-09
Visual Question Answering (VQA)	LLaVA-OneVision-7B	LLaVA-OneVision: Easy Visual Task Transfer	2024-08-06
Visual Question Answering (VQA)	LongVA-7B	Long Context Transfer from Language …	2024-06-24

Research Papers

Recent papers with results on this dataset:

Qwen2.5-VL Technical Report (2025) -
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling (2024) -
GPT-4o System Card (2024) -
Video Instruction Tuning With Synthetic Data (2024) -
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution (2024) -
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models (2024) -
LLaVA-OneVision: Easy Visual Task Transfer (2024) -
Long Context Transfer from Language to Vision (2024) -

External Links: