VLM²-Bench
VLM²-Bench is the first comprehensive benchmark designed to evaluate vision-language models' (VLMs) ability to visually link matching cues across multi-image sequences and videos. The benchmark consists of 9 subtasks with over 3,000 test cases, focusing on fundamental visual linking capabilities that humans use daily. A key example is identifying the same person across different photos without prior knowledge of their identity.
Through extensive evaluation of eight open-source VLMs and GPT-4o using various prompting techniques, we uncover significant challenges in visual cue linking. Even the best-performing model, GPT-4o, falls 34.80% below human-level performance. Our analysis highlights critical areas for improvement:
1. Enhancing core visual understanding with reduced reliance on prior knowledge.
2. Better integration of language reasoning within visual tasks.
3. Developing training approaches that improve independent visual relationship inference.
📄 Paper: VLM²-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues
📂 Code Repository: GitHub - vlm2-bench/VLM2-Bench
@misc{zhang2025vlm2benchcloserlookvlms,
title={VLM$^2$-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues},
author={Jianshu Zhang and Dongyu Yao and Renjie Pi and Paul Pu Liang and Yi R. Fung},
year={2025},
eprint={2502.12084},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.12084}
}
Variants: VLM2-Bench
This dataset is used in 1 benchmark:
Task | Model | Paper | Date |
---|---|---|---|
Visual Question Answering (VQA) | Qwen2.5-VL-7B | Qwen2.5-VL Technical Report | 2025-02-19 |
Visual Question Answering (VQA) | InternVL2.5-8B | Expanding Performance Boundaries of Open-Source … | 2024-12-06 |
Visual Question Answering (VQA) | InternVL2.5-26B | Expanding Performance Boundaries of Open-Source … | 2024-12-06 |
Visual Question Answering (VQA) | GPT-4o | GPT-4o System Card | 2024-10-25 |
Visual Question Answering (VQA) | LLaVA-Video-7B | Video Instruction Tuning With Synthetic … | 2024-10-03 |
Visual Question Answering (VQA) | Qwen2-VL-7B | Qwen2-VL: Enhancing Vision-Language Model's Perception … | 2024-09-18 |
Visual Question Answering (VQA) | mPLUG-Owl3-7B | mPLUG-Owl3: Towards Long Image-Sequence Understanding … | 2024-08-09 |
Visual Question Answering (VQA) | LLaVA-OneVision-7B | LLaVA-OneVision: Easy Visual Task Transfer | 2024-08-06 |
Visual Question Answering (VQA) | LongVA-7B | Long Context Transfer from Language … | 2024-06-24 |
Recent papers with results on this dataset: