BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown remarkable capabilities in visual reasoning with common image styles. However, their robustness against diverse style shifts, crucial for practical applications, remains largely unexplored. In this paper, we propose a new benchmark, BenchLMM, to assess the robustness of LMMs against three different styles: artistic image style, imaging sensor style, and application style, where each style has five sub-styles. Utilizing BenchLMM, we comprehensively evaluate state-of-the-art LMMs and reveal: 1) LMMs generally suffer performance degradation when working with other styles; 2) An LMM performs better than another model in common style does not guarantee its superior performance in other styles; 3) LMMs' reasoning capability can be enhanced by prompting LMMs to predict the style first, based on which we propose a versatile and training-free method for improving LMMs; 4) An intelligent LMM is expected to interpret the causes of its errors when facing stylistic variations. We hope that our benchmark and analysis can shed new light on developing more intelligent and versatile LMMs.
Variants: BenchLMM
This dataset is used in 1 benchmark:
Task | Model | Paper | Date |
---|---|---|---|
Visual Question Answering | Sphinx-V2-1K | SPHINX: The Joint Mixing of … | 2023-11-13 |
Visual Question Answering | MiniGPTv2-7B | MiniGPT-v2: large language model as … | 2023-10-14 |
Visual Question Answering | LLaVA-1.5-13B | Improved Baselines with Visual Instruction … | 2023-10-05 |
Visual Question Answering | InstructBLIP-7B | InstructBLIP: Towards General-purpose Vision-Language Models … | 2023-05-11 |
Visual Question Answering | InstructBLIP-13B | InstructBLIP: Towards General-purpose Vision-Language Models … | 2023-05-11 |
Visual Question Answering | Otter-7B | Otter: A Multi-Modal Model with … | 2023-05-05 |
Visual Question Answering | MiniGPT4-13B | MiniGPT-4: Enhancing Vision-Language Understanding with … | 2023-04-20 |
Visual Question Answering | LLaVA-1-13B | Visual Instruction Tuning | 2023-04-17 |
Visual Question Answering | LLaVA-1.5-7B | Visual Instruction Tuning | 2023-04-17 |
Visual Question Answering | GPT-4V | GPT-4 Technical Report | 2023-03-15 |
Recent papers with results on this dataset: