← ML Research Wiki / 2305.10355

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li Gaoling School of Artificial Intelligence Renmin University of China Beijing Key Laboratory of Big Data Management and Analysis Methods 4 Meituan Group, Yifan Du Gaoling School of Artificial Intelligence Renmin University of China Beijing Key Laboratory of Big Data Management and Analysis Methods 4 Meituan Group, Kun Zhou School of Information Renmin University of China, Jinpeng Wang [email protected], Wayne Xin Zhao School of Information Renmin University of China Beijing Key Laboratory of Big Data Management and Analysis Methods 4 Meituan Group, Ji-Rong Wen [email protected] Gaoling School of Artificial Intelligence Renmin University of China School of Information Renmin University of China Beijing Key Laboratory of Big Data Management and (2023)

Paper Information
arXiv ID
Venue
Conference on Empirical Methods in Natural Language Processing
Domain
Computer Vision and Natural Language Processing
Reproducibility
7/10

Abstract

Inspired by the superior language abilities of large language models (LLM), large visionlanguage models (LVLM) have been recently proposed by integrating powerful LLMs for improving the performance on complex multimodal tasks.Despite the promising progress on LVLMs, we find that they suffer from object hallucinations, i.e., they tend to generate objects inconsistent with the target images in the descriptions.To investigate it, this work presents the first systematic study on object hallucination of LVLMs.We conduct the evaluation experiments on several representative LVLMs, and show that they mostly suffer from severe object hallucination issues.We further discuss that the visual instructions may influence the hallucination, and find that: objects that frequently appear in the visual instructions or co-occur with the image objects are obviously prone to be hallucinated by LVLMs.Besides, we further design a polling-based query method called POPE for better evaluation of object hallucination.Experiment results show that our POPE can evaluate object hallucination in a more stable and flexible way.

Summary

This paper investigates the phenomenon of object hallucination in large vision-language models (LVLMs), wherein these models generate objects that do not exist in the images they describe. The authors conduct a systematic study, evaluating several representative LVLMs and presenting experimental results that indicate significant hallucination issues, particularly in comparison to smaller models. They highlight that visual instructions influence hallucinations, with frequently appearing objects being more prone to erroneous generation. A new evaluation method called Polling-based Object Probing Evaluation (POPE) is proposed, demonstrating improved stability and flexibility over existing metrics such as the Caption Hallucination Assessment with Image Relevance (CHAIR). The authors conclude with discussions on the implications and limitations of their findings, noting that further investigation into more LVLMs is necessary.

Methods

This paper employs the following methods:

  • Polling-based Object Probing Evaluation (POPE)
  • Caption Hallucination Assessment with Image Relevance (CHAIR)

Models Used

  • mPLUG-Owl
  • LLaVA
  • Multimodal-GPT
  • MiniGPT-4
  • Instruct-BLIP

Datasets

The following datasets were used in this research:

  • MSCOCO

Evaluation Metrics

  • CHAIR
  • Accuracy
  • Precision
  • Recall
  • F1 score

Results

  • Most LVLMs exhibit severe object hallucinations, especially when compared to smaller models.
  • InstructBLIP performed better than other LVLMs in terms of hallucination rates.
  • POPE method shows more stability and flexibility in evaluating hallucinations than traditional methods.

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

Object hallucination LVLM Evaluation methods POPE MSCOCO

Papers Using Similar Methods

External Resources