← ML Research Wiki / 2305.10355

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li Gaoling School of Artificial Intelligence Renmin University of China Beijing Key Laboratory of Big Data Management and Analysis Methods 4 Meituan Group, Yifan Du Gaoling School of Artificial Intelligence Renmin University of China Beijing Key Laboratory of Big Data Management and Analysis Methods 4 Meituan Group, Kun Zhou School of Information Renmin University of China, Jinpeng Wang [email protected], Wayne Xin Zhao School of Information Renmin University of China Beijing Key Laboratory of Big Data Management and Analysis Methods 4 Meituan Group, Ji-Rong Wen [email protected] Gaoling School of Artificial Intelligence Renmin University of China School of Information Renmin University of China Beijing Key Laboratory of Big Data Management and (2023)

Paper Information

arXiv ID

2305.10355

Venue

Conference on Empirical Methods in Natural Language Processing

Domain

Computer Vision and Natural Language Processing

Reproducibility

7/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Inspired by the superior language abilities of large language models (LLM), large visionlanguage models (LVLM) have been recently proposed by integrating powerful LLMs for improving the performance on complex multimodal tasks.Despite the promising progress on LVLMs, we find that they suffer from object hallucinations, i.e., they tend to generate objects inconsistent with the target images in the descriptions.To investigate it, this work presents the first systematic study on object hallucination of LVLMs.We conduct the evaluation experiments on several representative LVLMs, and show that they mostly suffer from severe object hallucination issues.We further discuss that the visual instructions may influence the hallucination, and find that: objects that frequently appear in the visual instructions or co-occur with the image objects are obviously prone to be hallucinated by LVLMs.Besides, we further design a polling-based query method called POPE for better evaluation of object hallucination.Experiment results show that our POPE can evaluate object hallucination in a more stable and flexible way.

Summary

This paper investigates the phenomenon of object hallucination in large vision-language models (LVLMs), wherein these models generate objects that do not exist in the images they describe. The authors conduct a systematic study, evaluating several representative LVLMs and presenting experimental results that indicate significant hallucination issues, particularly in comparison to smaller models. They highlight that visual instructions influence hallucinations, with frequently appearing objects being more prone to erroneous generation. A new evaluation method called Polling-based Object Probing Evaluation (POPE) is proposed, demonstrating improved stability and flexibility over existing metrics such as the Caption Hallucination Assessment with Image Relevance (CHAIR). The authors conclude with discussions on the implications and limitations of their findings, noting that further investigation into more LVLMs is necessary.

Methods

This paper employs the following methods:

Polling-based Object Probing Evaluation (POPE)
Caption Hallucination Assessment with Image Relevance (CHAIR)

Models Used

mPLUG-Owl
LLaVA
Multimodal-GPT
MiniGPT-4
Instruct-BLIP

Datasets

The following datasets were used in this research:

MSCOCO

Evaluation Metrics

CHAIR
Accuracy
Precision
Recall
F1 score

Results

Most LVLMs exhibit severe object hallucinations, especially when compared to smaller models.
InstructBLIP performed better than other LVLMs in terms of hallucination rates.
POPE method shows more stability and flexibility in evaluating hallucinations than traditional methods.

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

Object hallucination LVLM Evaluation methods POPE MSCOCO

Papers Using Similar Methods

H-POPE: Hierarchical Polling-based Probing Evaluation of Hallucinations in Large Vision-Language Models (2024)

External Resources

Funding: Supported by National Natural Science Foundation of China, Beijing Natural Science Foundation, Beijing Outstanding Young Scientist Program, and Meituan.
References: 56
Influential Citations: 117

Evaluating Object Hallucination in Large Vision-Language Models

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers