← ML Research Wiki / 2505.17374

Chart-to-Experience: Benchmarking Multimodal LLMs for Predicting Experiential Impact of Charts

(2025)

Paper Information

arXiv ID

2505.17374

Venue

IEEE Pacific Visualization Symposium

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

The field of Multimodal Large Language Models (MLLMs) has made remarkable progress in visual understanding tasks, presenting a vast opportunity to predict the perceptual and emotional impact of charts.However, it also raises concerns, as many applications of LLMs are based on overgeneralized assumptions from a few examples, lacking sufficient validation of their performance and effectiveness.We introduce Chart-to-Experience, a benchmark dataset comprising 36 charts, evaluated by crowdsourced workers for their impact on seven experiential factors.Using the dataset as ground truth, we evaluated capabilities of state-of-the-art MLLMs on two tasks: direct prediction and pairwise comparison of charts.Our findings imply that MLLMs are not as sensitive as human evaluators when assessing individual charts, but are accurate and reliable in pairwise comparisons.

Summary

This paper presents the Chart-to-Experience benchmark dataset designed to evaluate the experiential impact of charts using Multimodal Large Language Models (MLLMs). The dataset comprises 36 charts covering three subjects: COVID-19, House Prices, and Global Warming, rated by 216 crowdsourced workers based on seven experiential factors. The study reveals that while MLLMs struggle with direct score predictions, they perform better in pairwise comparisons of charts, particularly when the differences in human ratings are substantial. The paper also discusses limitations and future research directions, emphasizing the potential of benchmarks in AI model evaluation and innovation.

Methods

This paper employs the following methods:

MLLMs
crowdsourced evaluation
pairwise comparison
Likert scale scoring

Models Used

GPT-4o
Claude 3.5 Sonnet
Llama-3.2-11B-Vision-Instruct

Datasets

The following datasets were used in this research:

Chart-to-Experience

Evaluation Metrics

Likert scale scoring
Kendall's τ

Results

MLLMs are less sensitive than human evaluators for individual chart assessments.
MLLMs show higher accuracy in pairwise comparisons of charts with larger score differences.

Limitations

The authors identified the following limitations:

MLLMs lack sensitivity in direct score prediction tasks
Variability and biases in MLLM evaluations
Limited exploration of advanced prompt engineering techniques

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified
Compute Requirements: None specified

Papers Using Similar Methods

External Resources

References: 55

Chart-to-Experience: Benchmarking Multimodal LLMs for Predicting Experiential Impact of Charts

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Related Papers