REBUS

Name: REBUS
Published: 2024-01-09
License: Unknown

A Robust Evaluation Benchmark of Understanding Symbols

Dataset Information

Modalities

Images, Texts

Languages

English

Introduced

2024

License

Unknown

Homepage

Official Website

Contents

Overview
Associated Benchmarks
Recent Benchmark Submissions
Research Papers

Overview

Recent advances in large language models have led to the development of multimodal LLMs
(MLLMs), which take both image data and text as an input. Virtually all of these models
have been announced within the past year, leading to a significant need for benchmarks
evaluating the abilities of these models to reason truthfully and accurately on a diverse set
of tasks. When Google announced Gemini (Gemini Team et al., 2023), they showcased its
ability to solve rebuses—wordplay puzzles which involve creatively adding and subtracting
letters from words derived from text and images. The diversity of rebuses allows for a
broad evaluation of multimodal reasoning capabilities, including image recognition, multi-
step reasoning, and understanding the human creator’s intent.
We present REBUS: a collection of 333 hand-crafted rebuses spanning 13 diverse cate-
gories, including hand-drawn and digital images created by nine contributors. Samples are
presented in Table 1. Notably, GPT-4V, the most powerful model we evaluated, answered
only 24% of puzzles correctly, highlighting the poor capabilities of MLLMs in new and unex-
pected domains to which human reasoning generalizes with comparative ease. Open-source
models perform even worse, with a median accuracy below 1%. We notice that models
often give faithless explanations, fail to change their minds after an initial approach doesn’t
work, and remain highly uncalibrated on their own abilities.

Variants: REBUS

Associated Benchmarks

This dataset is used in 1 benchmark:

Multimodal Reasoning - Metrics: Accuracy

Recent Benchmark Submissions

Task	Model	Paper	Date
Multimodal Reasoning	GPT-4V	REBUS: A Robust Evaluation Benchmark …	2024-01-11
Multimodal Reasoning	Gemini Pro	REBUS: A Robust Evaluation Benchmark …	2024-01-11
Multimodal Reasoning	LLaVa-1.5-13B	REBUS: A Robust Evaluation Benchmark …	2024-01-11
Multimodal Reasoning	LLaVa-1.5-7B	REBUS: A Robust Evaluation Benchmark …	2024-01-11
Multimodal Reasoning	BLIP2-FLAN-T5-XXL	REBUS: A Robust Evaluation Benchmark …	2024-01-11
Multimodal Reasoning	CogVLM	REBUS: A Robust Evaluation Benchmark …	2024-01-11
Multimodal Reasoning	QWEN	REBUS: A Robust Evaluation Benchmark …	2024-01-11
Multimodal Reasoning	InstructBLIP	REBUS: A Robust Evaluation Benchmark …	2024-01-11

Research Papers

Recent papers with results on this dataset:

REBUS: A Robust Evaluation Benchmark of Understanding Symbols (2024) -

External Links:

REBUS

Overview edit

Associated Benchmarks

Recent Benchmark Submissions

Research Papers

Edit Dataset Information

Overview