← ML Research Wiki / 2405.09711

STAR: A Benchmark for Situated Reasoning in Real-World Videos

Bo Wu MIT-IBM Watson AI Lab, Shoubin Yu Shanghai Jiao Tong University, Zhenfang Chen MIT-IBM Watson AI Lab, Joshua B Tenenbaum Chuang Gan MIT-IBM Watson AI Lab, CBMM, CSAILMit Bcs Chuang Gan MIT-IBM Watson AI Lab (2024)

Paper Information

arXiv ID

2405.09711

Venue

NeurIPS Datasets and Benchmarks

Domain

Artificial Intelligence, Computer Vision, Natural Language Processing

Reproducibility

4/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Reasoning in the real world is not divorced from situations.How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence.This paper introduces a new benchmark that evaluates the situated reasoning ability via situation abstraction and logic-grounded question answering for real-world videos, called Situated Reasoning in Real-World Videos (STAR).This benchmark is built upon the realworld videos associated with human actions or interactions, which are naturally dynamic, compositional, and logical.The dataset includes four types of questions, including interaction, sequence, prediction, and feasibility.We represent the situations in real-world videos by hyper-graphs connecting extracted atomic entities and relations (e.g., actions, persons, objects, and relationships).Besides visual perception, situated reasoning also requires structured situation comprehension and logical reasoning.Questions and answers are procedurally generated.The answering logic of each question is represented by a functional program based on a situation hyper-graph.We compare various existing video reasoning models and find that they all struggle on this challenging situated reasoning task.We further propose a diagnostic neuro-symbolic model that can disentangle visual perception, situation abstraction, language understanding, and functional reasoning to understand the challenges of this benchmark.35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks.

Summary

This paper presents STAR, a benchmark for assessing situated reasoning in real-world videos, focusing on the ability to understand dynamic situations through human actions and interactions. It introduces a dataset featuring diverse questions based on video clips, requiring systems to analyze interactions, sequences, predictions, and feasibility within a logical framework. The benchmark is built on real action videos and employs situation hypergraphs to encapsulate entities and their interrelations. The study evaluates various existing reasoning models, highlighting their limitations, and proposes a neuro-symbolic model that aims to disentangle the processes of visual perception, situation abstraction, and logical reasoning. Results indicate that existing models struggle with situated reasoning tasks, underscoring the benchmark’s challenges and the need for improved methods in this domain.

Methods

This paper employs the following methods:

Neuro-Symbolic Situated Reasoning (NS-SR)

Models Used

None specified

Datasets

The following datasets were used in this research:

STAR

Evaluation Metrics

Accuracy

Results

Existing QA models struggle with situated reasoning tasks.
STAR benchmark reveals significant performance gaps in current models.

Limitations

The authors identified the following limitations:

Current state-of-the-art methods struggle with situated reasoning tasks.

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

benchmark situated reasoning real-world videos hyper-graphs visual question answering logical reasoning neuro-symbolic model

External Resources

Funding: MIT-IBM Watson AI Lab, Nexplore, ONR MURI, DARPA Machine Common Sense program, ONR, Mitsubishi Electric
References: 50
Influential Citations: 25

STAR: A Benchmark for Situated Reasoning in Real-World Videos

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers