← ML Research Wiki / 2411.10440

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

(2024)

Paper Information

arXiv ID

2411.10440

Venue

arXiv.org

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI's o1.However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks.In this work, we introduce LLaVA-CoT 1 , a novel VLM designed to conduct autonomous multistage reasoning.Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation.This structured approach enables LLaVA-CoT to achieve marked improvements in precision on reasoning-intensive tasks.To accomplish this, we compile the LLaVA-CoT-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations.Besides, we propose an inference-time stage-level beam search method, which enables effective inferencetime scaling.Remarkably, with only 100k training samples and a simple yet effective inference time scaling method, LLaVA-CoT not only outperforms its base model by 7.4% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closedsource models, such as Gemini-1.5-pro,GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.The code, dataset, and pre-trained weights are publicly available at https:// github.com/PKU-YuanGroup/LLaVA-CoT.

Summary

LLaVA-CoT is a novel vision-language model designed to perform structured, autonomous reasoning in multiple stages. The model integrates a multistage reasoning process that comprises summarization, interpretation, logical reasoning, and conclusion generation, leading to significant improvements in reasoning-intensive tasks. It utilizes the LLaVA-CoT-100k dataset, compiled from various visual question-answering sources, and employs a unique inference-time stage-level beam search method for scaling. In experiments, LLaVA-CoT demonstrates superior performance compared to existing models, including both larger and closed-source alternatives.

Methods

This paper employs the following methods:

LLaVA-CoT
Stage-level beam search

Models Used

Llama-3.2-11B-Vision-Instruct
Gemini-1.5-pro
GPT-4o-mini
Llama-3.2-90B-Vision-Instruct

Datasets

The following datasets were used in this research:

LLaVA-CoT-100k
ShareGPT4V
ChartQA
A-OKVQA
AI2D
GeoQA+
ScienceQA
DocVQA
PISC
CLEVR
CLEVR-Math

Evaluation Metrics

Accuracy
Average Score

Results

Outperforms Llama-3.2-11B-Vision-Instruct by 7.4%
Surpasses performance of Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct on multimodal reasoning benchmarks.

Limitations

The authors identified the following limitations:

Current VLMs exhibit difficulty in structured reasoning, often leading to errors during the reasoning process.

Technical Requirements

Number of GPUs: 8
GPU Type: H100
Compute Requirements: training on a single node with 8 H100 GPUs

External Resources

References: 67
Influential Citations: 21

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Related Papers