← ML Research Wiki / 2411.10440

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

(2024)

Paper Information
arXiv ID
Venue
arXiv.org

Abstract

Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI's o1.However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks.In this work, we introduce LLaVA-CoT 1 , a novel VLM designed to conduct autonomous multistage reasoning.Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation.This structured approach enables LLaVA-CoT to achieve marked improvements in precision on reasoning-intensive tasks.To accomplish this, we compile the LLaVA-CoT-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations.Besides, we propose an inference-time stage-level beam search method, which enables effective inferencetime scaling.Remarkably, with only 100k training samples and a simple yet effective inference time scaling method, LLaVA-CoT not only outperforms its base model by 7.4% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closedsource models, such as Gemini-1.5-pro,GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.The code, dataset, and pre-trained weights are publicly available at https:// github.com/PKU-YuanGroup/LLaVA-CoT.

Summary

LLaVA-CoT is a novel vision-language model designed to perform structured, autonomous reasoning in multiple stages. The model integrates a multistage reasoning process that comprises summarization, interpretation, logical reasoning, and conclusion generation, leading to significant improvements in reasoning-intensive tasks. It utilizes the LLaVA-CoT-100k dataset, compiled from various visual question-answering sources, and employs a unique inference-time stage-level beam search method for scaling. In experiments, LLaVA-CoT demonstrates superior performance compared to existing models, including both larger and closed-source alternatives.

Methods

This paper employs the following methods:

  • LLaVA-CoT
  • Stage-level beam search

Models Used

  • Llama-3.2-11B-Vision-Instruct
  • Gemini-1.5-pro
  • GPT-4o-mini
  • Llama-3.2-90B-Vision-Instruct

Datasets

The following datasets were used in this research:

  • LLaVA-CoT-100k
  • ShareGPT4V
  • ChartQA
  • A-OKVQA
  • AI2D
  • GeoQA+
  • ScienceQA
  • DocVQA
  • PISC
  • CLEVR
  • CLEVR-Math

Evaluation Metrics

  • Accuracy
  • Average Score

Results

  • Outperforms Llama-3.2-11B-Vision-Instruct by 7.4%
  • Surpasses performance of Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct on multimodal reasoning benchmarks.

Limitations

The authors identified the following limitations:

  • Current VLMs exhibit difficulty in structured reasoning, often leading to errors during the reasoning process.

Technical Requirements

  • Number of GPUs: 8
  • GPU Type: H100
  • Compute Requirements: training on a single node with 8 H100 GPUs

External Resources