Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI's o1.However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks.In this work, we introduce LLaVA-CoT 1 , a novel VLM designed to conduct autonomous multistage reasoning.Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation.This structured approach enables LLaVA-CoT to achieve marked improvements in precision on reasoning-intensive tasks.To accomplish this, we compile the LLaVA-CoT-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations.Besides, we propose an inference-time stage-level beam search method, which enables effective inferencetime scaling.Remarkably, with only 100k training samples and a simple yet effective inference time scaling method, LLaVA-CoT not only outperforms its base model by 7.4% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closedsource models, such as Gemini-1.5-pro,GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.The code, dataset, and pre-trained weights are publicly available at https:// github.com/PKU-YuanGroup/LLaVA-CoT.
LLaVA-CoT is a novel vision-language model designed to perform structured, autonomous reasoning in multiple stages. The model integrates a multistage reasoning process that comprises summarization, interpretation, logical reasoning, and conclusion generation, leading to significant improvements in reasoning-intensive tasks. It utilizes the LLaVA-CoT-100k dataset, compiled from various visual question-answering sources, and employs a unique inference-time stage-level beam search method for scaling. In experiments, LLaVA-CoT demonstrates superior performance compared to existing models, including both larger and closed-source alternatives.
This paper employs the following methods:
- LLaVA-CoT
- Stage-level beam search
- Llama-3.2-11B-Vision-Instruct
- Gemini-1.5-pro
- GPT-4o-mini
- Llama-3.2-90B-Vision-Instruct
The following datasets were used in this research:
- LLaVA-CoT-100k
- ShareGPT4V
- ChartQA
- A-OKVQA
- AI2D
- GeoQA+
- ScienceQA
- DocVQA
- PISC
- CLEVR
- CLEVR-Math
- Outperforms Llama-3.2-11B-Vision-Instruct by 7.4%
- Surpasses performance of Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct on multimodal reasoning benchmarks.
The authors identified the following limitations:
- Current VLMs exhibit difficulty in structured reasoning, often leading to errors during the reasoning process.
- Number of GPUs: 8
- GPU Type: H100
- Compute Requirements: training on a single node with 8 H100 GPUs