Domain
artificial intelligence; computer vision; natural language processing
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series.Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios.Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities.In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.
LLaVA-OneVision is an open family of large multimodal models developed to enhance performance in various computer vision scenarios, including single-image, multi-image, and video processing. This model integrates insights from previous models, employing techniques for strong transfer learning across modalities, resulting in improved video understanding and cross-scenario capabilities. The paper discusses its architecture based on Qwen-2 for language processing and SigLIP for visual encoding, emphasizing the importance of high-quality data for training. Key contributions include proposing a flexible visual representation framework and demonstrating strong performance across diverse benchmarks compared to state-of-the-art models. It showcases emerging capabilities from task transfer and emphasizes the model's open-source nature for community accessibility.
This paper employs the following methods:
- Transfer Learning
- Language-Image Alignment
- High-Quality Knowledge Learning
- Visual Instruction Tuning
- LLaVA-OneVision
- Qwen-2
- SigLIP
The following datasets were used in this research:
- COCO118K
- BLIP558K
- CC3M
- UReader
- Evo-Instruct
- State-of-the-art performance in single-image, multi-image, and video benchmarks
- Emerging capabilities through task transfer
- Competitive performance against proprietary models like GPT-4V and GPT-4o
The authors identified the following limitations:
- Performance gap in complex visual chat scenarios
- Need for larger training data and better preference learning for certain tasks
- Number of GPUs: None specified
- GPU Type: None specified
large multimodal models
visual transfer learning
multi-image understanding
video comprehension