Domain
artificial intelligence, machine learning
We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications.Our approach is structured around three key dimensions:• Data Construction: We strive to ensure our data is diverse, scalable and extensively covers real-world scenarios including web screenshots, PDFs, OCR, charts, and knowledge-based content (expert knowledge, textbooks), aiming for a comprehensive representation of practical contexts.Further, we create a use case taxonomy from real user scenarios and construct an instruction-tuning dataset accordingly.The fine-tuning with this dataset substantially improves the model's user experience in practical applications.• Model Architecture: Considering efficiency and the demands of most real-world scenarios, DeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolution images (1024 x 1024) within a fixed token budget, while maintaining a relatively low computational overhead.This design choice ensures the model's ability to capture critical semantic and detailed information across various visual tasks.
DeepSeek-VL is an open-source Vision-Language Model aimed at improving real-world applications in vision and language. It emphasizes three key areas: comprehensive data construction for diverse real-world scenarios, a hybrid model architecture for efficient high-resolution image processing, and an instruction-tuning dataset based on real user scenarios to enhance user experience. DeepSeek-VL addresses challenges faced by open-source models in achieving comparable performance to proprietary models through extensive pretraining and careful data curation. Additional innovations include a novel training strategy that maintains language capabilities while developing new multimodal abilities. The model shows significant performance advantages in practical applications across various tasks.
This paper employs the following methods:
- Hybrid Vision Encoder
- Instruction Tuning
- Modality Warm-Up
- Vision-Language Adapter
The following datasets were used in this research:
- MMC4
- Wiki
- Wikihow
- Capsfusion
- TaiSu
- Chart2text
- Geo170K
- Ureader
- ScienceQA
- MathVista
- DeepSeek-LLM
- ShareGPT4V
- LAION-GPTV
- LVIS-Instruct4V
- DeepSeek-VL showcases superior user experiences in real-world applications.
- Achieves state-of-the-art performance on visual-language benchmarks at the same model size.
- Outperforms most open-source models in multiple benchmarks such as MMB and SEEDbench.
The authors identified the following limitations:
- Challenges in preserving language capabilities during multimodal training.
- Performance limitations observed with smaller model scales.
- Number of GPUs: 8
- GPU Type: NVIDIA A100
vision-language models
multimodal understanding
high-resolution images
instruction tuning