Wenliang Dai Salesforce Research Hong Kong University of Science and Technology, Junnan Li [email protected], Dongxu Li Salesforce Research, Anthony Meng Huat Salesforce Research Nanyang Technological University Singapore, Junqi Zhao Salesforce Research Nanyang Technological University Singapore, Weisheng Wang Nanyang Technological University Singapore, Boyang Li Nanyang Technological University Singapore, Pascale Fung Hong Kong University of Science and Technology, Steven Hoi [email protected] Salesforce Research (2023)
This paper introduces InstructBLIP, a framework for vision-language models that employs instruction tuning to enhance general-purpose capabilities. it elucidates the challenges associated with constructing effective vision-language models, specifically highlighting the complexities brought by diverse tasks and rich input distributions. The paper provides a systematic investigation into vision-language instruction tuning, using a novel instruction-aware Query Transformer architecture alongside a collection of 26 publicly available datasets systematically formatted for instruction tuning. InstructBLIP demonstrates remarkable performance, achieving state-of-the-art zero-shot performance across 13 held-out datasets compared to previous models like BLIP-2 and Flamingo. The study encompasses qualitative assessments illustrating InstructBLIP's adeptness in visual reasoning and its capacity to handle diverse visual and textual instructions effectively. It contributes significantly by showing robust generalizability to unseen tasks and detailed discussion on improving instruction-tuning performance through methodological innovations. Furthermore, it validates InstructBLIP's utility as a solid starting point for downstream task finetuning, with improvements noted across various tasks and datasets.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: