Zesen Cheng DAMO Academy Alibaba Group, Sicong Leng DAMO Academy Alibaba Group, Hang Zhang DAMO Academy Alibaba Group, Yifei Xin DAMO Academy Alibaba Group, Xin Li DAMO Academy Alibaba Group, Guanzheng Chen DAMO Academy Alibaba Group, Yongxin Zhu DAMO Academy Alibaba Group, Wenqi Zhang DAMO Academy Alibaba Group, Ziyang Luo DAMO Academy Alibaba Group, Deli Zhao DAMO Academy Alibaba Group, Lidong Bing DAMO Academy Alibaba Group (2024)
In this paper, the authors introduce VideoLLaMA 2, a set of Video Large Language Models designed to enhance spatial-temporal modeling and audio understanding for video and audio-oriented tasks. The model features a specialized Spatial-Temporal Convolution (STC) connector to capture spatial and temporal dynamics of video data and includes an Audio Branch through joint training. Evaluations demonstrate competitive performance in various benchmarks such as multi-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC), showing improvements over existing models. The architecture integrates dual branches for visual and audio data independently connected to a language model, optimizing training while maintaining separate modalities. The training process utilizes a range of datasets for pre-training and fine-tuning tasks. Results indicate that VideoLLaMA 2 outperforms many open-source models and approaches proprietary models in various benchmarks, establishing its effectiveness in multimedia analysis tasks.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: