← ML Research Wiki / 2406.07476

VideoLLaMA 2 Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng DAMO Academy Alibaba Group, Sicong Leng DAMO Academy Alibaba Group, Hang Zhang DAMO Academy Alibaba Group, Yifei Xin DAMO Academy Alibaba Group, Xin Li DAMO Academy Alibaba Group, Guanzheng Chen DAMO Academy Alibaba Group, Yongxin Zhu DAMO Academy Alibaba Group, Wenqi Zhang DAMO Academy Alibaba Group, Ziyang Luo DAMO Academy Alibaba Group, Deli Zhao DAMO Academy Alibaba Group, Lidong Bing DAMO Academy Alibaba Group (2024)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
artificial intelligence, computer vision, natural language processing, multimodal learning
SOTA Claim
Yes
Code
Available
Reproducibility
8/10

Abstract

In this paper, we present VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks.Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data.Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues.Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks.Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audiovideo question-answering (AQA & OE-AVQA) benchmarks over existing models.All models are public to facilitate further research.

Summary

In this paper, the authors introduce VideoLLaMA 2, a set of Video Large Language Models designed to enhance spatial-temporal modeling and audio understanding for video and audio-oriented tasks. The model features a specialized Spatial-Temporal Convolution (STC) connector to capture spatial and temporal dynamics of video data and includes an Audio Branch through joint training. Evaluations demonstrate competitive performance in various benchmarks such as multi-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC), showing improvements over existing models. The architecture integrates dual branches for visual and audio data independently connected to a language model, optimizing training while maintaining separate modalities. The training process utilizes a range of datasets for pre-training and fine-tuning tasks. Results indicate that VideoLLaMA 2 outperforms many open-source models and approaches proprietary models in various benchmarks, establishing its effectiveness in multimedia analysis tasks.

Methods

This paper employs the following methods:

  • Spatial-Temporal Convolution (STC)
  • Joint training
  • Dual-branch framework
  • Cross-modal interactions

Models Used

  • VideoLLaMA 2
  • Mistral-Instruct
  • Mixtral-Instruct
  • Qwen2-Instruct

Datasets

The following datasets were used in this research:

  • Panda-70M
  • VIDAL-10M
  • WebVid-10M
  • InternVid-10M
  • CC-3M
  • DCI
  • Kinetics-710
  • SthSthv2
  • NExtQA
  • CLEVRER
  • EgoQA
  • Tgif
  • WebVidQA
  • RealworldQA
  • Hm3d
  • WavCaps
  • ClothoAQA
  • AudioCaps
  • Clotho
  • MusicCaps
  • VGGSound
  • UrbanSound8K
  • ESC50
  • TUT2017
  • TUT2016
  • VocalSound
  • ClothoAQA
  • AVQA
  • AVQA-music
  • AVSD
  • evol-instruct

Evaluation Metrics

  • Accuracy
  • Correctness
  • Detailedness
  • Cross-entropy loss

Results

  • VideoLLaMA 2 demonstrates competitive results in MC-VQA and OE-VQA tasks, nearing performance of proprietary models.
  • Outperforms open-source models across multiple benchmarks.

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

VideoLLaMA 2 spatial-temporal convolution audio understanding multimodal AI video question answering

Papers Using Similar Methods

External Resources