← ML Research Wiki / 2406.07476

VideoLLaMA 2 Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng DAMO Academy Alibaba Group, Sicong Leng DAMO Academy Alibaba Group, Hang Zhang DAMO Academy Alibaba Group, Yifei Xin DAMO Academy Alibaba Group, Xin Li DAMO Academy Alibaba Group, Guanzheng Chen DAMO Academy Alibaba Group, Yongxin Zhu DAMO Academy Alibaba Group, Wenqi Zhang DAMO Academy Alibaba Group, Ziyang Luo DAMO Academy Alibaba Group, Deli Zhao DAMO Academy Alibaba Group, Lidong Bing DAMO Academy Alibaba Group (2024)

Paper Information

arXiv ID

2406.07476

Venue

arXiv.org

Domain

artificial intelligence, computer vision, natural language processing, multimodal learning

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

In this paper, we present VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks.Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data.Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues.Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks.Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audiovideo question-answering (AQA & OE-AVQA) benchmarks over existing models.All models are public to facilitate further research.

Summary

In this paper, the authors introduce VideoLLaMA 2, a set of Video Large Language Models designed to enhance spatial-temporal modeling and audio understanding for video and audio-oriented tasks. The model features a specialized Spatial-Temporal Convolution (STC) connector to capture spatial and temporal dynamics of video data and includes an Audio Branch through joint training. Evaluations demonstrate competitive performance in various benchmarks such as multi-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC), showing improvements over existing models. The architecture integrates dual branches for visual and audio data independently connected to a language model, optimizing training while maintaining separate modalities. The training process utilizes a range of datasets for pre-training and fine-tuning tasks. Results indicate that VideoLLaMA 2 outperforms many open-source models and approaches proprietary models in various benchmarks, establishing its effectiveness in multimedia analysis tasks.

Methods

This paper employs the following methods:

Spatial-Temporal Convolution (STC)
Joint training
Dual-branch framework
Cross-modal interactions

Models Used

VideoLLaMA 2
Mistral-Instruct
Mixtral-Instruct
Qwen2-Instruct

Datasets

The following datasets were used in this research:

Panda-70M
VIDAL-10M
WebVid-10M
InternVid-10M
CC-3M
DCI
Kinetics-710
SthSthv2
NExtQA
CLEVRER
EgoQA
Tgif
WebVidQA
RealworldQA
Hm3d
WavCaps
ClothoAQA
AudioCaps
Clotho
MusicCaps
VGGSound
UrbanSound8K
ESC50
TUT2017
TUT2016
VocalSound
ClothoAQA
AVQA
AVQA-music
AVSD
evol-instruct

Evaluation Metrics

Accuracy
Correctness
Detailedness
Cross-entropy loss

Results

VideoLLaMA 2 demonstrates competitive results in MC-VQA and OE-VQA tasks, nearing performance of proprietary models.
Outperforms open-source models across multiple benchmarks.

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

VideoLLaMA 2 spatial-temporal convolution audio understanding multimodal AI video question answering

Papers Using Similar Methods

External Resources

Funding: DAMO Academy, Alibaba Group
References: 125
Influential Citations: 43

VideoLLaMA 2 Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers