Venue
Conference on Empirical Methods in Natural Language Processing
Domain
Artificial intelligence
We present Video-LLaMA 1 a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video.Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual & audio encoders and the frozen LLMs.Unlike previous works that complement LLMs to process the visual or audio signals only(Zhu et al., 2023;Liu et al., 2023;Huang et al., 2023a), Video-LLaMA enables video comprehension by tackling two challenges: (1) capturing the temporal changes in visual scenes, (2) integrating audio-visual signals.To counter the first challenge, we propose a Video Q-former to assemble a pre-trained image encoder into our video encoder and introduce a video-to-text generation task to learn video-language correspondence.For the second challenge, we leverage ImageBind(Girdhar et al., 2023), a universal embedding model aligning multiple modalities, as the pre-trained audio encoder and introduce an Audio Q-former on top of ImageBind to learn reasonable auditory query embeddings for the LLM module.To align the output of both visual & audio encoders with LLM's embedding space, we first train Video-LLaMA on massive video/image-caption pairs and then tune our model with visual-instruction datasets of moderate amount but higher quality.We found Video-LLaMA shows the ability to perceive and comprehend video content and generate meaningful responses grounded in the visual and auditory information presented in the videos.
This paper presents Video-LLaMA, a multi-modal framework that adds audio-visual understanding capabilities to Large Language Models (LLMs). The model integrates static visual comprehension and audio processing simultaneously to enable video understanding. To address key challenges in video comprehension, Video-LLaMA employs a Video Q-former and an Audio Q-former alongside a frozen pre-trained visual encoder and a universal embedding model called ImageBind. The model is trained on large video/image-caption pairs during the pre-training phase and fine-tuned with visual-instruction datasets to enhance instruction-following capabilities. The authors demonstrate Video-LLaMA's ability to interpret and generate contextually relevant responses based on both visual and auditory inputs. They further highlight the model's performance in multi-modal instruction following, showcasing its potential for developing audio-visual AI assistants. The framework and model weights are open-sourced for further research and development.
This paper employs the following methods:
- Q-former
- Video Q-former
- Audio Q-former
- Video-LLaMA
- BLIP-2
- MiniGPT-4
- LLaVA
- mPLUG-Owl
- AudioGPT
- Video-ChatGPT
The following datasets were used in this research:
- Webvid-2M
- CC595k
- None specified
- Video-LLaMA shows the ability to perceive and comprehend video content and generate meaningful responses based on visual and auditory information.
- Video-LLaMA exhibits remarkable abilities in following instructions and understanding images and videos.
The authors identified the following limitations:
- Limited perception capacities due to dataset quality and scale.
- Limited ability to handle long videos.
- Inherits the hallucination problem from frozen LLMs.
- Number of GPUs: None specified
- GPU Type: None specified
Video understanding
Multimodal learning
Language models
Audio-visual analysis