Venue
Computer Vision and Pattern Recognition
HDVILA-100M"He thought he was gonna get shows terrible communication on the teams part.""We're gonna cook this all together stirring it constantly for just a minute until it smells nice and fragrant.""It is a close-up shot of a brown and white english bulldog with wrinkles on its face, sitting on a person's lap.""It is a red and purple betta fish swimming in a tank with gravel and plants.""A person is adding chicken broth to a pot of quinoa on a stove."* This work was done while interning at Snap. licly available HD-VILA-100M dataset.We then split them into semantically consistent video clips, and apply multiple cross-modality teacher models to obtain captions for each video.Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation.In this way, we get 70M videos paired with high-quality text captions.We dub the dataset as Panda-70M.We show the value of the proposed dataset on three downstream tasks: video captioning, video and text retrieval, and text-driven video generation.The models trained on the proposed data score substantially better on the majority of metrics across all the tasks.
The paper introduces the Panda-70M dataset, which consists of 70 million high-resolution video clips with semantically coherent captions generated using multiple cross-modality teacher models from various multimedia sources such as subtitles and video frames. The authors emphasize the challenges of collecting high-quality video-text data, particularly the time-intensive process of manual annotation. To address this, they propose an automated approach that utilizes multimodal inputs to curate a large-scale dataset and demonstrate its utility for downstream tasks including video captioning, video and text retrieval, and text-driven video generation. Extensive experiments show significant improvements in these tasks when utilizing the Panda-70M dataset compared to existing datasets.
This paper employs the following methods:
- Automatic captioning pipeline
- Semantics-aware video splitting
- Cross-modality teacher models
- Fine-grained video-to-text retrieval
- Knowledge distillation for student model
- BLIP-2
- MiniGPT-4
- Video-LLaMA
- VideoChat
The following datasets were used in this research:
- Panda-70M
- HD-VILA-100M
- HowTo100M
- MSR-VTT
- MSVD
- BLEU-4
- ROUGE-L
- METEOR
- CIDEr
- BERTScore
- Achieved better performance in video captioning compared to existing datasets.
- Demonstrated benefits for video and text retrieval and text-driven video generation.
The authors identified the following limitations:
- The dataset primarily includes vocal-intensive videos, limiting diversity. Future work is needed to include more non-vocal content and longer videos for expanded applications.
- Number of GPUs: 48
- GPU Type: NVIDIA A100 80GB
video captioning
multimodal learning
dataset creation
automatic captioning