← ML Research Wiki / 2402.19479

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Tsai-Shien Chen Snap Inc University of California Merced, Aliaksandr Siarohin Snap Inc, Willi Menapace Snap Inc University of Trento https://snap-research.github.io/Panda-70M, Ekaterina Deyneka Snap Inc, Hsiang-Wei Chao Snap Inc, Byung Eun Jeon Snap Inc, Yuwei Fang Snap Inc, Hsin-Ying Lee Snap Inc, Jian Ren Snap Inc, Ming-Hsuan Yang University of California Merced, Sergey Tulyakov Snap Inc (2024)

Paper Information
arXiv ID
Venue
Computer Vision and Pattern Recognition
Domain
computer vision
SOTA Claim
Yes
Reproducibility
7/10

Abstract

HDVILA-100M"He thought he was gonna get shows terrible communication on the teams part.""We're gonna cook this all together stirring it constantly for just a minute until it smells nice and fragrant.""It is a close-up shot of a brown and white english bulldog with wrinkles on its face, sitting on a person's lap.""It is a red and purple betta fish swimming in a tank with gravel and plants.""A person is adding chicken broth to a pot of quinoa on a stove."* This work was done while interning at Snap. licly available HD-VILA-100M dataset.We then split them into semantically consistent video clips, and apply multiple cross-modality teacher models to obtain captions for each video.Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation.In this way, we get 70M videos paired with high-quality text captions.We dub the dataset as Panda-70M.We show the value of the proposed dataset on three downstream tasks: video captioning, video and text retrieval, and text-driven video generation.The models trained on the proposed data score substantially better on the majority of metrics across all the tasks.

Summary

The paper introduces the Panda-70M dataset, which consists of 70 million high-resolution video clips with semantically coherent captions generated using multiple cross-modality teacher models from various multimedia sources such as subtitles and video frames. The authors emphasize the challenges of collecting high-quality video-text data, particularly the time-intensive process of manual annotation. To address this, they propose an automated approach that utilizes multimodal inputs to curate a large-scale dataset and demonstrate its utility for downstream tasks including video captioning, video and text retrieval, and text-driven video generation. Extensive experiments show significant improvements in these tasks when utilizing the Panda-70M dataset compared to existing datasets.

Methods

This paper employs the following methods:

  • Automatic captioning pipeline
  • Semantics-aware video splitting
  • Cross-modality teacher models
  • Fine-grained video-to-text retrieval
  • Knowledge distillation for student model

Models Used

  • BLIP-2
  • MiniGPT-4
  • Video-LLaMA
  • VideoChat

Datasets

The following datasets were used in this research:

  • Panda-70M
  • HD-VILA-100M
  • HowTo100M
  • MSR-VTT
  • MSVD

Evaluation Metrics

  • BLEU-4
  • ROUGE-L
  • METEOR
  • CIDEr
  • BERTScore

Results

  • Achieved better performance in video captioning compared to existing datasets.
  • Demonstrated benefits for video and text retrieval and text-driven video generation.

Limitations

The authors identified the following limitations:

  • The dataset primarily includes vocal-intensive videos, limiting diversity. Future work is needed to include more non-vocal content and longer videos for expanded applications.

Technical Requirements

  • Number of GPUs: 48
  • GPU Type: NVIDIA A100 80GB

Keywords

video captioning multimodal learning dataset creation automatic captioning

External Resources