← ML Research Wiki / 2402.19479

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Tsai-Shien Chen Snap Inc University of California Merced, Aliaksandr Siarohin Snap Inc, Willi Menapace Snap Inc University of Trento https://snap-research.github.io/Panda-70M, Ekaterina Deyneka Snap Inc, Hsiang-Wei Chao Snap Inc, Byung Eun Jeon Snap Inc, Yuwei Fang Snap Inc, Hsin-Ying Lee Snap Inc, Jian Ren Snap Inc, Ming-Hsuan Yang University of California Merced, Sergey Tulyakov Snap Inc (2024)

Paper Information

arXiv ID

2402.19479

Venue

Computer Vision and Pattern Recognition

Domain

computer vision

SOTA Claim

Yes

Reproducibility

7/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

HDVILA-100M"He thought he was gonna get shows terrible communication on the teams part.""We're gonna cook this all together stirring it constantly for just a minute until it smells nice and fragrant.""It is a close-up shot of a brown and white english bulldog with wrinkles on its face, sitting on a person's lap.""It is a red and purple betta fish swimming in a tank with gravel and plants.""A person is adding chicken broth to a pot of quinoa on a stove."* This work was done while interning at Snap. licly available HD-VILA-100M dataset.We then split them into semantically consistent video clips, and apply multiple cross-modality teacher models to obtain captions for each video.Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation.In this way, we get 70M videos paired with high-quality text captions.We dub the dataset as Panda-70M.We show the value of the proposed dataset on three downstream tasks: video captioning, video and text retrieval, and text-driven video generation.The models trained on the proposed data score substantially better on the majority of metrics across all the tasks.

Summary

The paper introduces the Panda-70M dataset, which consists of 70 million high-resolution video clips with semantically coherent captions generated using multiple cross-modality teacher models from various multimedia sources such as subtitles and video frames. The authors emphasize the challenges of collecting high-quality video-text data, particularly the time-intensive process of manual annotation. To address this, they propose an automated approach that utilizes multimodal inputs to curate a large-scale dataset and demonstrate its utility for downstream tasks including video captioning, video and text retrieval, and text-driven video generation. Extensive experiments show significant improvements in these tasks when utilizing the Panda-70M dataset compared to existing datasets.

Methods

This paper employs the following methods:

Automatic captioning pipeline
Semantics-aware video splitting
Cross-modality teacher models
Fine-grained video-to-text retrieval
Knowledge distillation for student model

Models Used

BLIP-2
MiniGPT-4
Video-LLaMA
VideoChat

Datasets

The following datasets were used in this research:

Panda-70M
HD-VILA-100M
HowTo100M
MSR-VTT
MSVD

Evaluation Metrics

BLEU-4
ROUGE-L
METEOR
CIDEr
BERTScore

Results

Achieved better performance in video captioning compared to existing datasets.
Demonstrated benefits for video and text retrieval and text-driven video generation.

Limitations

The authors identified the following limitations:

The dataset primarily includes vocal-intensive videos, limiting diversity. Future work is needed to include more non-vocal content and longer videos for expanded applications.

Technical Requirements

Number of GPUs: 48
GPU Type: NVIDIA A100 80GB

Keywords

video captioning multimodal learning dataset creation automatic captioning

External Resources

Funding: Not specified
References: 94
Influential Citations: 26

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers