MSVD

Microsoft Research Video Description Corpus

Dataset Information
Modalities
Videos, Texts
Languages
Chinese
Introduced
2011
License
Unknown
Homepage

Overview

The Microsoft Research Video Description Corpus (MSVD) dataset consists of about 120K sentences collected during the summer of 2010. Workers on Mechanical Turk were paid to watch a short video snippet and then summarize the action in a single sentence. The result is a set of roughly parallel descriptions of more than 2,000 video snippets. Because the workers were urged to complete the task in the language of their choice, both paraphrase and bilingual alternations are captured in the data.

Source: https://www.microsoft.com/en-us/download/details.aspx?id=52422&from=https%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fdownloads%2F38cf15fd-b8df-477e-a4e4-a4680caa75af%2F
Image Source: https://arxiv.org/pdf/1609.06782.pdf

Variants: MSVD

Associated Benchmarks

This dataset is used in 3 benchmarks:

  • Video Captioning -
  • Video Retrieval -
  • Zero-Shot Video Retrieval -

Recent Benchmark Submissions

Task Model Paper Date
Zero-Shot Video Retrieval InternVideo2-6B InternVideo2: Scaling Foundation Models for … 2024-03-22
Zero-Shot Video Retrieval InternVideo2-1B InternVideo2: Scaling Foundation Models for … 2024-03-22
Video Retrieval InternVideo2-6B InternVideo2: Scaling Foundation Models for … 2024-03-22
Video Retrieval vid-TLDR (UMT-L) vid-TLDR: Training Free Token merging … 2024-03-20
Zero-Shot Video Retrieval vid-TLDR (UMT-L) vid-TLDR: Training Free Token merging … 2024-03-20
Video Captioning RTQ RTQ: Rethinking Video-language Understanding Based … 2023-12-01
Video Retrieval Side4Video Side4Video: Spatial-Temporal Side Network for … 2023-11-27
Video Captioning HowToCaption HowToCaption: Prompting LLMs to Transform … 2023-10-07
Zero-Shot Video Retrieval HowToCaption HowToCaption: Prompting LLMs to Transform … 2023-10-07
Zero-Shot Video Retrieval VAST, HowToCaption-finetuned HowToCaption: Prompting LLMs to Transform … 2023-10-07
Zero-Shot Video Retrieval LanguageBind(ViT-H/14) LanguageBind: Extending Video-Language Pretraining to … 2023-10-03
Zero-Shot Video Retrieval LanguageBind(ViT-L/14) LanguageBind: Extending Video-Language Pretraining to … 2023-10-03
Video Retrieval PAU Prototype-based Aleatoric Uncertainty Quantification for … 2023-09-29
Video Captioning CoCap (ViT/L14) Accurate and Fast Compressed Video … 2023-09-22
Video Retrieval DMAE (ViT-B/32) Dual-Modal Attention-Enhanced Text-Video Retrieval with … 2023-09-20
Video Captioning COSA COSA: Concatenated Sample Pretrained Vision-Language … 2023-06-15
Video Retrieval VLAB VLAB: Enhancing Video Language Pre-training … 2023-05-22
Video Captioning VLAB VLAB: Enhancing Video Language Pre-training … 2023-05-22
Video Captioning VALOR VALOR: Vision-Audio-Language Omni-Perception Pretraining Model … 2023-04-17
Video Captioning MaMMUT MaMMUT: A Simple Architecture for … 2023-03-29

Research Papers

Recent papers with results on this dataset: