MSVD

Name: MSVD
Published: 2011-01-01
License: Unknown

Microsoft Research Video Description Corpus

Dataset Information

Modalities

Videos, Texts

Languages

Chinese

Introduced

2011

License

Unknown

Homepage

Official Website

Contents

Overview
Associated Benchmarks
Recent Benchmark Submissions
Research Papers

Overview

The Microsoft Research Video Description Corpus (MSVD) dataset consists of about 120K sentences collected during the summer of 2010. Workers on Mechanical Turk were paid to watch a short video snippet and then summarize the action in a single sentence. The result is a set of roughly parallel descriptions of more than 2,000 video snippets. Because the workers were urged to complete the task in the language of their choice, both paraphrase and bilingual alternations are captured in the data.

Source: https://www.microsoft.com/en-us/download/details.aspx?id=52422&from=https%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fdownloads%2F38cf15fd-b8df-477e-a4e4-a4680caa75af%2F
Image Source: https://arxiv.org/pdf/1609.06782.pdf

Variants: MSVD

Associated Benchmarks

This dataset is used in 3 benchmarks:

Video Captioning - Metrics: CIDEr, BLEU-4, METEOR, ROUGE-L, GS
Video Retrieval - Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video Median Rank, text-to-video Mean Rank, text-to-video R@50, video-to-text R@1, video-to-text R@5, video-to-text R@10, video-to-text Median Rank, video-to-text Mean Rank
Zero-Shot Video Retrieval - Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video Median Rank, text-to-video Mean Rank, video-to-text R@1, video-to-text R@5, video-to-text R@10, video-to-text Median Rank

Recent Benchmark Submissions

Task	Model	Paper	Date
Zero-Shot Video Retrieval	InternVideo2-6B	InternVideo2: Scaling Foundation Models for …	2024-03-22
Zero-Shot Video Retrieval	InternVideo2-1B	InternVideo2: Scaling Foundation Models for …	2024-03-22
Video Retrieval	InternVideo2-6B	InternVideo2: Scaling Foundation Models for …	2024-03-22
Video Retrieval	vid-TLDR (UMT-L)	vid-TLDR: Training Free Token merging …	2024-03-20
Zero-Shot Video Retrieval	vid-TLDR (UMT-L)	vid-TLDR: Training Free Token merging …	2024-03-20
Video Captioning	RTQ	RTQ: Rethinking Video-language Understanding Based …	2023-12-01
Video Retrieval	Side4Video	Side4Video: Spatial-Temporal Side Network for …	2023-11-27
Video Captioning	HowToCaption	HowToCaption: Prompting LLMs to Transform …	2023-10-07
Zero-Shot Video Retrieval	HowToCaption	HowToCaption: Prompting LLMs to Transform …	2023-10-07
Zero-Shot Video Retrieval	VAST, HowToCaption-finetuned	HowToCaption: Prompting LLMs to Transform …	2023-10-07
Zero-Shot Video Retrieval	LanguageBind(ViT-H/14)	LanguageBind: Extending Video-Language Pretraining to …	2023-10-03
Zero-Shot Video Retrieval	LanguageBind(ViT-L/14)	LanguageBind: Extending Video-Language Pretraining to …	2023-10-03
Video Retrieval	PAU	Prototype-based Aleatoric Uncertainty Quantification for …	2023-09-29
Video Captioning	CoCap (ViT/L14)	Accurate and Fast Compressed Video …	2023-09-22
Video Retrieval	DMAE (ViT-B/32)	Dual-Modal Attention-Enhanced Text-Video Retrieval with …	2023-09-20
Video Captioning	COSA	COSA: Concatenated Sample Pretrained Vision-Language …	2023-06-15
Video Retrieval	VLAB	VLAB: Enhancing Video Language Pre-training …	2023-05-22
Video Captioning	VLAB	VLAB: Enhancing Video Language Pre-training …	2023-05-22
Video Captioning	VALOR	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model …	2023-04-17
Video Captioning	MaMMUT	MaMMUT: A Simple Architecture for …	2023-03-29

Research Papers

Recent papers with results on this dataset:

External Links:

MSVD

Overview edit

Associated Benchmarks

Recent Benchmark Submissions

Research Papers

Edit Dataset Information

Overview