MSR-VTT

Name: MSR-VTT
Published: 2016-01-01
License: Unknown

Dataset Information

Modalities

Videos, Texts

Languages

English

Introduced

2016

License

Unknown

Homepage

Official Website

Contents

Overview
Associated Benchmarks
Recent Benchmark Submissions
Research Papers

Overview

MSR-VTT (Microsoft Research Video to Text) is a large-scale dataset for the open domain video captioning, which consists of 10,000 video clips from 20 categories, and each video clip is annotated with 20 English sentences by Amazon Mechanical Turks. There are about 29,000 unique words in all captions. The standard splits uses 6,513 clips for training, 497 clips for validation, and 2,990 clips for testing.

Source: Learning to Discretely Compose Reasoning Module Networksfor Video Captioning

Variants: MSR-VTT, MSR-VTT-1kA, MSRVTT

Associated Benchmarks

This dataset is used in 7 benchmarks:

Video Question Answering - Metrics: Accuracy
Video Captioning - Metrics: CIDEr, METEOR, ROUGE-L, BLEU-4, GS
Video Generation - Metrics: FVD16, Inception score
Video Retrieval - Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video Mean Rank, text-to-video Median Rank, video-to-text R@1, video-to-text R@5, video-to-text R@10, video-to-text Median Rank, video-to-text Mean Rank, text-to-video MedianR, text-to-videoMedian Rank
Text to Video Retrieval - Metrics: text-to-video R@1
Text-to-Video Generation - Metrics: FVD, CLIPSIM, CLIP-FID, FID
Zero-Shot Video Retrieval - Metrics: text-to-video R@1, text-to-video R@5, text-to-video R@10, text-to-video Median Rank, text-to-video Mean Rank, video-to-text R@1, video-to-text R@5, video-to-text R@10, video-to-text Median Rank

Recent Benchmark Submissions

Task	Model	Paper	Date
Zero-Shot Video Retrieval	FluxViT-S	Make Your Training Flexible: Towards …	2025-03-18
Zero-Shot Video Retrieval	FluxViT-B	Make Your Training Flexible: Towards …	2025-03-18
Zero-Shot Video Retrieval	GRAM	Gramian Multimodal Representation Learning and …	2024-12-16
Video Retrieval	GRAM	Gramian Multimodal Representation Learning and …	2024-12-16
Video Question Answering	LocVLM-Vid-B	Learning to Localize Objects Improves …	2024-04-11
Zero-Shot Video Retrieval	InternVideo2-6B	InternVideo2: Scaling Foundation Models for …	2024-03-22
Video Retrieval	InternVideo2-6B	InternVideo2: Scaling Foundation Models for …	2024-03-22
Zero-Shot Video Retrieval	InternVideo2-1B	InternVideo2: Scaling Foundation Models for …	2024-03-22
Video Retrieval	vid-TLDR (UMT-L)	vid-TLDR: Training Free Token merging …	2024-03-20
Zero-Shot Video Retrieval	vid-TLDR (UMT-L)	vid-TLDR: Training Free Token merging …	2024-03-20
Text-to-Video Generation	Snap Video (288×288)	Snap Video: Scaled Spatiotemporal Transformers …	2024-02-22
Text-to-Video Generation	Snap Video (512x288)	Snap Video: Scaled Spatiotemporal Transformers …	2024-02-22
Text-to-Video Generation	Video-LaVIT	Video-LaVIT: Unified Video-Language Pre-training with …	2024-02-05
Zero-Shot Video Retrieval	Norton	Multi-granularity Correspondence Learning from Long-term …	2024-01-30
Text-to-Video Generation	TF-T2V	A Recipe for Scaling up …	2023-12-25
Text-to-Video Generation	VideoPoet	VideoPoet: A Large Language Model …	2023-12-21
Text-to-Video Generation	HiGen	Hierarchical Spatio-temporal Decoupling for Text-to-Video …	2023-12-07
Video Captioning	RTQ	RTQ: Rethinking Video-language Understanding Based …	2023-12-01
Video Generation	VideoAssembler (Zero-Shot, 256x256, class-conditional)	MagDiff: Multi-Alignment Diffusion for High-Fidelity …	2023-11-29
Text-to-Video Generation	PixelDance	Make Pixels Dance: High-Dynamic Video …	2023-11-18

Research Papers

Recent papers with results on this dataset:

External Links:

MSR-VTT

Overview edit

Associated Benchmarks

Recent Benchmark Submissions

Research Papers

Edit Dataset Information

Overview