MSR-VTT

Dataset Information
Modalities
Videos, Texts
Languages
English
Introduced
2016
License
Unknown
Homepage

Overview

MSR-VTT (Microsoft Research Video to Text) is a large-scale dataset for the open domain video captioning, which consists of 10,000 video clips from 20 categories, and each video clip is annotated with 20 English sentences by Amazon Mechanical Turks. There are about 29,000 unique words in all captions. The standard splits uses 6,513 clips for training, 497 clips for validation, and 2,990 clips for testing.

Source: Learning to Discretely Compose Reasoning Module Networksfor Video Captioning

Variants: MSR-VTT, MSR-VTT-1kA, MSRVTT

Associated Benchmarks

This dataset is used in 7 benchmarks:

  • Video Question Answering -
  • Video Captioning -
  • Video Generation -
  • Video Retrieval -
  • Text to Video Retrieval -
  • Text-to-Video Generation -
  • Zero-Shot Video Retrieval -

Recent Benchmark Submissions

Task Model Paper Date
Zero-Shot Video Retrieval FluxViT-S Make Your Training Flexible: Towards … 2025-03-18
Zero-Shot Video Retrieval FluxViT-B Make Your Training Flexible: Towards … 2025-03-18
Zero-Shot Video Retrieval GRAM Gramian Multimodal Representation Learning and … 2024-12-16
Video Retrieval GRAM Gramian Multimodal Representation Learning and … 2024-12-16
Video Question Answering LocVLM-Vid-B Learning to Localize Objects Improves … 2024-04-11
Zero-Shot Video Retrieval InternVideo2-6B InternVideo2: Scaling Foundation Models for … 2024-03-22
Video Retrieval InternVideo2-6B InternVideo2: Scaling Foundation Models for … 2024-03-22
Zero-Shot Video Retrieval InternVideo2-1B InternVideo2: Scaling Foundation Models for … 2024-03-22
Video Retrieval vid-TLDR (UMT-L) vid-TLDR: Training Free Token merging … 2024-03-20
Zero-Shot Video Retrieval vid-TLDR (UMT-L) vid-TLDR: Training Free Token merging … 2024-03-20
Text-to-Video Generation Snap Video (288×288) Snap Video: Scaled Spatiotemporal Transformers … 2024-02-22
Text-to-Video Generation Snap Video (512x288) Snap Video: Scaled Spatiotemporal Transformers … 2024-02-22
Text-to-Video Generation Video-LaVIT Video-LaVIT: Unified Video-Language Pre-training with … 2024-02-05
Zero-Shot Video Retrieval Norton Multi-granularity Correspondence Learning from Long-term … 2024-01-30
Text-to-Video Generation TF-T2V A Recipe for Scaling up … 2023-12-25
Text-to-Video Generation VideoPoet VideoPoet: A Large Language Model … 2023-12-21
Text-to-Video Generation HiGen Hierarchical Spatio-temporal Decoupling for Text-to-Video … 2023-12-07
Video Captioning RTQ RTQ: Rethinking Video-language Understanding Based … 2023-12-01
Video Generation VideoAssembler (Zero-Shot, 256x256, class-conditional) MagDiff: Multi-Alignment Diffusion for High-Fidelity … 2023-11-29
Text-to-Video Generation PixelDance Make Pixels Dance: High-Dynamic Video … 2023-11-18

Research Papers

Recent papers with results on this dataset: