ViTT

Video Timeline Tags

Dataset Information
Modalities
Videos
License
Unknown
Homepage

Overview

The ViTT dataset consists of human produced segment-level annotations for 8,169 videos. Of these, 5,840 videos have been annotated once, and the rest of the videos have been annotated twice or more. A total of 12,461 sets of annotations are released. The videos in the dataset are from the Youtube-8M dataset.

An annotation has the following format:

{
  "id": "FmTp",
  "annotations": [
    {
      "timestamp": 260,
      "tag": "Opening"
    },
    {
      "timestamp": 16000,
      "tag": "Displaying technique"
    },
    {
      "timestamp": 23990,
      "tag": "Showing foot positioning"
    },
    {
      "timestamp": 55530,
      "tag": "Demonstrating crossover"
    },
    {
      "timestamp": 114100,
      "tag": "Closing"
    }
  ]
}

Source: Video Timeline Tags (ViTT)

Variants: ViTT

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
Dense Video Captioning HiCM² HiCM$^2$: Hierarchical Compact Memory Modeling … 2024-12-19
Dense Video Captioning Vid2Seq Vid2Seq: Large-Scale Pretraining of a … 2023-02-27
Dense Video Captioning E2ESG End-to-end Dense Video Captioning as … 2022-04-18

Research Papers

Recent papers with results on this dataset: