ViTT

Name: ViTT
License: Unknown

Video Timeline Tags

Dataset Information

Modalities

Videos

License

Unknown

Homepage

Official Website

Contents

Overview
Associated Benchmarks
Recent Benchmark Submissions
Research Papers

Overview

The ViTT dataset consists of human produced segment-level annotations for 8,169 videos. Of these, 5,840 videos have been annotated once, and the rest of the videos have been annotated twice or more. A total of 12,461 sets of annotations are released. The videos in the dataset are from the Youtube-8M dataset.

An annotation has the following format:

{
  "id": "FmTp",
  "annotations": [
    {
      "timestamp": 260,
      "tag": "Opening"
    },
    {
      "timestamp": 16000,
      "tag": "Displaying technique"
    },
    {
      "timestamp": 23990,
      "tag": "Showing foot positioning"
    },
    {
      "timestamp": 55530,
      "tag": "Demonstrating crossover"
    },
    {
      "timestamp": 114100,
      "tag": "Closing"
    }
  ]
}

Source: Video Timeline Tags (ViTT)

Variants: ViTT

Associated Benchmarks

This dataset is used in 1 benchmark:

Dense Video Captioning - Metrics: SODA, CIDEr, METEOR

Recent Benchmark Submissions

Task	Model	Paper	Date
Dense Video Captioning	HiCM²	HiCM$^2$: Hierarchical Compact Memory Modeling …	2024-12-19
Dense Video Captioning	Vid2Seq	Vid2Seq: Large-Scale Pretraining of a …	2023-02-27
Dense Video Captioning	E2ESG	End-to-end Dense Video Captioning as …	2022-04-18

Research Papers

Recent papers with results on this dataset:

External Links:

ViTT

Overview edit

Associated Benchmarks

Recent Benchmark Submissions

Research Papers

Edit Dataset Information

Overview