Video Timeline Tags
The ViTT dataset consists of human produced segment-level annotations for 8,169 videos. Of these, 5,840 videos have been annotated once, and the rest of the videos have been annotated twice or more. A total of 12,461 sets of annotations are released. The videos in the dataset are from the Youtube-8M dataset.
An annotation has the following format:
{
"id": "FmTp",
"annotations": [
{
"timestamp": 260,
"tag": "Opening"
},
{
"timestamp": 16000,
"tag": "Displaying technique"
},
{
"timestamp": 23990,
"tag": "Showing foot positioning"
},
{
"timestamp": 55530,
"tag": "Demonstrating crossover"
},
{
"timestamp": 114100,
"tag": "Closing"
}
]
}
Source: Video Timeline Tags (ViTT)
Variants: ViTT
This dataset is used in 1 benchmark:
Task | Model | Paper | Date |
---|---|---|---|
Dense Video Captioning | HiCM² | HiCM$^2$: Hierarchical Compact Memory Modeling … | 2024-12-19 |
Dense Video Captioning | Vid2Seq | Vid2Seq: Large-Scale Pretraining of a … | 2023-02-27 |
Dense Video Captioning | E2ESG | End-to-end Dense Video Captioning as … | 2022-04-18 |
Recent papers with results on this dataset: