The VidSTG dataset is a spatio-temporal video grounding dataset constructed based on the video relation dataset VidOR. VidOR contains 7,000, 835 and 2,165 videos for training, validation and testing, respectively. The goal of the Spatio-Temporal Video Grounding task (STVG) is to localize the spatio-temporal section of an untrimmed video that matches a given sentence depicting an object. VidSTG contains 5,563, 618, and 743 videos for training, validation, and testing, respectively.
Source: https://github.com/Guaranteer/VidSTG-Dataset
Image Source: https://github.com/Guaranteer/VidSTG-Dataset
Variants: VidSTG
This dataset is used in 1 benchmark:
Task | Model | Paper | Date |
---|---|---|---|
Spatio-Temporal Video Grounding | TA-STVG | Knowing Your Target: Target-Aware Transformer … | 2025-02-16 |
Spatio-Temporal Video Grounding | CG-STVG | Context-Guided Spatio-Temporal Video Grounding | 2024-01-03 |
Spatio-Temporal Video Grounding | TubeDETR | TubeDETR: Spatio-Temporal Video Grounding with … | 2022-03-30 |
Recent papers with results on this dataset: