A short clip of video may contain progression of multiple events and an interesting story line. A human needs to capture both the event in every shot and associate them together to understand the story behind it.
In this work, we present a new multi-shot video understanding benchmark Shot2Story with detailed shot-level captions and comprehensive video summaries. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video and narration captioning, multi-shot video summarization, and video retrieval with shot descriptions.
Preliminary experiments show some challenges to generate a long and comprehensive video summary. Nevertheless, the generated imperfect summaries can already significantly boost the performance of existing video understanding tasks such as video question-answering, promoting an underexplored setting of video understanding with detailed summaries.
Variants: Shot2Story20K
This dataset is used in 3 benchmarks:
Task | Model | Paper | Date |
---|---|---|---|
Video Captioning | Shotluck-Holmes (3.1B) | Shotluck Holmes: A Family of … | 2024-05-31 |
Video Summarization | Shotluck-Holmes (3.1B) | Shotluck Holmes: A Family of … | 2024-05-31 |
Video Captioning | Shot2Story | Shot2Story20K: A New Benchmark for … | 2023-12-16 |
Video Summarization | SUM-shot | Shot2Story20K: A New Benchmark for … | 2023-12-16 |
video narration captioning | Ours | Shot2Story20K: A New Benchmark for … | 2023-12-16 |
Recent papers with results on this dataset: