The TVQA dataset is a large-scale video dataset for video question answering. It is based on 6 popular TV shows (Friends, The Big Bang Theory, How I Met Your Mother, House M.D., Grey's Anatomy, Castle). It includes 152,545 QA pairs from 21,793 TV show clips. The QA pairs are split into the ratio of 8:1:1 for training, validation, and test sets. The TVQA dataset provides the sequence of video frames extracted at 3 FPS, the corresponding subtitles with the video clips, and the query consisting of a question and four answer candidates. Among the four answer candidates, there is only one correct answer.
Source: Two-stream Spatiotemporal Feature for Video QA Task
Image Source: https://arxiv.org/abs/1809.01696
Variants: TVQA
This dataset is used in 2 benchmarks:
Task | Model | Paper | Date |
---|---|---|---|
Zero-Shot Learning | VideoChat2 | MVBench: A Comprehensive Multi-modal Video … | 2023-11-28 |
Video Question Answering | LLaMA-VQA | Large Language Models are Temporal … | 2023-10-24 |
Video Question Answering | VindLU | VindLU: A Recipe for Effective … | 2022-12-09 |
Video Question Answering | FrozenBiLM | Zero-Shot Video Question Answering via … | 2022-06-16 |
Video Question Answering | iPerceive (Chadha et al., 2020) | iPerceive: Applying Common-Sense Reasoning to … | 2020-11-16 |
Video Question Answering | Hero w/ pre-training | HERO: Hierarchical Encoder for Video+Language … | 2020-05-01 |
Video Question Answering | STAGE (Lei et al., 2019) | TVQA+: Spatio-Temporal Grounding for Video … | 2019-04-25 |
Recent papers with results on this dataset: