YouCook2

Dataset Information
Modalities
Videos, Texts
Languages
English
Introduced
2018
License
Homepage

Overview

YouCook2 is the largest task-oriented, instructional video dataset in the vision community. It contains 2000 long untrimmed videos from 89 cooking recipes; on average, each distinct recipe has 22 videos. The procedure steps for each video are annotated with temporal boundaries and described by imperative English sentences (see the example below). The videos were downloaded from YouTube and are all in the third-person viewpoint. All the videos are unconstrained and can be performed by individual persons at their houses with unfixed cameras. YouCook2 contains rich recipe types and various cooking styles from all over the world.

Source: http://youcook2.eecs.umich.edu/
Image Source: https://competitions.codalab.org/competitions/20594

Variants: YouCook2

Associated Benchmarks

This dataset is used in 5 benchmarks:

Recent Benchmark Submissions

Task Model Paper Date
Dense Video Captioning HiCM² HiCM$^2$: Hierarchical Compact Memory Modeling … 2024-12-19
Dense Video Captioning CM² Do You Remember? Dense Video … 2024-04-11
Video Captioning MA-LMM MA-LMM: Memory-Augmented Large Multimodal Model … 2024-04-08
Zero-Shot Video Retrieval Norton Multi-granularity Correspondence Learning from Long-term … 2024-01-30
Long Video Retrieval (Background Removed) Norton Multi-granularity Correspondence Learning from Long-term … 2024-01-30
Video Retrieval OmniVec (pretrained) OmniVec: Learning robust representations with … 2023-11-07
Video Retrieval OmniVec OmniVec: Learning robust representations with … 2023-11-07
Zero-Shot Video Retrieval HowToCaption HowToCaption: Prompting LLMs to Transform … 2023-10-07
Video Captioning HowToCaption HowToCaption: Prompting LLMs to Transform … 2023-10-07
Zero-Shot Video Retrieval VAST, HowToCaption-finetuned HowToCaption: Prompting LLMs to Transform … 2023-10-07
Video Captioning COSA COSA: Concatenated Sample Pretrained Vision-Language … 2023-06-15
Video Retrieval VAST VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation … 2023-05-29
Video Captioning VAST VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation … 2023-05-29
Video Captioning UniVL + MELTR MELTR: Meta Loss Transformer for … 2023-03-23
Video Retrieval UniVL + MELTR MELTR: Meta Loss Transformer for … 2023-03-23
Video Captioning TextKG Text with Knowledge Graph Augmented … 2023-03-22
Dense Video Captioning GVL Learning Grounded Vision-Language Representation for … 2023-03-11
Dense Video Captioning Vid2Seq Vid2Seq: Large-Scale Pretraining of a … 2023-02-27
Long Video Retrieval (Background Removed) TempCLR TempCLR: Temporal Alignment Representation with … 2022-12-28
Video Retrieval VideoCoCa (zero-shot) VideoCoCa: Video-Text Modeling with Zero-Shot … 2022-12-09

Research Papers

Recent papers with results on this dataset: