DiDeMo

Distinct Describable Moments

Dataset Information
Modalities
Videos, Texts
Introduced
2017
License
Homepage

Overview

The Distinct Describable Moments (DiDeMo) dataset is one of the largest and most diverse datasets for the temporal localization of events in videos given natural language descriptions. The videos are collected from Flickr and each video is trimmed to a maximum of 30 seconds. The videos in the dataset are divided into 5-second segments to reduce the complexity of annotation. The dataset is split into training, validation and test sets containing 8,395, 1,065 and 1,004 videos respectively. The dataset contains a total of 26,892 moments and one moment could be associated with descriptions from multiple annotators. The descriptions in DiDeMo dataset are detailed and contain camera movement, temporal transition indicators, and activities. Moreover, the descriptions in DiDeMo are verified so that each description refers to a single moment.

Source: Weakly Supervised Video Moment Retrieval From Text Queries
Image Source: https://www.di.ens.fr/~miech/datasetviz/

Variants: DiDeMo

Associated Benchmarks

This dataset is used in 2 benchmarks:

  • Video Retrieval -
  • Zero-Shot Video Retrieval -

Recent Benchmark Submissions

Task Model Paper Date
Video Retrieval GRAM Gramian Multimodal Representation Learning and … 2024-12-16
Zero-Shot Video Retrieval GRAM Gramian Multimodal Representation Learning and … 2024-12-16
Zero-Shot Video Retrieval InternVideo2-1B InternVideo2: Scaling Foundation Models for … 2024-03-22
Zero-Shot Video Retrieval InternVideo2-6B InternVideo2: Scaling Foundation Models for … 2024-03-22
Video Retrieval InternVideo2-6B InternVideo2: Scaling Foundation Models for … 2024-03-22
Zero-Shot Video Retrieval vid-TLDR (UMT-L) vid-TLDR: Training Free Token merging … 2024-03-20
Video Retrieval vid-TLDR (UMT-L) vid-TLDR: Training Free Token merging … 2024-03-20
Video Retrieval RTQ RTQ: Rethinking Video-language Understanding Based … 2023-12-01
Video Retrieval TESTA (ViT-B/16) TESTA: Temporal-Spatial Token Aggregation for … 2023-10-29
Zero-Shot Video Retrieval LanguageBind(ViT-L/14) LanguageBind: Extending Video-Language Pretraining to … 2023-10-03
Zero-Shot Video Retrieval LanguageBind(ViT-H/14) LanguageBind: Extending Video-Language Pretraining to … 2023-10-03
Video Retrieval PAU Prototype-based Aleatoric Uncertainty Quantification for … 2023-09-29
Zero-Shot Video Retrieval BT-Adapter BT-Adapter: Video Conversation is Feasible … 2023-09-27
Video Retrieval DMAE (ViT-B/32) Dual-Modal Attention-Enhanced Text-Video Retrieval with … 2023-09-20
Video Retrieval COSA COSA: Concatenated Sample Pretrained Vision-Language … 2023-06-15
Zero-Shot Video Retrieval VAST VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation … 2023-05-29
Video Retrieval VAST VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation … 2023-05-29
Video Retrieval VLAB VLAB: Enhancing Video Language Pre-training … 2023-05-22
Video Retrieval VALOR VALOR: Vision-Audio-Language Omni-Perception Pretraining Model … 2023-04-17
Video Retrieval UMT-L (ViT-L/16) Unmasked Teacher: Towards Training-Efficient Video … 2023-03-28

Research Papers

Recent papers with results on this dataset: