ActivityNet

Dataset Information
Modalities
Videos
Introduced
2015
License
Unknown
Homepage

Overview

The ActivityNet dataset contains 200 different types of activities and a total of 849 hours of videos collected from YouTube. ActivityNet is the largest benchmark for temporal activity detection to date in terms of both the number of activity categories and number of videos, making the task particularly challenging. Version 1.3 of the dataset contains 19994 untrimmed videos in total and is divided into three disjoint subsets, training, validation, and testing by a ratio of 2:1:1. On average, each activity category has 137 untrimmed videos. Each video on average has 1.41 activities which are annotated with temporal boundaries. The ground-truth annotations of test videos are not public.

Source: Dynamic Temporal Pyramid Network: A Closer Look at Multi-Scale Modeling for Activity Detection

Variants: ActivityNet, ActivityNet-1.3, ActivityNet-1.2, ActivityNet-GZSL (cls), ActivityNet-GZSL(main)

Associated Benchmarks

This dataset is used in 6 benchmarks:

Recent Benchmark Submissions

Task Model Paper Date
Zero-Shot Video Retrieval GRAM Gramian Multimodal Representation Learning and … 2024-12-16
Video Retrieval GRAM Gramian Multimodal Representation Learning and … 2024-12-16
Video Retrieval InternVideo2-6B InternVideo2: Scaling Foundation Models for … 2024-03-22
Zero-Shot Video Retrieval InternVideo2-6B InternVideo2: Scaling Foundation Models for … 2024-03-22
Zero-Shot Video Retrieval InternVideo2-1B InternVideo2: Scaling Foundation Models for … 2024-03-22
Action Recognition InternVideo2-6B InternVideo2: Scaling Foundation Models for … 2024-03-22
Video Retrieval vid-TLDR (UMT-L) vid-TLDR: Training Free Token merging … 2024-03-20
Zero-Shot Video Retrieval vid-TLDR (UMT-L) vid-TLDR: Training Free Token merging … 2024-03-20
Visual Question Answering (VQA) BLIP-2 T5 Open-ended VQA benchmarking of Vision-Language … 2024-02-11
Video Retrieval RTQ RTQ: Rethinking Video-language Understanding Based … 2023-12-01
Video Retrieval TESTA (ViT-B/16) TESTA: Temporal-Spatial Token Aggregation for … 2023-10-29
Zero-Shot Video Retrieval LanguageBind(ViT-L/14) LanguageBind: Extending Video-Language Pretraining to … 2023-10-03
Zero-Shot Video Retrieval LanguageBind(ViT-H/14) LanguageBind: Extending Video-Language Pretraining to … 2023-10-03
Zero-Shot Video Retrieval BT-Adapter BT-Adapter: Video Conversation is Feasible … 2023-09-27
Video Retrieval DMAE (ViT-B/32) Dual-Modal Attention-Enhanced Text-Video Retrieval with … 2023-09-20
Video Retrieval COSA COSA: Concatenated Sample Pretrained Vision-Language … 2023-06-15
Video Retrieval VAST VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation … 2023-05-29
Video Retrieval VALOR VALOR: Vision-Audio-Language Omni-Perception Pretraining Model … 2023-04-17
Zero-Shot Video Retrieval UMT-L (ViT-L/16) Unmasked Teacher: Towards Training-Efficient Video … 2023-03-28
Video Retrieval UMT-L (ViT-L/16) Unmasked Teacher: Towards Training-Efficient Video … 2023-03-28

Research Papers

Recent papers with results on this dataset: