ML Research Wiki / Benchmarks / Zero-Shot Video Retrieval / YouCook2

YouCook2

Zero-Shot Video Retrieval Benchmark

Performance Over Time

📊 Showing 8 results | 📏 Metric: text-to-video R@1

Top Performing Models

Rank	Model	Paper	text-to-video R@1	Date	Code
1	Norton	Multi-granularity Correspondence Learning from Long-term Noisy Videos	64.10	2024-01-30	📦 XLearning-SCU/2024-ICLR-Norton
2	VideoCLIP	VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding	63.10	2021-09-28	📦 facebookresearch/fairseq 📦 pytorch/fairseq
3	TACo	TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment	55.70	2021-08-23	-
4	VAST, HowToCaption-finetuned	HowToCaption: Prompting LLMs to Transform Video Annotations at Scale	53.90	2023-10-07	📦 ninatu/howtocaption
5	VideoCOca	VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners	53.30	2022-12-09	-
6	MIL-NCE	End-to-End Learning of Visual Representations from Uncurated Instructional Videos	51.20	2019-12-13	📦 antoine77340/MIL-NCE_HowTo100M 📦 antoine77340/milnce_howto100m 📦 antoine77340/S3D_HowTo100M 📦 linjieli222/hero_video_feature_extractor
7	VATT-MBS	VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text	45.50	2021-04-22	📦 google-research/google-research 📦 akashe/ProgrammingInterview 📦 pwc-1/Paper-9 📦 MindCode-4/code-9 📦 MindCode-4/code-13
8	HowToCaption	HowToCaption: Prompting LLMs to Transform Video Annotations at Scale	44.10	2023-10-07	📦 ninatu/howtocaption

All Papers (8)

Multi-granularity Correspondence Learning from Long-term Noisy Videos

2024

Norton

XLearning-SCU/2024-ICLR-Norton

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

2021

VideoCLIP

facebookresearch/fairseq pytorch/fairseq

TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment

2021

TACo

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

2023

VAST, HowToCaption-finetuned

ninatu/howtocaption

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

2022

VideoCOca

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

2019

MIL-NCE

antoine77340/MIL-NCE_HowTo100M antoine77340/milnce_howto100m

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

2021

VATT-MBS

google-research/google-research akashe/ProgrammingInterview

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

2023

HowToCaption

ninatu/howtocaption

YouCook2

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (8)

Multi-granularity Correspondence Learning from Long-term Noisy Videos

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

Model	Paper	text-to-video R@1	Date
Norton	Multi-granularity Correspondence Learning from Lo…	64.10	2024-01-30
VideoCLIP	VideoCLIP: Contrastive Pre-training for Zero-shot…	63.10	2021-09-28
TACo	TACo: Token-aware Cascade Contrastive Learning fo…	55.70	2021-08-23
VAST, HowToCaption-finetuned	HowToCaption: Prompting LLMs to Transform Video A…	53.90	2023-10-07
VideoCOca	VideoCoCa: Video-Text Modeling with Zero-Shot Tra…	53.30	2022-12-09
MIL-NCE	End-to-End Learning of Visual Representations fro…	51.20	2019-12-13
VATT-MBS	VATT: Transformers for Multimodal Self-Supervised…	45.50	2021-04-22
HowToCaption	HowToCaption: Prompting LLMs to Transform Video A…	44.10	2023-10-07