ML Research Wiki / Benchmarks / Video Retrieval / YouCook2

YouCook2

Video Retrieval Benchmark

Performance Over Time

📊 Showing 15 results | 📏 Metric: text-to-video R@1

Top Performing Models

Rank	Model	Paper	text-to-video R@1	Date	Code
1	VAST 📚	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	80.80	2023-05-29	📦 TXH-mercury/VALOR 📦 txh-mercury/vast
2	VideoCLIP 📚	VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding	75.00	2021-09-28	📦 facebookresearch/fairseq 📦 pytorch/fairseq
3	UniVL + MELTR	MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models	74.80	2023-03-23	📦 mlvlab/MELTR
4	MDMMT-2 📚	MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization	74.80	2022-03-14	-
5	TACo 📚	TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment	72.70	2021-08-23	-
6	OmniVec 📚	OmniVec: Learning robust representations with cross modal sharing	70.80	2023-11-07	-
7	UniVL 📚	UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation	70.00	2020-02-15	📦 microsoft/UniVL 📦 wqliu657/UniVL
8	VLM 📚	VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding	69.38	2021-05-20	📦 pytorch/fairseq
9	OmniVec (pretrained) 📚	OmniVec: Learning robust representations with cross modal sharing	64.20	2023-11-07	-
10	VideoCLIP (zero-shot) 📚	VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding	63.10	2021-09-28	📦 facebookresearch/fairseq 📦 pytorch/fairseq

All Papers (15)

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

2023

VAST

TXH-mercury/VALOR txh-mercury/vast

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

2021

VideoCLIP

facebookresearch/fairseq pytorch/fairseq

MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models

2023

UniVL + MELTR

mlvlab/MELTR

MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

2022

MDMMT-2

TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment

2021

TACo

OmniVec: Learning robust representations with cross modal sharing

2023

OmniVec

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

2020

UniVL

microsoft/UniVL wqliu657/UniVL

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

2021

VLM

pytorch/fairseq

OmniVec: Learning robust representations with cross modal sharing

2023

OmniVec (pretrained)

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

2021

VideoCLIP (zero-shot)

facebookresearch/fairseq pytorch/fairseq

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

2022

VideoCoCa (zero-shot)

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

2020

COOT

gingsi/coot-videotext

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

2019

Text-Video Embedding

antoine77340/MIL-NCE_HowTo100M antoine77340/milnce_howto100m

RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval

2022

RoME

buraksatar/RoME_video_retrieval

Semantic Role Aware Correlation Transformer for Text to Video Retrieval

2022

Satar et al.

buraksatar/RoME_video_retrieval

YouCook2

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (15)

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models

MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment

OmniVec: Learning robust representations with cross modal sharing

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

OmniVec: Learning robust representations with cross modal sharing

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval

Semantic Role Aware Correlation Transformer for Text to Video Retrieval

Model	Paper	text-to-video R@1	Date
VAST	VAST: A Vision-Audio-Subtitle-Text Omni-Modality …	80.80	2023-05-29
VideoCLIP	VideoCLIP: Contrastive Pre-training for Zero-shot…	75.00	2021-09-28
UniVL + MELTR	MELTR: Meta Loss Transformer for Learning to Fine…	74.80	2023-03-23
MDMMT-2	MDMMT-2: Multidomain Multimodal Transformer for V…	74.80	2022-03-14
TACo	TACo: Token-aware Cascade Contrastive Learning fo…	72.70	2021-08-23
OmniVec	OmniVec: Learning robust representations with cro…	70.80	2023-11-07
UniVL	UniVL: A Unified Video and Language Pre-Training …	70.00	2020-02-15
VLM	VLM: Task-agnostic Video-Language Model Pre-train…	69.38	2021-05-20
OmniVec (pretrained)	OmniVec: Learning robust representations with cro…	64.20	2023-11-07
VideoCLIP (zero-shot)	VideoCLIP: Contrastive Pre-training for Zero-shot…	63.10	2021-09-28
VideoCoCa (zero-shot)	VideoCoCa: Video-Text Modeling with Zero-Shot Tra…	55.20	2022-12-09
COOT	COOT: Cooperative Hierarchical Transformer for Vi…	52.30	2020-11-01
Text-Video Embedding	HowTo100M: Learning a Text-Video Embedding by Wat…	35.30	2019-06-07
RoME	RoME: Role-aware Mixture-of-Expert Transformer fo…	25.20	2022-06-26
Satar et al.	Semantic Role Aware Correlation Transformer for T…	20.80	2022-06-26