ML Research Wiki / Benchmarks / Video Captioning / YouCook2

YouCook2

Video Captioning Benchmark

Performance Over Time

📊 Showing 14 results | 📏 Metric: BLEU-4

Top Performing Models

Rank	Model	Paper	BLEU-4	Date	Code
1	VAST 📚	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	18.20	2023-05-29	📦 TXH-mercury/VALOR 📦 txh-mercury/vast
2	UniVL + MELTR	MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models	17.92	2023-03-23	📦 mlvlab/MELTR
3	UniVL 📚	UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation	17.35	2020-02-15	📦 microsoft/UniVL 📦 wqliu657/UniVL
4	VideoCoCa 📚	VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners	14.20	2022-12-09	-
5	VLM 📚	VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding	12.27	2021-05-20	📦 pytorch/fairseq
6	E2vidD6-MASSvid-BiD 📚	Multimodal Pretraining for Dense Video Captioning	12.04	2020-11-10	📦 google-research-datasets/Video-Timeline-Tags-ViTT
7	TextKG	Text with Knowledge Graph Augmented Transformer for Video Captioning	11.70	2023-03-22	-
8	COOT 📚	COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning	11.30	2020-11-01	📦 gingsi/coot-videotext
9	COSA 📚	COSA: Concatenated Sample Pretrained Vision-Language Foundation Model	10.10	2023-06-15	📦 txh-mercury/cosa
10	HowToCaption	HowToCaption: Prompting LLMs to Transform Video Annotations at Scale	8.80	2023-10-07	📦 ninatu/howtocaption

All Papers (14)

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

2023

VAST

TXH-mercury/VALOR txh-mercury/vast

MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models

2023

UniVL + MELTR

mlvlab/MELTR

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

2020

UniVL

microsoft/UniVL wqliu657/UniVL

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

2022

VideoCoCa

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

2021

VLM

pytorch/fairseq

Multimodal Pretraining for Dense Video Captioning

2020

E2vidD6-MASSvid-BiD

google-research-datasets/Video-Timeline-Tags-ViTT

Text with Knowledge Graph Augmented Transformer for Video Captioning

2023

TextKG

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

2020

COOT

gingsi/coot-videotext

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

2023

COSA

txh-mercury/cosa

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

2023

HowToCaption

ninatu/howtocaption

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

2022

OmniVL

End-to-End Dense Video Captioning with Masked Transformer

2018

Zhou

salesforce/densecap

VideoBERT: A Joint Model for Video and Language Representation Learning

2019

VideoBERT + S3D

ammesatyajit/VideoBERT MDSKUL/MasterProject parkervg/allrecipes-bert

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

2024

MA-LMM

boheumd/MA-LMM

YouCook2

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (14)

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

Multimodal Pretraining for Dense Video Captioning

Text with Knowledge Graph Augmented Transformer for Video Captioning

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

End-to-End Dense Video Captioning with Masked Transformer

VideoBERT: A Joint Model for Video and Language Representation Learning

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Model	Paper	BLEU-4	Date
VAST	VAST: A Vision-Audio-Subtitle-Text Omni-Modality …	18.20	2023-05-29
UniVL + MELTR	MELTR: Meta Loss Transformer for Learning to Fine…	17.92	2023-03-23
UniVL	UniVL: A Unified Video and Language Pre-Training …	17.35	2020-02-15
VideoCoCa	VideoCoCa: Video-Text Modeling with Zero-Shot Tra…	14.20	2022-12-09
VLM	VLM: Task-agnostic Video-Language Model Pre-train…	12.27	2021-05-20
E2vidD6-MASSvid-BiD	Multimodal Pretraining for Dense Video Captioning	12.04	2020-11-10
TextKG	Text with Knowledge Graph Augmented Transformer f…	11.70	2023-03-22
COOT	COOT: Cooperative Hierarchical Transformer for Vi…	11.30	2020-11-01
COSA	COSA: Concatenated Sample Pretrained Vision-Langu…	10.10	2023-06-15
HowToCaption	HowToCaption: Prompting LLMs to Transform Video A…	8.80	2023-10-07
OmniVL	OmniVL:One Foundation Model for Image-Language an…	8.72	2022-09-15
Zhou	End-to-End Dense Video Captioning with Masked Tra…	4.38	2018-04-03
VideoBERT + S3D	VideoBERT: A Joint Model for Video and Language R…	4.33	2019-04-03
MA-LMM	MA-LMM: Memory-Augmented Large Multimodal Model f…	1.31	2024-04-08