ML Research Wiki / Benchmarks / Video Captioning / MSR-VTT

MSR-VTT

Video Captioning Benchmark

Performance Over Time

📊 Showing 22 results | 📏 Metric: CIDEr

Top Performing Models

Rank	Model	Paper	CIDEr	Date	Code
1	MaMMUT (ours)	MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks	73.60	2023-03-29	📦 lucidrains/mammut-pytorch
2	Vid2Seq 📚	Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning	64.60	2023-02-27	📦 google-research/scenic 📦 antoyang/VidChapters 📦 KastanDay/video-pretrained-transformer
3	VIOLETv2	An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling	58.00	2022-09-04	📦 tsujuifu/pytorch_empirical-mvm
4	mPLUG-2	mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	57.80	2023-02-01	📦 modelscope/modelscope 📦 x-plug/mplug-owl 📦 alibaba/AliceMind 📦 X-PLUG/mPLUG-2
5	VAST 📚	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	56.70	2023-05-29	📦 TXH-mercury/VALOR 📦 txh-mercury/vast
6	GIT2 📚	GIT: A Generative Image-to-text Transformer for Vision and Language	54.80	2022-05-27	📦 microsoft/GenerativeImage2Text
7	VLAB 📚	VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending	54.60	2023-05-22	-
8	VALOR 📚	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	54.40	2023-04-17	📦 TXH-mercury/VALOR
9	VideoCoCa 📚	VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners	53.80	2022-12-09	-
10	COSA 📚	COSA: Concatenated Sample Pretrained Vision-Language Foundation Model	53.70	2023-06-15	📦 txh-mercury/cosa

All Papers (22)

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

2023

MaMMUT (ours)

lucidrains/mammut-pytorch

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

2023

Vid2Seq

google-research/scenic antoyang/VidChapters KastanDay/video-pretrained-transformer

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

2022

VIOLETv2

tsujuifu/pytorch_empirical-mvm

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

2023

mPLUG-2

modelscope/modelscope x-plug/mplug-owl

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

2023

VAST

TXH-mercury/VALOR txh-mercury/vast

GIT: A Generative Image-to-text Transformer for Vision and Language

2022

GIT2

microsoft/GenerativeImage2Text

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

2023

VLAB

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

2023

VALOR

TXH-mercury/VALOR

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

2022

VideoCoCa

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

2023

COSA

txh-mercury/cosa

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

2023

HowToCaption

ninatu/howtocaption

RTQ: Rethinking Video-language Understanding Based on Image-text Model

2023

RTQ

SCZwangxiao/RTQ-MM2023 sczwangxiao/tsgvs-mm2023

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

2022

HiTeA

End-to-end Generative Pretraining for Multimodal Video Captioning

2022

MV-GPT

CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter

2021

CLIP-DCD

yangbang18/CLIP-Captioner

Text with Knowledge Graph Augmented Transformer for Video Captioning

2023

TextKG

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

2022

EMCL-Net

jpthu17/emcl jpthu17/diffusionret

SEM-POS: Grammatically and Semantically Correct Video Captioning

2023

SEM-POS

Accurate and Fast Compressed Video Captioning

2023

CoCap (ViT/L14)

acherstyx/CoCap

Diverse Video Captioning by Adaptive Spatio-temporal Attention

2022

VASTA (Vatex-backbone)

zohrehghaderi/vasta

MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models

2023

UniVL + MELTR

mlvlab/MELTR

Diverse Video Captioning by Adaptive Spatio-temporal Attention

2022

VASTA (Kinetics-backbone)

zohrehghaderi/vasta

MSR-VTT

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (22)

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

GIT: A Generative Image-to-text Transformer for Vision and Language

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

RTQ: Rethinking Video-language Understanding Based on Image-text Model

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

End-to-end Generative Pretraining for Multimodal Video Captioning

CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter

Text with Knowledge Graph Augmented Transformer for Video Captioning

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

SEM-POS: Grammatically and Semantically Correct Video Captioning

Accurate and Fast Compressed Video Captioning

Diverse Video Captioning by Adaptive Spatio-temporal Attention

MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models

Diverse Video Captioning by Adaptive Spatio-temporal Attention

Model	Paper	CIDEr	Date
MaMMUT (ours)	MaMMUT: A Simple Architecture for Joint Learning …	73.60	2023-03-29
Vid2Seq	Vid2Seq: Large-Scale Pretraining of a Visual Lang…	64.60	2023-02-27
VIOLETv2	An Empirical Study of End-to-End Video-Language T…	58.00	2022-09-04
mPLUG-2	mPLUG-2: A Modularized Multi-modal Foundation Mod…	57.80	2023-02-01
VAST	VAST: A Vision-Audio-Subtitle-Text Omni-Modality …	56.70	2023-05-29
GIT2	GIT: A Generative Image-to-text Transformer for V…	54.80	2022-05-27
VLAB	VLAB: Enhancing Video Language Pre-training by Fe…	54.60	2023-05-22
VALOR	VALOR: Vision-Audio-Language Omni-Perception Pret…	54.40	2023-04-17
VideoCoCa	VideoCoCa: Video-Text Modeling with Zero-Shot Tra…	53.80	2022-12-09
COSA	COSA: Concatenated Sample Pretrained Vision-Langu…	53.70	2023-06-15
HowToCaption	HowToCaption: Prompting LLMs to Transform Video A…	49.80	2023-10-07
RTQ	RTQ: Rethinking Video-language Understanding Base…	49.60	2023-12-01
HiTeA	HiTeA: Hierarchical Temporal-Aware Video-Language…	49.20	2022-12-30
MV-GPT	End-to-end Generative Pretraining for Multimodal …	48.90	2022-01-20
CLIP-DCD	CLIP Meets Video Captioning: Concept-Aware Repres…	48.20	2021-11-30
TextKG	Text with Knowledge Graph Augmented Transformer f…	46.60	2023-03-22
EMCL-Net	Expectation-Maximization Contrastive Learning for…	45.30	2022-11-21
SEM-POS	SEM-POS: Grammatically and Semantically Correct V…	45.20	2023-03-26
CoCap (ViT/L14)	Accurate and Fast Compressed Video Captioning	44.40	2023-09-22
VASTA (Vatex-backbone)	Diverse Video Captioning by Adaptive Spatio-tempo…	44.21	2022-08-19
UniVL + MELTR	MELTR: Meta Loss Transformer for Learning to Fine…	44.17	2023-03-23
VASTA (Kinetics-backbone)	Diverse Video Captioning by Adaptive Spatio-tempo…	43.40	2022-08-19