ML Research Wiki / Benchmarks / Video Captioning / MSVD

MSVD

Video Captioning Benchmark

Performance Over Time

📊 Showing 14 results | 📏 Metric: CIDEr

Top Performing Models

Rank	Model	Paper	CIDEr	Date	Code
1	MaMMUT	MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks	195.60	2023-03-29	📦 lucidrains/mammut-pytorch
2	Vid2Seq 📚	Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning	146.20	2023-02-27	📦 google-research/scenic 📦 antoyang/VidChapters 📦 KastanDay/video-pretrained-transformer
3	VIOLETv2	An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling	139.20	2022-09-04	📦 tsujuifu/pytorch_empirical-mvm
4	VALOR 📚	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	80.70	2023-04-17	📦 TXH-mercury/VALOR
5	VLAB 📚	VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending	79.30	2023-05-22	-
6	COSA 📚	COSA: Concatenated Sample Pretrained Vision-Language Foundation Model	76.50	2023-06-15	📦 txh-mercury/cosa
7	HiTeA 📚	HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training	71.00	2022-12-30	-
8	mPLUG-2	mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	70.50	2023-02-01	📦 modelscope/modelscope 📦 x-plug/mplug-owl 📦 alibaba/AliceMind 📦 X-PLUG/mPLUG-2
9	HowToCaption	HowToCaption: Prompting LLMs to Transform Video Annotations at Scale	70.40	2023-10-07	📦 ninatu/howtocaption
10	RTQ	RTQ: Rethinking Video-language Understanding Based on Image-text Model	66.90	2023-12-01	📦 SCZwangxiao/RTQ-MM2023 📦 sczwangxiao/tsgvs-mm2023

All Papers (14)

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

2023

MaMMUT

lucidrains/mammut-pytorch

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

2023

Vid2Seq

google-research/scenic antoyang/VidChapters KastanDay/video-pretrained-transformer

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

2022

VIOLETv2

tsujuifu/pytorch_empirical-mvm

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

2023

VALOR

TXH-mercury/VALOR

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

2023

VLAB

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

2023

COSA

txh-mercury/cosa

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

2022

HiTeA

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

2023

mPLUG-2

modelscope/modelscope x-plug/mplug-owl

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

2023

HowToCaption

ninatu/howtocaption

RTQ: Rethinking Video-language Understanding Based on Image-text Model

2023

RTQ

SCZwangxiao/RTQ-MM2023 sczwangxiao/tsgvs-mm2023

Accurate and Fast Compressed Video Captioning

2023

CoCap (ViT/L14)

acherstyx/CoCap

SEM-POS: Grammatically and Semantically Correct Video Captioning

2023

SEM-POS

Diverse Video Captioning by Adaptive Spatio-temporal Attention

2022

VASTA (Vatex-backbone)

zohrehghaderi/vasta

Diverse Video Captioning by Adaptive Spatio-temporal Attention

2022

VASTA (Kinetics-backbone)

zohrehghaderi/vasta

MSVD

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (14)

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

RTQ: Rethinking Video-language Understanding Based on Image-text Model

Accurate and Fast Compressed Video Captioning

SEM-POS: Grammatically and Semantically Correct Video Captioning

Diverse Video Captioning by Adaptive Spatio-temporal Attention

Diverse Video Captioning by Adaptive Spatio-temporal Attention

Model	Paper	CIDEr	Date
MaMMUT	MaMMUT: A Simple Architecture for Joint Learning …	195.60	2023-03-29
Vid2Seq	Vid2Seq: Large-Scale Pretraining of a Visual Lang…	146.20	2023-02-27
VIOLETv2	An Empirical Study of End-to-End Video-Language T…	139.20	2022-09-04
VALOR	VALOR: Vision-Audio-Language Omni-Perception Pret…	80.70	2023-04-17
VLAB	VLAB: Enhancing Video Language Pre-training by Fe…	79.30	2023-05-22
COSA	COSA: Concatenated Sample Pretrained Vision-Langu…	76.50	2023-06-15
HiTeA	HiTeA: Hierarchical Temporal-Aware Video-Language…	71.00	2022-12-30
mPLUG-2	mPLUG-2: A Modularized Multi-modal Foundation Mod…	70.50	2023-02-01
HowToCaption	HowToCaption: Prompting LLMs to Transform Video A…	70.40	2023-10-07
RTQ	RTQ: Rethinking Video-language Understanding Base…	66.90	2023-12-01
CoCap (ViT/L14)	Accurate and Fast Compressed Video Captioning	60.10	2023-09-22
SEM-POS	SEM-POS: Grammatically and Semantically Correct V…	60.10	2023-03-26
VASTA (Vatex-backbone)	Diverse Video Captioning by Adaptive Spatio-tempo…	59.20	2022-08-19
VASTA (Kinetics-backbone)	Diverse Video Captioning by Adaptive Spatio-tempo…	56.10	2022-08-19