ML Research Wiki / Benchmarks / Video-based Generative Performance Benchmarking (Correctness of Information) / VideoInstruct

VideoInstruct

Video-based Generative Performance Benchmarking (Correctness of Information) Benchmark

Performance Over Time

📊 Showing 18 results | 📏 Metric: gpt-score

Top Performing Models

Rank	Model	Paper	gpt-score	Date	Code
1	PPLLaVA-7B	PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance	3.85	2024-11-04	📦 farewellthree/ppllava
2	PLLaVA-34B	PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning	3.60	2024-04-25	📦 magic-research/PLLaVA
3	TS-LLaVA-34B	TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models	3.55	2024-11-17	📦 tingyu215/ts-llava
4	SlowFast-LLaVA-34B	SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models	3.48	2024-07-22	📦 apple/ml-slowfast-llava
5	VideoChat2_HD_mistral	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	3.40	2023-11-28	📦 opengvlab/ask-anything 📦 magic-research/PLLaVA 📦 bytedance/tarsier
6	VideoGPT+	VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding	3.27	2024-06-13	📦 mbzuai-oryx/videogpt-plus
7	ST-LLM	ST-LLM: Large Language Models Are Effective Temporal Learners	3.23	2024-03-30	📦 TencentARC/ST-LLM
8	MiniGPT4-video-7B	MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens	3.08	2024-04-04	📦 Vision-CAIR/MiniGPT4-video 📦 pwc-1/Paper-9
9	VideoChat2	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	3.02	2023-11-28	📦 opengvlab/ask-anything 📦 magic-research/PLLaVA 📦 bytedance/tarsier
10	Chat-UniVi	Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding	2.89	2023-11-14	📦 pku-yuangroup/chat-univi 📦 skyworkai/moh 📦 skyworkai/moe-plus-plus 📦 pku-yuangroup/video-bench

All Papers (18)

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

2024

PPLLaVA-7B

farewellthree/ppllava

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

2024

PLLaVA-34B

magic-research/PLLaVA

TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models

2024

TS-LLaVA-34B

tingyu215/ts-llava

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

2024

SlowFast-LLaVA-34B

apple/ml-slowfast-llava

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

2023

VideoChat2_HD_mistral

opengvlab/ask-anything magic-research/PLLaVA bytedance/tarsier

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

2024

VideoGPT+

mbzuai-oryx/videogpt-plus

ST-LLM: Large Language Models Are Effective Temporal Learners

2024

ST-LLM

TencentARC/ST-LLM

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

2024

MiniGPT4-video-7B

Vision-CAIR/MiniGPT4-video pwc-1/Paper-9

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

2023

VideoChat2

opengvlab/ask-anything magic-research/PLLaVA bytedance/tarsier

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

2023

Chat-UniVi

pku-yuangroup/chat-univi skyworkai/moh

VTimeLLM: Empower LLM to Grasp Video Moments

2023

VTimeLLM

huangb23/vtimellm

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

2023

MovieChat

rese1f/MovieChat

BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning

2023

BT-Adapter

farewellthree/BT-Adapter

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

2023

Video-ChatGPT

mbzuai-oryx/video-chatgpt qiujihao19/artemis

VideoChat: Chat-Centric Video Understanding

2023

Video Chat

opengvlab/ask-anything

BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning

2023

BT-Adapter (zero-shot)

farewellthree/BT-Adapter

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

2023

LLaMA Adapter

opengvlab/llama-adapter zrrskywalker/llama-adapter Mind23-2/MindCode-140

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

2023

Video LLaMA

damo-nlp-sg/video-llama damo-nlp-sg/videollama2

VideoInstruct

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (18)

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

ST-LLM: Large Language Models Are Effective Temporal Learners

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

VTimeLLM: Empower LLM to Grasp Video Moments

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

VideoChat: Chat-Centric Video Understanding

BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Model	Paper	gpt-score	Date
PPLLaVA-7B	PPLLaVA: Varied Video Sequence Understanding With…	3.85	2024-11-04
PLLaVA-34B	PLLaVA : Parameter-free LLaVA Extension from Imag…	3.60	2024-04-25
TS-LLaVA-34B	TS-LLaVA: Constructing Visual Tokens through Thum…	3.55	2024-11-17
SlowFast-LLaVA-34B	SlowFast-LLaVA: A Strong Training-Free Baseline f…	3.48	2024-07-22
VideoChat2_HD_mistral	MVBench: A Comprehensive Multi-modal Video Unders…	3.40	2023-11-28
VideoGPT+	VideoGPT+: Integrating Image and Video Encoders f…	3.27	2024-06-13
ST-LLM	ST-LLM: Large Language Models Are Effective Tempo…	3.23	2024-03-30
MiniGPT4-video-7B	MiniGPT4-Video: Advancing Multimodal LLMs for Vid…	3.08	2024-04-04
VideoChat2	MVBench: A Comprehensive Multi-modal Video Unders…	3.02	2023-11-28
Chat-UniVi	Chat-UniVi: Unified Visual Representation Empower…	2.89	2023-11-14
VTimeLLM	VTimeLLM: Empower LLM to Grasp Video Moments	2.78	2023-11-30
MovieChat	MovieChat: From Dense Token to Sparse Memory for …	2.76	2023-07-31
BT-Adapter	BT-Adapter: Video Conversation is Feasible Withou…	2.68	2023-09-27
Video-ChatGPT	Video-ChatGPT: Towards Detailed Video Understandi…	2.40	2023-06-08
Video Chat	VideoChat: Chat-Centric Video Understanding	2.32	2023-05-10
BT-Adapter (zero-shot)	BT-Adapter: Video Conversation is Feasible Withou…	2.16	2023-09-27
LLaMA Adapter	LLaMA-Adapter V2: Parameter-Efficient Visual Inst…	2.03	2023-04-28
Video LLaMA	Video-LLaMA: An Instruction-tuned Audio-Visual La…	1.96	2023-06-05