ML Research Wiki / Benchmarks / Image Captioning / COCO Captions

COCO Captions

Image Captioning Benchmark

Performance Over Time

📊 Showing 40 results | 📏 Metric: BLEU-4

Top Performing Models

Rank Model Paper BLEU-4 Date Code
1 VALOR 📚 VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset 152.50 2023-04-17 📦 TXH-mercury/VALOR
2 VAST 📚 VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset 149.00 2023-05-29 📦 TXH-mercury/VALOR 📦 txh-mercury/vast
3 Virtex (ResNet-101) VirTex: Learning Visual Representations from Textual Annotations 94.00 2020-06-11 📦 kdexd/virtex 📦 mattdeitke/cvpr-buzz 📦 rahulvigneswaran/longtail-buzz
4 BLIP-FuseCap FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions 78.50 2023-05-28 📦 RotsteinNoam/FuseCap
5 mPLUG mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections 46.50 2022-05-24 📦 modelscope/modelscope 📦 alibaba/AliceMind 📦 x-plug/mplug
6 OFA OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework 44.90 2022-02-07 📦 modelscope/modelscope 📦 ofa-sys/ofa 📦 JHKim-snu/GVCCI 📦 JHKim-snu/PGA
7 GIT GIT: A Generative Image-to-text Transformer for Vision and Language 44.10 2022-05-27 📦 microsoft/GenerativeImage2Text
8 BLIP-2 ViT-G OPT 2.7B (zero-shot) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models 43.70 2023-01-30 📦 huggingface/transformers 📦 salesforce/lavis 📦 thudm/visualglm-6b
9 BLIP-2 ViT-G OPT 6.7B (zero-shot) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models 43.50 2023-01-30 📦 huggingface/transformers 📦 salesforce/lavis 📦 thudm/visualglm-6b
10 ExpansionNet v2 (No VL pretraining) Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning 42.70 2022-08-13 📦 jchenghu/expansionnet_v2

All Papers (40)

From Captions to Visual Concepts and Back

2014
From Captions to Visual Concepts and Back