ML Research Wiki / Benchmarks / Image-to-Text Retrieval / COCO (Common Objects in Context)

COCO (Common Objects in Context)

Image-to-Text Retrieval Benchmark

Performance Over Time

📊 Showing 9 results | 📏 Metric: Recall@1

Top Performing Models

Rank Model Paper Recall@1 Date Code
1 Oscar Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks 99.80 2020-04-13 📦 rmokady/clip_prefix_caption 📦 microsoft/Oscar 📦 milvlg/rosita 📦 ThanThoai/Visual-Question-Answering_Vietnamese
2 BLIP-2 (ViT-G, fine-tuned) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models 98.50 2023-01-30 📦 huggingface/transformers 📦 salesforce/lavis 📦 thudm/visualglm-6b
3 ONE-PEACE (ViT-G, w/o ranking) ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities 98.30 2023-05-18 📦 modelscope/modelscope 📦 OFA-Sys/ONE-PEACE
4 BLIP-2 (ViT-L, fine-tuned) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models 98.00 2023-01-30 📦 huggingface/transformers 📦 salesforce/lavis 📦 thudm/visualglm-6b
5 Unicoder-VL Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training 97.20 2019-08-16 -
6 IAIS Learning Relation Alignment for Calibrated Cross-modal Retrieval 94.48 2021-05-28 📦 lancopku/IAIS
7 CLIP (zero-shot) Learning Transferable Visual Models From Natural Language Supervision 88.10 2021-02-26 📦 openai/CLIP 📦 mlfoundations/open_clip 📦 towhee-io/towhee
8 DVSA Deep Visual-Semantic Alignments for Generating Image Descriptions 74.80 2014-12-07 📦 VinitSR7/Image-Caption-Generation 📦 Lieberk/Paddle-AoA-Captioning 📦 souvikshanku/digit-captioning 📦 IzabelaKrupinska/PROJBAD
9 FLAVA (ViT-B, zero-shot) FLAVA: A Foundational Language And Vision Alignment Model 42.74 2021-12-08 📦 facebookresearch/multimodal 📦 apsdehal/flava-tutorials 📦 social-ai-studio/matk 📦 2024-MindSpore-1/Code2

All Papers (9)