ML Research Wiki / Benchmarks / Image-to-Text Retrieval / Flickr30k

Flickr30k

Image-to-Text Retrieval Benchmark

Performance Over Time

📊 Showing 11 results | 📏 Metric: Recall@1

Top Performing Models

Rank	Model	Paper	Recall@1	Date	Code
1	InternVL-G-FT (finetuned, w/o ranking)	InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks	97.90	2023-12-21	📦 opengvlab/internvl 📦 opengvlab/internvl-mmdetseg
2	BLIP-2 ViT-G (zero-shot, 1K test set)	BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	97.60	2023-01-30	📦 huggingface/transformers 📦 salesforce/lavis 📦 thudm/visualglm-6b
3	ONE-PEACE (finetuned, w/o ranking)	ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities	97.60	2023-05-18	📦 modelscope/modelscope 📦 OFA-Sys/ONE-PEACE
4	InternVL-C-FT (finetuned, w/o ranking)	InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks	97.20	2023-12-21	📦 opengvlab/internvl 📦 opengvlab/internvl-mmdetseg
5	BLIP-2 ViT-L (zero-shot, 1K test set)	BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	96.90	2023-01-30	📦 huggingface/transformers 📦 salesforce/lavis 📦 thudm/visualglm-6b
6	ERNIE-ViL 2.0	ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training	96.10	2022-09-30	📦 PaddlePaddle/ERNIE
7	ALBEF	Align before Fuse: Vision and Language Representation Learning with Momentum Distillation	95.90	2021-07-16	📦 salesforce/lavis 📦 salesforce/ALBEF 📦 facebookresearch/multimodal
8	ALBEF	HADA: A Graph-based Amalgamation Framework in Image-text Retrieval	92.60	2023-01-11	📦 m2man/hada 📦 m2man/HADA-LAVIS
9	UNITER	HADA: A Graph-based Amalgamation Framework in Image-text Retrieval	87.30	2023-01-11	📦 m2man/hada 📦 m2man/HADA-LAVIS
10	GSMN	A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval	76.40	2021-06-04	📦 m2man/LGSGM

All Papers (11)

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

2023

InternVL-G-FT (finetuned, w/o ranking)

opengvlab/internvl opengvlab/internvl-mmdetseg

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

2023

BLIP-2 ViT-G (zero-shot, 1K test set)

huggingface/transformers salesforce/lavis

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

2023

ONE-PEACE (finetuned, w/o ranking)

modelscope/modelscope OFA-Sys/ONE-PEACE

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

2023

InternVL-C-FT (finetuned, w/o ranking)

opengvlab/internvl opengvlab/internvl-mmdetseg

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

2023

BLIP-2 ViT-L (zero-shot, 1K test set)

huggingface/transformers salesforce/lavis

ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training

2022

ERNIE-ViL 2.0

PaddlePaddle/ERNIE

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

2021

ALBEF

salesforce/lavis salesforce/ALBEF

HADA: A Graph-based Amalgamation Framework in Image-text Retrieval

2023

ALBEF

m2man/hada m2man/HADA-LAVIS

HADA: A Graph-based Amalgamation Framework in Image-text Retrieval

2023

UNITER

m2man/hada m2man/HADA-LAVIS

A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval

2021

GSMN

m2man/LGSGM

A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval

2021

LGSGM

m2man/LGSGM

Flickr30k

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (11)

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

HADA: A Graph-based Amalgamation Framework in Image-text Retrieval

HADA: A Graph-based Amalgamation Framework in Image-text Retrieval

A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval

A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval

Model	Paper	Recall@1	Date
InternVL-G-FT (finetuned, w/o ranking)	InternVL: Scaling up Vision Foundation Models and…	97.90	2023-12-21
BLIP-2 ViT-G (zero-shot, 1K test set)	BLIP-2: Bootstrapping Language-Image Pre-training…	97.60	2023-01-30
ONE-PEACE (finetuned, w/o ranking)	ONE-PEACE: Exploring One General Representation M…	97.60	2023-05-18
InternVL-C-FT (finetuned, w/o ranking)	InternVL: Scaling up Vision Foundation Models and…	97.20	2023-12-21
BLIP-2 ViT-L (zero-shot, 1K test set)	BLIP-2: Bootstrapping Language-Image Pre-training…	96.90	2023-01-30
ERNIE-ViL 2.0	ERNIE-ViL 2.0: Multi-view Contrastive Learning fo…	96.10	2022-09-30
ALBEF	Align before Fuse: Vision and Language Representa…	95.90	2021-07-16
ALBEF	HADA: A Graph-based Amalgamation Framework in Ima…	92.60	2023-01-11
UNITER	HADA: A Graph-based Amalgamation Framework in Ima…	87.30	2023-01-11
GSMN	A Deep Local and Global Scene-Graph Matching for …	76.40	2021-06-04
LGSGM	A Deep Local and Global Scene-Graph Matching for …	71.00	2021-06-04