InternVL-G-FT (finetuned, w/o ranking)
|
InternVL: Scaling up Vision Foundation Models and…
|
97.90
|
2023-12-21
|
|
BLIP-2 ViT-G (zero-shot, 1K test set)
|
BLIP-2: Bootstrapping Language-Image Pre-training…
|
97.60
|
2023-01-30
|
|
ONE-PEACE (finetuned, w/o ranking)
|
ONE-PEACE: Exploring One General Representation M…
|
97.60
|
2023-05-18
|
|
InternVL-C-FT (finetuned, w/o ranking)
|
InternVL: Scaling up Vision Foundation Models and…
|
97.20
|
2023-12-21
|
|
BLIP-2 ViT-L (zero-shot, 1K test set)
|
BLIP-2: Bootstrapping Language-Image Pre-training…
|
96.90
|
2023-01-30
|
|
ERNIE-ViL 2.0
|
ERNIE-ViL 2.0: Multi-view Contrastive Learning fo…
|
96.10
|
2022-09-30
|
|
ALBEF
|
Align before Fuse: Vision and Language Representa…
|
95.90
|
2021-07-16
|
|
ALBEF
|
HADA: A Graph-based Amalgamation Framework in Ima…
|
92.60
|
2023-01-11
|
|
UNITER
|
HADA: A Graph-based Amalgamation Framework in Ima…
|
87.30
|
2023-01-11
|
|
GSMN
|
A Deep Local and Global Scene-Graph Matching for …
|
76.40
|
2021-06-04
|
|
LGSGM
|
A Deep Local and Global Scene-Graph Matching for …
|
71.00
|
2021-06-04
|
|