GPT-4o + CA
|
A Cognitive Paradigm Approach to Probe the Percep…
|
75.50
|
2025-01-23
|
|
GPT-4V (CoT, pick b/w two options)
|
The Role of Chain-of-Thought in Complex Vision-La…
|
75.25
|
2023-11-15
|
|
GPT-4V (pick b/w two options)
|
The Role of Chain-of-Thought in Complex Vision-La…
|
69.25
|
2023-11-15
|
|
MMICL + CoCoT
|
CoCoT: Contrastive Chain-of-Thought Prompting for…
|
64.25
|
2024-01-05
|
|
GPT-4V + CoCoT
|
CoCoT: Contrastive Chain-of-Thought Prompting for…
|
58.50
|
2024-01-05
|
|
OpenFlamingo + CoCoT
|
CoCoT: Contrastive Chain-of-Thought Prompting for…
|
58.25
|
2024-01-05
|
|
GPT-4V
|
CoCoT: Contrastive Chain-of-Thought Prompting for…
|
54.50
|
2024-01-05
|
|
FIBER (EqSim)
|
Equivariant Similarity for Vision-Language Founda…
|
51.50
|
2023-03-25
|
|
FIBER (finetuned, Flickr30k)
|
Equivariant Similarity for Vision-Language Founda…
|
51.25
|
2023-03-25
|
|
MMICL + CCoT
|
CoCoT: Contrastive Chain-of-Thought Prompting for…
|
51.00
|
2024-01-05
|
|
OpenFlamingo + DDCoT
|
CoCoT: Contrastive Chain-of-Thought Prompting for…
|
47.50
|
2024-01-05
|
|
VQ2
|
What You See is What You Read? Improving Text-Ima…
|
47.00
|
2023-05-17
|
|
MMICL + DDCoT
|
CoCoT: Contrastive Chain-of-Thought Prompting for…
|
46.75
|
2024-01-05
|
|
X-VLM 16M
|
Measuring Progress in Fine-grained Vision-and-Lan…
|
46.70
|
2023-05-12
|
|
PaLI (ft SNLI-VE + Synthetic Data)
|
What You See is What You Read? Improving Text-Ima…
|
46.50
|
2023-05-17
|
|
FIBER
|
Equivariant Similarity for Vision-Language Founda…
|
46.25
|
2023-03-25
|
|
MMICL (FLAN-T5-XXL)
|
MMICL: Empowering Vision-language Model with Mult…
|
45.50
|
2023-09-14
|
|
PaLI (ft SNLI-VE)
|
What You See is What You Read? Improving Text-Ima…
|
45.00
|
2023-05-17
|
|
Gemini + DDCoT
|
CoCoT: Contrastive Chain-of-Thought Prompting for…
|
45.00
|
2024-01-05
|
|
METER (EqSim)
|
Equivariant Similarity for Vision-Language Founda…
|
45.00
|
2023-03-25
|
|
X-VLM 4M
|
Measuring Progress in Fine-grained Vision-and-Lan…
|
44.00
|
2023-05-12
|
|
BLIP2 (ft COCO)
|
What You See is What You Read? Improving Text-Ima…
|
44.00
|
2023-05-17
|
|
KeyComp* (GPT-4)
|
Prompting Large Vision-Language Models for Compos…
|
43.50
|
2024-01-20
|
|
METER (finetuned, Flickr30k)
|
Equivariant Similarity for Vision-Language Founda…
|
43.50
|
2023-03-25
|
|
BLIP2 (SGVL)
|
Incorporating Structured Representations into Pre…
|
42.80
|
2023-05-10
|
|
BLIP (SGVL)
|
Incorporating Structured Representations into Pre…
|
42.80
|
2023-05-10
|
|
KeyComp* (GPT-3.5)
|
Prompting Large Vision-Language Models for Compos…
|
42.70
|
2024-01-20
|
|
OpenFlamingo + CCoT
|
CoCoT: Contrastive Chain-of-Thought Prompting for…
|
42.50
|
2024-01-05
|
|
NegBLIP
|
Incorporating Structured Representations into Pre…
|
42.50
|
2023-05-10
|
|
LLaVA-1.5-CCoT
|
Compositional Chain-of-Thought Prompting for Larg…
|
42.00
|
2023-11-27
|
|
BLIP2
|
Incorporating Structured Representations into Pre…
|
42.00
|
2023-05-10
|
|
NegBLIP2
|
Incorporating Structured Representations into Pre…
|
41.50
|
2023-05-10
|
|
BLIP (+Graph Text, +Graph Neg)
|
Incorporating Structured Representations into Pre…
|
40.50
|
2023-05-10
|
|
BLIP (+Graph Text)
|
Incorporating Structured Representations into Pre…
|
40.30
|
2023-05-10
|
|
Gemini + CoCoT
|
CoCoT: Contrastive Chain-of-Thought Prompting for…
|
40.00
|
2024-01-05
|
|
METER
|
Equivariant Similarity for Vision-Language Founda…
|
39.25
|
2023-03-25
|
|
OpenFlamingo
|
CoCoT: Contrastive Chain-of-Thought Prompting for…
|
39.00
|
2024-01-05
|
|
BLIP
|
Incorporating Structured Representations into Pre…
|
39.00
|
2023-05-10
|
|
UNITER large
|
Winoground: Probing Vision and Language Models fo…
|
38.00
|
2022-04-07
|
|
VinVL
|
Winoground: Probing Vision and Language Models fo…
|
37.75
|
2022-04-07
|
|
ViLLA large
|
Winoground: Probing Vision and Language Models fo…
|
37.00
|
2022-04-07
|
|
BLIP (VisualGPTScore, α-tuned)
|
Revisiting the Role of Language Priors in Vision-…
|
36.50
|
2023-06-02
|
|
BLIP 14M
|
Measuring Progress in Fine-grained Vision-and-Lan…
|
36.50
|
2023-05-12
|
|
LLaVA-1.5
|
Compositional Chain-of-Thought Prompting for Larg…
|
36.00
|
2023-11-27
|
|
BLIP (ITM)
|
Revisiting the Role of Language Priors in Vision-…
|
35.80
|
2023-06-02
|
|
BLIP 129M
|
Measuring Progress in Fine-grained Vision-and-Lan…
|
35.50
|
2023-05-12
|
|
ViLT (ViT-B/32)
|
Winoground: Probing Vision and Language Models fo…
|
34.75
|
2022-04-07
|
|
BLIP 129M (CapFilt/L)
|
Measuring Progress in Fine-grained Vision-and-Lan…
|
34.70
|
2023-05-12
|
|
BLIP-ViT/L 129M
|
Measuring Progress in Fine-grained Vision-and-Lan…
|
34.70
|
2023-05-12
|
|
Diffusion Classifier (zero-shot)
|
Your Diffusion Model is Secretly a Zero-Shot Clas…
|
34.00
|
2023-03-28
|
|
PEVL 14M
|
Measuring Progress in Fine-grained Vision-and-Lan…
|
33.20
|
2023-05-12
|
|
ALBEF 14M
|
Measuring Progress in Fine-grained Vision-and-Lan…
|
32.50
|
2023-05-12
|
|
FLAVA (ITM)
|
Winoground: Probing Vision and Language Models fo…
|
32.25
|
2022-04-07
|
|
UNITER base
|
Winoground: Probing Vision and Language Models fo…
|
32.25
|
2022-04-07
|
|
CLIP (SGVL)
|
Incorporating Structured Representations into Pre…
|
32.00
|
2023-05-10
|
|
Gemini
|
CoCoT: Contrastive Chain-of-Thought Prompting for…
|
30.75
|
2024-01-05
|
|
OCLIP (ViT-H/14)
|
SelfEval: Leveraging the discriminative nature of…
|
30.75
|
2023-11-17
|
|
CLIP (ViT-B/32)
|
Winoground: Probing Vision and Language Models fo…
|
30.75
|
2022-04-07
|
|
OFA large (ITM)
|
Simple Token-Level Confidence Improves Caption Co…
|
30.75
|
2023-05-11
|
|
KeyComp (GPT-3.5)
|
Prompting Large Vision-Language Models for Compos…
|
30.30
|
2024-01-20
|
|
CLIP (ViT-L/14)
|
SelfEval: Leveraging the discriminative nature of…
|
30.25
|
2023-11-17
|
|
ViLLA base
|
Winoground: Probing Vision and Language Models fo…
|
30.00
|
2022-04-07
|
|
syn-CLIP
|
Going Beyond Nouns With Vision & Language Models …
|
30.00
|
2023-03-30
|
|
syn-CyCLIP
|
Going Beyond Nouns With Vision & Language Models …
|
30.00
|
2023-03-30
|
|
NegCLIP
|
Incorporating Structured Representations into Pre…
|
29.50
|
2023-05-10
|
|
OFA large (TLC-A)
|
Simple Token-Level Confidence Improves Caption Co…
|
29.25
|
2023-05-11
|
|
ALBEF 4M
|
Measuring Progress in Fine-grained Vision-and-Lan…
|
29.20
|
2023-05-12
|
|
LDM-T5 (SelfEval)
|
SelfEval: Leveraging the discriminative nature of…
|
29.00
|
2023-11-17
|
|
CyCLIP
|
Going Beyond Nouns With Vision & Language Models …
|
28.50
|
2023-03-30
|
|
PDM-T5 (SelfEval)
|
SelfEval: Leveraging the discriminative nature of…
|
28.25
|
2023-11-17
|
|
COCA ViT-L14 (f.t on COCO)
|
What You See is What You Read? Improving Text-Ima…
|
28.25
|
2023-05-17
|
|
LLaVA-1.5-ZS-CoT
|
Compositional Chain-of-Thought Prompting for Larg…
|
28.00
|
2023-11-27
|
|
BLIP (ITC)
|
Revisiting the Role of Language Priors in Vision-…
|
28.00
|
2023-06-02
|
|
OFA large (ft SNLI-VE)
|
What You See is What You Read? Improving Text-Ima…
|
27.70
|
2023-05-17
|
|
OFA base (ITM)
|
Simple Token-Level Confidence Improves Caption Co…
|
26.75
|
2023-05-11
|
|
CLIP RN50x64
|
What You See is What You Read? Improving Text-Ima…
|
26.50
|
2023-05-17
|
|
LLaVA-7B (GPTScore)
|
An Examination of the Compositionality of Large G…
|
25.50
|
2023-08-21
|
|
FLAVA (contrastive)
|
Winoground: Probing Vision and Language Models fo…
|
25.25
|
2022-04-07
|
|
Random chance
|
Winoground: Probing Vision and Language Models fo…
|
25.00
|
2022-04-07
|
|
LLaVA
|
Incorporating Structured Representations into Pre…
|
24.80
|
2023-05-10
|
|
OFA base (TLC-A)
|
Simple Token-Level Confidence Improves Caption Co…
|
24.50
|
2023-05-11
|
|
MiniGPT-4-7B (GPTScore)
|
An Examination of the Compositionality of Large G…
|
24.50
|
2023-08-21
|
|
ViLBERT base
|
Winoground: Probing Vision and Language Models fo…
|
23.75
|
2022-04-07
|
|
MiniGPT-4
|
Incorporating Structured Representations into Pre…
|
23.30
|
2023-05-10
|
|
MiniGPT-4-7B (VisualGPTScore)
|
An Examination of the Compositionality of Large G…
|
23.25
|
2023-08-21
|
|
VSE++ (COCO, ResNet)
|
Winoground: Probing Vision and Language Models fo…
|
22.75
|
2022-04-07
|
|
OFA tiny (ITM)
|
Simple Token-Level Confidence Improves Caption Co…
|
22.75
|
2023-05-11
|
|
LDM-CLIP (SelfEval)
|
SelfEval: Leveraging the discriminative nature of…
|
22.75
|
2023-11-17
|
|
Gemini + CCoT
|
CoCoT: Contrastive Chain-of-Thought Prompting for…
|
22.50
|
2024-01-05
|
|
InstructBLIP-CCoT
|
Compositional Chain-of-Thought Prompting for Larg…
|
21.00
|
2023-11-27
|
|
VSRN (Flickr30k)
|
Winoground: Probing Vision and Language Models fo…
|
20.00
|
2022-04-07
|
|
VSE++ (Flickr30k, ResNet)
|
Winoground: Probing Vision and Language Models fo…
|
20.00
|
2022-04-07
|
|
VSE++ (Flickr30k, VGG)
|
Winoground: Probing Vision and Language Models fo…
|
19.75
|
2022-04-07
|
|
UniT (ITM finetuned)
|
Winoground: Probing Vision and Language Models fo…
|
19.50
|
2022-04-07
|
|
LXMERT
|
Winoground: Probing Vision and Language Models fo…
|
19.25
|
2022-04-07
|
|
TIFA
|
What You See is What You Read? Improving Text-Ima…
|
19.00
|
2023-05-17
|
|
VSE++ (COCO, VGG)
|
Winoground: Probing Vision and Language Models fo…
|
18.75
|
2022-04-07
|
|
VSRN (COCO)
|
Winoground: Probing Vision and Language Models fo…
|
17.50
|
2022-04-07
|
|
PDM-CLIP (SelfEval)
|
SelfEval: Leveraging the discriminative nature of…
|
17.00
|
2023-11-17
|
|
OFA tiny (TLC-A)
|
Simple Token-Level Confidence Improves Caption Co…
|
16.50
|
2023-05-11
|
|
VisualBERT base
|
Winoground: Probing Vision and Language Models fo…
|
15.50
|
2022-04-07
|
|
MiniGPT-4-7B (BERTScore)
|
An Examination of the Compositionality of Large G…
|
14.00
|
2023-08-21
|
|
LLaVA-7B (BERTScore)
|
An Examination of the Compositionality of Large G…
|
13.50
|
2023-08-21
|
|
InstructBLIP-ZS-CoT
|
Compositional Chain-of-Thought Prompting for Larg…
|
9.30
|
2023-11-27
|
|
InstructBLIP
|
Compositional Chain-of-Thought Prompting for Larg…
|
7.00
|
2023-11-27
|
|