PaLI-X-VPD
|
Visual Program Distillation: Distilling Tools and…
|
66.80
|
2023-12-05
|
|
PaLM-E-562B
|
PaLM-E: An Embodied Multimodal Language Model
|
66.10
|
2023-03-06
|
|
PaLI-X (Single-task FT)
|
PaLI-X: On Scaling up a Multilingual Vision and L…
|
66.10
|
2023-05-29
|
|
PaLI 17B
|
PaLI: A Jointly-Scaled Multilingual Language-Imag…
|
64.50
|
2022-09-14
|
|
Prophet
|
Prophet: Prompting Large Language Models with Com…
|
62.50
|
2023-03-03
|
|
RA-VQA-v2 (BLIP 2)
|
Fine-grained Late-interaction Multi-modal Retriev…
|
62.08
|
2023-09-29
|
|
A Simple Baseline for KB-VQA
|
A Simple Baseline for Knowledge-Based Visual Ques…
|
61.20
|
2023-10-20
|
|
PromptCap
|
PromptCap: Prompt-Guided Task-Aware Image Caption…
|
60.40
|
2022-11-15
|
|
ReVeaL WIT + CC12M + Wikidata + VQA-2
|
REVEAL: Retrieval-Augmented Visual-Language Pre-T…
|
59.10
|
2022-12-10
|
|
Lyrics
|
Lyrics: Boosting Fine-grained Language-Vision Ali…
|
58.20
|
2023-12-08
|
|
REVIVE (Ensemble)
|
REVIVE: Regional Visual Representation Matters in…
|
58.00
|
2022-06-02
|
|
REVIVE (Single)
|
REVIVE: Regional Visual Representation Matters in…
|
56.60
|
2022-06-02
|
|
RA-VQA-v2 (T5-large)
|
Fine-grained Late-interaction Multi-modal Retriev…
|
54.85
|
2023-09-29
|
|
RA-VQA (T5-large)
|
Retrieval Augmented Visual Question Answering wit…
|
54.48
|
2022-10-07
|
|
VK-OOD
|
Differentiable Outlier Detection Enable Robust De…
|
52.40
|
2023-02-11
|
|
RA-VQA-FrDPR (T5-large)
|
Retrieval Augmented Visual Question Answering wit…
|
51.22
|
2022-10-07
|
|
Flamingo80B
|
Flamingo: a Visual Language Model for Few-Shot Le…
|
50.60
|
2022-04-29
|
|
HYDRA
|
HYDRA: A Hyper Agent for Dynamic Compositional Vi…
|
48.60
|
2024-03-19
|
|
PICa
|
An Empirical Study of GPT-3 for Few-Shot Knowledg…
|
48.00
|
2021-09-10
|
|
LaKo
|
LaKo: Knowledge-driven Visual Question Answering …
|
47.01
|
2022-07-26
|
|
BLIP-2 ViT-G FlanT5 XXL (zero-shot)
|
BLIP-2: Bootstrapping Language-Image Pre-training…
|
45.90
|
2023-01-30
|
|
Flamingo9B
|
Flamingo: a Visual Language Model for Few-Shot Le…
|
44.70
|
2022-04-29
|
|
VLC-BERT
|
VLC-BERT: Visual Question Answering with Contextu…
|
43.10
|
2022-10-24
|
|
T5(Tan and Bansal, 2019) + Prefixes
|
LaKo: Knowledge-driven Visual Question Answering …
|
42.03
|
2022-07-26
|
|
Flamingo3B
|
Flamingo: a Visual Language Model for Few-Shot Le…
|
41.20
|
2022-04-29
|
|
BLIP-2 ViT-G FlanT5 XL (zero-shot)
|
BLIP-2: Bootstrapping Language-Image Pre-training…
|
40.70
|
2023-01-30
|
|
BLIP-2 ViT-L FlanT5 XL (zero-shot)
|
BLIP-2: Bootstrapping Language-Image Pre-training…
|
39.40
|
2023-01-30
|
|
BLIP-2 ViT-G OPT 6.7B (zero-shot)
|
BLIP-2: Bootstrapping Language-Image Pre-training…
|
36.40
|
2023-01-30
|
|
PNP-VQA
|
Plug-and-Play VQA: Zero-shot VQA by Conjoining La…
|
35.90
|
2022-10-17
|
|
BLIP-2 ViT-G OPT 2.7B (zero-shot)
|
BLIP-2: Bootstrapping Language-Image Pre-training…
|
31.70
|
2023-01-30
|
|
BLIP-2 ViT-L OPT 2.7B (zero-shot)
|
BLIP-2: Bootstrapping Language-Image Pre-training…
|
30.20
|
2023-01-30
|
|
FewVLM
|
A Good Prompt Is Worth Millions of Parameters: Lo…
|
16.50
|
2021-10-16
|
|
MetaLM
|
Language Models are General-Purpose Interfaces
|
11.40
|
2022-06-13
|
|
VLKD(ViT-B/16)
|
Enabling Multimodal Generation on CLIP via Vision…
|
10.50
|
2021-11-16
|
|
Frozen
|
Multimodal Few-Shot Learning with Frozen Language…
|
5.90
|
2021-06-25
|
|