VALOR
|
VALOR: Vision-Audio-Language Omni-Perception Pret…
|
152.50
|
2023-04-17
|
|
VAST
|
VAST: A Vision-Audio-Subtitle-Text Omni-Modality …
|
149.00
|
2023-05-29
|
|
Virtex (ResNet-101)
|
VirTex: Learning Visual Representations from Text…
|
94.00
|
2020-06-11
|
|
BLIP-FuseCap
|
FuseCap: Leveraging Large Language Models for Enr…
|
78.50
|
2023-05-28
|
|
mPLUG
|
mPLUG: Effective and Efficient Vision-Language Le…
|
46.50
|
2022-05-24
|
|
OFA
|
OFA: Unifying Architectures, Tasks, and Modalitie…
|
44.90
|
2022-02-07
|
|
GIT
|
GIT: A Generative Image-to-text Transformer for V…
|
44.10
|
2022-05-27
|
|
BLIP-2 ViT-G OPT 2.7B (zero-shot)
|
BLIP-2: Bootstrapping Language-Image Pre-training…
|
43.70
|
2023-01-30
|
|
BLIP-2 ViT-G OPT 6.7B (zero-shot)
|
BLIP-2: Bootstrapping Language-Image Pre-training…
|
43.50
|
2023-01-30
|
|
ExpansionNet v2 (No VL pretraining)
|
Exploiting Multiple Sequence Lengths in Fast End …
|
42.70
|
2022-08-13
|
|
LEMON
|
Scaling Up Vision-Language Pre-training for Image…
|
42.60
|
2021-11-24
|
|
BLIP-2 ViT-G FlanT5 XL (zero-shot)
|
BLIP-2: Bootstrapping Language-Image Pre-training…
|
42.40
|
2023-01-30
|
|
GRIT (No VL pretraining - base)
|
GRIT: Faster and Better Image captioning Transfor…
|
42.40
|
2022-07-20
|
|
Prompt Tuning
|
Prompt Tuning for Generative Multimodal Pretraine…
|
41.81
|
2022-08-04
|
|
Oscar
|
Oscar: Object-Semantics Aligned Pre-training for …
|
41.70
|
2020-04-13
|
|
Xmodal-Ctx
|
Beyond a Pre-Trained Object Detector: Cross-Modal…
|
41.40
|
2022-05-09
|
|
Xmodal-Ctx + OSCAR
|
Beyond a Pre-Trained Object Detector: Cross-Modal…
|
41.30
|
2022-05-09
|
|
X-VLM (base)
|
Multi-Grained Vision Language Pre-Training: Align…
|
41.30
|
2021-11-16
|
|
VinVL
|
VinVL: Revisiting Visual Representations in Visio…
|
41.00
|
2021-01-02
|
|
CoCa
|
CoCa: Contrastive Captioners are Image-Text Found…
|
40.90
|
2022-05-04
|
|
SimVLM
|
SimVLM: Simple Visual Language Model Pretraining …
|
40.60
|
2021-08-24
|
|
Prismer
|
Prismer: A Vision-Language Model with Multi-Task …
|
40.40
|
2023-03-04
|
|
PTP-BLIP (14M)
|
Position-guided Text Prompt for Vision-Language P…
|
40.10
|
2022-12-19
|
|
L-Verse
|
L-Verse: Bidirectional Generation Between Image a…
|
39.90
|
2021-11-22
|
|
Xmodal-Ctx
|
Beyond a Pre-Trained Object Detector: Cross-Modal…
|
39.70
|
2022-05-09
|
|
X-Transformer
|
X-Linear Attention Networks for Image Captioning
|
39.70
|
2020-03-31
|
|
AoANet + VC
|
Visual Commonsense R-CNN
|
39.50
|
2020-02-27
|
|
Transformer_NSC
|
A Better Variant of Self-Critical Sequence Traini…
|
39.40
|
2020-03-22
|
|
Meshed-Memory Transformer
|
Meshed-Memory Transformer for Image Captioning
|
39.10
|
2019-12-17
|
|
CLIP Text Encoder (RL w/ CIDEr-reward)
|
Fine-grained Image Captioning with CLIP Reward
|
38.20
|
2022-05-26
|
|
RefineCap (w/ REINFORCE)
|
RefineCap: Concept-Aware Refinement for Image Cap…
|
37.80
|
2021-09-08
|
|
RDN
|
Reflective Decoding Network for Image Captioning
|
37.30
|
2019-08-30
|
|
SmallCapd=16, Large
|
SmallCap: Lightweight Image Captioning Prompted w…
|
37.20
|
2022-09-30
|
|
ClipCap (Transformer)
|
ClipCap: CLIP Prefix for Image Captioning
|
33.53
|
2021-11-18
|
|
ClipCap (MLP + GPT2 tuning)
|
ClipCap: CLIP Prefix for Image Captioning
|
32.15
|
2021-11-18
|
|
CapDec
|
Text-Only Training for Image Captioning using Noi…
|
26.40
|
2022-11-01
|
|
From Captions to Visual Concepts and Back
|
From Captions to Visual Concepts and Back
|
25.70
|
2014-11-18
|
|
LaDiC
|
LaDiC: Are Diffusion Models Really Inferior to Au…
|
22.40
|
2024-04-16
|
|
VLKD (ViT-B/16)
|
Enabling Multimodal Generation on CLIP via Vision…
|
16.70
|
2021-11-16
|
|
LaDiC (ours, 30 steps)
|
LaDiC: Are Diffusion Models Really Inferior to Au…
|
0.38
|
2024-04-16
|
|