ERNIE-ViL 2.0
|
ERNIE-ViL 2.0: Multi-view Contrastive Learning fo…
|
93.30
|
2022-09-30
|
|
X2-VLM (large)
|
X$^2$-VLM: All-In-One Pre-trained Model For Visio…
|
91.80
|
2022-11-22
|
|
VAST
|
VAST: A Vision-Audio-Subtitle-Text Omni-Modality …
|
91.00
|
2023-05-29
|
|
X2-VLM (base)
|
X$^2$-VLM: All-In-One Pre-trained Model For Visio…
|
90.40
|
2022-11-22
|
|
BEiT-3
|
Image as a Foreign Language: BEiT Pretraining for…
|
90.30
|
2022-08-22
|
|
OmniVL (14M)
|
OmniVL:One Foundation Model for Image-Language an…
|
87.90
|
2022-09-15
|
|
X-VLM (base)
|
Multi-Grained Vision Language Pre-Training: Align…
|
86.90
|
2021-11-16
|
|
VSE-Gradient
|
Dissecting Deep Metric Learning Losses for Image-…
|
86.30
|
2022-10-21
|
|
ALIGN
|
Scaling Up Visual and Vision-Language Representat…
|
84.90
|
2021-02-11
|
|
IAIS
|
Learning Relation Alignment for Calibrated Cross-…
|
76.86
|
2021-05-28
|
|
ViSTA
|
ViSTA: Vision and Scene Text Aggregation for Cros…
|
75.80
|
2022-03-31
|
|
3SHNet
|
3SHNet: Boosting Image-Sentence Retrieval via Vis…
|
69.50
|
2024-04-26
|
|
DSMD
|
Dynamic Self-adaptive Multiscale Distillation fro…
|
68.40
|
2024-04-16
|
|
ViLT-B/32
|
ViLT: Vision-and-Language Transformer Without Con…
|
64.40
|
2021-02-05
|
|
RCAR
|
Plug-and-Play Regulators for Image-Text Matching
|
62.60
|
2023-03-23
|
|
SGRAF
|
Similarity Reasoning and Filtration for Image-Tex…
|
58.50
|
2021-01-05
|
|
GSMN
|
Graph Structured Network for Image-Text Matching
|
57.40
|
2020-04-01
|
|
Dual-Path
(ResNet)
|
Dual-Path Convolutional Image-Text Embeddings wit…
|
55.60
|
2017-11-15
|
|
IMRAM
|
IMRAM: Iterative Matching with Recurrent Attentio…
|
53.90
|
2020-03-08
|
|
SCAN
|
Stacked Cross Attention for Image-Text Matching
|
48.60
|
2018-03-21
|
|
SCO
(ResNet)
|
Learning Semantic Concepts and Order for Image an…
|
41.10
|
2017-12-06
|
|
VSE++
(ResNet)
|
VSE++: Improving Visual-Semantic Embeddings with …
|
39.60
|
2017-07-18
|
|
Dual-Path (ResNet)
|
Dual-Path Convolutional Image-Text Embeddings wit…
|
39.10
|
2017-11-15
|
|