InternImage-H (M3I Pre-training)
|
InternImage: Exploring Large-Scale Vision Foundat…
|
1310.00
|
2022-11-10
|
|
ViT-P (InternImage-H)
|
The Missing Point in Vision Transformers for Univ…
|
63.60
|
2025-05-26
|
|
ONE-PEACE
|
ONE-PEACE: Exploring One General Representation M…
|
63.00
|
2023-05-18
|
|
InternImage-H
|
InternImage: Exploring Large-Scale Vision Foundat…
|
62.90
|
2022-11-10
|
|
M3I Pre-training (InternImage-H)
|
Towards All-in-one Pre-training via Maximizing Mu…
|
62.90
|
2022-11-17
|
|
BEiT-3
|
Image as a Foreign Language: BEiT Pretraining for…
|
62.80
|
2022-08-22
|
|
EVA
|
EVA: Exploring the Limits of Masked Visual Repres…
|
62.30
|
2022-11-14
|
|
ViT-P (OneFormer, InternImage-H)
|
The Missing Point in Vision Transformers for Univ…
|
61.60
|
2025-05-26
|
|
ViT-Adapter-L (Mask2Former, BEiTv2 pretrain)
|
Vision Transformer Adapter for Dense Predictions
|
61.50
|
2022-05-17
|
|
FD-SwinV2-G
|
Contrastive Learning Rivals Masked Image Modeling…
|
61.40
|
2022-05-27
|
|
RevCol-H (Mask2Former)
|
Reversible Column Networks
|
61.00
|
2022-12-22
|
|
MasK DINO (SwinL, multi-scale)
|
Mask DINO: Towards A Unified Transformer-based Fr…
|
60.80
|
2022-06-06
|
|
ViT-Adapter-L (Mask2Former, BEiT pretrain)
|
Vision Transformer Adapter for Dense Predictions
|
60.50
|
2022-05-17
|
|
DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)
|
DINOv2: Learning Robust Visual Features without S…
|
60.20
|
2023-04-14
|
|
ViT-P (OneFormer, DiNAT-L)
|
The Missing Point in Vision Transformers for Univ…
|
59.90
|
2025-05-26
|
|
SwinV2-G(UperNet)
|
Swin Transformer V2: Scaling Up Capacity and Reso…
|
59.90
|
2021-11-18
|
|
PIIP-LH6B(UperNet)
|
Parameter-Inverted Image Pyramid Networks
|
59.90
|
2024-06-06
|
|
SERNet-Former
|
SERNet-Former: Semantic Segmentation by Efficient…
|
59.35
|
2024-01-28
|
|
FocalNet-L (Mask2Former)
|
Focal Modulation Networks
|
58.50
|
2022-03-22
|
|
RSSeg-ViT-L (BEiT pretrain)
|
Representation Separation for Semantic Segmentati…
|
58.40
|
2022-12-28
|
|
EoMT (DINOv2-L, single-scale, 512x512)
|
Your ViT is Secretly an Image Segmentation Model
|
58.40
|
2025-03-24
|
|
ViT-Adapter-L (UperNet, BEiT pretrain)
|
Vision Transformer Adapter for Dense Predictions
|
58.40
|
2022-05-17
|
|
SegViT-v2 (BEiT-v2-Large)
|
SegViTv2: Exploring Efficient and Continual Seman…
|
58.20
|
2023-06-09
|
|
SeMask (SeMask Swin-L FaPN-Mask2Former)
|
SeMask: Semantically Masked Transformers for Sema…
|
58.20
|
2021-12-23
|
|
SeMask (SeMask Swin-L MSFaPN-Mask2Former)
|
SeMask: Semantically Masked Transformers for Sema…
|
58.20
|
2021-12-23
|
|
DiNAT-L (Mask2Former)
|
Dilated Neighborhood Attention Transformer
|
58.10
|
2022-09-29
|
|
HorNet-L (Mask2Former)
|
HorNet: Efficient High-Order Spatial Interactions…
|
57.90
|
2022-07-28
|
|
Mask2Former (SwinL-FaPN)
|
Masked-attention Mask Transformer for Universal I…
|
57.70
|
2021-12-02
|
|
FASeg (SwinL)
|
Dynamic Focus-aware Positional Queries for Semant…
|
57.70
|
2022-04-04
|
|
RR (BEiT-L)
|
Region Rebalance for Long-Tailed Semantic Segment…
|
57.70
|
2022-04-05
|
|
MOAT-4 (IN-22K pretraining, single-scale)
|
MOAT: Alternating Mobile Convolution and Attentio…
|
57.60
|
2022-10-04
|
|
Frozen Backbone, SwinV2-G-ext22K (Mask2Former)
|
Could Giant Pretrained Image Models Extract Unive…
|
57.60
|
2022-11-03
|
|
SeMask (SeMask Swin-L Mask2Former)
|
SeMask: Semantically Masked Transformers for Sema…
|
57.50
|
2021-12-23
|
|
Mask2Former (SwinL)
|
Masked-attention Mask Transformer for Universal I…
|
57.30
|
2021-12-02
|
|
SenFormer (BEiT-L)
|
Efficient Self-Ensemble for Semantic Segmentation
|
57.10
|
2021-11-26
|
|
BEiT-L (ViT+UperNet)
|
BEiT: BERT Pre-Training of Image Transformers
|
57.00
|
2021-06-15
|
|
SeMask(SeMask Swin-L MSFaPN-Mask2Former, single-scale)
|
SeMask: Semantically Masked Transformers for Sema…
|
57.00
|
2021-12-23
|
|
MetaPrompt-SD
|
Harnessing Diffusion Models for Visual Perception…
|
56.80
|
2023-12-22
|
|
FaPN (MaskFormer, Swin-L, ImageNet-22k pretrain)
|
FaPN: Feature-aligned Pyramid Network for Dense I…
|
56.70
|
2021-08-16
|
|
MOAT-3 (IN-22K pretraining, single-scale)
|
MOAT: Alternating Mobile Convolution and Attentio…
|
56.50
|
2022-10-04
|
|
Mask2Former (Swin-L-FaPN)
|
Masked-attention Mask Transformer for Universal I…
|
56.40
|
2021-12-02
|
|
SeMask (SeMask Swin-L MaskFormer)
|
SeMask: Semantically Masked Transformers for Sema…
|
56.20
|
2021-12-23
|
|
dBOT ViT-L (CLIP)
|
Exploring Target Representations for Masked Autoe…
|
56.20
|
2022-09-08
|
|
TADP
|
Text-image Alignment for Diffusion-based Percepti…
|
55.90
|
2023-09-29
|
|
CSWin-L (UperNet, ImageNet-22k pretrain)
|
CSWin Transformer: A General Vision Transformer B…
|
55.70
|
2021-07-01
|
|
UniRepLKNet-XL
|
UniRepLKNet: A Universal Perception Large-Kernel …
|
55.60
|
2023-11-27
|
|
Focal-L (UperNet, ImageNet-22k pretrain)
|
Focal Self-attention for Local-Global Interaction…
|
55.40
|
2021-07-01
|
|
InternImage-XL
|
InternImage: Exploring Large-Scale Vision Foundat…
|
55.30
|
2022-11-10
|
|
dBOT ViT-L
|
Exploring Target Representations for Masked Autoe…
|
55.20
|
2022-09-08
|
|
Mask2Former(Swin-B)
|
Masked-attention Mask Transformer for Universal I…
|
55.10
|
2021-12-02
|
|
ConvNeXt V2-H (FCMAE)
|
ConvNeXt V2: Co-designing and Scaling ConvNets wi…
|
55.00
|
2023-01-02
|
|
UniRepLKNet-L++
|
UniRepLKNet: A Universal Perception Large-Kernel …
|
55.00
|
2023-11-27
|
|
DiNAT-Large (UperNet)
|
Dilated Neighborhood Attention Transformer
|
54.90
|
2022-09-29
|
|
TransNeXt-Base (IN-1K pretrain, Mask2Former, 512)
|
TransNeXt: Robust Foveal Visual Perception for Vi…
|
54.70
|
2023-11-28
|
|
MOAT-2 (IN-22K pretraining, single-scale)
|
MOAT: Alternating Mobile Convolution and Attentio…
|
54.70
|
2022-10-04
|
|
CAE (ViT-L, UperNet)
|
Context Autoencoder for Self-Supervised Represent…
|
54.70
|
2022-02-07
|
|
VAN-B6
|
Visual Attention Network
|
54.70
|
2022-02-20
|
|
DiNAT_s-Large (UperNet)
|
Dilated Neighborhood Attention Transformer
|
54.60
|
2022-09-29
|
|
DDP (Swin-L, step-3)
|
DDP: Diffusion Model for Dense Visual Prediction
|
54.40
|
2023-03-30
|
|
PatchDiverse + Swin-L (multi-scale test, upernet, ImageNet22k pretrain)
|
Vision Transformers with Patch Diversification
|
54.40
|
2021-04-26
|
|
VOLO-D5
|
VOLO: Vision Outlooker for Visual Recognition
|
54.30
|
2021-06-24
|
|
K-Net
|
K-Net: Towards Unified Image Segmentation
|
54.30
|
2021-06-28
|
|
GPaCo (Swin-L)
|
Generalized Parametric Contrastive Learning
|
54.30
|
2022-09-26
|
|
SenFormer (Swin-L)
|
Efficient Self-Ensemble for Semantic Segmentation
|
54.20
|
2021-11-26
|
|
Swin V2-H
|
ConvNeXt V2: Co-designing and Scaling ConvNets wi…
|
54.20
|
2023-01-02
|
|
InternImage-L
|
InternImage: Exploring Large-Scale Vision Foundat…
|
54.10
|
2022-11-10
|
|
TransNeXt-Small (IN-1K pretrain, Mask2Former, 512)
|
TransNeXt: Robust Foveal Visual Perception for Vi…
|
54.10
|
2023-11-28
|
|
ConvNeXt-XL++
|
A ConvNet for the 2020s
|
54.00
|
2022-01-10
|
|
Sequential Ensemble (SegFormer)
|
Sequential Ensembling for Semantic Segmentation
|
54.00
|
2022-10-08
|
|
MogaNet-XL (UperNet)
|
MogaNet: Multi-order Gated Aggregation Network
|
54.00
|
2022-11-07
|
|
UniRepLKNet-B++
|
UniRepLKNet: A Universal Perception Large-Kernel …
|
53.90
|
2023-11-27
|
|
MaskFormer(Swin-B)
|
Per-Pixel Classification is Not All You Need for …
|
53.80
|
2021-07-13
|
|
ConvNeXt-L++
|
A ConvNet for the 2020s
|
53.70
|
2022-01-10
|
|
SwinV2-G-HTC++ Liu et al. ([2021a])
|
Swin Transformer V2: Scaling Up Capacity and Reso…
|
53.70
|
2021-11-18
|
|
ConvNeXt V2-L
|
ConvNeXt V2: Co-designing and Scaling ConvNets wi…
|
53.70
|
2023-01-02
|
|
Seg-L-Mask/16 (MS)
|
Segmenter: Transformer for Semantic Segmentation
|
53.63
|
2021-05-12
|
|
MAE (ViT-L, UperNet)
|
Masked Autoencoders Are Scalable Vision Learners
|
53.60
|
2021-11-11
|
|
SeMask (SeMask Swin-L FPN)
|
SeMask: Semantically Masked Transformers for Sema…
|
53.52
|
2021-12-23
|
|
Swin-L (UperNet, ImageNet-22k pretrain)
|
Swin Transformer: Hierarchical Vision Transformer…
|
53.50
|
2021-03-25
|
|
Swin-L
|
ConvNeXt V2: Co-designing and Scaling ConvNets wi…
|
53.50
|
2023-01-02
|
|
TransNeXt-Tiny (IN-1K pretrain, Mask2Former, 512)
|
TransNeXt: Robust Foveal Visual Perception for Vi…
|
53.40
|
2023-11-28
|
|
ConvNeXt-B++
|
A ConvNet for the 2020s
|
53.10
|
2022-01-10
|
|
PatchConvNet-L120 (UperNet)
|
Augmenting Convolutional networks with attention-…
|
52.90
|
2021-12-27
|
|
dBOT ViT-B (CLIP)
|
Exploring Target Representations for Masked Autoe…
|
52.90
|
2022-09-08
|
|
PatchConvNet-B120
(UperNet)
|
Augmenting Convolutional networks with attention-…
|
52.80
|
2021-12-27
|
|
Swin-B
|
ConvNeXt V2: Co-designing and Scaling ConvNets wi…
|
52.80
|
2023-01-02
|
|
UniRepLKNet-S++
|
UniRepLKNet: A Universal Perception Large-Kernel …
|
52.70
|
2023-11-27
|
|
ConvNeXt V2-B
|
ConvNeXt V2: Co-designing and Scaling ConvNets wi…
|
52.10
|
2023-01-02
|
|
DeBiFormer-B (IN1k pretrain, Upernet 160k)
|
DeBiFormer: Vision Transformer with Deformable Ag…
|
52.00
|
2024-10-11
|
|
LV-ViT-L (UperNet, MS)
|
All Tokens Matter: Token Labeling for Training Be…
|
51.80
|
2021-04-22
|
|
SegFormer-B5
|
SegFormer: Simple and Efficient Design for Semant…
|
51.80
|
2021-05-31
|
|
BiFormer-B (IN1k pretrain, Upernet 160k)
|
BiFormer: Vision Transformer with Bi-Level Routin…
|
51.70
|
2023-03-15
|
|
ConvNeXt V2-L (Supervised)
|
ConvNeXt V2: Co-designing and Scaling ConvNets wi…
|
51.60
|
2023-01-02
|
|
Light-Ham (VAN-Huge)
|
Is Attention Better Than Matrix Decomposition?
|
51.50
|
2021-09-09
|
|
DAT-B++
|
DAT++: Spatially Dynamic Vision Transformer with …
|
51.50
|
2023-09-04
|
|
CrossFormer (ImageNet1k-pretrain, UPerNet, multi-scale test)
|
CrossFormer: A Versatile Vision Transformer Hingi…
|
51.40
|
2021-07-31
|
|
InternImage-B
|
InternImage: Exploring Large-Scale Vision Foundat…
|
51.30
|
2022-11-10
|
|
DAT-S++
|
DAT++: Spatially Dynamic Vision Transformer with …
|
51.20
|
2023-09-04
|
|
ActiveMLP-L(UperNet)
|
Active Token Mixer
|
51.10
|
2022-03-11
|
|
SegFormer-B4
|
SegFormer: Simple and Efficient Design for Semant…
|
51.10
|
2021-05-31
|
|
PatchConvNet-B60 (UperNet)
|
Augmenting Convolutional networks with attention-…
|
51.10
|
2021-12-27
|
|
Light-Ham (VAN-Large)
|
Is Attention Better Than Matrix Decomposition?
|
51.00
|
2021-09-09
|
|
TEC (Vit-B, Upernet)
|
Towards Sustainable Self-supervised Learning
|
51.00
|
2022-10-20
|
|
UniRepLKNet-S
|
UniRepLKNet: A Universal Perception Large-Kernel …
|
51.00
|
2023-11-27
|
|
SeMask (SeMask Swin-B FPN)
|
SeMask: Semantically Masked Transformers for Sema…
|
50.98
|
2021-12-23
|
|
InternImage-S
|
InternImage: Exploring Large-Scale Vision Foundat…
|
50.90
|
2022-11-10
|
|
MogaNet-L (UperNet)
|
MogaNet: Multi-order Gated Aggregation Network
|
50.90
|
2022-11-07
|
|
dBOT ViT-B
|
Exploring Target Representations for Masked Autoe…
|
50.80
|
2022-09-08
|
|
Upernet-BiFormer-S (IN1k pretrain, Upernet 160k)
|
BiFormer: Vision Transformer with Bi-Level Routin…
|
50.80
|
2023-03-15
|
|
UperNet Shuffle-B
|
Shuffle Transformer: Rethinking Spatial Shuffle f…
|
50.50
|
2021-06-07
|
|
ConvNeXt V1-L
|
ConvNeXt V2: Co-designing and Scaling ConvNets wi…
|
50.50
|
2023-01-02
|
|
DiNAT-Base (UperNet)
|
Dilated Neighborhood Attention Transformer
|
50.40
|
2022-09-29
|
|
ELSA-Swin-S
|
ELSA: Enhanced Local Self-Attention for Vision Tr…
|
50.30
|
2021-12-23
|
|
DAT-T++
|
DAT++: Spatially Dynamic Vision Transformer with …
|
50.30
|
2023-09-04
|
|
SETR-MLA (160k, MS)
|
Rethinking Semantic Segmentation from a Sequence-…
|
50.28
|
2020-12-31
|
|
VAN-Large (HamNet)
|
Visual Attention Network
|
50.20
|
2022-02-20
|
|
HRViT-b3 (SegFormer, SS)
|
Multi-Scale High-Resolution Vision Transformer fo…
|
50.20
|
2021-11-01
|
|
Twins-SVT-L (UperNet, ImageNet-1k pretrain)
|
Twins: Revisiting the Design of Spatial Attention…
|
50.20
|
2021-04-28
|
|
MogaNet-B (UperNet)
|
MogaNet: Multi-order Gated Aggregation Network
|
50.10
|
2022-11-07
|
|
iBOT (ViT-B/16)
|
iBOT: Image BERT Pre-Training with Online Tokeniz…
|
50.00
|
2021-11-15
|
|
Seg-B-Mask/16(MS, ViT-B)
|
Segmenter: Transformer for Semantic Segmentation
|
50.00
|
2021-05-12
|
|
ConvNeXt-B
|
A ConvNet for the 2020s
|
49.90
|
2022-01-10
|
|
DiNAT-Small (UperNet)
|
Dilated Neighborhood Attention Transformer
|
49.90
|
2022-09-29
|
|
ConvNeXt V1-B
|
ConvNeXt V2: Co-designing and Scaling ConvNets wi…
|
49.90
|
2023-01-02
|
|
NAT-Base
|
Neighborhood Attention Transformer
|
49.70
|
2022-04-14
|
|
Swin-B (UperNet, ImageNet-1k pretrain)
|
Swin Transformer: Hierarchical Vision Transformer…
|
49.70
|
2021-03-25
|
|
Seg-B/8 (MS, ViT-B)
|
Segmenter: Transformer for Semantic Segmentation
|
49.61
|
2021-05-12
|
|
ConvNeXt-S
|
A ConvNet for the 2020s
|
49.60
|
2022-01-10
|
|
Light-Ham (VAN-Base)
|
Is Attention Better Than Matrix Decomposition?
|
49.60
|
2021-09-09
|
|
NAT-Small
|
Neighborhood Attention Transformer
|
49.50
|
2022-04-14
|
|
DaViT-B
|
DaViT: Dual Attention Vision Transformers
|
49.40
|
2022-04-07
|
|
DAT-B (UperNet)
|
Vision Transformer with Deformable Attention
|
49.38
|
2022-01-03
|
|
PatchConvNet-S60 (UperNet)
|
Augmenting Convolutional networks with attention-…
|
49.30
|
2021-12-27
|
|
ColorMAE-Green-ViTB-1600
|
ColorMAE: Exploring data-independent masking stra…
|
49.30
|
2024-07-17
|
|
Shift-B (UperNet)
|
When Shift Operation Meets Vision Transformer: An…
|
49.20
|
2022-01-26
|
|
MogaNet-S (UperNet)
|
MogaNet: Multi-order Gated Aggregation Network
|
49.20
|
2022-11-07
|
|
UniRepLKNet-T
|
UniRepLKNet: A Universal Perception Large-Kernel …
|
49.10
|
2023-11-27
|
|
DPT-Hybrid
|
Vision Transformers for Dense Prediction
|
49.02
|
2021-03-24
|
|
GC ViT-B
|
Global Context Vision Transformers
|
49.00
|
2022-06-20
|
|
A2MIM (ViT-B)
|
Architecture-Agnostic Masked Image Modeling -- Fr…
|
49.00
|
2022-05-27
|
|
EfficientViT-B3 (r512)
|
EfficientViT: Multi-Scale Linear Attention for Hi…
|
49.00
|
2022-05-29
|
|
DiNAT-Tiny (UperNet)
|
Dilated Neighborhood Attention Transformer
|
48.80
|
2022-09-29
|
|
HRViT-b2 (SegFormer, SS)
|
Multi-Scale High-Resolution Vision Transformer fo…
|
48.76
|
2021-11-01
|
|
NAT-Tiny
|
Neighborhood Attention Transformer
|
48.40
|
2022-04-14
|
|
XCiT-M24/8 (UperNet)
|
XCiT: Cross-Covariance Image Transformers
|
48.40
|
2021-06-17
|
|
ResNeSt-200
|
ResNeSt: Split-Attention Networks
|
48.36
|
2020-04-19
|
|
DAT-S (UperNet)
|
Vision Transformer with Deformable Attention
|
48.31
|
2022-01-03
|
|
GC ViT-S
|
Global Context Vision Transformers
|
48.30
|
2022-06-20
|
|
InternImage-T
|
InternImage: Exploring Large-Scale Vision Foundat…
|
48.10
|
2022-11-10
|
|
VAN-Large
|
Visual Attention Network
|
48.10
|
2022-02-20
|
|
XCiT-S24/8 (UperNet)
|
XCiT: Cross-Covariance Image Transformers
|
48.10
|
2021-06-17
|
|
MaskFormer(ResNet-101)
|
Per-Pixel Classification is Not All You Need for …
|
48.10
|
2021-07-13
|
|
MAE (ViT-B, UperNet)
|
Masked Autoencoders Are Scalable Vision Learners
|
48.10
|
2021-11-11
|
|
HRNetV2 + OCR + RMI (PaddleClas pretrained)
|
Segmentation Transformer: Object-Contextual Repre…
|
47.98
|
2019-09-24
|
|
Shift-B
|
When Shift Operation Meets Vision Transformer: An…
|
47.90
|
2022-01-26
|
|
Shift-S
|
When Shift Operation Meets Vision Transformer: An…
|
47.80
|
2022-01-26
|
|
MogaNet-S (Semantic FPN)
|
MogaNet: Multi-order Gated Aggregation Network
|
47.70
|
2022-11-07
|
|
SeMask (SeMask Swin-S FPN)
|
SeMask: Semantically Masked Transformers for Sema…
|
47.63
|
2021-12-23
|
|
ResNeSt-269
|
ResNeSt: Split-Attention Networks
|
47.60
|
2020-04-19
|
|
UperNet Shuffle-T
|
Shuffle Transformer: Rethinking Spatial Shuffle f…
|
47.60
|
2021-06-07
|
|
CondNet(ResNest-101)
|
CondNet: Conditional Classifier for Scene Segment…
|
47.54
|
2021-09-21
|
|
tiny-MOAT-3 (IN-1K pretraining, single scale)
|
MOAT: Alternating Mobile Convolution and Attentio…
|
47.50
|
2022-10-04
|
|
CondNet(ResNet-101)
|
CondNet: Conditional Classifier for Scene Segment…
|
47.38
|
2021-09-21
|
|
DiNAT-Mini (UperNet)
|
Dilated Neighborhood Attention Transformer
|
47.20
|
2022-09-29
|
|
DCNAS
|
DCNAS: Densely Connected Neural Architecture Sear…
|
47.12
|
2020-03-26
|
|
XCiT-S24/8 (Semantic-FPN)
|
XCiT: Cross-Covariance Image Transformers
|
47.10
|
2021-06-17
|
|
ResNeSt-101
|
ResNeSt: Split-Attention Networks
|
46.91
|
2020-04-19
|
|
XCiT-M24/8 (Semantic-FPN)
|
XCiT: Cross-Covariance Image Transformers
|
46.90
|
2021-06-17
|
|
HamNet (ResNet-101)
|
Is Attention Better Than Matrix Decomposition?
|
46.80
|
2021-09-09
|
|
Sequential Ensemble (DeepLabv3+)
|
Sequential Ensembling for Semantic Segmentation
|
46.80
|
2022-10-08
|
|
ConvNeXt-T
|
A ConvNet for the 2020s
|
46.70
|
2022-01-10
|
|
VAN-Base (Semantic-FPN)
|
Visual Attention Network
|
46.70
|
2022-02-20
|
|
XCiT-S12/8 (UperNet)
|
XCiT: Cross-Covariance Image Transformers
|
46.60
|
2021-06-17
|
|
GC ViT-T
|
Global Context Vision Transformers
|
46.50
|
2022-06-20
|
|
NAT-Mini
|
Neighborhood Attention Transformer
|
46.40
|
2022-04-14
|
|
DaViT-T
|
DaViT: Dual Attention Vision Transformers
|
46.30
|
2022-04-07
|
|
Shift-T
|
When Shift Operation Meets Vision Transformer: An…
|
46.30
|
2022-01-26
|
|
CPN(ResNet-101)
|
Context Prior for Scene Segmentation
|
46.27
|
2020-04-03
|
|
MultiMAE (ViT-B)
|
MultiMAE: Multi-modal Multi-task Masked Autoencod…
|
46.20
|
2022-04-04
|
|
PyConvSegNet-152
|
Pyramidal Convolution: Rethinking Convolutional N…
|
45.99
|
2020-06-20
|
|
DNL
|
Disentangled Non-Local Neural Networks
|
45.97
|
2020-06-11
|
|
ACNet (ResNet-101)
|
Adaptive Context Network for Scene Parsing
|
45.90
|
2019-11-05
|
|
ACNet
(ResNet-101)
|
Adaptive Context Network for Scene Parsing
|
45.90
|
2019-11-05
|
|
HRViT-b1 (SegFormer, SS)
|
Multi-Scale High-Resolution Vision Transformer fo…
|
45.88
|
2021-11-01
|
|
OCR(HRNetV2-W48)
|
Segmentation Transformer: Object-Contextual Repre…
|
45.66
|
2019-09-24
|
|
SPNet (ResNet-101)
|
Strip Pooling: Rethinking Spatial Pooling for Sce…
|
45.60
|
2020-03-30
|
|
Swin-T (UPerNet) MoBY
|
Self-Supervised Learning with Swin Transformers
|
45.58
|
2021-05-10
|
|
DAT-T (UperNet)
|
Vision Transformer with Deformable Attention
|
45.54
|
2022-01-03
|
|
iBOT (ViT-S/16)
|
iBOT: Image BERT Pre-Training with Online Tokeniz…
|
45.40
|
2021-11-15
|
|
EANet
(ResNet-101)
|
Beyond Self-attention: External Attention using T…
|
45.33
|
2021-05-05
|
|
OCR (ResNet-101)
|
Segmentation Transformer: Object-Contextual Repre…
|
45.28
|
2019-09-24
|
|
Asymmetric ALNN
|
Asymmetric Non-local Neural Networks for Semantic…
|
45.24
|
2019-08-21
|
|
Light-Ham (VAN-Small, D=256)
|
Is Attention Better Than Matrix Decomposition?
|
45.20
|
2021-09-09
|
|
LaU-regression-loss
|
Location-aware Upsampling for Semantic Segmentati…
|
45.02
|
2019-11-13
|
|
PSPNet
|
Pyramid Scene Parsing Network
|
44.94
|
2016-12-04
|
|
tiny-MOAT-2 (IN-1K pretraining, single scale)
|
MOAT: Alternating Mobile Convolution and Attentio…
|
44.90
|
2022-10-04
|
|
EncNet
|
Context Encoding for Semantic Segmentation
|
44.65
|
2018-03-23
|
|
FastViT-MA36
|
FastViT: A Fast Hybrid Vision Transformer using S…
|
44.60
|
2023-03-24
|
|
LaU-offset-loss
|
Location-aware Upsampling for Semantic Segmentati…
|
44.55
|
2019-11-13
|
|
EncNet + JPU
|
FastFCN: Rethinking Dilated Convolution in the Ba…
|
44.34
|
2019-03-28
|
|
XCiT-S12/8 (Semantic-FPN)
|
XCiT: Cross-Covariance Image Transformers
|
44.20
|
2021-06-17
|
|
Auto-DeepLab-L
|
Auto-DeepLab: Hierarchical Neural Architecture Se…
|
43.98
|
2019-01-10
|
|
DSSPN (ResNet-101)
|
Dynamic-structured Semantic Propagation Network
|
43.68
|
2018-03-16
|
|
PSPNet (ResNet-152)
|
Pyramid Scene Parsing Network
|
43.51
|
2016-12-04
|
|
PSPNet
(ResNet-101)
|
Pyramid Scene Parsing Network
|
43.29
|
2016-12-04
|
|
HRNetV2
|
High-Resolution Representations for Labeling Pixe…
|
43.20
|
2019-04-09
|
|
SeMask (SeMask Swin-T FPN)
|
SeMask: Semantically Masked Transformers for Sema…
|
43.16
|
2021-12-23
|
|
tiny-MOAT-1 (IN-1K pretraining, single scale)
|
MOAT: Alternating Mobile Convolution and Attentio…
|
43.10
|
2022-10-04
|
|
VAN-Small
|
Visual Attention Network
|
42.90
|
2022-02-20
|
|
FastViT-SA36
|
FastViT: A Fast Hybrid Vision Transformer using S…
|
42.90
|
2023-03-24
|
|
PoolFormer-M48
|
MetaFormer Is Actually What You Need for Vision
|
42.70
|
2021-11-22
|
|
UperNet (ResNet-101)
|
Unified Perceptual Parsing for Scene Understanding
|
42.66
|
2018-07-26
|
|
tiny-MOAT-0 (IN-1K pretraining, single scale)
|
MOAT: Alternating Mobile Convolution and Attentio…
|
41.20
|
2022-10-04
|
|
FastViT-SA24
|
FastViT: A Fast Hybrid Vision Transformer using S…
|
41.00
|
2023-03-24
|
|
RefineNet
|
RefineNet: Multi-Path Refinement Networks for Hig…
|
40.70
|
2016-11-20
|
|
FBNetV5
|
FBNetV5: Neural Architecture Search for Multiple …
|
40.40
|
2021-11-19
|
|
ConvMLP-L
|
ConvMLP: Hierarchical Convolutional MLPs for Visi…
|
40.00
|
2021-09-09
|
|
ConvMLP-M
|
ConvMLP: Hierarchical Convolutional MLPs for Visi…
|
38.60
|
2021-09-09
|
|
VAN-Tiny
|
Visual Attention Network
|
38.50
|
2022-02-20
|
|
A2MIM (ResNet-50)
|
Architecture-Agnostic Masked Image Modeling -- Fr…
|
38.30
|
2022-05-27
|
|
iBOT (ViT-B/16) (linear head)
|
iBOT: Image BERT Pre-Training with Online Tokeniz…
|
38.30
|
2021-11-15
|
|
FastViT-SA12
|
FastViT: A Fast Hybrid Vision Transformer using S…
|
38.00
|
2023-03-24
|
|
SegFormer-B0
|
SegFormer: Simple and Efficient Design for Semant…
|
37.40
|
2021-05-31
|
|
MUXNet-m + PPM
|
MUXConv: Information Multiplexing in Convolutiona…
|
35.80
|
2020-03-31
|
|
ConvMLP-S
|
ConvMLP: Hierarchical Convolutional MLPs for Visi…
|
35.80
|
2021-09-09
|
|
MUXNet-m + C1
|
MUXConv: Information Multiplexing in Convolutiona…
|
32.42
|
2020-03-31
|
|
DilatedNet
|
Multi-Scale Context Aggregation by Dilated Convol…
|
32.31
|
2015-11-23
|
|
FCN
|
Fully Convolutional Networks for Semantic Segment…
|
29.39
|
2014-11-14
|
|
SegNet
|
SegNet: A Deep Convolutional Encoder-Decoder Arch…
|
21.64
|
2015-11-02
|
|