ML Research Wiki / Benchmarks / Semantic Segmentation / ADE20K

ADE20K

Semantic Segmentation Benchmark

Performance Over Time

📊 Showing 229 results | 📏 Metric: Validation mIoU

Top Performing Models

Rank	Model	Paper	Validation mIoU	Date	Code
1	InternImage-H (M3I Pre-training)	InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions	1310.00	2022-11-10	📦 opengvlab/internimage 📦 OpenGVLab/M3I-Pretraining 📦 chenller/mmseg-extension
2	ViT-P (InternImage-H) 📚	The Missing Point in Vision Transformers for Universal Image Segmentation	63.60	2025-05-26	📦 sajjad-sh33/vit-p
3	ONE-PEACE 📚	ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities	63.00	2023-05-18	📦 modelscope/modelscope 📦 OFA-Sys/ONE-PEACE
4	InternImage-H 📚	InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions	62.90	2022-11-10	📦 opengvlab/internimage 📦 OpenGVLab/M3I-Pretraining 📦 chenller/mmseg-extension
5	M3I Pre-training (InternImage-H) 📚	Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information	62.90	2022-11-17	📦 OpenGVLab/M3I-Pretraining
6	BEiT-3 📚	Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks	62.80	2022-08-22	📦 microsoft/unilm 📦 lyan62/data-curation
7	EVA 📚	EVA: Exploring the Limits of Masked Visual Representation Learning at Scale	62.30	2022-11-14	📦 rwightman/pytorch-image-models 📦 open-mmlab/mmselfsup 📦 baaivision/eva
8	ViT-P (OneFormer, InternImage-H)	The Missing Point in Vision Transformers for Universal Image Segmentation	61.60	2025-05-26	📦 sajjad-sh33/vit-p
9	ViT-Adapter-L (Mask2Former, BEiTv2 pretrain) 📚	Vision Transformer Adapter for Dense Predictions	61.50	2022-05-17	📦 czczup/vit-adapter 📦 chenller/mmseg-extension
10	FD-SwinV2-G	Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation	61.40	2022-05-27	📦 SwinTransformer/Feature-Distillation

All Papers (229)

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

2022

InternImage-H (M3I Pre-training)

opengvlab/internimage OpenGVLab/M3I-Pretraining chenller/mmseg-extension

The Missing Point in Vision Transformers for Universal Image Segmentation

2025

ViT-P (InternImage-H)

sajjad-sh33/vit-p

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

2023

ONE-PEACE

modelscope/modelscope OFA-Sys/ONE-PEACE

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

2022

InternImage-H

opengvlab/internimage OpenGVLab/M3I-Pretraining chenller/mmseg-extension

Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information

2022

M3I Pre-training (InternImage-H)

OpenGVLab/M3I-Pretraining

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

2022

BEiT-3

microsoft/unilm lyan62/data-curation

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

2022

EVA

rwightman/pytorch-image-models open-mmlab/mmselfsup

The Missing Point in Vision Transformers for Universal Image Segmentation

2025

ViT-P (OneFormer, InternImage-H)

sajjad-sh33/vit-p

Vision Transformer Adapter for Dense Predictions

2022

ViT-Adapter-L (Mask2Former, BEiTv2 pretrain)

czczup/vit-adapter chenller/mmseg-extension

Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation

2022

FD-SwinV2-G

SwinTransformer/Feature-Distillation

Reversible Column Networks

2022

RevCol-H (Mask2Former)

megvii-research/revcol

Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation

2022

MasK DINO (SwinL, multi-scale)

PaddlePaddle/PaddleDetection IDEACVR/DINO

Vision Transformer Adapter for Dense Predictions

2022

ViT-Adapter-L (Mask2Former, BEiT pretrain)

czczup/vit-adapter chenller/mmseg-extension

DINOv2: Learning Robust Visual Features without Supervision

2023

DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)

huggingface/transformers facebookresearch/dinov2

The Missing Point in Vision Transformers for Universal Image Segmentation

2025

ViT-P (OneFormer, DiNAT-L)

sajjad-sh33/vit-p

Swin Transformer V2: Scaling Up Capacity and Resolution

2021

SwinV2-G(UperNet)

rwightman/pytorch-image-models microsoft/Swin-Transformer

Parameter-Inverted Image Pyramid Networks

2024

PIIP-LH6B(UperNet)

opengvlab/piip

SERNet-Former: Semantic Segmentation by Efficient Residual Network with Attention-Boosting Gates and Attention-Fusion Networks

2024

SERNet-Former

serdarch/sernet-former serdarch/SERNet-Former

Focal Modulation Networks

2022

FocalNet-L (Mask2Former)

PaddlePaddle/PaddleDetection keras-team/keras-io

Representation Separation for Semantic Segmentation with Vision Transformers

2022

RSSeg-ViT-L (BEiT pretrain)

Your ViT is Secretly an Image Segmentation Model

2025

EoMT (DINOv2-L, single-scale, 512x512)

tue-mps/eomt

Vision Transformer Adapter for Dense Predictions

2022

ViT-Adapter-L (UperNet, BEiT pretrain)

czczup/vit-adapter chenller/mmseg-extension

SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers

2023

SegViT-v2 (BEiT-v2-Large)

zbwxp/SegVit

SeMask: Semantically Masked Transformers for Semantic Segmentation

2021

SeMask (SeMask Swin-L FaPN-Mask2Former)

Picsart-AI-Research/SeMask-Segmentation

SeMask: Semantically Masked Transformers for Semantic Segmentation

2021

SeMask (SeMask Swin-L MSFaPN-Mask2Former)

Picsart-AI-Research/SeMask-Segmentation

Dilated Neighborhood Attention Transformer

2022

DiNAT-L (Mask2Former)

huggingface/transformers SHI-Labs/Neighborhood-Attention-Transformer

HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions

2022

HorNet-L (Mask2Former)

open-mmlab/mmclassification towhee-io/towhee

Masked-attention Mask Transformer for Universal Image Segmentation

2021

Mask2Former (SwinL-FaPN)

huggingface/transformers open-mmlab/mmdetection

Dynamic Focus-aware Positional Queries for Semantic Segmentation

2022

FASeg (SwinL)

ziplab/faseg zip-group/faseg

Region Rebalance for Long-Tailed Semantic Segmentation

2022

RR (BEiT-L)

dvlab-research/parametric-contrastive-learning dvlab-research/imbalanced-learning

MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models

2022

MOAT-4 (IN-22K pretraining, single-scale)

google-research/deeplab2 RooKichenn/pytorch-MOAT

Could Giant Pretrained Image Models Extract Universal Representations?

2022

Frozen Backbone, SwinV2-G-ext22K (Mask2Former)

SeMask: Semantically Masked Transformers for Semantic Segmentation

2021

SeMask (SeMask Swin-L Mask2Former)

Picsart-AI-Research/SeMask-Segmentation

Masked-attention Mask Transformer for Universal Image Segmentation

2021

Mask2Former (SwinL)

huggingface/transformers open-mmlab/mmdetection

Efficient Self-Ensemble for Semantic Segmentation

2021

SenFormer (BEiT-L)

WalBouss/SenFormer

BEiT: BERT Pre-Training of Image Transformers

2021

BEiT-L (ViT+UperNet)

huggingface/transformers rwightman/pytorch-image-models

SeMask: Semantically Masked Transformers for Semantic Segmentation

2021

SeMask(SeMask Swin-L MSFaPN-Mask2Former, single-scale)

Picsart-AI-Research/SeMask-Segmentation

Harnessing Diffusion Models for Visual Perception with Meta Prompts

2023

MetaPrompt-SD

fudan-zvg/meta-prompts

FaPN: Feature-aligned Pyramid Network for Dense Image Prediction

2021

FaPN (MaskFormer, Swin-L, ImageNet-22k pretrain)

sithu31296/semantic-segmentation EMI-Group/FaPN ShihuaHuang95/FaPN-full

MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models

2022

MOAT-3 (IN-22K pretraining, single-scale)

google-research/deeplab2 RooKichenn/pytorch-MOAT

Masked-attention Mask Transformer for Universal Image Segmentation

2021

Mask2Former (Swin-L-FaPN)

huggingface/transformers open-mmlab/mmdetection

SeMask: Semantically Masked Transformers for Semantic Segmentation

2021

SeMask (SeMask Swin-L MaskFormer)

Picsart-AI-Research/SeMask-Segmentation

Exploring Target Representations for Masked Autoencoders

2022

dBOT ViT-L (CLIP)

liuxingbin/dbot

Text-image Alignment for Diffusion-based Perception

2023

TADP

damaggu/tadp nkondapa/RSVC

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

2021

CSWin-L (UperNet, ImageNet-22k pretrain)

PaddlePaddle/PaddleClas BR-IDL/PaddleViT

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

2023

UniRepLKNet-XL

ailab-cvc/unireplknet Westlake-AI/openmixup chenller/mmseg-extension

Focal Self-attention for Local-Global Interactions in Vision Transformers

2021

Focal-L (UperNet, ImageNet-22k pretrain)

BR-IDL/PaddleViT microsoft/Focal-Transformer microsoft/esvit

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

2022

InternImage-XL

opengvlab/internimage OpenGVLab/M3I-Pretraining chenller/mmseg-extension

Exploring Target Representations for Masked Autoencoders

2022

dBOT ViT-L

liuxingbin/dbot

Masked-attention Mask Transformer for Universal Image Segmentation

2021

Mask2Former(Swin-B)

huggingface/transformers open-mmlab/mmdetection

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

2023

ConvNeXt V2-H (FCMAE)

rwightman/pytorch-image-models facebookresearch/convnext-v2

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

2023

UniRepLKNet-L++

ailab-cvc/unireplknet Westlake-AI/openmixup chenller/mmseg-extension

Dilated Neighborhood Attention Transformer

2022

DiNAT-Large (UperNet)

huggingface/transformers SHI-Labs/Neighborhood-Attention-Transformer

TransNeXt: Robust Foveal Visual Perception for Vision Transformers

2023

TransNeXt-Base (IN-1K pretrain, Mask2Former, 512)

Westlake-AI/openmixup daishiresearch/transnext

MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models

2022

MOAT-2 (IN-22K pretraining, single-scale)

google-research/deeplab2 RooKichenn/pytorch-MOAT

Context Autoencoder for Self-Supervised Representation Learning

2022

CAE (ViT-L, UperNet)

open-mmlab/mmselfsup PaddlePaddle/PaddleFL

Visual Attention Network

2022

VAN-B6

huggingface/transformers facebookresearch/xformers

Dilated Neighborhood Attention Transformer

2022

DiNAT_s-Large (UperNet)

huggingface/transformers SHI-Labs/Neighborhood-Attention-Transformer

DDP: Diffusion Model for Dense Visual Prediction

2023

DDP (Swin-L, step-3)

jiyuanfeng/ddp

Vision Transformers with Patch Diversification

2021

PatchDiverse + Swin-L (multi-scale test, upernet, ImageNet22k pretrain)

ChengyueGongR/PatchVisionTransformer

VOLO: Vision Outlooker for Visual Recognition

2021

VOLO-D5

rwightman/pytorch-image-models xmu-xiaoma666/External-Attention-pytorch

K-Net: Towards Unified Image Segmentation

2021

K-Net

zwwwayne/k-net

Generalized Parametric Contrastive Learning

2022

GPaCo (Swin-L)

dvlab-research/parametric-contrastive-learning jiequancui/Parametric-Contrastive-Learning

Efficient Self-Ensemble for Semantic Segmentation

2021

SenFormer (Swin-L)

WalBouss/SenFormer

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

2023

Swin V2-H

rwightman/pytorch-image-models facebookresearch/convnext-v2

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

2022

InternImage-L

opengvlab/internimage OpenGVLab/M3I-Pretraining chenller/mmseg-extension

TransNeXt: Robust Foveal Visual Perception for Vision Transformers

2023

TransNeXt-Small (IN-1K pretrain, Mask2Former, 512)

Westlake-AI/openmixup daishiresearch/transnext

A ConvNet for the 2020s

2022

ConvNeXt-XL++

keras-team/keras rwightman/pytorch-image-models

Sequential Ensembling for Semantic Segmentation

2022

Sequential Ensemble (SegFormer)

MogaNet: Multi-order Gated Aggregation Network

2022

MogaNet-XL (UperNet)

chengtan9907/OpenSTL chengtan9907/simvpv2

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

2023

UniRepLKNet-B++

ailab-cvc/unireplknet Westlake-AI/openmixup chenller/mmseg-extension

Per-Pixel Classification is Not All You Need for Semantic Segmentation

2021

MaskFormer(Swin-B)

huggingface/transformers open-mmlab/mmdetection facebookresearch/MaskFormer

A ConvNet for the 2020s

2022

ConvNeXt-L++

keras-team/keras rwightman/pytorch-image-models

Swin Transformer V2: Scaling Up Capacity and Resolution

2021

SwinV2-G-HTC++ Liu et al. ([2021a])

rwightman/pytorch-image-models microsoft/Swin-Transformer

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

2023

ConvNeXt V2-L

rwightman/pytorch-image-models facebookresearch/convnext-v2

Segmenter: Transformer for Semantic Segmentation

2021

Seg-L-Mask/16 (MS)

PaddlePaddle/PaddleSeg BR-IDL/PaddleViT

Masked Autoencoders Are Scalable Vision Learners

2021

MAE (ViT-L, UperNet)

facebookresearch/mae lightly-ai/lightly

SeMask: Semantically Masked Transformers for Semantic Segmentation

2021

SeMask (SeMask Swin-L FPN)

Picsart-AI-Research/SeMask-Segmentation

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

2021

Swin-L (UperNet, ImageNet-22k pretrain)

huggingface/transformers rwightman/pytorch-image-models

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

2023

Swin-L

rwightman/pytorch-image-models facebookresearch/convnext-v2

TransNeXt: Robust Foveal Visual Perception for Vision Transformers

2023

TransNeXt-Tiny (IN-1K pretrain, Mask2Former, 512)

Westlake-AI/openmixup daishiresearch/transnext

A ConvNet for the 2020s

2022

ConvNeXt-B++

keras-team/keras rwightman/pytorch-image-models

Augmenting Convolutional networks with attention-based aggregation

2021

PatchConvNet-L120 (UperNet)

facebookresearch/deit keras-team/keras-io

Exploring Target Representations for Masked Autoencoders

2022

dBOT ViT-B (CLIP)

liuxingbin/dbot

Augmenting Convolutional networks with attention-based aggregation

2021

PatchConvNet-B120 (UperNet)

facebookresearch/deit keras-team/keras-io

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

2023

Swin-B

rwightman/pytorch-image-models facebookresearch/convnext-v2

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

2023

UniRepLKNet-S++

ailab-cvc/unireplknet Westlake-AI/openmixup chenller/mmseg-extension

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

2023

ConvNeXt V2-B

rwightman/pytorch-image-models facebookresearch/convnext-v2

DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention

2024

DeBiFormer-B (IN1k pretrain, Upernet 160k)

maclong01/DeBiFormer

All Tokens Matter: Token Labeling for Training Better Vision Transformers

2021

LV-ViT-L (UperNet, MS)

zihangJiang/TokenLabeling naver-ai/vidt

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

2021

SegFormer-B5

huggingface/transformers VikParuchuri/surya

BiFormer: Vision Transformer with Bi-Level Routing Attention

2023

BiFormer-B (IN1k pretrain, Upernet 160k)

rayleizhu/biformer chenller/mmseg-extension birder/birder

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

2023

ConvNeXt V2-L (Supervised)

rwightman/pytorch-image-models facebookresearch/convnext-v2

Is Attention Better Than Matrix Decomposition?

2021

Light-Ham (VAN-Huge)

Gsunshine/Enjoy-Hamburger plumprc/MTS-Mixers toqitahamid/gasformer

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

2023

DAT-B++

leaplabthu/dat

CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention

2021

CrossFormer (ImageNet1k-pretrain, UPerNet, multi-scale test)

cheerss/CrossFormer birder/birder

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

2022

InternImage-B

opengvlab/internimage OpenGVLab/M3I-Pretraining chenller/mmseg-extension

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

2023

DAT-S++

leaplabthu/dat

Active Token Mixer

2022

ActiveMLP-L(UperNet)

microsoft/TokenMixers microsoft/activemlp

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

2021

SegFormer-B4

huggingface/transformers VikParuchuri/surya

Augmenting Convolutional networks with attention-based aggregation

2021

PatchConvNet-B60 (UperNet)

facebookresearch/deit keras-team/keras-io

Is Attention Better Than Matrix Decomposition?

2021

Light-Ham (VAN-Large)

Gsunshine/Enjoy-Hamburger plumprc/MTS-Mixers toqitahamid/gasformer

Towards Sustainable Self-supervised Learning

2022

TEC (Vit-B, Upernet)

sail-sg/tec

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

2023

UniRepLKNet-S

ailab-cvc/unireplknet Westlake-AI/openmixup chenller/mmseg-extension

SeMask: Semantically Masked Transformers for Semantic Segmentation

2021

SeMask (SeMask Swin-B FPN)

Picsart-AI-Research/SeMask-Segmentation

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

2022

InternImage-S

opengvlab/internimage OpenGVLab/M3I-Pretraining chenller/mmseg-extension

MogaNet: Multi-order Gated Aggregation Network

2022

MogaNet-L (UperNet)

chengtan9907/OpenSTL chengtan9907/simvpv2

Exploring Target Representations for Masked Autoencoders

2022

dBOT ViT-B

liuxingbin/dbot

BiFormer: Vision Transformer with Bi-Level Routing Attention

2023

Upernet-BiFormer-S (IN1k pretrain, Upernet 160k)

rayleizhu/biformer chenller/mmseg-extension birder/birder

Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer

2021

UperNet Shuffle-B

alibaba/EasyCV BR-IDL/PaddleViT

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

2023

ConvNeXt V1-L

rwightman/pytorch-image-models facebookresearch/convnext-v2

Dilated Neighborhood Attention Transformer

2022

DiNAT-Base (UperNet)

huggingface/transformers SHI-Labs/Neighborhood-Attention-Transformer

ELSA: Enhanced Local Self-Attention for Vision Transformer

2021

ELSA-Swin-S

damo-cv/elsa

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

2023

DAT-T++

leaplabthu/dat

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

2020

SETR-MLA (160k, MS)

PaddlePaddle/PaddleSeg BR-IDL/PaddleViT

Visual Attention Network

2022

VAN-Large (HamNet)

huggingface/transformers facebookresearch/xformers

Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation

2021

HRViT-b3 (SegFormer, SS)

facebookresearch/HRViT

Twins: Revisiting the Design of Spatial Attention in Vision Transformers

2021

Twins-SVT-L (UperNet, ImageNet-1k pretrain)

rwightman/pytorch-image-models PaddlePaddle/PaddleClas

MogaNet: Multi-order Gated Aggregation Network

2022

MogaNet-B (UperNet)

chengtan9907/OpenSTL chengtan9907/simvpv2

iBOT: Image BERT Pre-Training with Online Tokenizer

2021

iBOT (ViT-B/16)

bytedance/ibot birder/birder

Segmenter: Transformer for Semantic Segmentation

2021

Seg-B-Mask/16(MS, ViT-B)

PaddlePaddle/PaddleSeg BR-IDL/PaddleViT

A ConvNet for the 2020s

2022

ConvNeXt-B

keras-team/keras rwightman/pytorch-image-models

Dilated Neighborhood Attention Transformer

2022

DiNAT-Small (UperNet)

huggingface/transformers SHI-Labs/Neighborhood-Attention-Transformer

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

2023

ConvNeXt V1-B

rwightman/pytorch-image-models facebookresearch/convnext-v2

Neighborhood Attention Transformer

2022

NAT-Base

huggingface/transformers SHI-Labs/Neighborhood-Attention-Transformer

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

2021

Swin-B (UperNet, ImageNet-1k pretrain)

huggingface/transformers rwightman/pytorch-image-models

Segmenter: Transformer for Semantic Segmentation

2021

Seg-B/8 (MS, ViT-B)

PaddlePaddle/PaddleSeg BR-IDL/PaddleViT

A ConvNet for the 2020s

2022

ConvNeXt-S

keras-team/keras rwightman/pytorch-image-models

Is Attention Better Than Matrix Decomposition?

2021

Light-Ham (VAN-Base)

Gsunshine/Enjoy-Hamburger plumprc/MTS-Mixers toqitahamid/gasformer

Neighborhood Attention Transformer

2022

NAT-Small

huggingface/transformers SHI-Labs/Neighborhood-Attention-Transformer

DaViT: Dual Attention Vision Transformers

2022

DaViT-B

rwightman/pytorch-image-models leondgarse/keras_cv_attention_models

Vision Transformer with Deformable Attention

2022

DAT-B (UperNet)

leaplabthu/dat ChristophReich1996/Swin-Transformer-V2

Augmenting Convolutional networks with attention-based aggregation

2021

PatchConvNet-S60 (UperNet)

facebookresearch/deit keras-team/keras-io

ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders

2024

ColorMAE-Green-ViTB-1600

carlosh93/ColorMAE

When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism

2022

Shift-B (UperNet)

keras-team/keras-io microsoft/SPACH

MogaNet: Multi-order Gated Aggregation Network

2022

MogaNet-S (UperNet)

chengtan9907/OpenSTL chengtan9907/simvpv2

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

2023

UniRepLKNet-T

ailab-cvc/unireplknet Westlake-AI/openmixup chenller/mmseg-extension

Vision Transformers for Dense Prediction

2021

DPT-Hybrid

huggingface/transformers isl-org/MiDaS

Global Context Vision Transformers

2022

GC ViT-B

rwightman/pytorch-image-models open-mmlab/mmclassification

Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN

2022

A2MIM (ViT-B)

open-mmlab/mmpretrain Westlake-AI/openmixup Westlake-AI/A2MIM

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

2022

EfficientViT-B3 (r512)

rwightman/pytorch-image-models mit-han-lab/efficientvit

Dilated Neighborhood Attention Transformer

2022

DiNAT-Tiny (UperNet)

huggingface/transformers SHI-Labs/Neighborhood-Attention-Transformer

Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation

2021

HRViT-b2 (SegFormer, SS)

facebookresearch/HRViT

Neighborhood Attention Transformer

2022

NAT-Tiny

huggingface/transformers SHI-Labs/Neighborhood-Attention-Transformer

XCiT: Cross-Covariance Image Transformers

2021

XCiT-M24/8 (UperNet)

rwightman/pytorch-image-models facebookresearch/dino

ResNeSt: Split-Attention Networks

2020

ResNeSt-200

rwightman/pytorch-image-models open-mmlab/mmdetection

Vision Transformer with Deformable Attention

2022

DAT-S (UperNet)

leaplabthu/dat ChristophReich1996/Swin-Transformer-V2

Global Context Vision Transformers

2022

GC ViT-S

rwightman/pytorch-image-models open-mmlab/mmclassification

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

2022

InternImage-T

opengvlab/internimage OpenGVLab/M3I-Pretraining chenller/mmseg-extension

Visual Attention Network

2022

VAN-Large

huggingface/transformers facebookresearch/xformers

XCiT: Cross-Covariance Image Transformers

2021

XCiT-S24/8 (UperNet)

rwightman/pytorch-image-models facebookresearch/dino

Per-Pixel Classification is Not All You Need for Semantic Segmentation

2021

MaskFormer(ResNet-101)

huggingface/transformers open-mmlab/mmdetection facebookresearch/MaskFormer

Masked Autoencoders Are Scalable Vision Learners

2021

MAE (ViT-B, UperNet)

facebookresearch/mae lightly-ai/lightly

Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation

2019

HRNetV2 + OCR + RMI (PaddleClas pretrained)

open-mmlab/mmsegmentation PaddlePaddle/PaddleSeg

When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism

2022

Shift-B

keras-team/keras-io microsoft/SPACH

When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism

2022

Shift-S

keras-team/keras-io microsoft/SPACH

MogaNet: Multi-order Gated Aggregation Network

2022

MogaNet-S (Semantic FPN)

chengtan9907/OpenSTL chengtan9907/simvpv2

SeMask: Semantically Masked Transformers for Semantic Segmentation

2021

SeMask (SeMask Swin-S FPN)

Picsart-AI-Research/SeMask-Segmentation

ResNeSt: Split-Attention Networks

2020

ResNeSt-269

rwightman/pytorch-image-models open-mmlab/mmdetection

Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer

2021

UperNet Shuffle-T

alibaba/EasyCV BR-IDL/PaddleViT

CondNet: Conditional Classifier for Scene Segmentation

2021

CondNet(ResNest-101)

sithu31296/semantic-segmentation ycszen/CondNet

MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models

2022

tiny-MOAT-3 (IN-1K pretraining, single scale)

google-research/deeplab2 RooKichenn/pytorch-MOAT

CondNet: Conditional Classifier for Scene Segmentation

2021

CondNet(ResNet-101)

sithu31296/semantic-segmentation ycszen/CondNet

Dilated Neighborhood Attention Transformer

2022

DiNAT-Mini (UperNet)

huggingface/transformers SHI-Labs/Neighborhood-Attention-Transformer

DCNAS: Densely Connected Neural Architecture Search for Semantic Image Segmentation

2020

DCNAS

XCiT: Cross-Covariance Image Transformers

2021

XCiT-S24/8 (Semantic-FPN)

rwightman/pytorch-image-models facebookresearch/dino

ResNeSt: Split-Attention Networks

2020

ResNeSt-101

rwightman/pytorch-image-models open-mmlab/mmdetection

XCiT: Cross-Covariance Image Transformers

2021

XCiT-M24/8 (Semantic-FPN)

rwightman/pytorch-image-models facebookresearch/dino

Is Attention Better Than Matrix Decomposition?

2021

HamNet (ResNet-101)

Gsunshine/Enjoy-Hamburger plumprc/MTS-Mixers toqitahamid/gasformer

Sequential Ensembling for Semantic Segmentation

2022

Sequential Ensemble (DeepLabv3+)

A ConvNet for the 2020s

2022

ConvNeXt-T

keras-team/keras rwightman/pytorch-image-models

Visual Attention Network

2022

VAN-Base (Semantic-FPN)

huggingface/transformers facebookresearch/xformers

XCiT: Cross-Covariance Image Transformers

2021

XCiT-S12/8 (UperNet)

rwightman/pytorch-image-models facebookresearch/dino

Global Context Vision Transformers

2022

GC ViT-T

rwightman/pytorch-image-models open-mmlab/mmclassification

Neighborhood Attention Transformer

2022

NAT-Mini

huggingface/transformers SHI-Labs/Neighborhood-Attention-Transformer

DaViT: Dual Attention Vision Transformers

2022

DaViT-T

rwightman/pytorch-image-models leondgarse/keras_cv_attention_models

When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism

2022

Shift-T

keras-team/keras-io microsoft/SPACH

Context Prior for Scene Segmentation

2020

CPN(ResNet-101)

ycszen/ContextPrior AndPuQing/ContextPrior_Paddle

MultiMAE: Multi-modal Multi-task Masked Autoencoders

2022

MultiMAE (ViT-B)

EPFL-VILAB/MultiMAE

Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual Recognition

2020

PyConvSegNet-152

iduta/pyconv iduta/pyconvsegnet PaulEmmanuelSotir/DeepCV

Disentangled Non-Local Neural Networks

2020

DNL

open-mmlab/mmsegmentation PaddlePaddle/PaddleSeg

Adaptive Context Network for Scene Parsing

2019

ACNet (ResNet-101)

Adaptive Context Network for Scene Parsing

2019

ACNet (ResNet-101)

Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation

2021

HRViT-b1 (SegFormer, SS)

facebookresearch/HRViT

Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation

2019

OCR(HRNetV2-W48)

open-mmlab/mmsegmentation PaddlePaddle/PaddleSeg

Strip Pooling: Rethinking Spatial Pooling for Scene Parsing

2020

SPNet (ResNet-101)

Andrew-Qibin/SPNet MaybeShewill-CV/sfnet-tensorflow

Self-Supervised Learning with Swin Transformers

2021

Swin-T (UPerNet) MoBY

microsoft/Swin-Transformer alibaba/EasyCV

Vision Transformer with Deformable Attention

2022

DAT-T (UperNet)

leaplabthu/dat ChristophReich1996/Swin-Transformer-V2

iBOT: Image BERT Pre-Training with Online Tokenizer

2021

iBOT (ViT-S/16)

bytedance/ibot birder/birder

Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks

2021

EANet (ResNet-101)

xmu-xiaoma666/External-Attention-pytorch MenghaoGuo/Awesome-Vision-Attentions

Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation

2019

OCR (ResNet-101)

open-mmlab/mmsegmentation PaddlePaddle/PaddleSeg

Asymmetric Non-local Neural Networks for Semantic Segmentation

2019

Asymmetric ALNN

open-mmlab/mmsegmentation PaddlePaddle/PaddleSeg

Is Attention Better Than Matrix Decomposition?

2021

Light-Ham (VAN-Small, D=256)

Gsunshine/Enjoy-Hamburger plumprc/MTS-Mixers toqitahamid/gasformer

Location-aware Upsampling for Semantic Segmentation

2019

LaU-regression-loss

HolmesShuan/Location-aware-Upsampling-for-Semantic-Segmentation

Pyramid Scene Parsing Network

2016

PSPNet

tensorflow/models tensorflow/models

MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models

2022

tiny-MOAT-2 (IN-1K pretraining, single scale)

google-research/deeplab2 RooKichenn/pytorch-MOAT

Context Encoding for Semantic Segmentation

2018

EncNet

open-mmlab/mmsegmentation PaddlePaddle/PaddleSeg

FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

2023

FastViT-MA36

rwightman/pytorch-image-models apple/ml-fastvit

Location-aware Upsampling for Semantic Segmentation

2019

LaU-offset-loss

HolmesShuan/Location-aware-Upsampling-for-Semantic-Segmentation

FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation

2019

EncNet + JPU

PaddlePaddle/PaddleSeg wuhuikai/FastFCN

XCiT: Cross-Covariance Image Transformers

2021

XCiT-S12/8 (Semantic-FPN)

rwightman/pytorch-image-models facebookresearch/dino

Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation

2019

Auto-DeepLab-L

tensorflow/models MenghaoGuo/AutoDeeplab

Dynamic-structured Semantic Propagation Network

2018

DSSPN (ResNet-101)

Pyramid Scene Parsing Network

2016

PSPNet (ResNet-152)

tensorflow/models tensorflow/models

Pyramid Scene Parsing Network

2016

PSPNet (ResNet-101)

tensorflow/models tensorflow/models

High-Resolution Representations for Labeling Pixels and Regions

2019

HRNetV2

PaddlePaddle/PaddleDetection PaddlePaddle/PaddleClas

SeMask: Semantically Masked Transformers for Semantic Segmentation

2021

SeMask (SeMask Swin-T FPN)

Picsart-AI-Research/SeMask-Segmentation

MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models

2022

tiny-MOAT-1 (IN-1K pretraining, single scale)

google-research/deeplab2 RooKichenn/pytorch-MOAT

Visual Attention Network

2022

VAN-Small

huggingface/transformers facebookresearch/xformers

FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

2023

FastViT-SA36

rwightman/pytorch-image-models apple/ml-fastvit

MetaFormer Is Actually What You Need for Vision

2021

PoolFormer-M48

huggingface/transformers rwightman/pytorch-image-models

Unified Perceptual Parsing for Scene Understanding

2018

UperNet (ResNet-101)

open-mmlab/mmsegmentation CSAILVision/semantic-segmentation-pytorch

MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models

2022

tiny-MOAT-0 (IN-1K pretraining, single scale)

google-research/deeplab2 RooKichenn/pytorch-MOAT

FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

2023

FastViT-SA24

rwightman/pytorch-image-models apple/ml-fastvit

RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation

2016

RefineNet

guosheng/refinenet mindspore-ai/models

FBNetV5: Neural Architecture Search for Multiple Tasks in One Run

2021

FBNetV5

ConvMLP: Hierarchical Convolutional MLPs for Vision

2021

ConvMLP-L

BR-IDL/PaddleViT shinya7y/UniverseNet

ConvMLP: Hierarchical Convolutional MLPs for Vision

2021

ConvMLP-M

BR-IDL/PaddleViT shinya7y/UniverseNet

Visual Attention Network

2022

VAN-Tiny

huggingface/transformers facebookresearch/xformers

Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN

2022

A2MIM (ResNet-50)

open-mmlab/mmpretrain Westlake-AI/openmixup Westlake-AI/A2MIM

iBOT: Image BERT Pre-Training with Online Tokenizer

2021

iBOT (ViT-B/16) (linear head)

bytedance/ibot birder/birder

FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

2023

FastViT-SA12

rwightman/pytorch-image-models apple/ml-fastvit

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

2021

SegFormer-B0

huggingface/transformers VikParuchuri/surya

MUXConv: Information Multiplexing in Convolutional Neural Networks

2020

MUXNet-m + PPM

human-analysis/MUXConv

ConvMLP: Hierarchical Convolutional MLPs for Vision

2021

ConvMLP-S

BR-IDL/PaddleViT shinya7y/UniverseNet

MUXConv: Information Multiplexing in Convolutional Neural Networks

2020

MUXNet-m + C1

human-analysis/MUXConv

Multi-Scale Context Aggregation by Dilated Convolutions

2015

DilatedNet

fyu/dilation vlievin/Unet

Fully Convolutional Networks for Semantic Segmentation

2014

FCN

pochih/fcn-pytorch andyzeng/apc-vision-toolbox

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

2015

SegNet

PaddlePaddle/PaddleSeg osmr/imgclsmob

Model	Paper	Validation mIoU	Date
InternImage-H (M3I Pre-training)	InternImage: Exploring Large-Scale Vision Foundat…	1310.00	2022-11-10
ViT-P (InternImage-H)	The Missing Point in Vision Transformers for Univ…	63.60	2025-05-26
ONE-PEACE	ONE-PEACE: Exploring One General Representation M…	63.00	2023-05-18
InternImage-H	InternImage: Exploring Large-Scale Vision Foundat…	62.90	2022-11-10
M3I Pre-training (InternImage-H)	Towards All-in-one Pre-training via Maximizing Mu…	62.90	2022-11-17
BEiT-3	Image as a Foreign Language: BEiT Pretraining for…	62.80	2022-08-22
EVA	EVA: Exploring the Limits of Masked Visual Repres…	62.30	2022-11-14
ViT-P (OneFormer, InternImage-H)	The Missing Point in Vision Transformers for Univ…	61.60	2025-05-26
ViT-Adapter-L (Mask2Former, BEiTv2 pretrain)	Vision Transformer Adapter for Dense Predictions	61.50	2022-05-17
FD-SwinV2-G	Contrastive Learning Rivals Masked Image Modeling…	61.40	2022-05-27
RevCol-H (Mask2Former)	Reversible Column Networks	61.00	2022-12-22
MasK DINO (SwinL, multi-scale)	Mask DINO: Towards A Unified Transformer-based Fr…	60.80	2022-06-06
ViT-Adapter-L (Mask2Former, BEiT pretrain)	Vision Transformer Adapter for Dense Predictions	60.50	2022-05-17
DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)	DINOv2: Learning Robust Visual Features without S…	60.20	2023-04-14
ViT-P (OneFormer, DiNAT-L)	The Missing Point in Vision Transformers for Univ…	59.90	2025-05-26
SwinV2-G(UperNet)	Swin Transformer V2: Scaling Up Capacity and Reso…	59.90	2021-11-18
PIIP-LH6B(UperNet)	Parameter-Inverted Image Pyramid Networks	59.90	2024-06-06
SERNet-Former	SERNet-Former: Semantic Segmentation by Efficient…	59.35	2024-01-28
FocalNet-L (Mask2Former)	Focal Modulation Networks	58.50	2022-03-22
RSSeg-ViT-L (BEiT pretrain)	Representation Separation for Semantic Segmentati…	58.40	2022-12-28
EoMT (DINOv2-L, single-scale, 512x512)	Your ViT is Secretly an Image Segmentation Model	58.40	2025-03-24
ViT-Adapter-L (UperNet, BEiT pretrain)	Vision Transformer Adapter for Dense Predictions	58.40	2022-05-17
SegViT-v2 (BEiT-v2-Large)	SegViTv2: Exploring Efficient and Continual Seman…	58.20	2023-06-09
SeMask (SeMask Swin-L FaPN-Mask2Former)	SeMask: Semantically Masked Transformers for Sema…	58.20	2021-12-23
SeMask (SeMask Swin-L MSFaPN-Mask2Former)	SeMask: Semantically Masked Transformers for Sema…	58.20	2021-12-23
DiNAT-L (Mask2Former)	Dilated Neighborhood Attention Transformer	58.10	2022-09-29
HorNet-L (Mask2Former)	HorNet: Efficient High-Order Spatial Interactions…	57.90	2022-07-28
Mask2Former (SwinL-FaPN)	Masked-attention Mask Transformer for Universal I…	57.70	2021-12-02
FASeg (SwinL)	Dynamic Focus-aware Positional Queries for Semant…	57.70	2022-04-04
RR (BEiT-L)	Region Rebalance for Long-Tailed Semantic Segment…	57.70	2022-04-05
MOAT-4 (IN-22K pretraining, single-scale)	MOAT: Alternating Mobile Convolution and Attentio…	57.60	2022-10-04
Frozen Backbone, SwinV2-G-ext22K (Mask2Former)	Could Giant Pretrained Image Models Extract Unive…	57.60	2022-11-03
SeMask (SeMask Swin-L Mask2Former)	SeMask: Semantically Masked Transformers for Sema…	57.50	2021-12-23
Mask2Former (SwinL)	Masked-attention Mask Transformer for Universal I…	57.30	2021-12-02
SenFormer (BEiT-L)	Efficient Self-Ensemble for Semantic Segmentation	57.10	2021-11-26
BEiT-L (ViT+UperNet)	BEiT: BERT Pre-Training of Image Transformers	57.00	2021-06-15
SeMask(SeMask Swin-L MSFaPN-Mask2Former, single-scale)	SeMask: Semantically Masked Transformers for Sema…	57.00	2021-12-23
MetaPrompt-SD	Harnessing Diffusion Models for Visual Perception…	56.80	2023-12-22
FaPN (MaskFormer, Swin-L, ImageNet-22k pretrain)	FaPN: Feature-aligned Pyramid Network for Dense I…	56.70	2021-08-16
MOAT-3 (IN-22K pretraining, single-scale)	MOAT: Alternating Mobile Convolution and Attentio…	56.50	2022-10-04
Mask2Former (Swin-L-FaPN)	Masked-attention Mask Transformer for Universal I…	56.40	2021-12-02
SeMask (SeMask Swin-L MaskFormer)	SeMask: Semantically Masked Transformers for Sema…	56.20	2021-12-23
dBOT ViT-L (CLIP)	Exploring Target Representations for Masked Autoe…	56.20	2022-09-08
TADP	Text-image Alignment for Diffusion-based Percepti…	55.90	2023-09-29
CSWin-L (UperNet, ImageNet-22k pretrain)	CSWin Transformer: A General Vision Transformer B…	55.70	2021-07-01
UniRepLKNet-XL	UniRepLKNet: A Universal Perception Large-Kernel …	55.60	2023-11-27
Focal-L (UperNet, ImageNet-22k pretrain)	Focal Self-attention for Local-Global Interaction…	55.40	2021-07-01
InternImage-XL	InternImage: Exploring Large-Scale Vision Foundat…	55.30	2022-11-10
dBOT ViT-L	Exploring Target Representations for Masked Autoe…	55.20	2022-09-08
Mask2Former(Swin-B)	Masked-attention Mask Transformer for Universal I…	55.10	2021-12-02
ConvNeXt V2-H (FCMAE)	ConvNeXt V2: Co-designing and Scaling ConvNets wi…	55.00	2023-01-02
UniRepLKNet-L++	UniRepLKNet: A Universal Perception Large-Kernel …	55.00	2023-11-27
DiNAT-Large (UperNet)	Dilated Neighborhood Attention Transformer	54.90	2022-09-29
TransNeXt-Base (IN-1K pretrain, Mask2Former, 512)	TransNeXt: Robust Foveal Visual Perception for Vi…	54.70	2023-11-28
MOAT-2 (IN-22K pretraining, single-scale)	MOAT: Alternating Mobile Convolution and Attentio…	54.70	2022-10-04
CAE (ViT-L, UperNet)	Context Autoencoder for Self-Supervised Represent…	54.70	2022-02-07
VAN-B6	Visual Attention Network	54.70	2022-02-20
DiNAT_s-Large (UperNet)	Dilated Neighborhood Attention Transformer	54.60	2022-09-29
DDP (Swin-L, step-3)	DDP: Diffusion Model for Dense Visual Prediction	54.40	2023-03-30
PatchDiverse + Swin-L (multi-scale test, upernet, ImageNet22k pretrain)	Vision Transformers with Patch Diversification	54.40	2021-04-26
VOLO-D5	VOLO: Vision Outlooker for Visual Recognition	54.30	2021-06-24
K-Net	K-Net: Towards Unified Image Segmentation	54.30	2021-06-28
GPaCo (Swin-L)	Generalized Parametric Contrastive Learning	54.30	2022-09-26
SenFormer (Swin-L)	Efficient Self-Ensemble for Semantic Segmentation	54.20	2021-11-26
Swin V2-H	ConvNeXt V2: Co-designing and Scaling ConvNets wi…	54.20	2023-01-02
InternImage-L	InternImage: Exploring Large-Scale Vision Foundat…	54.10	2022-11-10
TransNeXt-Small (IN-1K pretrain, Mask2Former, 512)	TransNeXt: Robust Foveal Visual Perception for Vi…	54.10	2023-11-28
ConvNeXt-XL++	A ConvNet for the 2020s	54.00	2022-01-10
Sequential Ensemble (SegFormer)	Sequential Ensembling for Semantic Segmentation	54.00	2022-10-08
MogaNet-XL (UperNet)	MogaNet: Multi-order Gated Aggregation Network	54.00	2022-11-07
UniRepLKNet-B++	UniRepLKNet: A Universal Perception Large-Kernel …	53.90	2023-11-27
MaskFormer(Swin-B)	Per-Pixel Classification is Not All You Need for …	53.80	2021-07-13
ConvNeXt-L++	A ConvNet for the 2020s	53.70	2022-01-10
SwinV2-G-HTC++ Liu et al. ([2021a])	Swin Transformer V2: Scaling Up Capacity and Reso…	53.70	2021-11-18
ConvNeXt V2-L	ConvNeXt V2: Co-designing and Scaling ConvNets wi…	53.70	2023-01-02
Seg-L-Mask/16 (MS)	Segmenter: Transformer for Semantic Segmentation	53.63	2021-05-12
MAE (ViT-L, UperNet)	Masked Autoencoders Are Scalable Vision Learners	53.60	2021-11-11
SeMask (SeMask Swin-L FPN)	SeMask: Semantically Masked Transformers for Sema…	53.52	2021-12-23
Swin-L (UperNet, ImageNet-22k pretrain)	Swin Transformer: Hierarchical Vision Transformer…	53.50	2021-03-25
Swin-L	ConvNeXt V2: Co-designing and Scaling ConvNets wi…	53.50	2023-01-02
TransNeXt-Tiny (IN-1K pretrain, Mask2Former, 512)	TransNeXt: Robust Foveal Visual Perception for Vi…	53.40	2023-11-28
ConvNeXt-B++	A ConvNet for the 2020s	53.10	2022-01-10
PatchConvNet-L120 (UperNet)	Augmenting Convolutional networks with attention-…	52.90	2021-12-27
dBOT ViT-B (CLIP)	Exploring Target Representations for Masked Autoe…	52.90	2022-09-08
PatchConvNet-B120 (UperNet)	Augmenting Convolutional networks with attention-…	52.80	2021-12-27
Swin-B	ConvNeXt V2: Co-designing and Scaling ConvNets wi…	52.80	2023-01-02
UniRepLKNet-S++	UniRepLKNet: A Universal Perception Large-Kernel …	52.70	2023-11-27
ConvNeXt V2-B	ConvNeXt V2: Co-designing and Scaling ConvNets wi…	52.10	2023-01-02
DeBiFormer-B (IN1k pretrain, Upernet 160k)	DeBiFormer: Vision Transformer with Deformable Ag…	52.00	2024-10-11
LV-ViT-L (UperNet, MS)	All Tokens Matter: Token Labeling for Training Be…	51.80	2021-04-22
SegFormer-B5	SegFormer: Simple and Efficient Design for Semant…	51.80	2021-05-31
BiFormer-B (IN1k pretrain, Upernet 160k)	BiFormer: Vision Transformer with Bi-Level Routin…	51.70	2023-03-15
ConvNeXt V2-L (Supervised)	ConvNeXt V2: Co-designing and Scaling ConvNets wi…	51.60	2023-01-02
Light-Ham (VAN-Huge)	Is Attention Better Than Matrix Decomposition?	51.50	2021-09-09
DAT-B++	DAT++: Spatially Dynamic Vision Transformer with …	51.50	2023-09-04
CrossFormer (ImageNet1k-pretrain, UPerNet, multi-scale test)	CrossFormer: A Versatile Vision Transformer Hingi…	51.40	2021-07-31
InternImage-B	InternImage: Exploring Large-Scale Vision Foundat…	51.30	2022-11-10
DAT-S++	DAT++: Spatially Dynamic Vision Transformer with …	51.20	2023-09-04
ActiveMLP-L(UperNet)	Active Token Mixer	51.10	2022-03-11
SegFormer-B4	SegFormer: Simple and Efficient Design for Semant…	51.10	2021-05-31
PatchConvNet-B60 (UperNet)	Augmenting Convolutional networks with attention-…	51.10	2021-12-27
Light-Ham (VAN-Large)	Is Attention Better Than Matrix Decomposition?	51.00	2021-09-09
TEC (Vit-B, Upernet)	Towards Sustainable Self-supervised Learning	51.00	2022-10-20
UniRepLKNet-S	UniRepLKNet: A Universal Perception Large-Kernel …	51.00	2023-11-27
SeMask (SeMask Swin-B FPN)	SeMask: Semantically Masked Transformers for Sema…	50.98	2021-12-23
InternImage-S	InternImage: Exploring Large-Scale Vision Foundat…	50.90	2022-11-10
MogaNet-L (UperNet)	MogaNet: Multi-order Gated Aggregation Network	50.90	2022-11-07
dBOT ViT-B	Exploring Target Representations for Masked Autoe…	50.80	2022-09-08
Upernet-BiFormer-S (IN1k pretrain, Upernet 160k)	BiFormer: Vision Transformer with Bi-Level Routin…	50.80	2023-03-15
UperNet Shuffle-B	Shuffle Transformer: Rethinking Spatial Shuffle f…	50.50	2021-06-07
ConvNeXt V1-L	ConvNeXt V2: Co-designing and Scaling ConvNets wi…	50.50	2023-01-02
DiNAT-Base (UperNet)	Dilated Neighborhood Attention Transformer	50.40	2022-09-29
ELSA-Swin-S	ELSA: Enhanced Local Self-Attention for Vision Tr…	50.30	2021-12-23
DAT-T++	DAT++: Spatially Dynamic Vision Transformer with …	50.30	2023-09-04
SETR-MLA (160k, MS)	Rethinking Semantic Segmentation from a Sequence-…	50.28	2020-12-31
VAN-Large (HamNet)	Visual Attention Network	50.20	2022-02-20
HRViT-b3 (SegFormer, SS)	Multi-Scale High-Resolution Vision Transformer fo…	50.20	2021-11-01
Twins-SVT-L (UperNet, ImageNet-1k pretrain)	Twins: Revisiting the Design of Spatial Attention…	50.20	2021-04-28
MogaNet-B (UperNet)	MogaNet: Multi-order Gated Aggregation Network	50.10	2022-11-07
iBOT (ViT-B/16)	iBOT: Image BERT Pre-Training with Online Tokeniz…	50.00	2021-11-15
Seg-B-Mask/16(MS, ViT-B)	Segmenter: Transformer for Semantic Segmentation	50.00	2021-05-12
ConvNeXt-B	A ConvNet for the 2020s	49.90	2022-01-10
DiNAT-Small (UperNet)	Dilated Neighborhood Attention Transformer	49.90	2022-09-29
ConvNeXt V1-B	ConvNeXt V2: Co-designing and Scaling ConvNets wi…	49.90	2023-01-02
NAT-Base	Neighborhood Attention Transformer	49.70	2022-04-14
Swin-B (UperNet, ImageNet-1k pretrain)	Swin Transformer: Hierarchical Vision Transformer…	49.70	2021-03-25
Seg-B/8 (MS, ViT-B)	Segmenter: Transformer for Semantic Segmentation	49.61	2021-05-12
ConvNeXt-S	A ConvNet for the 2020s	49.60	2022-01-10
Light-Ham (VAN-Base)	Is Attention Better Than Matrix Decomposition?	49.60	2021-09-09
NAT-Small	Neighborhood Attention Transformer	49.50	2022-04-14
DaViT-B	DaViT: Dual Attention Vision Transformers	49.40	2022-04-07
DAT-B (UperNet)	Vision Transformer with Deformable Attention	49.38	2022-01-03
PatchConvNet-S60 (UperNet)	Augmenting Convolutional networks with attention-…	49.30	2021-12-27
ColorMAE-Green-ViTB-1600	ColorMAE: Exploring data-independent masking stra…	49.30	2024-07-17
Shift-B (UperNet)	When Shift Operation Meets Vision Transformer: An…	49.20	2022-01-26
MogaNet-S (UperNet)	MogaNet: Multi-order Gated Aggregation Network	49.20	2022-11-07
UniRepLKNet-T	UniRepLKNet: A Universal Perception Large-Kernel …	49.10	2023-11-27
DPT-Hybrid	Vision Transformers for Dense Prediction	49.02	2021-03-24
GC ViT-B	Global Context Vision Transformers	49.00	2022-06-20
A2MIM (ViT-B)	Architecture-Agnostic Masked Image Modeling -- Fr…	49.00	2022-05-27
EfficientViT-B3 (r512)	EfficientViT: Multi-Scale Linear Attention for Hi…	49.00	2022-05-29
DiNAT-Tiny (UperNet)	Dilated Neighborhood Attention Transformer	48.80	2022-09-29
HRViT-b2 (SegFormer, SS)	Multi-Scale High-Resolution Vision Transformer fo…	48.76	2021-11-01
NAT-Tiny	Neighborhood Attention Transformer	48.40	2022-04-14
XCiT-M24/8 (UperNet)	XCiT: Cross-Covariance Image Transformers	48.40	2021-06-17
ResNeSt-200	ResNeSt: Split-Attention Networks	48.36	2020-04-19
DAT-S (UperNet)	Vision Transformer with Deformable Attention	48.31	2022-01-03
GC ViT-S	Global Context Vision Transformers	48.30	2022-06-20
InternImage-T	InternImage: Exploring Large-Scale Vision Foundat…	48.10	2022-11-10
VAN-Large	Visual Attention Network	48.10	2022-02-20
XCiT-S24/8 (UperNet)	XCiT: Cross-Covariance Image Transformers	48.10	2021-06-17
MaskFormer(ResNet-101)	Per-Pixel Classification is Not All You Need for …	48.10	2021-07-13
MAE (ViT-B, UperNet)	Masked Autoencoders Are Scalable Vision Learners	48.10	2021-11-11
HRNetV2 + OCR + RMI (PaddleClas pretrained)	Segmentation Transformer: Object-Contextual Repre…	47.98	2019-09-24
Shift-B	When Shift Operation Meets Vision Transformer: An…	47.90	2022-01-26
Shift-S	When Shift Operation Meets Vision Transformer: An…	47.80	2022-01-26
MogaNet-S (Semantic FPN)	MogaNet: Multi-order Gated Aggregation Network	47.70	2022-11-07
SeMask (SeMask Swin-S FPN)	SeMask: Semantically Masked Transformers for Sema…	47.63	2021-12-23
ResNeSt-269	ResNeSt: Split-Attention Networks	47.60	2020-04-19
UperNet Shuffle-T	Shuffle Transformer: Rethinking Spatial Shuffle f…	47.60	2021-06-07
CondNet(ResNest-101)	CondNet: Conditional Classifier for Scene Segment…	47.54	2021-09-21
tiny-MOAT-3 (IN-1K pretraining, single scale)	MOAT: Alternating Mobile Convolution and Attentio…	47.50	2022-10-04
CondNet(ResNet-101)	CondNet: Conditional Classifier for Scene Segment…	47.38	2021-09-21
DiNAT-Mini (UperNet)	Dilated Neighborhood Attention Transformer	47.20	2022-09-29
DCNAS	DCNAS: Densely Connected Neural Architecture Sear…	47.12	2020-03-26
XCiT-S24/8 (Semantic-FPN)	XCiT: Cross-Covariance Image Transformers	47.10	2021-06-17
ResNeSt-101	ResNeSt: Split-Attention Networks	46.91	2020-04-19
XCiT-M24/8 (Semantic-FPN)	XCiT: Cross-Covariance Image Transformers	46.90	2021-06-17
HamNet (ResNet-101)	Is Attention Better Than Matrix Decomposition?	46.80	2021-09-09
Sequential Ensemble (DeepLabv3+)	Sequential Ensembling for Semantic Segmentation	46.80	2022-10-08
ConvNeXt-T	A ConvNet for the 2020s	46.70	2022-01-10
VAN-Base (Semantic-FPN)	Visual Attention Network	46.70	2022-02-20
XCiT-S12/8 (UperNet)	XCiT: Cross-Covariance Image Transformers	46.60	2021-06-17
GC ViT-T	Global Context Vision Transformers	46.50	2022-06-20
NAT-Mini	Neighborhood Attention Transformer	46.40	2022-04-14
DaViT-T	DaViT: Dual Attention Vision Transformers	46.30	2022-04-07
Shift-T	When Shift Operation Meets Vision Transformer: An…	46.30	2022-01-26
CPN(ResNet-101)	Context Prior for Scene Segmentation	46.27	2020-04-03
MultiMAE (ViT-B)	MultiMAE: Multi-modal Multi-task Masked Autoencod…	46.20	2022-04-04
PyConvSegNet-152	Pyramidal Convolution: Rethinking Convolutional N…	45.99	2020-06-20
DNL	Disentangled Non-Local Neural Networks	45.97	2020-06-11
ACNet (ResNet-101)	Adaptive Context Network for Scene Parsing	45.90	2019-11-05
ACNet (ResNet-101)	Adaptive Context Network for Scene Parsing	45.90	2019-11-05
HRViT-b1 (SegFormer, SS)	Multi-Scale High-Resolution Vision Transformer fo…	45.88	2021-11-01
OCR(HRNetV2-W48)	Segmentation Transformer: Object-Contextual Repre…	45.66	2019-09-24
SPNet (ResNet-101)	Strip Pooling: Rethinking Spatial Pooling for Sce…	45.60	2020-03-30
Swin-T (UPerNet) MoBY	Self-Supervised Learning with Swin Transformers	45.58	2021-05-10
DAT-T (UperNet)	Vision Transformer with Deformable Attention	45.54	2022-01-03
iBOT (ViT-S/16)	iBOT: Image BERT Pre-Training with Online Tokeniz…	45.40	2021-11-15
EANet (ResNet-101)	Beyond Self-attention: External Attention using T…	45.33	2021-05-05
OCR (ResNet-101)	Segmentation Transformer: Object-Contextual Repre…	45.28	2019-09-24
Asymmetric ALNN	Asymmetric Non-local Neural Networks for Semantic…	45.24	2019-08-21
Light-Ham (VAN-Small, D=256)	Is Attention Better Than Matrix Decomposition?	45.20	2021-09-09
LaU-regression-loss	Location-aware Upsampling for Semantic Segmentati…	45.02	2019-11-13
PSPNet	Pyramid Scene Parsing Network	44.94	2016-12-04
tiny-MOAT-2 (IN-1K pretraining, single scale)	MOAT: Alternating Mobile Convolution and Attentio…	44.90	2022-10-04
EncNet	Context Encoding for Semantic Segmentation	44.65	2018-03-23
FastViT-MA36	FastViT: A Fast Hybrid Vision Transformer using S…	44.60	2023-03-24
LaU-offset-loss	Location-aware Upsampling for Semantic Segmentati…	44.55	2019-11-13
EncNet + JPU	FastFCN: Rethinking Dilated Convolution in the Ba…	44.34	2019-03-28
XCiT-S12/8 (Semantic-FPN)	XCiT: Cross-Covariance Image Transformers	44.20	2021-06-17
Auto-DeepLab-L	Auto-DeepLab: Hierarchical Neural Architecture Se…	43.98	2019-01-10
DSSPN (ResNet-101)	Dynamic-structured Semantic Propagation Network	43.68	2018-03-16
PSPNet (ResNet-152)	Pyramid Scene Parsing Network	43.51	2016-12-04
PSPNet (ResNet-101)	Pyramid Scene Parsing Network	43.29	2016-12-04
HRNetV2	High-Resolution Representations for Labeling Pixe…	43.20	2019-04-09
SeMask (SeMask Swin-T FPN)	SeMask: Semantically Masked Transformers for Sema…	43.16	2021-12-23
tiny-MOAT-1 (IN-1K pretraining, single scale)	MOAT: Alternating Mobile Convolution and Attentio…	43.10	2022-10-04
VAN-Small	Visual Attention Network	42.90	2022-02-20
FastViT-SA36	FastViT: A Fast Hybrid Vision Transformer using S…	42.90	2023-03-24
PoolFormer-M48	MetaFormer Is Actually What You Need for Vision	42.70	2021-11-22
UperNet (ResNet-101)	Unified Perceptual Parsing for Scene Understanding	42.66	2018-07-26
tiny-MOAT-0 (IN-1K pretraining, single scale)	MOAT: Alternating Mobile Convolution and Attentio…	41.20	2022-10-04
FastViT-SA24	FastViT: A Fast Hybrid Vision Transformer using S…	41.00	2023-03-24
RefineNet	RefineNet: Multi-Path Refinement Networks for Hig…	40.70	2016-11-20
FBNetV5	FBNetV5: Neural Architecture Search for Multiple …	40.40	2021-11-19
ConvMLP-L	ConvMLP: Hierarchical Convolutional MLPs for Visi…	40.00	2021-09-09
ConvMLP-M	ConvMLP: Hierarchical Convolutional MLPs for Visi…	38.60	2021-09-09
VAN-Tiny	Visual Attention Network	38.50	2022-02-20
A2MIM (ResNet-50)	Architecture-Agnostic Masked Image Modeling -- Fr…	38.30	2022-05-27
iBOT (ViT-B/16) (linear head)	iBOT: Image BERT Pre-Training with Online Tokeniz…	38.30	2021-11-15
FastViT-SA12	FastViT: A Fast Hybrid Vision Transformer using S…	38.00	2023-03-24
SegFormer-B0	SegFormer: Simple and Efficient Design for Semant…	37.40	2021-05-31
MUXNet-m + PPM	MUXConv: Information Multiplexing in Convolutiona…	35.80	2020-03-31
ConvMLP-S	ConvMLP: Hierarchical Convolutional MLPs for Visi…	35.80	2021-09-09
MUXNet-m + C1	MUXConv: Information Multiplexing in Convolutiona…	32.42	2020-03-31
DilatedNet	Multi-Scale Context Aggregation by Dilated Convol…	32.31	2015-11-23
FCN	Fully Convolutional Networks for Semantic Segment…	29.39	2014-11-14
SegNet	SegNet: A Deep Convolutional Encoder-Decoder Arch…	21.64	2015-11-02

ADE20K

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (229)