← ML Research Wiki / 2401.10166

VMamba: Visual State Space Model

Yue Liu, Yunjie Tian [email protected], Yuzhong Zhao [email protected], Hongtian Yu [email protected], Lingxi Xie, Yaowei Wang [email protected], Qixiang Ye [email protected], Yunfan Liu [email protected] (2024)

Paper Information

arXiv ID

2401.10166

Venue

Neural Information Processing Systems

Domain

computer vision

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Designing computationally efficient network architectures persists as an ongoing necessity in computer vision.In this paper, we transplant Mamba, a state-space language model, into VMamba, a vision backbone that works in linear time complexity.At the core of VMamba lies a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module.By traversing along four scanning routes, SS2D helps bridge the gap between the ordered nature of 1D selective scan and the nonsequential structure of 2D vision data, which facilitates the gathering of contextual information from various sources and perspectives.Based on the VSS blocks, we develop a family of VMamba architectures and accelerate them through a succession of architectural and implementation enhancements.Extensive experiments showcase VMamba's promising performance across diverse visual perception tasks, highlighting its advantages in input scaling efficiency compared to existing benchmark models.Source code is available at https://github.com/MzeroMiko/VMamba.

Summary

The paper introduces VMamba, an innovative vision backbone network that employs Visual State-Space (VSS) blocks for efficient visual representation learning with linear time complexity. VMamba addresses the challenges of computational complexity associated with traditional models, such as convolutional neural networks and Vision Transformers, especially in the context of large spatial resolutions. The core mechanism implemented is the 2D Selective Scan (SS2D), which enables effective contextual gathering from visual data. VMamba demonstrates superior performance on benchmark tasks, including image classification on ImageNet-1K, object detection on COCO, and semantic segmentation on ADE20K, outperforming existing models while maintaining a significant efficiency advantage in terms of computational resources. The paper also discusses enhancements made to improve inference speed and scalability, alongside visualizations that reveal its effective receptive field and activation maps.

Methods

This paper employs the following methods:

Visual State-Space (VSS) blocks
2D Selective Scan (SS2D)

Models Used

VMamba-Tiny
VMamba-Small
VMamba-Base

Datasets

The following datasets were used in this research:

ImageNet-1K
COCO
ADE20K

Evaluation Metrics

Top-1 accuracy
mAP
mIoU

Results

VMamba-Base achieves a top-1 accuracy of 83.9% on ImageNet-1K, surpassing Swin by +0.4%.
VMamba-Tiny/Small/Base achieves mAPs of 47.3%/48.7%/49.2% in object detection on COCO, outpacing Swin by 4.6%/3.9%/2.3%.
VMamba-Tiny/Small/Base achieves mIoUs of 47.9%/50.6%/51.0% on ADE20K, surpassing Swin by 3.4%/3.0%/2.9%.

Limitations

The authors identified the following limitations:

Compatibility of pre-training methods with SSM-based architectures remains unexplored.
Limited computational resources prevented exploration of VMamba's architecture at a larger scale.

Technical Requirements

Number of GPUs: 8
GPU Type: NVIDIA Tesla-A100

Keywords

vision backbone State Space Model SS2D linear complexity visual perception

External Resources

Funding: Not specified
References: 87
Influential Citations: 139

VMamba: Visual State Space Model

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers