← ML Research Wiki / 2401.10166

VMamba: Visual State Space Model

Yue Liu, Yunjie Tian [email protected], Yuzhong Zhao [email protected], Hongtian Yu [email protected], Lingxi Xie, Yaowei Wang [email protected], Qixiang Ye [email protected], Yunfan Liu [email protected] (2024)

Paper Information
arXiv ID
Venue
Neural Information Processing Systems
Domain
computer vision
SOTA Claim
Yes
Code
Reproducibility
8/10

Abstract

Designing computationally efficient network architectures persists as an ongoing necessity in computer vision.In this paper, we transplant Mamba, a state-space language model, into VMamba, a vision backbone that works in linear time complexity.At the core of VMamba lies a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module.By traversing along four scanning routes, SS2D helps bridge the gap between the ordered nature of 1D selective scan and the nonsequential structure of 2D vision data, which facilitates the gathering of contextual information from various sources and perspectives.Based on the VSS blocks, we develop a family of VMamba architectures and accelerate them through a succession of architectural and implementation enhancements.Extensive experiments showcase VMamba's promising performance across diverse visual perception tasks, highlighting its advantages in input scaling efficiency compared to existing benchmark models.Source code is available at https://github.com/MzeroMiko/VMamba.

Summary

The paper introduces VMamba, an innovative vision backbone network that employs Visual State-Space (VSS) blocks for efficient visual representation learning with linear time complexity. VMamba addresses the challenges of computational complexity associated with traditional models, such as convolutional neural networks and Vision Transformers, especially in the context of large spatial resolutions. The core mechanism implemented is the 2D Selective Scan (SS2D), which enables effective contextual gathering from visual data. VMamba demonstrates superior performance on benchmark tasks, including image classification on ImageNet-1K, object detection on COCO, and semantic segmentation on ADE20K, outperforming existing models while maintaining a significant efficiency advantage in terms of computational resources. The paper also discusses enhancements made to improve inference speed and scalability, alongside visualizations that reveal its effective receptive field and activation maps.

Methods

This paper employs the following methods:

  • Visual State-Space (VSS) blocks
  • 2D Selective Scan (SS2D)

Models Used

  • VMamba-Tiny
  • VMamba-Small
  • VMamba-Base

Datasets

The following datasets were used in this research:

  • ImageNet-1K
  • COCO
  • ADE20K

Evaluation Metrics

  • Top-1 accuracy
  • mAP
  • mIoU

Results

  • VMamba-Base achieves a top-1 accuracy of 83.9% on ImageNet-1K, surpassing Swin by +0.4%.
  • VMamba-Tiny/Small/Base achieves mAPs of 47.3%/48.7%/49.2% in object detection on COCO, outpacing Swin by 4.6%/3.9%/2.3%.
  • VMamba-Tiny/Small/Base achieves mIoUs of 47.9%/50.6%/51.0% on ADE20K, surpassing Swin by 3.4%/3.0%/2.9%.

Limitations

The authors identified the following limitations:

  • Compatibility of pre-training methods with SSM-based architectures remains unexplored.
  • Limited computational resources prevented exploration of VMamba's architecture at a larger scale.

Technical Requirements

  • Number of GPUs: 8
  • GPU Type: NVIDIA Tesla-A100

Keywords

vision backbone State Space Model SS2D linear complexity visual perception

External Resources