Venue
Neural Information Processing Systems
Designing computationally efficient network architectures persists as an ongoing necessity in computer vision.In this paper, we transplant Mamba, a state-space language model, into VMamba, a vision backbone that works in linear time complexity.At the core of VMamba lies a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module.By traversing along four scanning routes, SS2D helps bridge the gap between the ordered nature of 1D selective scan and the nonsequential structure of 2D vision data, which facilitates the gathering of contextual information from various sources and perspectives.Based on the VSS blocks, we develop a family of VMamba architectures and accelerate them through a succession of architectural and implementation enhancements.Extensive experiments showcase VMamba's promising performance across diverse visual perception tasks, highlighting its advantages in input scaling efficiency compared to existing benchmark models.Source code is available at https://github.com/MzeroMiko/VMamba.
The paper introduces VMamba, an innovative vision backbone network that employs Visual State-Space (VSS) blocks for efficient visual representation learning with linear time complexity. VMamba addresses the challenges of computational complexity associated with traditional models, such as convolutional neural networks and Vision Transformers, especially in the context of large spatial resolutions. The core mechanism implemented is the 2D Selective Scan (SS2D), which enables effective contextual gathering from visual data. VMamba demonstrates superior performance on benchmark tasks, including image classification on ImageNet-1K, object detection on COCO, and semantic segmentation on ADE20K, outperforming existing models while maintaining a significant efficiency advantage in terms of computational resources. The paper also discusses enhancements made to improve inference speed and scalability, alongside visualizations that reveal its effective receptive field and activation maps.
This paper employs the following methods:
- Visual State-Space (VSS) blocks
- 2D Selective Scan (SS2D)
- VMamba-Tiny
- VMamba-Small
- VMamba-Base
The following datasets were used in this research:
- VMamba-Base achieves a top-1 accuracy of 83.9% on ImageNet-1K, surpassing Swin by +0.4%.
- VMamba-Tiny/Small/Base achieves mAPs of 47.3%/48.7%/49.2% in object detection on COCO, outpacing Swin by 4.6%/3.9%/2.3%.
- VMamba-Tiny/Small/Base achieves mIoUs of 47.9%/50.6%/51.0% on ADE20K, surpassing Swin by 3.4%/3.0%/2.9%.
The authors identified the following limitations:
- Compatibility of pre-training methods with SSM-based architectures remains unexplored.
- Limited computational resources prevented exploration of VMamba's architecture at a larger scale.
- Number of GPUs: 8
- GPU Type: NVIDIA Tesla-A100
vision backbone
State Space Model
SS2D
linear complexity
visual perception