Recent advancements in state space models, notably Mamba, have demonstrated significant progress in modeling long sequences for tasks like language understanding.Yet, their application in vision tasks has not markedly surpassed the performance of traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs).This paper posits that the key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling.Traditional ViM approaches, which flatten spatial tokens, overlook the preservation of local 2D dependencies, thereby elongating the distance between adjacent tokens.We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective.Additionally, acknowledging the varying preferences for scan patterns across different network layers, we propose a dynamic method to independently search for the optimal scan choices for each layer, substantially improving performance.Extensive experiments across both plain and hierarchical models underscore our approach's superiority in effectively capturing image representations.For example, our model significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs.Code is available at: https: //github.com/hunto/LocalMamba.
This paper introduces LocalMamba, a visual state space model designed to enhance the processing of visual data. By addressing limitations in traditional sequence modeling approaches, LocalMamba employs a novel local scanning technique that involves segmenting images into distinct windows. This method preserves local 2D dependencies crucial for image interpretation, thus improving performance in vision tasks. The authors present a dynamic method to optimize scan patterns for different layers, leading to improved modeling of local features while maintaining a global perspective. The study reports experimental results demonstrating LocalMamba's superior accuracy compared to existing models on tasks such as image classification, object detection, and semantic segmentation, particularly outperforming Vim and VMamba on benchmarks. Key contributions include a new scanning methodology, an adaptive scan direction search algorithm, and architecture variants for different modeling needs, establishing a foundation for future advancements in visual state space modeling.
This paper employs the following methods:
- Local scanning
- Scan direction search
- SCAttn module
The following datasets were used in this research:
- ImageNet
- MSCOCO 2017
- ADE20K
- Top-1 Accuracy
- mIoU
- Box AP
- Mask AP
- LocalMamba significantly outperforms Vim-Ti by 3.1% on ImageNet
- LocalVim-T achieves 76.2% accuracy with 1.5G FLOPs
- LocalVMamba-T achieves 82.7% accuracy, surpassing Swin-T by 1.4%
- LocalVim-S improves over Vim-S by 1.5 on mIoU
The authors identified the following limitations:
- The computational framework of SSMs is more intricate than CNNs and ViTs, complicating efficient parallel execution
- Number of GPUs: None specified
- GPU Type: None specified
- Compute Requirements: 300 epochs with a base batch size of 1024, AdamW optimizer, cosine annealing learning rate schedule with initial value 10 −3 and 20-epoch warmup