← ML Research Wiki / 2403.09338

LocalMamba: Visual State Space Model with Windowed Selective Scan

(2024)

Paper Information

arXiv ID

2403.09338

Venue

ECCV Workshops

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Recent advancements in state space models, notably Mamba, have demonstrated significant progress in modeling long sequences for tasks like language understanding.Yet, their application in vision tasks has not markedly surpassed the performance of traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs).This paper posits that the key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling.Traditional ViM approaches, which flatten spatial tokens, overlook the preservation of local 2D dependencies, thereby elongating the distance between adjacent tokens.We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective.Additionally, acknowledging the varying preferences for scan patterns across different network layers, we propose a dynamic method to independently search for the optimal scan choices for each layer, substantially improving performance.Extensive experiments across both plain and hierarchical models underscore our approach's superiority in effectively capturing image representations.For example, our model significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs.Code is available at: https: //github.com/hunto/LocalMamba.

Summary

This paper introduces LocalMamba, a visual state space model designed to enhance the processing of visual data. By addressing limitations in traditional sequence modeling approaches, LocalMamba employs a novel local scanning technique that involves segmenting images into distinct windows. This method preserves local 2D dependencies crucial for image interpretation, thus improving performance in vision tasks. The authors present a dynamic method to optimize scan patterns for different layers, leading to improved modeling of local features while maintaining a global perspective. The study reports experimental results demonstrating LocalMamba's superior accuracy compared to existing models on tasks such as image classification, object detection, and semantic segmentation, particularly outperforming Vim and VMamba on benchmarks. Key contributions include a new scanning methodology, an adaptive scan direction search algorithm, and architecture variants for different modeling needs, establishing a foundation for future advancements in visual state space modeling.

Methods

This paper employs the following methods:

Local scanning
Scan direction search
SCAttn module

Models Used

LocalVim
LocalVMamba

Datasets

The following datasets were used in this research:

ImageNet
MSCOCO 2017
ADE20K

Evaluation Metrics

Top-1 Accuracy
mIoU
Box AP
Mask AP

Results

LocalMamba significantly outperforms Vim-Ti by 3.1% on ImageNet
LocalVim-T achieves 76.2% accuracy with 1.5G FLOPs
LocalVMamba-T achieves 82.7% accuracy, surpassing Swin-T by 1.4%
LocalVim-S improves over Vim-S by 1.5 on mIoU

Limitations

The authors identified the following limitations:

The computational framework of SSMs is more intricate than CNNs and ViTs, complicating efficient parallel execution

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified
Compute Requirements: 300 epochs with a base batch size of 1024, AdamW optimizer, cosine annealing learning rate schedule with initial value 10 −3 and 20-epoch warmup

External Resources

References: 58
Influential Citations: 35

LocalMamba: Visual State Space Model with Windowed Selective Scan

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Related Papers