← ML Research Wiki / 2401.09417

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Lianghui Zhu School of EIC Huazhong University of Science & Technology, Bencheng Liao School of EIC Huazhong University of Science & Technology Institute of Artificial Intelligence Huazhong University of Science & Technology, Qian Zhang Horizon Robotics, Xinlong Wang <[email protected]>. Beijing Academy of Artificial Intelligence, Wenyu Liu School of EIC Huazhong University of Science & Technology, Xinggang Wang School of EIC Huazhong University of Science & Technology, Xinggang Wang (2024)

Paper Information

arXiv ID

2401.09417

Venue

International Conference on Machine Learning

Domain

computer vision

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

semantic segmentation tasks, Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation & memory efficiency.For example, Vim is 2.8× faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248×1248.The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to be the next-generation backbone for vision foundation models.Code and models are released at https://github.com/hustvl/Vim

Summary

This paper introduces Vision Mamba (Vim), a new visual backbone model that utilizes bidirectional state space models (SSMs) for efficient visual representation learning. The authors argue that the reliance on self-attention mechanisms in visual models is unnecessary and present Vim, which integrates bidirectional SSMs to capture global visual context while maintaining low computational costs. Vim shows superior performance on tasks such as image classification (ImageNet), semantic segmentation (ADE20K), and object detection (COCO). The key contributions include achieving significant gains in speed and memory efficiency compared to established models like DeiT, specifically reporting Vim being 2.8× faster and saving 86.8% GPU memory for high-resolution images. The paper highlights the potential of Vim as a next-generation vision backbone model suitable for various downstream tasks and emphasizes its adaptability for unsupervised learning and multimodal scenarios.

Methods

This paper employs the following methods:

Bidirectional State Space Model (SSM)
Mamba

Models Used

Vision Mamba (Vim)
DeiT

Datasets

The following datasets were used in this research:

ImageNet
COCO
ADE20K

Evaluation Metrics

Accuracy
mIoU
AP

Results

Vim outperforms DeiT on ImageNet classification
Vim achieves superior performance on ADE20K and COCO datasets
Vim is 2.8× faster than DeiT with 86.8% GPU memory savings when processing high-resolution images.

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: 8
GPU Type: A800

Keywords

state space models vision transformers Bidirectional SSM visual representation learning efficient backbone

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 79
Influential Citations: 89

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers