← ML Research Wiki / 2401.09417

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Lianghui Zhu School of EIC Huazhong University of Science & Technology, Bencheng Liao School of EIC Huazhong University of Science & Technology Institute of Artificial Intelligence Huazhong University of Science & Technology, Qian Zhang Horizon Robotics, Xinlong Wang <[email protected]>. Beijing Academy of Artificial Intelligence, Wenyu Liu School of EIC Huazhong University of Science & Technology, Xinggang Wang School of EIC Huazhong University of Science & Technology, Xinggang Wang (2024)

Paper Information
arXiv ID
Venue
International Conference on Machine Learning
Domain
computer vision
SOTA Claim
Yes
Code
Reproducibility
8/10

Abstract

semantic segmentation tasks, Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation & memory efficiency.For example, Vim is 2.8× faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248×1248.The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to be the next-generation backbone for vision foundation models.Code and models are released at https://github.com/hustvl/Vim

Summary

This paper introduces Vision Mamba (Vim), a new visual backbone model that utilizes bidirectional state space models (SSMs) for efficient visual representation learning. The authors argue that the reliance on self-attention mechanisms in visual models is unnecessary and present Vim, which integrates bidirectional SSMs to capture global visual context while maintaining low computational costs. Vim shows superior performance on tasks such as image classification (ImageNet), semantic segmentation (ADE20K), and object detection (COCO). The key contributions include achieving significant gains in speed and memory efficiency compared to established models like DeiT, specifically reporting Vim being 2.8× faster and saving 86.8% GPU memory for high-resolution images. The paper highlights the potential of Vim as a next-generation vision backbone model suitable for various downstream tasks and emphasizes its adaptability for unsupervised learning and multimodal scenarios.

Methods

This paper employs the following methods:

  • Bidirectional State Space Model (SSM)
  • Mamba

Models Used

  • Vision Mamba (Vim)
  • DeiT

Datasets

The following datasets were used in this research:

  • ImageNet
  • COCO
  • ADE20K

Evaluation Metrics

  • Accuracy
  • mIoU
  • AP

Results

  • Vim outperforms DeiT on ImageNet classification
  • Vim achieves superior performance on ADE20K and COCO datasets
  • Vim is 2.8× faster than DeiT with 86.8% GPU memory savings when processing high-resolution images.

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: 8
  • GPU Type: A800

Keywords

state space models vision transformers Bidirectional SSM visual representation learning efficient backbone

Papers Using Similar Methods

External Resources