Lianghui Zhu School of EIC Huazhong University of Science & Technology, Bencheng Liao School of EIC Huazhong University of Science & Technology Institute of Artificial Intelligence Huazhong University of Science & Technology, Qian Zhang Horizon Robotics, Xinlong Wang <[email protected]>. Beijing Academy of Artificial Intelligence, Wenyu Liu School of EIC Huazhong University of Science & Technology, Xinggang Wang School of EIC Huazhong University of Science & Technology, Xinggang Wang (2024)
This paper introduces Vision Mamba (Vim), a new visual backbone model that utilizes bidirectional state space models (SSMs) for efficient visual representation learning. The authors argue that the reliance on self-attention mechanisms in visual models is unnecessary and present Vim, which integrates bidirectional SSMs to capture global visual context while maintaining low computational costs. Vim shows superior performance on tasks such as image classification (ImageNet), semantic segmentation (ADE20K), and object detection (COCO). The key contributions include achieving significant gains in speed and memory efficiency compared to established models like DeiT, specifically reporting Vim being 2.8× faster and saving 86.8% GPU memory for high-resolution images. The paper highlights the potential of Vim as a next-generation vision backbone model suitable for various downstream tasks and emphasizes its adaptability for unsupervised learning and multimodal scenarios.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: