← ML Research Wiki / 2408.00714

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer, Meta Fair (2024)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
computer vision
SOTA Claim
Yes
Code
Reproducibility
8/10

Abstract

We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos.We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date.Our model is a simple transformer architecture with streaming memory for real-time video processing.SAM 2 trained on our data provides strong performance across a wide range of tasks.In video segmentation, we observe better accuracy, using 3× fewer interactions than prior approaches.In image segmentation, our model is more accurate and 6× faster than the Segment Anything Model (SAM).We believe that our data, model, and insights will serve as a significant milestone for video segmentation and related perception tasks.We are releasing our main model, dataset, as well as code for model training and our demo.

Summary

The paper introduces Segment Anything Model 2 (SAM 2), a foundation model aimed at promptable visual segmentation in both images and videos. It highlights a data engine developed to create the largest video segmentation dataset, including 35.5 million masks across 50,900 videos. The SAM 2 model employs a transformer architecture with streaming memory for real-time video processing, demonstrating improved accuracy and speed in segmentation tasks compared to its predecessor. Results indicate that SAM 2 achieves better performance with fewer interactions than other methods, making it a significant contribution to video segmentation research. The dataset and model are made publicly available to support future research in this area.

Methods

This paper employs the following methods:

  • Transformer
  • Memory Attention

Models Used

  • SAM 2

Datasets

The following datasets were used in this research:

  • SA-V

Evaluation Metrics

  • J &F
  • mIoU

Results

  • Better accuracy in video segmentation than prior approaches
  • 6× faster image segmentation than its predecessor
  • 3× fewer interactions needed for accurate segmentation

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: 256
  • GPU Type: NVIDIA A100

Keywords

video segmentation image segmentation promptable models dataset transformer architecture

Papers Using Similar Methods

External Resources