← ML Research Wiki / 2408.00714

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer, Meta Fair (2024)

Paper Information

arXiv ID

2408.00714

Venue

arXiv.org

Domain

computer vision

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos.We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date.Our model is a simple transformer architecture with streaming memory for real-time video processing.SAM 2 trained on our data provides strong performance across a wide range of tasks.In video segmentation, we observe better accuracy, using 3× fewer interactions than prior approaches.In image segmentation, our model is more accurate and 6× faster than the Segment Anything Model (SAM).We believe that our data, model, and insights will serve as a significant milestone for video segmentation and related perception tasks.We are releasing our main model, dataset, as well as code for model training and our demo.

Summary

The paper introduces Segment Anything Model 2 (SAM 2), a foundation model aimed at promptable visual segmentation in both images and videos. It highlights a data engine developed to create the largest video segmentation dataset, including 35.5 million masks across 50,900 videos. The SAM 2 model employs a transformer architecture with streaming memory for real-time video processing, demonstrating improved accuracy and speed in segmentation tasks compared to its predecessor. Results indicate that SAM 2 achieves better performance with fewer interactions than other methods, making it a significant contribution to video segmentation research. The dataset and model are made publicly available to support future research in this area.

Methods

This paper employs the following methods:

Transformer
Memory Attention

Models Used

SAM 2

Datasets

The following datasets were used in this research:

SA-V

Evaluation Metrics

J &F
mIoU

Results

Better accuracy in video segmentation than prior approaches
6× faster image segmentation than its predecessor
3× fewer interactions needed for accurate segmentation

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: 256
GPU Type: NVIDIA A100

Keywords

video segmentation image segmentation promptable models dataset transformer architecture

Papers Using Similar Methods

External Resources

Funding: Meta FAIR
References: 126
Influential Citations: 97

SAM 2: Segment Anything in Images and Videos

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers