Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer, Meta Fair (2024)
The paper introduces Segment Anything Model 2 (SAM 2), a foundation model aimed at promptable visual segmentation in both images and videos. It highlights a data engine developed to create the largest video segmentation dataset, including 35.5 million masks across 50,900 videos. The SAM 2 model employs a transformer architecture with streaming memory for real-time video processing, demonstrating improved accuracy and speed in segmentation tasks compared to its predecessor. Results indicate that SAM 2 achieves better performance with fewer interactions than other methods, making it a significant contribution to video segmentation research. The dataset and model are made publicly available to support future research in this area.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: