← ML Research Wiki / 1703.06870

Mask R-CNN

Kaiming He Facebook AI Research (FAIR), Georgia Gkioxari Facebook AI Research (FAIR), Piotr Dollár Facebook AI Research (FAIR), Ross Girshick Facebook AI Research (FAIR), Kaiming He Facebook AI Research (FAIR), Georgia Gkioxari Facebook AI Research (FAIR), Piotr Dollár Facebook AI Research (FAIR), Ross Girshick Facebook AI Research (FAIR) (2017)

Paper Information
arXiv ID
Domain
Artificial Intelligence, Computer Vision
SOTA Claim
Yes
Code
Reproducibility
8/10

Abstract

We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without tricks, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code will be made available.

Summary

The paper introduces Mask R-CNN, a framework for object instance segmentation that enhances Faster R-CNN by adding a parallel mask prediction branch, allowing for high-quality segmentation of objects in images while retaining speed and flexibility. The authors achieve state-of-the-art results across various benchmarks, particularly on the COCO dataset, demonstrating the framework's effectiveness in instance segmentation, object detection, and human pose estimation. The paper emphasizes the importance of precise pixel alignments in segmentation tasks and introduces the RoIAlign layer to improve accuracy by addressing issues from previous RoIPool methods. Detailed comparisons with existing models show significant improvements in performance, showcasing Mask R-CNN's simplicity, high speed, and generalizability to other tasks. The code for the model will be made publicly available to facilitate further research.

Methods

This paper employs the following methods:

  • Mask R-CNN
  • RoIAlign

Models Used

  • Faster R-CNN
  • ResNet-50
  • ResNet-101
  • ResNeXt-101
  • Feature Pyramid Network (FPN)

Datasets

The following datasets were used in this research:

  • COCO
  • Cityscapes

Evaluation Metrics

  • AP (Average Precision)
  • AP 50
  • AP 75

Results

  • State-of-the-art performance on COCO dataset
  • Achieved 35.7 mask AP
  • Improved accuracy by up to 50% with RoIAlign

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: 8
  • GPU Type: None specified

Keywords

Mask R-CNN object detection instance segmentation deep learning CNN RoIAlign

Papers Using Similar Methods

External Resources