Ross Girshick UC Berkeley, Jeff Donahue [email protected] UC Berkeley, Trevor Darrell [email protected] UC Berkeley, Jitendra Malik [email protected] UC Berkeley (2013)
This paper introduces R-CNN (Regions with CNN features), a novel approach for object detection that significantly improves mean Average Precision (mAP) on the PASCAL VOC and ILSVRC detection datasets. The authors identify two key factors for the success of R-CNN: the application of high-capacity convolutional neural networks (CNNs) to bottom-up region proposals, and effective training using supervised pre-training for an auxiliary classification task followed by domain-specific fine-tuning. R-CNN achieves a 53.3% mAP on VOC 2012, surpassing previous state-of-the-art methods, and demonstrates competitive performance on the ILSVRC 2013 detection dataset with a mAP of 31.4%. This performance enhancement is attributed not only to CNNs but also to the strategic use of selective search for generating region proposals and class-specific linear SVMs for classification. The paper discusses the design of R-CNN's three modules: region proposal generation, feature extraction using CNNs, and classification with SVMs. The authors highlight the efficiency of R-CNN in terms of computational resources and memory usage compared to previous methods. Additionally, performance on semantic segmentation tasks is also evaluated, achieving an average accuracy of 47.9% on the VOC 2011 test set. The findings underscore a paradigm wherein leveraging abundant data for auxiliary tasks can improve performance in data-scarce scenarios.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: