← ML Research Wiki / 1311.2524

Rich feature hierarchies for accurate object detection and semantic segmentation Tech report (v5)

Ross Girshick UC Berkeley, Jeff Donahue [email protected] UC Berkeley, Trevor Darrell [email protected] UC Berkeley, Jitendra Malik [email protected] UC Berkeley (2013)

Paper Information

arXiv ID

1311.2524

Venue

2014 IEEE Conference on Computer Vision and Pattern Recognition

Domain

Natural language processing

SOTA Claim

Yes

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012-achieving a mAP of 53.3%. Our approach combines two key insights:(1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN architecture. We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset. Source code for the complete system is available at

Summary

This paper introduces R-CNN (Regions with CNN features), a novel approach for object detection that significantly improves mean Average Precision (mAP) on the PASCAL VOC and ILSVRC detection datasets. The authors identify two key factors for the success of R-CNN: the application of high-capacity convolutional neural networks (CNNs) to bottom-up region proposals, and effective training using supervised pre-training for an auxiliary classification task followed by domain-specific fine-tuning. R-CNN achieves a 53.3% mAP on VOC 2012, surpassing previous state-of-the-art methods, and demonstrates competitive performance on the ILSVRC 2013 detection dataset with a mAP of 31.4%. This performance enhancement is attributed not only to CNNs but also to the strategic use of selective search for generating region proposals and class-specific linear SVMs for classification. The paper discusses the design of R-CNN's three modules: region proposal generation, feature extraction using CNNs, and classification with SVMs. The authors highlight the efficiency of R-CNN in terms of computational resources and memory usage compared to previous methods. Additionally, performance on semantic segmentation tasks is also evaluated, achieving an average accuracy of 47.9% on the VOC 2011 test set. The findings underscore a paradigm wherein leveraging abundant data for auxiliary tasks can improve performance in data-scarce scenarios.

Methods

This paper employs the following methods:

Convolutional Neural Network (CNN)
Selective Search
SVM (Support Vector Machine)

Models Used

R-CNN
OverFeat

Datasets

The following datasets were used in this research:

PASCAL VOC
ILSVRC2013

Evaluation Metrics

mAP

Results

Achieved mAP of 53.3% on PASCAL VOC 2012
Outperformed OverFeat by achieving mAP of 31.4% on ILSVRC 2013

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: 1
GPU Type: NVIDIA Tesla K20

Keywords

CNN object detection region proposals deep learning visual recognition

Papers Using Similar Methods

External Resources

Funding: DARPA Mind's Eye, MSEE, NSF IIS awards IIS-0905647, IIS-1134072, IIS-1212798, MURI N000014-10-1-0933, Toyota
References: 55
Influential Citations: 2779

Rich feature hierarchies for accurate object detection and semantic segmentation Tech report (v5)

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers