← ML Research Wiki / 1311.2524

Rich feature hierarchies for accurate object detection and semantic segmentation Tech report (v5)

Ross Girshick UC Berkeley, Jeff Donahue [email protected] UC Berkeley, Trevor Darrell [email protected] UC Berkeley, Jitendra Malik [email protected] UC Berkeley (2013)

Paper Information
arXiv ID
Venue
2014 IEEE Conference on Computer Vision and Pattern Recognition
Domain
Natural language processing
SOTA Claim
Yes
Reproducibility
8/10

Abstract

Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012-achieving a mAP of 53.3%. Our approach combines two key insights:(1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN architecture. We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset. Source code for the complete system is available at

Summary

This paper introduces R-CNN (Regions with CNN features), a novel approach for object detection that significantly improves mean Average Precision (mAP) on the PASCAL VOC and ILSVRC detection datasets. The authors identify two key factors for the success of R-CNN: the application of high-capacity convolutional neural networks (CNNs) to bottom-up region proposals, and effective training using supervised pre-training for an auxiliary classification task followed by domain-specific fine-tuning. R-CNN achieves a 53.3% mAP on VOC 2012, surpassing previous state-of-the-art methods, and demonstrates competitive performance on the ILSVRC 2013 detection dataset with a mAP of 31.4%. This performance enhancement is attributed not only to CNNs but also to the strategic use of selective search for generating region proposals and class-specific linear SVMs for classification. The paper discusses the design of R-CNN's three modules: region proposal generation, feature extraction using CNNs, and classification with SVMs. The authors highlight the efficiency of R-CNN in terms of computational resources and memory usage compared to previous methods. Additionally, performance on semantic segmentation tasks is also evaluated, achieving an average accuracy of 47.9% on the VOC 2011 test set. The findings underscore a paradigm wherein leveraging abundant data for auxiliary tasks can improve performance in data-scarce scenarios.

Methods

This paper employs the following methods:

  • Convolutional Neural Network (CNN)
  • Selective Search
  • SVM (Support Vector Machine)

Models Used

  • R-CNN
  • OverFeat

Datasets

The following datasets were used in this research:

  • PASCAL VOC
  • ILSVRC2013

Evaluation Metrics

  • mAP

Results

  • Achieved mAP of 53.3% on PASCAL VOC 2012
  • Outperformed OverFeat by achieving mAP of 31.4% on ILSVRC 2013

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: 1
  • GPU Type: NVIDIA Tesla K20

Keywords

CNN object detection region proposals deep learning visual recognition

Papers Using Similar Methods

External Resources