← ML Research Wiki / 1405.0312

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C Lawrence Zitnick, Piotr Doll, California Institute of Technology, University of California at Irvine (2014)

Paper Information
arXiv ID
Venue
European Conference on Computer Vision
Domain
Computer vision
SOTA Claim
Yes
Reproducibility
7/10

Abstract

We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding.This is achieved by gathering images of complex everyday scenes containing common objects in their natural context.Objects are labeled using per-instance segmentations to aid in precise object localization.Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old.With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation.We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN.Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model..

Summary

This paper presents the Microsoft Common Objects in Context (MS COCO) dataset aimed at enhancing object recognition within the broader scope of scene understanding. The dataset features images of everyday scenes with common objects in their natural contexts, supporting precise object localization through per-instance segmentation. MS COCO consists of 91 object categories and 2.5 million labeled instances spread across 328,000 images. It uniquely facilitates the study of non-iconic views, contextual reasoning between objects, and precise 2D localization. It contrasts with other prominent datasets like PASCAL and ImageNet in terms of instance density per image and the variety of contextual information. A detailed statistical analysis compares MS COCO to existing datasets and outlines challenges in gathering non-iconic images effectively. The dataset's creation utilized crowd sourcing via Amazon Mechanical Turk to ensure accuracy in labeling and segmentation using robust user interfaces. The paper also discusses the implications of the dataset for training and evaluating modern image recognition systems, alongside proposed future enhancements such as including 'stuff' categories and additional annotations for better performance evaluation.

Methods

This paper employs the following methods:

  • Crowd Sourcing
  • Instance Segmentation
  • Contextual Reasoning

Models Used

  • DPMv5-P
  • DPMv5-C

Datasets

The following datasets were used in this research:

  • ImageNet
  • PASCAL VOC
  • SUN
  • MS COCO

Evaluation Metrics

  • None specified

Results

  • The MS COCO dataset contains 2.5 million labeled instances in 328,000 images.
  • Models trained on MS COCO perform better on everyday scenes than those trained on prior datasets.
  • MS COCO has an average of 7.7 object instances per image compared to lower counts in other datasets.

Limitations

The authors identified the following limitations:

  • The dataset only includes 'thing' categories and does not yet label 'stuff' categories.
  • Initial segmentation quality varied due to the complexity of the task and varying annotator quality.

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

dataset object recognition scene understanding segmentation detection

Papers Using Similar Methods

External Resources