R-CNN模型

论文：Rich feature hierarchies for accurate object detection and semantic segmentation

本篇文章根据目标检测相关文章中提到的文章而写，综合不同的文章以及作者原文来写。

问自己一个问题，什么时目标检测

规定上面四张图片从左到右，依次为图1-4

图1为Classification

图2为单物体的Classification+ localization

图3为 Object detection=Classification+ localization

图4为 Instance Segmentation

目标检测为 Classification+ localization

R-CNN模型的几个步骤

(1)takes an input image,

(2) extracts around 2000 bottom-up regionproposals,

(3) computes features for each proposal using a large convolutional neural network (CNN), and then

(4) classifies each region using class-specific linear SVMs.

按照深度学习目标检测模型全面综述：Faster R-CNN、R-FCN和SSD

可以概述为

借助一个可以生成约 2000 个 region proposal 的「选择性搜索」（Selective Search）算法，R-CNN 可以对输入图像进行扫描，来获取可能出现的目标。

在每个 region proposal 上都运行一个卷积神经网络（CNN）。

将每个 CNN 的输出都输入进：a）一个支持向量机（SVM），以对上述区域进行分类。b）一个线性回归器，以收缩目标周围的边界框，前提是这样的目标存在。

作者在论文中提到模型的优点

Our approach combines two key insights:

(1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and

(2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significantperformance boost.

Figure 1: Object detection system overview.

Our system

(1)takes an input image,

(2) extracts around 2000 bottom-up regionproposals,

(3) computes features for each proposal using a large convolutional neural network (CNN), and then

(4) classifies each region using class-specific linear SVMs.

具体细节：

At test time, our method generates around 2000 category-independent region proposals for the input image, extracts a fixed-length feature vector from each proposal using a CNN, and then classifies each region with category-specific linear SVMs. We use a simple technique (affine image warping) to compute a fixed-size CNN input from each region proposal, regardless of the region’s shape. （这段话信息量极大）

本文的第二个贡献

The second principle contribution of this paper is to show that supervised pre-training on a large auxiliary dataset (ILSVRC), followed by domainspecific fine-tuning on a small dataset (PASCAL), is an effective paradigm for learning high-capacity CNNs when data is scarce.

R-CNN的三个模块

Our object detection system consists of three modules.
The first generates category-independent region proposals.These proposals define the set of candidate detections available to our detector.

The second module is a large convolutional neural network that extracts a fixed-length featurevector from each region.

The third module is a set of class specific linear SVMs.

Test-time detection（信息量极大）

At test time, we run selective search on the test image to extract around 2000 region proposals (we use selective
search’s “fast mode” in all experiments).

We warp each proposal and forward propagate it through the CNN in order to compute features.

Then, for each class, we score each extracted feature vector using the SVM trained for that class.

Given all scored regions in an image, we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-overunion (IoU) overlap with a higher scoring selected region larger than a learned threshold.

Bounding-box regression

We use a simple bounding-box regression stage to improve localization performance. After scoring each selective search proposal with a class-specific detection SVM, we predict a new bounding box for the detection using a class-specific bounding-box regressor.

This is similar in spirit to the bounding-box regression used in deformable part models [17]. The primary difference between the two
approaches is that here we regress from features computed by the CNN, rather than from geometric features computed on the inferred DPM part locations.