Cropping may loss some information about the object
Warpping may change the object’s appearance
FC layer need a fixed-lenground truthh input while conv layer can be adapted to arbitrary input size.事实上,CNN的卷积层不需要固定尺寸的图像,而全连接层是需要固定大小输入的
SPP-Net: Training for Detection(1)
Step1. Generate a image pyramid and exact the conv
FeatMap of the whole image
金字塔用了{6x6 3x3 2x2 1x1},共36+9+4+1=50个特征
做出的主要改进在于SPP-net能够一次得到整个feature map,大大减少了计算proposal的特征时候的运算开销。
具体做法,将图片缩放到s∈{480,576,688,864,1200}的大小,尽量让region在s集合中对应的尺度接近224x224,然后选择对应的feature map进行提取
空间金字塔池化层{6x6 3x3 2x2 1x1} s∈{480,576,688,864,1200} 224x2 224x3 224x6
SPP-Net: Training for Detection(2)
Step 2, For each proposal, walking the image pyramid and find a project version that has a number of pixels closest to 224x224. (For scaling invariance in training.)
Step 3, find the corresponding FeatMap in Conv5 and use SPP layer to pool it to a fix size.
Step 4, While getting all the proposals’ feature, fine-tune the FC layer only.
Step 5, Train the class-specified SVM
SPP-Net: Training for Detection:
Almost the same as R-CNN, except Step3.
Speed: 64x faster than R-CNN using one scale, and 24x faster using five-scale paramid. mAP: +1.2
mAP vs R-CNN
SPP-Net: 不足:
1. 训练分多阶段,并不是端到端的训练过程
2. 训练花费过大的硬盘开销和时间
3. 训练sppnet只微调全连阶层(检测除了语义信息还需要位置信息,多层pooling操作导致位置信息模糊)