YOLO v3

xiaoxiao2022-07-07 241

论文：YOLOv3: An Incremental Improvement （CVPR 2018）代码：eriklindernoren/PyTorch-YOLOv3 Jupyter 代码梳理笔记：YOLOv3 Darknet

文章目录

Darknet-53多尺度特征检测Loss的构成细节补充对象分类softmax改成logisticAnchor分配划分格子并为每个格子设置3个Anchor将预测框定位在在划分好格子的图上对各类别分别进行NMS 代码问题参考文献

Darknet-53

YOLO3 采用 Darknet-53 作为 Backbone（含有53个卷积层），借鉴残差网络，在一些层之间设置了Shortcut connections，弃用YOLOv2中最大池化操作，通过增大卷积层的步长降低特征图分辨率，网络架构如下：

多尺度特征检测

与YOLOv2中的passthrough结构不同，在YOLO3更进一步采用了3个不同尺度的特征图来进行对象检测，三个输出尺寸为： $\color{blue}S*S*3*(box_{atr}+class_{num})$ ，细节部分从下面两个图可以看出（出处应该在参考文献中）

Loss的构成

了解如何构建target，就清楚Loss的组成了

def build_targets(pred_boxes, target, anchors, num_anchors, num_classes, dim, ignore_thres, img_dim): nB = target.size(0) #batch个数 16 nA = num_anchors #锚框个数 3 nC = num_classes #数据集类别数 80 dim = dim #feature map相对于原图的缩放倍数13 # 初始化参数 mask = torch.zeros(nB, nA, dim, dim) #[16,3,13,13] 全0 conf_mask = torch.ones(nB, nA, dim, dim) #[16,3,13,13] 全1 tx = torch.zeros(nB, nA, dim, dim) #[16,3,13,13] 全0 ty = torch.zeros(nB, nA, dim, dim) #[16,3,13,13] 全0 tw = torch.zeros(nB, nA, dim, dim) #[16,3,13,13] 全0 th = torch.zeros(nB, nA, dim, dim) #[16,3,13,13] 全0 tconf = torch.zeros(nB, nA, dim, dim) #[16,3,13,13] 全0 tcls = torch.zeros(nB, nA, dim, dim, num_classes) #[16,3,13,13,80] 全0 # 为了计算一个batch中的recall召回率 nGT = 0 # 统计真值框个数 GT ground truth nCorrect = 0 # 统计预测出有物体的个数（即真值框与 3个原始锚框与真值框iou最大的那个锚框对应的预测框之间的iou > 0.5 为预测正确） # 遍历每一张图片 for b in range(nB): #遍历一张图片的所有物体 for t in range(target.shape[1]): if target[b, t].sum() == 0: # 即代表遍历完所有物体，continue直接开始下一次for循环(译者：使用break直接结束for循环更好) continue nGT += 1 # Convert to position relative to box # target真值框坐标被归一化后[16,50,5] 值在0-1之间。故乘以 dim 将尺度转化为 13x13尺度下的真值框 gx = target[b, t, 1] * dim gy = target[b, t, 2] * dim gw = target[b, t, 3] * dim gh = target[b, t, 4] * dim # Get grid box indices 向下取整，获取网格框索引，即左上角偏移坐标 gi = int(gx) gj = int(gy) # Get shape of gt box [1,4] gt_box = torch.FloatTensor(np.array([0, 0, gw, gh])).unsqueeze(0) # Get shape of anchor box [3,4] 前两列全为0 后两列为三个anchor的w、h anchor_shapes = torch.FloatTensor(np.concatenate((np.zeros((len(anchors), 2)), np.array(anchors)), 1)) # Calculate iou between gt and anchor shapes # 计算一个真值框与对应的3个原始锚框之间的iou anch_ious = bbox_iou(gt_box, anchor_shapes) # Where the overlap is larger than threshold set mask to zero (ignore) 当iou重叠率>阈值，则置为0 # conf_mask全为1 [16,3,13,13] 当一个真值框与一个原始锚框之间的iou > 阈值时，则置为0。 # 即将负责预测物体的网格及它周围的网格都置为0 不参与训练，后面的代码会将负责预测物体的网格再置为1。 conf_mask[b, anch_ious > ignore_thres] = 0 ########### 小于阈值（0.5）的就作为背景 # Find the best matching anchor box 找到一个真值框与对应的3个原始锚框之间的iou最大的下标值 best_n = np.argmax(anch_ious) # Get ground truth box [1,4] gt_box = torch.FloatTensor(np.array([gx, gy, gw, gh])).unsqueeze(0) # Get the best prediction [1,4] # pred_boxes:在13x13尺度上的预测框 # pred_box：取出 3个原始锚框与真值框 iou最大的那个锚框对应的预测框 pred_box = pred_boxes[b, best_n, gj, gi].unsqueeze(0) # Masks [16,3,13,13] 全0 在3个原始锚框与真值框 iou最大的那个锚框对应的预测框位，即负责预测物体的网格置为1 （此时它周围网格为0，思想类似nms） mask[b, best_n, gj, gi] = 1 # [16,3,13,13] 全1 然后将负责预测物体的网格及它周围的网格都置为0 不参与训练，然后将负责预测物体的网格再次置为1。 # 即总体思想为：负责预测物体的网格位置置为1，它周围的网格置为0。类似NMS 非极大值抑制 ################背景+Box_maxiou相当于rpn中的正样本+负样本，因为都要计入损失################ conf_mask[b, best_n, gj, gi] = 1 # Coordinates 坐标 gi= gx的向下取整。 gx-gi、gy-gj 为网格内的物体中心点坐标（0-1之间） # tx ty初始化全为0，在有真值框的网格位置写入真实的物体中心点坐标 tx[b, best_n, gj, gi] = gx - gi ty[b, best_n, gj, gi] = gy - gj # Width and height # 论文中 13x13尺度下真值框=原始锚框 x 以e为底的预测值。故预测值= log(13x13尺度下真值框 / 原始锚框 + 1e-16 ) tw[b, best_n, gj, gi] = math.log(gw/anchors[best_n][0] + 1e-16) th[b, best_n, gj, gi] = math.log(gh/anchors[best_n][1] + 1e-16) # One-hot encoding of label tcls[b, best_n, gj, gi, int(target[b, t, 0])] = 1 # Calculate iou between ground truth and best matching prediction 计算真值框与 3个原始锚框与真值框iou最大的那个锚框对应的预测框之间的iou iou = bbox_iou(gt_box, pred_box, x1y1x2y2=False) # [16,3,13,13] 全0，有真值框对应的网格位置为1 标明物体中心点落在该网格中，该网格去负责预测物体 tconf[b, best_n, gj, gi] = 1 if iou > 0.5: nCorrect += 1 # nGT 统计一个batch中的真值框个数 # nCorrect 统计一个batch预测出有物体的个数 # mask [16,3,13,13] 初始化全0 在3个原始锚框与真值框 iou最大的那个锚框对应的预测框位置置为1 # conf_mask [16,3,13,13] 初始化全1，之后的操作：负责预测物体的网格置为1，它周围网格置为0 # tx, ty [16,3,13,13] 初始化全为0，在有真值框的网格位置写入真实的物体中心点坐标 # tw, th [16,3,13,13] 初始化全为0，该值为真值框的w、h 按照公式转化为网络输出时对应的真值（该值对应于网络输出的真值） # tconf [16,3,13,13] 初始化全0，有真值框对应的网格位置为1 标明物体中心点落在该网格中，该网格去负责预测物体 # tcls #[16,3,13,13,80] 初始化全0，经过one-hot编码后在真实类别处值为1 return nGT, nCorrect, mask, conf_mask, tx, ty, tw, th, tconf, tcls

由target可知YOLOv3的Loss依旧可以大体上分成两部分

预测有目标的框（与gt_box IoU最大）：tx, ty, tw, th, tconf, tcls这些参数Loss作为背景的框（与gt_box IoU小于阈值的框）：相当于没有目标而预测有目标，所以其置信度要计入损失，代码中将有目标和无目标的置信度位置用一个conf_mask合并了

细节补充

对象分类softmax改成logistic

预测对象类别时不使用softmax，改成使用logistic的输出进行预测。这样能够支持多标签对象（比如一个人有Woman 和 Person两个标签）

Anchor分配

聚类得到的9个Anchor被三个尺度的特征层平分，深层的特征图谱尺寸小，感受野大，分配的Anchor size也更大，darknet中的mask就是Anchor的索引号

FeatureFeature_sizeAnchors_sizeAnchors_numFeature 1

13\times 13

[116,90]、[156,198]、[372，326]

13\times 13\times 3

Feature 2

26\times 26

[31,61]、[62,45]、[59,119]

26\times 26\times 3

Feature 3

52\times 52

[10,13]、[16,30]、[33,23]

52\times 52\times 3

划分格子并为每个格子设置3个Anchor

# grid_x、grid_y用于定位 feature map的网格左上角坐标 grid_x = torch.linspace(0, g_dim-1, g_dim).repeat(g_dim,1).repeat(bs*self.num_anchors, 1, 1).view(x.shape).type(FloatTensor) # [16.3.13.13] 每行内容为0-12,共13行 grid_y = torch.linspace(0, g_dim-1, g_dim).repeat(g_dim,1).t().repeat(bs*self.num_anchors, 1, 1).view(y.shape).type(FloatTensor) # [16.3.13.13] 每列内容为0-12,共13列（因为使用转置T） scaled_anchors = [(a_w / stride, a_h / stride) for a_w, a_h in self.anchors] #将原图尺度的锚框也缩放到统一尺度下 anchor_w = FloatTensor(scaled_anchors).index_select(1, LongTensor([0])) #[3,1] 3个锚框的w值 anchor_h = FloatTensor(scaled_anchors).index_select(1, LongTensor([1])) #[3,1] 3个锚框的h值 anchor_w = anchor_w.repeat(bs, 1).repeat(1, 1, g_dim*g_dim).view(w.shape) #[16,3,13,13] anchor_h = anchor_h.repeat(bs, 1).repeat(1, 1, g_dim*g_dim).view(h.shape) #[16,3,13,13]

将预测框定位在在划分好格子的图上

# Add offset and scale with anchors 给锚框添加偏移量和比例 pred_boxes = FloatTensor(prediction[..., :4].shape) #新建一个tensor[16,3,13,13,4] # pred_boxes为在13x13的feature map尺度上的预测框 # x,y为预测值（网格内的坐标，经过sigmoid之后值为0-1之间） grid_x，grid_y定位网格左上角偏移坐标（值在0-12之间） pred_boxes[..., 0] = x.data + grid_x pred_boxes[..., 1] = y.data + grid_y # w，h为预测值，即相对于原锚框的偏移值 anchor_w，anchor_h为网格对应的3个锚框 pred_boxes[..., 2] = torch.exp(w.data) * anchor_w pred_boxes[..., 3] = torch.exp(h.data) * anchor_h

对各类别分别进行NMS

def non_max_suppression(prediction, num_classes, conf_thres=0.5, nms_thres=0.4): """ Removes detections with lower object confidence score than 'conf_thres' and performs Non-Maximum Suppression to further filter detections. Returns detections with shape: (x1, y1, x2, y2, object_conf, class_score, class_pred) """ # From (center x, center y, width, height) to (x1, y1, x2, y2) box_corner = prediction.new(prediction.shape) box_corner[:, :, 0] = prediction[:, :, 0] - prediction[:, :, 2] / 2 box_corner[:, :, 1] = prediction[:, :, 1] - prediction[:, :, 3] / 2 box_corner[:, :, 2] = prediction[:, :, 0] + prediction[:, :, 2] / 2 box_corner[:, :, 3] = prediction[:, :, 1] + prediction[:, :, 3] / 2 prediction[:, :, :4] = box_corner[:, :, :4] output = [None for _ in range(len(prediction))] for image_i, image_pred in enumerate(prediction): # Filter out confidence scores below threshold conf_mask = (image_pred[:, 4] >= conf_thres).squeeze() image_pred = image_pred[conf_mask] # If none are remaining => process next image if not image_pred.size(0): continue # Get score and class with highest confidence class_conf, class_pred = torch.max(image_pred[:, 5:5 + num_classes], 1, keepdim=True) # Detections ordered as (x1, y1, x2, y2, obj_conf, class_conf, class_pred) detections = torch.cat((image_pred[:, :5], class_conf.float(), class_pred.float()), 1) # Iterate through all predicted classes unique_labels = detections[:, -1].cpu().unique() if prediction.is_cuda: unique_labels = unique_labels.cuda() for c in unique_labels: # Get the detections with the particular class detections_class = detections[detections[:, -1] == c] # Sort the detections by maximum objectness confidence _, conf_sort_index = torch.sort(detections_class[:, 4], descending=True) detections_class = detections_class[conf_sort_index] # Perform non-maximum suppression max_detections = [] while detections_class.size(0): # Get detection with highest confidence and save as max detection max_detections.append(detections_class[0].unsqueeze(0)) # Stop if we're at the last detection if len(detections_class) == 1: break # Get the IOUs for all boxes with lower confidence ious = bbox_iou(max_detections[-1], detections_class[1:]) # Remove detections with IoU >= NMS threshold detections_class = detections_class[1:][ious < nms_thres] max_detections = torch.cat(max_detections).data # Add max detections to outputs output[image_i] = max_detections if output[image_i] is None else torch.cat((output[image_i], max_detections)) return output

代码问题

Colab上跑遇到的问题问题一 TypeError: ‘NoneType’ object is not subscriptable 解决方法： datasets.py第128行

if self.augment: 改为： if self.augment and targets is not None: del checkpoint

pytorch 减小显存消耗，优化显存使用，避免out of memory

torch.backends.cudnn.benchmark = true

设置这个 flag 可以让内置的 cuDNN 的 auto-tuner 自动寻找最适合当前配置的高效算法，来达到优化运行效率的问题注意：1) 如果网络的输入数据维度或类型上变化不大，可以增加运行效率；2) 如果网络的输入数据在每次 iteration 都变化的话，会导致 cnDNN 每次都会去寻找一遍最优配置，这样反而会降低运行效率

参考文献

【1】yolo系列之yolo v3【深度解析】【2】YOLOv3 资源合集【3】使用PyTorch从零开始实现YOLO-V3目标检测算法【4】Pytorch版本yolov3源码阅读【5】目标检测-基于Pytorch实现Yolov3（1）- 搭建模型【6】PyTorch : Understanding Graphs, Automatic Differentiation and Autograd 【7】How to implement a YOLO (v3) object detector from scratch in PyTorch 【8】darknet 所有层功能说明【9】PyTorch 中的 ModuleList 和 Sequential: 区别和使用场景【10】pytorch 入坑三：nn module 【11】史上最详细的Pytorch版yolov3代码中文注释详解（一）【12】PyTorch-YOLOv3错误集锦【13】PyTorch实现yolov3代码详细解密【14】物体检测One-Stage开山之作YOLO从v1到v3 【15】YOLO v3网络结构分析【16】为什么 YOLOv3 用了 Focal Loss 后 mAP 反而掉了？

最新回复(0)