基于深度学习的目标检测技术研究

	基于深度学习的目标检测技术研究
	赵旭
	2019-05
页数	136
学位类型	博士
中文摘要	目标检测是计算机视觉研究领域的重要研究分支，是目标跟踪、目标识别、行为识别等高层任务的基础环节，在智能监控、智慧交通、无人驾驶、智能零售、人机交互、机器人等多项领域有广泛的应用。根据检测的目标类别和类别数目不同，目标检测的研究包括通用目标检测和特定目标检测两个方面。通用目标检测针对多个类别的物体进行同时检测，研究多种类别目标的共性问题，设计通用算法。特定目标检测针对单一类别的物体、物体局部或纹理模式，考虑目标特有属性，设计适合所关注类别目标的算法。通用目标检测算法一般可以适用于检测特定目标，但是为了追求更好的特定目标检测效果，则需要针对性地考虑特定目标的具体情况来进行算法设计。目标检测有着很长的研究历史，深度学习方法在计算机视觉中开始广泛应用以来，学界提出了许多基于深度学习的目标检测方法，目标检测领域取得了很大的进展，但是依然存在许多问题和挑战：通用目标检测中，存在小目标的检测难度大、同一类别目标差异大、不同类别目标相似性高等问题；特定目标检测中不同种类目标有各自的问题，例如人脸检测算法的速度与实际需求存在鸿沟，场景文本检测中场景文本和背景区分难度大等。针对这些问题，本文进行了深入探讨和研究。本文主要的研究内容和贡献归纳如下： 1.针对通用目标检测中，单阶段方法对上下文信息利用不足、用于小目标检测的特征表达差和预测分支判别力差这三个问题，本文提出了一种基于特征增强和多通路融合的单阶段目标检测方法。首先，利用多个不同膨胀率的膨胀卷积设计了多尺度上下文模块，来显式提取上下文信息；然后，设计了多层特征融合模块，融合不同深度的卷积层的特征图，从而获得分辨率大且语义性强的特征图，来对小目标有更好的特征表达，提高小目标检测准确率；最后，设计了一种由多个通路组成的预测分支，融合不同通路的特征，增强模型判别能力。三个策略设计时兼顾了高效性。实验表明，三个策略在保持单阶段目标检测算法的速度优势的同时，显著提升了其准确率，在多个评测数据集上超过了同期最好的算法。 2.针对特定目标检测中，人脸检测算法难以在嵌入式设备上达到实时运行的速度的问题，本文提出了一种轻量级的基于结构稀疏化和信息增强网络的人脸检测算法。为了设计该轻量级网络，采用了两方面的策略：首先通过卷积分解结构和连续下采样卷积结构搭建低计算量且保持足够容量的轻量级网络结构，然后通过高效的头肩区域上下文模块、非负饱和激活函数和焦点损失在仅增加极其少量计算量的前提下，提升轻量级检测网络准确率。实验结果表明，根据上述策略搭建的轻量级人脸检测算法在基于 ARM Cortex-A53@1.4Hz 的嵌入式设备上达到了 VGA 分辨率运行实时水平，远超其它人脸检测算法的速度，同时，在准确率上也超过了同等计算量水平的算法。 3.针对特定目标检测中，基于分割的场景文本检测方法的分割标注中文本外包框内的背景像素点也被标记为前景的问题，提出了一种新的损失函数：精英损失函数。该损失函数对框内的文字笔画区域像素点加大权重，对框内背景像素点降低权重，来降低框内背景像素点被标注为前景带来的框外背景区域误检问题。本文设计了两种有效的权重生成方法，一种根据网络第一层卷积层学到的低级特征进行变换而来，一种根据网络输出的前景置信度得分进行归一化而来。实验结果表明，精英损失函数有效降低了基于分割的文本检测算法的误检，提升了其精确率，同时使得文本检测算法的准确率达到了同期最优水平。 4.针对特定目标检测中，当前场景文本检测算法普遍采用的来自图像分类任务的主干网络不适合文本目标的问题，提出了一种针对文本检测任务的主干网络：十字感受野网络。首先，本文设计了由矩形卷积核起始的卷积通路组成的“十字感受野模块”，作为该网络的基础模块，使得由该模块堆叠构成的主干网络的感受野更贴合文本区域的分布；然后本文讨论了文本检测主干网络宽度、深度的合理设置策略。基于上述两点工作，搭建了更适合文本检测任务的十字感受野网络。实验表明，十字感受野网络作为主干网络用于文本检测任务的性能相比同等计算量或参数量的来自图像分类任务的网络有较大的提高。
英文摘要	Object detection is one of the most important research fields in computer vision. It is the fundamental step for many high-level tasks in computer vision, such as object tracking, object recognition, and action recognition. Moreover, it is widely applied in intelligent monitoring, intelligent transportation, manless driving, intelligent retail, human-computer interaction, robotics, etc. According to the categories and the number of categories of the objects, the research of object detection consists of two sub-areas: generic object detection and specific object detection. The generic object detection aims to detect objects of multiple categories at the same time, and universal algorithms should be designed for the common problems of these categories of objects. The specific object detection aims to detect a single category of objects, object parts, or texture patterns, and algorithms should be designed with special consideration for the specific conditions of the specific object. Generally, generic object detection methods can be applied to detect specific objects. However, in order to achieve better performance in practical applications, tailored algorithms should be designed for specific object detection. Object detection has been studied for several decades. Since the deep learning has been widely applied in computer vision, many deep learning based object detection methods are proposed. Those methods have largely improved the object detection performance. However, there are still many problems and challenges. In generic object detection, some representative challenges are the small-scale object detection, the small intra-class difference and large inter-class similarity problem. In specific object detection, the problems vary across different detection scenarios. Representative problems are the gap between the runtime efficiency of face detection and practical requirements, the difficulty of detecting scene text in the clutter background, etc. To solve these problems, this dissertation proposes several effective methods. The main contributions of this dissertation are summarized as follows: 1. To further improve the performance of the single-stage object detector while maintaining its advantage in the runtime efficiency, this dissertation proposes three strategies. First, this dissertation designs the multi-scale context module with dilated convolutions, to introduce the context information into the learned features. Second, this dissertation designs the multi-path prediction head to increase the discrimination ability by merging information from different paths. Third, this dissertation adopts the top-down feature map merging module to generate semantically stronger features for small objects. The structures of the proposed methods are designed to be as efficient as possible. Experimental results show the proposed strategies improve the performance of the single-stage detector by a large margin, and the improved single-stage detector achieves the new state-of-the-art in both speed and accuracy. 2. To solve the problem that the running efficiency of face detection methods cannot satisfy the practical requirement, this dissertation proposes an efficient face detector, EagleEye. EagleEye is built by using five strategies. First, this dissertation adopts the convolution factorization strategy in all network layer design and employs the successive downsampling convolutions at the beginning of the network. Then, this dissertation further improves the detection network by using the head-shoulder context module, the information-preserving activation function, and the Focal Loss, without increasing too many additional computation costs. Experimental results show the five strategies help EagleEye achieve a good balance between efficiency and accuracy. The EagleEye outperforms the other face detectors which have the same order of computation costs, in both runtime efficiency and accuracy. 3. To solve the false detection problem caused by regarding the background pixels in the text bounding boxes as foreground in the current segmentation based text detection methods, this dissertation proposes an efficient loss function, Elite Loss. For the segmentation based methods which have a regression step, each pixel location on the output feature map is an independent predicting unit. Instead of considering all the predicting units equally, the Elite Loss re-weights them with their contributions to the detector’s performance. It forces the detector to learn better features for the elite predicting units, which are usually on the strokes and capture the instinct characteristics of text regions better. The Elite Loss is flexible and effective and it can be easily integrated into current popular text detectors. This dissertation gives two forms of Elite Loss, which are the heuristic form and the adaptive form. Extensive experiments on various datasets demonstrate the effectiveness of the Adaptive Elite Loss. 4. To solve the problem that the backbone networks of the current deep learning based scene text detection methods, which are usually designed for the image classification tasks, are not suitable for the text objects, this dissertation proposes a new backbone network that is specially designed for the scene text detection task, CrossNet. The key of CrossNet is the three-path block, CrossRecepBlock. Considering scene texts usually appear in the rectangle shape, this dissertation utilizes rectangle convolution kernels instead of the common square ones in the CrossRecepBlock to guide networks to learn effective features with more suitable receptive fields for scene text detection. This dissertation also discusses the guidelines to set the suitable width and depth of the text detection backbone network. Then based on the CrossRecepBlock and the guidelines, the CrossNet is built. Experimental results show that CrossNet largely outperforms the counter-part classification networks which have similar parameters or FLOPS.
关键词	目标检测通用目标检测人脸检测场景文本检测深度学习
语种	中文
七大方向——子方向分类	目标检测、跟踪与识别
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/23952
专题	紫东太初大模型研究中心_图像与视频分析
推荐引用方式 GB/T 7714	赵旭. 基于深度学习的目标检测技术研究[D]. 北京. 中国科学院大学,2019.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
博士学位论文-赵旭.pdf（7867KB）	学位论文		开放获取	CC BY-NC-SA