基于特征学习的目标检测技术研究

	基于特征学习的目标检测技术研究
	朱优松
	2019-05-28
页数	140
学位类型	博士
中文摘要	多媒体与互联网的飞速发展以及移动设备的高度普及，导致数字图像和视频呈爆炸式的增长。如何高效地处理冗余的图像数据并且从中发现有用的信息变得日趋重要，而基于图像和视频的目标检测为有效地管理大数据提供了技术支撑。因此，设计出简单、高效以及鲁棒的目标检测算法来帮助用户从海量的图像数据中提取有用的信息非常具有实际价值。目标检测本质上可以看成是一个对候选目标区域的分类任务，因此高效而准确的特征表达是影响性能的关键因素。在传统方法中通常需要手工设计特征，如哈尔特征（Haar-like features）、局部二值模式（Local Binary Patterns， LBP）以及方向梯度直方图（Histogram of Oriented Gradient， HOG）等等，为了进一步增强特征的表达能力，往往也会联合多个手工设计的特征，如积分通道特征（Integral Channel Features， ICF）以及集成通道特征（Aggregated Channel Features， ACF），但是手工设计的特征往往泛化能力不足，一般只针对特定的目标（如人脸、行人等）设计特征。同时，手工设计的特征表达能力也比较弱，缺少和高层语义之间的关联，难以跨越“语义鸿沟”。近年来，随着卷积神经网络（Convolutional Neural Network， CNN）在图像分类领域的成功实践，充分展露了其强大的判别能力和特征表达能力。尽管基于 CNN 的目标检测取得了巨大的成功，但依然面临着一系列的问题，如目标尺度的变化、遮挡、形变、姿态的多样性以及复杂背景干扰等。同时，深层网络一般计算量大、复杂度高，难以在实际中应用。浅层小网络虽然容量小，速度快，但往往导致较低的精度。所以本文以深度学习为基础，针对特定的问题，通过设计合理的网络结构以及策略来显式地指导网络的特征学习，使得网络学习到一个更加精准且鲁棒的语义特征表达，从而极大地改善目标检测的性能。本文的主要成果和贡献归纳如下： 1. 针对监控场景下行人尺度多样性以及 CNN 对不同尺度目标描述力不足的问题，提出一种尺度自适应的反卷积回归网络。该网络通过引入反卷积层上采样深层的卷积特征图，在恢复其空间分辨率的同时兼顾特征的高层语义信息，并且在网络内部形成特征金字塔结构，根据候选行人的尺度自适应地在不同分辨率的特征图上提取对应的行人特征，有效地改善不同尺度行人的检测性能。利用CNN 能从不同的层抽象出不同语义层级的特征，该网络同时融合多深度卷积特征来提供行人的全局语义和局部细节信息，改善行人特征的判别力并增强最终的分类性能。大量的实验验证了该方法的有效性，并分别在多个行人数据集上获得了同期最好的检测效果。 2. 针对通用场景下目标状态的多样性，如类内外观差异、类间相互干扰以及目标的遮挡、截断（truncation）、形变等问题，提出了一种深度耦合网络。通过设计一个简单高效的耦合结构，该网络能同时学习目标不同层级的信息，如全局结构、局部部件以及上下文信息，并有效耦合这三种信息形成一个全面综合的特征表达，极大地改善了通用目标检测的性能。同时，通过设计不同的耦合策略和归一化方法，该网络可以实现端对端的训练和预测，并且充分挖掘不同特征之间的互补优势，提高了检测器对各种复杂场景下目标的建模能力。实验结果表明，该方法在多个通用目标检测数据集上远超经典的深度学习检测方法，实现了同期单模型最好的性能。 3. 针对自然场景下面临的复杂背景以及拥挤场景等问题，提出了一种全卷积注意力耦合网络。该网络有效实现了注意力机制和目标本身结构信息的整合，能够自动地关注前景区域，增强前景目标的特征表达，同时抑制背景或者相似区域的干扰。具体地，通过设计一个级联的注意力模块，产生一系列类别无关的注意力图来逐渐引导网络关注到待检测的目标区域，把注意力图作用于基础卷积特征上，显式地增强待检测区域的特征，指导基础网络特征的学习，提升网络对复杂场景中目标的感知能力。此外，该级联的注意力模块能轻易地嵌入到现有的卷积神经网络中，并且通过逐像素监督与检测损失实现联合优化。实验结果表明，该方法能有效提升了现有目标检测器的性能。 4. 针对目前基于大网络的目标检测器性能好，但是参数量大，模型复杂度高，以及使用小网络精度低等问题，提出了基于知识蒸馏的目标检测框架。该框架能充分利用大网络学习到的具有较强判别能力的特征来引导小网络的学习，从而提升小网络的检测性能，在资源受限的场景下具有良好的应用效果。具体地，通过设计加权掩膜的蒸馏方式，有效地整合全局特征图蒸馏和局部区域特征蒸馏，帮助模型在训练阶段更多地关注包含目标的局部区域，同时也保留全图信息，使得知识的传输更加具有针对性，加快小网络的收敛。大量的实验表明，该方法在实现显著的模型压缩率的情况下，依然能保持较好的检测精度。
英文摘要	The rapid development of multimedia and Internet and the high popularity of mobile devices have led to explosive growth of digital images and videos. However,how to efficiently process redundant images and find the useful information from the data become increasingly important. Object detection based on images and videos provides technical support for effective management of large data. Therefore, it is of great practical value to design a simple, efficient and robust object detection algorithm to help users extract useful information from massive image data. Object detection can also be seen as a classification task for candidate object regions in essence, so efficient and accurate feature expression is a key factor for the final performance. In traditional methods, we usually need to design features manually, such as Haar-like features, Local Binary Patterns (LBP), and Histogram of Oriented Gradient (HOG), etc. In order to further enhance the representation ability of features, many handcrafted features, like Integral Channel Features (ICF) and Aggregated Channel Features (ACF), are often integrated together to make the final prediction. However, the generalization ability of the handcrafted features is often insufficient. Generally, these features are designed only for specific targets (such as faces, pedestrians and so on). At the same time, the representation ability of handcrafted features is also relatively weak, lacking the connections with high-level semantics, and it is difficult to bridge the ”semantic gap”. In recent years, with the successful practice of Convolutional Neural Network (CNN) in the field of image classification, its powerful discriminative ability and feature representation ability have been fully demonstrated. Despite the great success of CNN-based object detection, it still faces a series of problems, such as scale, occlusion, deformation, pose and complex background interference. At the same time, deep network often suffers from a large number of parameters and high float point operations(FLOPs) which makes it difficult to deploy in practice. Although small shallow network has small capacity and high speed, it always leads to low accuracy. Therefore, based on deep learning technology, this dissertation proposes to design reasonable network structures and strategies for specific problems to guide the feature learning of the network, so that the network can learn a more accurate and robust semantic feature expression, thus greatly improving the performance of object detection. The main contributions of this dissertation are summarized as follows: 1. This dissertation proposes a scale-adaptive deconvolutional regression network to solve the problem of the diversity of pedestrian scales in surveillance scenarios and the inadequacy of CNN in describing the object of different scales. By introducing the deconvolutional layer to up-sample the deep feature maps, the network not only recovers its spatial resolution, but also takes into account the high-level semantic information of features, thus forming a feature pyramid structure within the network. According to the scale of candidate pedestrian, the pedestrian features are adaptively extracted from the feature maps of the corresponding resolution, which can effectively improve the detection performance of pedestrian of different scales. Furthermore, considering that CNN can abstract features of different semantic levels from different layers, the network integrates multi-depth convolutional features to provide global and local details of the pedestrian, thus further improving the discriminative power of pedestrian features and enhancing the final classification performance. Extensive experiments demonstrate the effectiveness of the proposed approach, and the best detection results are achieved on several public pedestrian datasets at the same time. 2. This dissertation proposes a deep coupling network, named as CoupleNet, to solve the various object configurations in general scenarios, like the differences of intraclass, the interference of inter-class and the occlusion, truncation and deformation of the object. By designing a simple and efficient coupling structure, the network can simultaneously learn information at different levels of the object, such as global structure, local parts and context information, and effectively couple these three information to form a comprehensive feature expression, which greatly improves the performance of generic object detection. At the same time, by designing different coupling strategies and normalization methods, the network can achieve end-to-end training and test. Moreover, the complementary advantages of different features are fully exploited, which improves the ability of the detector to model objects in various complex scenarios. The experimental results show that the proposed method is far superior to the classical CNN-based detection methods in several popular datasets, and achieves the best performance of single model in the same period. 3. This dissertation proposes a fully convolutional attention coupling network to solve the complex background and crowded scenes in natural scenes. The network effectively integrates the attention mechanism and the structure information of the object, it can automatically focus on the foregrounds and enhance the feature representation of the foreground objects while also suppressing the interference of the backgrounds or similar regions. Specifically, by designing a cascade attention module, a series of category-independent attention maps are generated to gradually guide the network to focus on the target regions to be detected. The attention maps are then applied to the convolutional feature maps to enhance the features of regions and guide the learning of basic network features, thus improving the network’s perception of objects in complex scenes. Moreover, the cascaded attention module can be easily embedded into the existing convolutional neural network and optimized jointly with the detection loss. The experimental results show that the proposed method can effectively improve the performance of existing object detectors. 4. Currently, the detectors with large backbones enjoy good performance, but they always suffer from large parameters and high complexity. On the contrary, using a small network leads to poor performance. In view of this problem, this dissertation proposes a framework of object detection based on knowledge distillation. The framework can make full use of the strong discriminative features learned by the large network to guide the learning of the small network, so it can greatly improve the detection performance of the small network and facilitate the deployment in resource-constrained scenarios. Specifically, we design a mask-guided knowledge distillation to effectively integrate the global feature distillation and local feature distillation, which helps the model pay more attention to the local regions containing the objects in the training stage while also preserving the whole image information. This makes the transfer of knowledge more targeted, thus speeding up the convergence of the small network. Extensive experiments demonstrate that the proposed approach can still maintain a good detection accuracy under the condition of achieving a significant model compression rate.
关键词	目标检测特征学习卷积神经网络深度学习
学科领域	计算机科学技术 ; 人工智能 ; 模式识别
语种	中文
七大方向——子方向分类	目标检测、跟踪与识别
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/23783
专题	紫东太初大模型研究中心_图像与视频分析
推荐引用方式 GB/T 7714	朱优松. 基于特征学习的目标检测技术研究[D]. 中国科学院自动化研究所. 中国科学院大学,2019.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Thesis-朱优松.pdf（8332KB）	学位论文		开放获取	CC BY-NC-SA