目标检测中的深层特征学习方法研究

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 先进时空数据分析与学习

	目标检测中的深层特征学习方法研究
	郭超旭
	2020-05-29
页数	60
学位类型	硕士
中文摘要	视觉目标检测是计算机视觉领域中一个关键的研究方向，是目标识别、目标跟踪和运动轨迹分析的基石。视觉目标检测在诸多实际应用场景中均扮演着至关重要的角色，被广泛应用于自动驾驶、生物特征识别和服务机器人等场景。近年来，深度学习网络通过提取深层语义特征，大幅提升了目标检测器的精度和泛化能力。尽管目标检测取得了显著的进步，但是目标多尺度变化、目标遮挡和运动模糊等问题依然会削弱深层特征的语义表达能力，从而降低目标检测的精度和鲁棒性。为此，本文围绕视觉目标检测这一问题，开展视觉目标特征学习的相关研究工作。通过设计新的深层网络模型，充分利用图像特征金字塔的多层次语义信息和视频中的时序信息来提升视觉目标检测的性能。本文的主要研究内容和贡献如下：面向图像目标检测任务，提出了一种增强特征金字塔结构。具体地，本文首先分析了特征金字塔在特征融合前、融合过程和融合后三个阶段的设计缺陷，然后在此基础上进行了以下三方面的改进：其一，在对不同层次的特征融合前，本文设计了一致性监督模块，对不同层次的语义特征施加一致的监督信号，缩小不同层次特征之间的语义差距；其二，在特征融合的过程，设计了残差特征增强模块，减少特征金字塔最高层语义特征的信息损失，保留更多的原始语义特征；其三，针对特征融合后的区域特征池化，设计了区域特征自适应融合模块，这个模块能够不依赖人工设计的规则，自适应地融合来自不同层级的区域特征，从而产生更加鲁棒的区域特征。所提出的增强特征金字塔具有较强的泛化性，其与多个特征提取网络和目标检测框架结合，在大型的目标检测数据集MS COCO上的平均精度能取得接近2% 的提升。面向视频目标检测任务，设计了一个渐进稀疏的局部注意力模块，该模块通过提取和传递视频中的时序信息来帮助视频目标检测器学习更鲁棒的深层特征。具体地，渐进稀疏的局部注意力模块通过在一个局部区域内对不同帧的特征建立空间对应关系，进而利用空间对应关系对特征进行对齐和传递，并从中提取出视频的时序上下文信息。在此基础上，本文所构建的视频目标检测器不依赖额外的光流模型来对齐特征，因此可以避免由于光照变化和物体运动幅度大等情况导致光流预测不准确，进而影响检测精度的问题。同时，由于本文所构建的视频目标检测器不依赖光流模型，模型的参数量接近了接近37 M，模型实际部署的难度也得到降低。本文在大型的视频目标检测数据集ImageNet VID 上进行实验，实验结果表明本文提出的视频目标检测器能够取得领先的检测精度（80% mAP），同时速度也保持在较高的水平（26 FPS）。
英文摘要	Object detection is a popular topic in the field of computer vision and is the cornerstone of object tracking, human behavior analyse, person Re-ID and motion trajectory analyse. Object detection serves as an important role in many applications such as auto-driving, biometrics and service robot. In recent years, the deep feature extracted by deep neural networks has improved the accuracy and robustness of object detection significantly. Although deep learning technology improves object detection remarkably, the representation ability of deep feature can be affected by the presence of objects with different scales, object occlusion, camera defocusing, motion blur and etc. In this paper, we focus on the problems of deep feature learning: on the one hand, designing better feature pyramid architecture to improve the feature representation of multi-scale features; modeling temporal information in feature to improve the feature robustness. They improve the performance of image and video object detection separately. Our main contributions as follows: For the task of image object detection, a new feature pyramid architecture named AugFPN is proposed. Specifically, the design defects of feature pyramid in the phase of before, during and after feature fusion are analyzed and pointed out in this paper. And we design three modules to address these defects in an extent. First, before feature fusion, this paper designs a consistence supervision module. Before multi-scale feature fusion, it applies a consistent supervision signal to multi-level semantic features to narrow the semantic gap between them. Second, during the process of feature fusion, we designs a residual feature augmentation module to reduce the information loss of the highest-level semantic feature, so as to retain more original semantic features. Third, for the RoI feature pooling after feature fusion, an adaptive RoI feature pooling module is designed, which can adaptively fuse RoI features from different levels without relying on the artificially designed strategy, and can produce better RoI features. By incorporating these modules into the original feature pyramid, a new feature pyramid named AugFPN is proposed, which further improves the multi-level feature representation of feature pyramid. By combining with multiple feature extraction networks and object detection framework, AugFPN consistently improves multiple detectors by nearly 2% in large-scale object detection datasets MS COCO. For the task of video object detection, this paper proposes a new module named progressive sparse local attention module, which can propagate the temporal context information between frames in videos with low computation to improve the representation of deep features. Specifically, this module establish the spatial correspondence between features of different frame. Then it aligns the feature based on the spatial correspondence and extracts the temporal context information from the video. Based on the progressive sparse local attention module, a video object detector is proposed. This detector does not rely on additional optical flow models, so it can avoid the problem of inaccurate optical flow prediction due to the large object motion. At the same time, it can greatly reduce the amount of parameters (37 M) of the detection model, and reduce the difficulty of the actual deployment of the video detection model. Experiments on a large-scale video object detection dataset ImageNet VID show that the video object detector based on progressive sparse local attention module can achieve leading detection accuracy (80% mAP), while maintaining a reasonable speed (26 FPS).
关键词	图像，视频，目标检测，特征金字塔，渐进稀疏局部注意力
学科领域	计算机科学技术
学科门类	工学
语种	中文
七大方向——子方向分类	目标检测、跟踪与识别
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/39102
专题	多模态人工智能系统全国重点实验室_先进时空数据分析与学习
推荐引用方式 GB/T 7714	郭超旭. 目标检测中的深层特征学习方法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
毕业论文.pdf（5640KB）	学位论文		开放获取	CC BY-NC-SA