基于多尺度特征提取与融合的视觉目标检测研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于多尺度特征提取与融合的视觉目标检测研究
	李泽坤
	2022-05-22
页数	140
学位类型	博士
中文摘要	基于深度学习的目标检测任务一直是计算机视觉领域中最基本与最重要的研究课题之一。该任务的目标是在给定输入图像样本及感兴趣类别的情况下，输出图像样本中感兴趣物体类别的置信度分数，以及表示物体位置和大小的包围框。在学术研究中，目标检测任务可以视为是其它高层视觉任务的基础课题，例如目标跟踪、目标实例分割、行人检测、图像描述等，目标检测任务的研究与发展在一定程度上都推动了高层视觉任务的研究与发展。在工业应用中，目标检测任务可以为大量下游任务提供技术支持，例如智能安防、人脸识别、无人驾驶、遥感监测等。因此，针对基于深度学习目标检测的研究与探索具有十分重要的学术意义和应用价值。无论是学术研究还是实际应用，目标检测任务面对的输入数据通常来自于真实场景，但在真实场景中通常存在尺寸大小不同的物体，而物体的尺度变化问题是目标检测任务中极具挑战的问题之一。传统方法针对多尺度物体的识别不具有鲁棒性，很难兼顾不同尺度物体的检测性能。为了缓解目标检测任务中的尺度变化问题，以特征金字塔为代表的多尺度特征融合方法作为基本结构被广泛用于检测模型中。因此本文针对目标检测任务中的尺度变化问题，分别从多尺度特征提取与利用、多尺度特征融合以及多尺度特征学习三个方面进行了深入的研究与分析。本文的主要成果和贡献可归纳如下：提出了基于自适应由粗到细交互器的多尺度目标检测框架。以特征金字塔为代表的多尺度特征融合方法中，多尺度特征信息存在提取不充分与利用率低下等问题，会导致所构建的金字塔式结构多尺度表示能力较差。基于上述问题，本文设计了一种基于自适应由粗到细粒度特征提取的多尺度交互器。相比于以往的特征金字塔式解决方案，本文提出的多尺度交互器可以充分利用多尺度特征相应的粗粒度特征用以补足不同分辨率特征的尺度信息，并通过提取多尺度特征中的细粒度信息用以获取精确的物体空间信息，对于多尺度特征有效的提取与利用使得所构建的金字塔式结构具有更强的多尺度表示能力。通过实验验证与分析，本文所设计的多尺度交互器在检测及实例分割任务上性能均有明显提升，有效改善了多尺度特征提取过程，充分探索了多尺度的内在信息，缓解了目标检测中的尺度变化问题。 • 提出了基于样本独立的动态多尺度融合目标检测框架。在面对不同尺度物体的输入时，多尺度特征融合过程的固化性会导致不同层级特征的学习出现偏差。基于上述问题，本文设计了一种基于样本独立的动态融合多尺度连接器。相比于传统方法固化的特征融合机制，本文提出的动态连接器可以根据当前输入动态地调整整个多尺度融合过程。首先所设计的动态连接器根据当前输入样本，动态地为融合过程筛选合适的多尺度特征；其次在多尺度特征交互过程中，动态地为融合过程选择出合适的多尺度特征交互路径。所设计的动态连接器可以根据当前输入样本，动态且灵活地调整融合过程所需的多尺度特征及合适的交互路径，针对不同尺度的物体可以实现分而治之的特征交互过程，有效改善了多尺度特征融合模式。通过实验验证与分析，本文所设计的基于样本独立的动态融合多尺度连接器可以明显改善不同尺度物体的检测性能。 • 提出了基于语义感知解耦Transformer 金字塔的密集预测框架。针对多尺度特征不能有效学习全局信息的问题，本文设计了一种基于语义感知解耦Transformer 金字塔模型。Transformer 可以有效帮助特征学习长距全局信息，因此基于所设计的语义感知机制可以充分探索特征全局语义信息多样性。此外通过提出的跨尺度解耦交互策略，可以有效促进不同层级间特征的交互学习，使得跨尺度特征在学习自身全局信息的同时，可以获取其它层特征的全局信息，以此增强当前层级特征的多尺度表示能力及全局信息感知能力。通过实验验证与分析，本文所提出的基于语义感知解耦Transformer 金字塔模型可以有效缓解密集预测任务中的尺度变化问题，并在多个密集预测任务如目标检测、语义分割等任务上均取得了较优的实验结果。
英文摘要	The object detection based on deep learning has always been one of the most fundamental and important research topics in the field of computer vision. Given the input images and the concerned categories, it aims to outputting the classification and location results with the corresponding confidence scores and bounding boxes. The object detection task can be regarded as the basis of many high-level vision tasks in the field of academic research, such as object tracking, instance segmentation, pedestrian detection and image caption, etc. The development of object detection has promoted many high-level vision tasks to a certain extent. Besides, object detection can provide technical support for a large number of downstream tasks in the field of industrial application, such as intelligent security, face recognition, unmanned driving, remote sensing monitoring, etc. Therefore, the research and exploration of object detection based on deep learning has very important theoretical significance and application value. Whether it is in the field of academic research or industrial application, the input image samples of object detection task usually come from real scenes. However, there are many objects with various scales in the real scenes. The scale variation is one of the extreme challenges in object detection. The traditional methods are not robust to multi-scale object detection, which is difficult to take into account the performance of objects with various scales. In order to alleviate the scale variation in object detection, the multi-scale feature fusion method represented by feature pyramid is widely used in object detection models as a basic structure. Therefore, this dissertation studies the scale variation problem in object detection from three aspects: the multi-scale feature extraction and utilization, the multi-scale feature fusion and the multi-scale feature learning. The main contributions and contents of this dissertation are summarized as follows: • We propose a multi-scale object detection framework based on the adaptive coarse-to-fine interactor. Based on the multi-scale feature fusion method represented by the feature pyramid, there are insufficient extraction and low utilization of multiscale features, which can lead to the poor multi-scale representation. To alleviate the above problems, the adaptive coarse-to-fine interactor is proposed. Compared with the previous feature pyramids, the proposed multi-scale interactor can make full use of the corresponding coarse-grained features of the multi-scale features to complement the scale information of the features from different resolutions. And the multi-scale interactor also can extract precise fine-grained spatial information from multi-scale features. The effective extraction and utilization of multi-scale features can make the constructed pyramid structure full of stronger multi-scale representation. Based on the validation and analysis of the experiments, the proposed multi-scale interactor has shown superiority in object detection and instance segmentation tasks, respectively. The proposed adaptive coarse-to-fine interactor can effectively improve the multi-scale feature extraction and explore the multi-scale inherent information, which alleviates the scale variation to some extent. • We propose a multi-scale object detection framework based on the dynamic sample-individualized connector. The specifically designed architectures and the fixed multi-scale interaction process are not flexible for the feature fusion and may lead to the learning deviation of different layers, especially when fed with various samples. Based on the above problems, the dynamic sample-individualized connector is proposed. Compared with the rigid feature fusion mechanism in traditional methods, the proposed connector can dynamically adjust the multi-scale fusion process according to the input samples. Firstly, the proposed connector can select the proper multi-scale features for the fusion process according to the input samples, dynamically. Then it can activate informative data flow paths based on the extracted multi-level features for a flexible multi-scale fusion, automatically. The proposed dynamic sample-individualized connector can activate different data flow paths of the extraction and the interaction of multi-scale features, which can achieve a divide-and-conquer style. Based on the validation and analysis of the experiments, the dynamic sample-individualized connector can significantly improve the object detection performance of various scales. • We propose a multi-scale dense image prediction framework based on the semantic-aware decoupled transformer pyramid. The semantic-aware decoupled Transformer pyramid is proposed in order to make the multi-scale features to learn global information, effectively. Transformer can effectively help features learn longrange global information. Thus, based on the designed semantic perception mechanism, the high-level feature can explore the diversity of global semantic information. Besides, the proposed cross-level decoupled interaction strategy can effectively and efficiently promote the interactive learning of features among different levels in the decoupled space, so that the features among different levels can learn the global information from their own and the other levels. After that, the ability of multi-scale representation and global perception can be enhanced. Based on the validation and analysis of the experiments, the proposed semantic-aware decoupled Transformer pyramid can effectively and efficiently alleviate the scale variation in various dense image prediction tasks. And the proposed method has achieved better performance in various dense prediction tasks such as object detection and semantic segmentation.
关键词	目标检测，尺度变化，多尺度特征提取，多尺度融合，多尺度全局信息融合
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/48828
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	李泽坤. 基于多尺度特征提取与融合的视觉目标检测研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于多尺度特征提取与融合的视觉目标检测研（14911KB）	学位论文		限制开放	CC BY-NC-SA