在海量数据和高性能硬件的加持下，人工智能技术已经取得了迅速的发展。 目标检测任务作为计算机视觉最基础的任务之一，为实例分割、目标跟踪等下游任务打好了基础。得益于深度卷积神经网络的高速发展，基于深度学习的目标检测模型已经成为了当前最主流的方法。相比于传统算法，它们在效率和性能两方面都展现出了显著的优势。目标检测作为计算机视觉最热门的研究课题之一， 在民用、军用等场景均得到了广泛的应用。然而，随着社会需求的发展，它也面临着一个主要矛盾：一方面，由于漏检、误检等问题所造成的后果难以接受，实际应用场景对目标检测器的性能提出了更高要求；另一方面，为了在低功耗设备上达到实时帧率，模型的结构又不易过于复杂。事实上，简单的模型与较高的检测性能是相互制约的。
当前主流的目标检测算法大致分为两阶段框架和单阶段框架，而后者由于 其简洁的结构和较高的效率受到了更多关注。单阶段框架的流水线主要包含了 主干网络、颈部网络、检测头以及标签分配策略等四个部分。现有研究更聚焦在 提升模型的性能，而很少注意到算法效率。因此，如何在保持模型推理速度的前 提下提升其性能，仍然是一个非常有研究价值的问题。针对该课题，本文从单阶 段框架的颈部网络、检测头及标签分配策略入手，设计了高效、高性能的目标检 测算法。具体贡献包含了如下三点：
（1）提出了一种基于可形变跨层交互模块的特征金字塔。利用多尺度特征是实现多尺度目标检测最有效的方法之一。由于图像金字塔非常耗时，特征金字塔 （FPN）成为了获取多尺度特征最受欢迎的组件。尽管它非常有效，其结构设计仍然存在本质性的缺点。本文中，我们首先分析了阻碍 FPN 构建代表性多尺度特征的关键点。然后，我们设计了一种抽象的可形变跨层交互（DCI）算子来进行信息传输，其目标是引入内容感知的采样策略和动态聚合权重。基于 DCI 算子，我们构建了可形变跨层交互特征金字塔（DCIFPN）来替代原始 FPN。实验结果表明，所提结构可以在保证效率的同时显著提升模型检测性能，达到比其他FPN 变种更强或相当的效果。
（2）提出了一种基于门控机制的目标检测头。在基于深度学习的算法中，目标检测任务常被建模为一个多任务优化问题。由于分类和回归任务所关注的重点不同，当前单阶段模型普遍都采用了两条并行分支作为检测头。然而，该方法对于多任务学习来说是次优的。本文中，我们提出了一种新的门控检测头（G-Head）来增强不同任务间的交互和多任务学习过程。通过引入多尺度聚合（MSA）、多方面学习（MAL）和门控选择器（GS），所提方法可以在更少参数量和计算量的情况下显著提升现有单阶段检测器的性能。为了验证 G-Head 的效率、有效性及泛化性，我们在挑战性的 MS COCO 数据集上开展了大量的实验。
（3）提出了一种基于动态准则的标签分配策略。标签分配策略负责在训练过程中进行正负样本划分，因此它与单阶段检测器的性能密切相关。过去的工作普遍都基于几何信息（例如交并比、中心距离）来确定正样本。尽管已取得了一定的成功，这些启发式策略是很死板的，它们会限制模型的性能上限。针对传统方法的问题，我们对标签分配策略开展了研究。通过引入额外的语义信息、预测感知的几何得分和样本重加权机制，我们提出了一种新的动态标签分配策略 （DLA）。实验结果表明，DLA 可以在几乎不改变模型结构的条件下显著提升性能。同时，所提方法仅在训练阶段带来额外开销，并不会减慢模型的推理速度。
With the support of massive data and high-performance hardware, artificial intel- ligence technology has made rapid progress. Object detection, which is one of the most fundamental tasks of computer vision, serves as the basis for downstream tasks such as instance segmentation and object tracking. Thanks to the rapid development of deep convolutional neural networks, object detectors based on deep learning have become the mainstream methods at present. Compared with the traditional algorithms, they have shown significant advantages in both efficiency and performance. As one of the most popular research topics in computer vision, object detection has been widely used in civil and military scenarios. However, as the needs of society increase, it also faces a major contradiction. On the one hand, due to the unacceptable consequences caused by missed detection or false detection, the practical application scenarios put higher re- quirements on the performance of object detectors. On the other hand, the structure of the model must not be too complex in order to achieve real-time speed on low-power devices. In fact, simple models and high detection performance are mutually restricted.
Currently, mainstream object detectors can be roughly classified into two-stage and one-stage frameworks. Due to its simple structure and high efficiency, the latter one attracts wider attention. The pipeline of one-stage framework mainly contains four parts, including backbone network, neck network, detection head and label assignment strategy. Existing studies focus more on the performance improvement of object detec- tors, and they paid little attention to the algorithm efficiency. Therefore, how to im- prove the performance of the model while maintaining its inference speed is still a very valuable research topic. To address the problem, this article desgins efficient and high- performance object detection algorithms, starting from the neck network, detection head and label assignment strategy of the one-stage framework. The detailed contributions contain the following three folds:
(1) Firstly, we present a feature pyramid network based on Deformable Cross-scale Interaction. Exploiting multi-scale features is one of the most effective methods to recognize objects of different scales in object detection. Since image pyramid is time- consuming, Feature Pyramid Network (FPN) becomes the most popular component used for obtaining pyramidal features. Despite its effectiveness, there still exist some in- trinsic defects in the structure design. In this article, we first analyze the key points that prevent FPN from building representative multi-scale features. Then, an abtract mod- ule called Deformable Cross-scale Interaction (DCI) is designed to perform information tranfer, which aims at introducing content-aware sampling strategy and dynamic aggre- gation weights. We build Deformable Cross-scale Interaction Feature Pyramid Network (DCIFPN) upon DCI to replace the original FPN. The experimental results show that the proposed structure can significantly improve the detection performance while main- taining high efficiency at the same time. Compared with other FPN variants, DCIFPN is able to achieve superior or comparable results.
(2) Secondly, we introduce a object detection head based on gating mechanism. For methods based on deep learning technologies, object detection is commonly formulated as a multi-task optimization problem. Due to the divergence between classification and regression tasks, modern one-stage detectors typically utilize two parallel branches as the detection head. However, this solution might be sub-optimal for the multi-task learn- ing problem. In this article, we introduce a new Gating Head (G-Head) to enhance the interaction between different tasks and promote the multi-task learning process. By in- troducing Multi-Scale Aggregation (MSA), Multi-Aspect Learning (MAL), and Gating Selector (GS), the proposed method can significantly boost the performance of existing one-stage frameworks with fewer parameters and computational costs. To validate the efficiency, effectiveness, and generalization of our G-Head, extensive experiments are conducted on the challenging MS COCO dataset.
(3) Thirdly, we propose a label assignment strategy based on dynamic metric. La- bel assignment, which is responsible for discriminating positive and negative samples in training process, is closely correlated to the detection performance of one-stage de- tectors. Previous works commonly utilize geometric information such as Interaction over Union (IoU) or center distance to determine positive samples. Although they have achieved substantial success, these heuristic strategies are rigid and they might limit the upper bound of detection performance. By introducing extra semantic informa- tion, prediction-aware geometric score and sample re-weighting mechanism, we pro- pose a novel strategy called Dynamic Label Assignment (DLA). The experimental re- sults demonstrate that DLA could greatly boost the performance and keeps the model structure almost unchanged. In the meanwhile, our method only brings extra computa- tional costs during training phase and it won’t slow down the inference speed.
|Keyword||卷积神经网络 目标检测 特征金字塔 目标检测头 标签分配策略|
|江鹤. 目标检测网络的性能优化技术研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.|
|Files in This Item:|
|Recommend this item|
|Export to Endnote|
|Similar articles in Google Scholar|
|Similar articles in Baidu academic|
|Similar articles in Bing Scholar|
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.