CASIA OpenIR  > 毕业生  > 博士学位论文
面向大规模不均衡数据的目标检测技术研究
王童
2022-05-18
页数140
学位类型博士
中文摘要
      随着互联网技术的高速发展以及移动设备的全面普及,数字图像和视频数据呈现指数级增长,人类已经进入了大数据时代。如何对海量的数据进行智能化的处理,进而提高人们的生活水平成为了一个日益重要的课题。目标检测技术旨在利用计算机对图像进行处理,输出属于感兴趣类别的目标的包围框以及类别得分。在计算机视觉领域中,目标检测技术是重要且基础的一部分。在实际生活中,目标检测技术也得到了广泛的应用,在智能零售,无人驾驶以及智能安防等诸多领域中发挥着核心的作用。因此,对目标检测技术的研究具有重要的现实意义和价值。
      深度学习的蓬勃发展极大地推动了目标检测技术的进步。以深度神经网络为基础的目标检测器可以自动地学习适合于目标检测任务的特征,省去了手工设计特征的麻烦,同时具有更高的精度。数据集规模的增大可以提升深度神经网络的泛化性能,但是同时也带来了一些新的挑战。首先,在面对大规模的数据集时,如何充分利用计算资源进行快速有效地训练,减少模型的迭代周期,这对于模型的快速部署具有重要的意义;其次,自然界中各类物体的数量分布通常呈现一种长尾的态势,从而导致收集到的大规模数据集也呈现这种态势。不同类别的样本数量分布严重不均衡,给模型的训练带来了新的挑战;除此之外,实际场景对模型的运行速度和大小有严格的要求,需要一些模型压缩的手段(比如:模型剪枝,模型量化以及知识蒸馏等)来减少网络的计算量,提高运行速度。本文围绕大规模不均衡目标检测任务存在的这三个问题提出了系统性的解决方案,极大地提高了目标检测器的性能。
      本文的主要成果和贡献归纳如下:
      1. 针对大规模数据集训练速度慢的弊端,本文提出了一种适用于目标检测任务的大批次训练框架LargeDet。现阶段的目标检测器通常使用一个比较小的批次来进行训练。在面对大规模数据集时,训练通常会耗费数周甚至上月的时间,但是直接增大训练批次会导致网络发散。因此,本文提出了一种新的优化算法来解决网络在大批次训练下发散的问题,并且提高了网络的收敛速度。以该优化算法为核心,结合学习率线性尺度缩放原则以及同步的批归一化,本文提出了LargeDet框架。大量的实验证明该框架极大地缩短了模型的训练时间,同时大幅度地提高了模型在大规模数据集上的精度。
      2. 针对大规模数据集中存在的严重类别不均衡的问题,本文提出了一种动态类别抑制损失。该损失函数通过网络输出的类别得分来判断类别之间的混淆程度,进而保留样本对于容易混淆类别的抑制梯度,忽略对于其他类别的抑制梯度。通过保留对于混淆类别的抑制梯度,网络能够学习到更加有判别力的特征,保持对易混淆类别的区分度。同时,由于忽略了对大多数不易混淆类别的抑制梯度,可以缓解尾部类别被过度抑制的情况,保护尾部类别的训练。实验结果表明,该方法在多个长尾目标检测数据集上都能带来显著的性能提升,特别是极大地提高了尾部类别的性能,实现了同期单模型最好的性能。
      3. 现有的长尾目标检测方法通过样本重采样或者损失重加权等方式隐式地重塑不同类别之间的分类边界,这种对分类边界的间接作用可能会削弱其效力。本文提出了一种类别相关的角度间隔损失,通过引入自适应的角度间隔显式地对不同类别间的分类边界进行调整。本文首先分析并指出线性分类器在长尾问题中会对尾部类别产生病态的分类边界,从而严重损害尾部类别的精度。该损失函数在余弦分类器的基础上引入了一种类别相关的自适应角度间隔来对不同类别之间的分类边界进行动态的调整,从而产生了更好的分类边界。实验结果表明该方法可以有效地提高检测器在长尾数据集上的性能,并且可以在提升尾部类别性能的同时保持头部类别的性能,实现了同期方法最好的性能。
      4. 针对以大模型为主干网络的检测器难以部署到有严格时延和功耗限制的实际应用场景中的问题,本文提出了一种基于空间注意力机制的单阶段检测器蒸馏算法。该方法将空间注意力机制引入到知识蒸馏过程中,旨在帮助学生网络在蒸馏过程中更加关注困难样本的学习。分类损失值代表了网络对样本学习的难易程度。因此,本文设计了一个函数将每个样本的分类损失值映射为一个权重值,损失越大的样本拥有更大的权重。将每个样本的权重赋值到对应的空间位置上构造一个与样本难易程度有关的空间注意力图,该方法随后利用该空间注意力图对蒸馏过程中每个像素点的损失进行加权。在多个数据集上的实验表明,该方法在实现显著的模型压缩效果的同时保持较好的检测精度。
英文摘要
      With the rapid development of Internet technology and comprehensive popularization of mobile devices, digital image and video data shows exponential growth. Mankind has entered the era of big data. How to deal with the massive data intelligently to further improve people's life quality has become an increasingly important topic. Object detection task utilizes the computer to process the image and output the bounding box and category score of the object belonging to the categories of interest. In the field of computer vision, object detection is an important and basic part. Besides, it has also been widely applied in real-life scenarios. It plays a core role in many fields such as intelligent retail, unmanned driving and intelligent security. Therefore, the research on object detection has important practical significance and value.
      The rapid development of deep learning has greatly promoted the progress of object detection technology. Deep learning based detectors can automatically learn the features without the trouble of manual design. And they also have higher accuracy. The increase of dataset size can improve the generalization ability of deep neural networks, but it also brings some new challenges. First, when facing large-scale datasets, how to make full use of computing resources to train the model rapidly and effectively, which is of great importance to the rapid deployment. Second, the quantity of objects in nature always exhibits a long-tail distribution, which leads to the long-tail distribution of the collected datasets. The extremely class imbalance brings new challenges to the model training. In addition, the real-world scenarios have strict requirements on the running speed and model size, which requires some measures of model compression (such as network pruning, network quantification and knowledge distillation, etc.) to reduce the computation cost and improve the running speed. In this dissertation, a systematic solution is proposed to solve these problems existing in large-scale imbalanced object detection task, which greatly improves the performance of the object detectors.
      The main contributions of this dissertation are summarized as follows:
      1. This dissertation proposes a large batch optimization framework (LargeDet) for object detection to solve the slow training speed on large-scale datasets. Current object detectors usually adopt a relatively small batch size to train. When training the network with a small batch size on a large-scale dataset, it will take weeks or even months to finish the training procedure. However, directly enlarging the batch size will lead to network divergence. Therefore, this dissertation proposes a novel optimization algorithm to deal with the network divergence issue under large batch size and improve the convergence speed. With the core of the proposed optimization algorithm, this dissertation proposes LargeDet by combining learning rate Linear Scaling Rule (LSR) and Synchronized Batch Normalization (SyncBN). Extensive experiments demonstrate that the proposed framework can greatly reduce the training time and significantly improve the accuracy on large-scale datasets.
      2. This dissertation proposes an Adaptive Class Suppression Loss (ACSL) to handle the extremely class imbalance in large-scale datasets. ACSL judges the degree of confusion among categories by the category scores outputted by the network, so as to retain the negative gradients for easily confused categories and ignore the negative gradients for other categories. Since the negative gradients to easily confused categories are maintained, the network is able to learn a more discriminative feature representation. And also, the tail categories can be protected from being over-suppressed due to the fact that the negative gradients of not easily confused categories are ignored. Extensive experiments on several long-tail object detection datasets demonstrate that ACSL brings significant performance improvements, especially on tail categories. And it achieves the best performance compared with approaches in the same period.
      3. Existing long-tail object detection algorithms implicitly reshape the decision boundaries between categories through sample re-sampling or loss re-weighting. Such indirect effect on the decision boundary may weaken their effectiveness. This dissertation proposes a Class-Aware Angular Margin loss (CA$^2$M Loss) to explicitly adjust the decision boundaries between categories by introducing an adaptive angular margin. This dissertation first analyzes and points out that linear classifier will generate ill conditioned decision boundaries for tail categories under long-tail setting, which will seriously damage the accuracy of tail categories. Based on the cosine classifier, the proposed loss function introduces a class-aware angular margin to adaptively adjust the decision boundaries between categories, thereby leading to better decision boundaries. The experimental results demonstrate that CA$^2$M Loss can effectively improve the performance of detectors trained on long-tail datasets. And it can improve the performance of tail categories while maintaining the performance of the head categories, obtaining the best accuracy compared with approaches in the same period.
      4. Aiming at the problem that detectors with large backbone networks are difficult to deploy in practical scenarios with strict time delay and power consumption constraints, this dissertation proposes a spatial attention-guided knowledge distillation algorithm for single-stage detectors. This method introduces the spatial attention mechanism into the knowledge distillation process to help the student network pay more attention to the learning of difficult samples. The classification loss value represents how difficult the sample is for the network to learn. Therefore, the proposed method designs a function to map the classification loss of each sample to a weight value. A sample with a large loss value will have a large weight. A spatial attention map related to the difficulty of sample is constructed by assigning the weight of each sample to the corresponding spatial position. And then this method re-weights the distillation loss of each pixel with the obtained spatial attention map. Experiments on multiple datasets show that the proposed method achieves significant model compression while maintaining good detection accuracy.
关键词深度学习 目标检测 类别不均衡 大规模训练 知识蒸馏
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/48628
专题毕业生_博士学位论文
紫东太初大模型研究中心_图像与视频分析
推荐引用方式
GB/T 7714
王童. 面向大规模不均衡数据的目标检测技术研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
王童-博士论文-修改-签名.pdf(8229KB)学位论文 限制开放CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[王童]的文章
百度学术
百度学术中相似的文章
[王童]的文章
必应学术
必应学术中相似的文章
[王童]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。