目标检测神经网络模型知识蒸馏方法研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	目标检测神经网络模型知识蒸馏方法研究
	曹巍瀚
	2023-05-19
页数	76
学位类型	硕士
中文摘要	目标检测是计算机视觉研究领域的一个重要分支，其研究内容是通过计算机自动识别图像中的多个物体，并获取不同物体的位置信息。目标检测在人脸识别、行人检测、自动驾驶与人机交互等领域有着广阔的应用前景。随着深度学习的发展，基于卷积神经网络的检测器逐渐取代传统的手工检测器，占据了该领域的主导地位。然而，随着检测器性能的提升，模型结构越来越复杂，模型的计算量也随之变大。智能监控、人脸识别等实际应用对目标检测模型的计算高效性有很高的要求。尤其是在计算资源受限以及对算法实时性要求较高的场景下，目标检测模型高计算复杂度、大内存占用的问题成为了阻碍其部署的主要障碍。因此，对目标检测轻量化模型的研究对于推动目标检测技术在各个领域的应用具有重要的应用价值和现实意义。近年来，诸如模型剪枝、量化、神经结构搜索和知识蒸馏等模型压缩技术得到了广泛的使用。本研究聚焦知识蒸馏在目标检测上的应用，因为知识蒸馏可以在不增加推理时间且不修改模型结构的前提下提升目标网络的性能，方便部署。本文的研究内容以及创新点归纳如下：（1）基于皮尔逊相关系数的特征图知识蒸馏算法。均方误差损失函数在特征图知识蒸馏算法中广泛使用。然而，实验发现，学生和教师网络输出特征之间、学生（及教师）网络不同的蒸馏位点输出特征之间以及学生（及教师）网络同一蒸馏位点输出特征的不同通道之间，均可能存在较大的特征幅值差异，当两者互为异构时差异尤其明显。因此，直接使用均方误差进行知识蒸馏会导致次优解。为缓解上述三个问题，本研究提出了一种新的蒸馏损失函数——PKD（Knowledge Distillation via Pearson Correlation Coefficient）损失函数，可以度量教师和学生网络输出特征之间的相对误差。具体而言，先计算学生和教师网络输出特征之间的皮尔逊相关系数r，然后使用1-r作为蒸馏损失函数。PKD损失函数可以保证教师网络更关注的特征图区域学生网络也更关注，同时放松了对学生网络特征图数值大小的限制，有利于学生网络的优化，进而大幅度提升其收敛速度和最终性能。在MS COCO数据集上进行的大量实验证明了本方法的有效性。（2）基于皮尔逊相关系数的响应值知识蒸馏算法。KL散度在响应值知识蒸馏算法中广泛使用。然而，KL散度在训练过程中引入了精确匹配要求，即只有当学生和教师的输出完全相同时，损失函数才能达到最小值。当学生和教师网络模型容量差异较大时，这一要求过于严格。另外，尽管教师网络输出中的不确定性可为学生网络的训练提供额外信息，但是当学生高估了教师网络输出的不确定性，甚至得出错误结果时，KL散度并不能充分惩罚学生网络。为缓解上述问题，本研究提出使用PKD损失函数代替传统KL散度，以度量学生和教师网络预测值之间的差异。由于模型容量限制，学生网络很难学习到教师网络预测结果的绝对数值，而PKD损失函数则放松了对学生网络预测结果绝对数值的约束，从而降低其优化难度。同时，PKD损失函数关注学生和教师网络预测结果之间的相关性，可以保证：教师网络置信度更高的类别，学生网络的置信度也更高。在MS COCO和ImageNet上进行的大量实验证明了本方法的有效性。
英文摘要	Object detection is a significant research area in the field of computer vision, which enables the computer to automatically identify and locate multiple objects in an image. This process holds significant importance in numerous application domains, such as face recognition, intelligent transportation, monitoring systems, and industrial detection. With the advancement of deep learning, convolutional neural network-based object detectors have gradually supplanted traditional manual detectors and now dominate the field of object detection. However, as object detectors have achieved better performance, their network structures have become increasingly complex, resulting in higher computational and memory demands. Many practical applications mandate that object detectors be computationally efficient, such as intelligent monitoring and face recognition. The high computational complexity and memory usage present major challenges in deploying detection models, especially for resource-constrained devices and real-time applications. Therefore, research on lightweight detection models holds great potential to advance the application of object detection in various domains and has significant theoretical and practical value. In recent times, various approaches such as pruning, quantization, neural architecture search, and knowledge distillation have been proposed to address the aforementioned issues. Of these, our study is focused on knowledge distillation since it can improve the performance of a target network without any additional inference-time overhead or structural modification. The primary contributions of our study are outlined as follows: (1) Feature Imitation with Pearson Correlation Coefficient. The mean square error (MSE) is commonly utilized in feature-based knowledge distillation. Nevertheless, experiments have shown that the magnitude discrepancy existing in both the teacher-student detector pair and within the detector, among distinct feature pyramid network (FPN) stages and channels, can adversely impact student training, particularly in heterogeneous detector pairs. To mitigate these issues, we introduce a new knowledge distillation approach that leverages the Pearson Correlation Coefficient to capture the linear correlation between student and teacher features. Specifically, we first normalize the feature maps to have zero mean and unit variance and subsequently minimize the MSE between the normalized features. By prioritizing the relational information between the teacher and student features, the Pearson Correlation Coefficient can alleviate the distributional constraints on the magnitude of the student's feature values, resulting in improved convergence rates and final performance of the student network. (2) Response mimicking with Pearson Correlation Coefficient. KL divergence is a prevalent tool in response-based knowledge distillation algorithms. Nevertheless, during the training process, KL divergence imposes exact matching requirements, where the loss function attains the minimum value only when the outputs of the student and teacher are exactly identical. However, this requirement is overly stringent when there exists a substantial difference in model capacity between the student and teacher networks. Additionally, even though the uncertainty in the teacher's output may furnish supplementary information for the training of the student, KL divergence fails to fully penalize the student when it overestimates the uncertainty in the teacher's output and produces erroneous results. To remedy these issues, we advocate utilizing the PKD loss instead of the conventional KL divergence to assess the disparity between the predicted values of the student and teacher networks. Due to the limitations of the model capacity, it is arduous for the student to learn the absolute values of the teacher's predictions, and our PKD loss relaxes the constraints on the absolute values, thus mitigating the optimization challenge of the student network. Moreover, the PKD loss concentrates on the correlation between the predictions of the student and teacher networks, which guarantees that the student is more confident in the categories that the teacher has higher confidence in. The efficacy of this technique has been demonstrated through numerous experiments conducted on the MS COCO and ImageNet datasets.
关键词	知识蒸馏模型压缩目标检测图像分类通用性
语种	中文
七大方向——子方向分类	目标检测、跟踪与识别
国重实验室规划方向分类	视觉信息处理
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/51989
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	曹巍瀚. 目标检测神经网络模型知识蒸馏方法研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
曹巍瀚-目标检测神经网络模型知识蒸馏方法（9798KB）	学位论文		限制开放	CC BY-NC-SA