面向不同数据标注场景的图像目标检测研究

CASIA OpenIR > 毕业生 > 博士学位论文

	面向不同数据标注场景的图像目标检测研究
	王绍儒
	2023-08-22
页数	144
学位类型	博士
中文摘要	目标检测是计算机视觉和数字图像处理领域的一项热门课题，旨在对图像中的目标物体进行定位识别，广泛应用于智能安防、自动驾驶、工业检测等诸多领域。近年来，目标检测方法在各场景的应用推进也为相关方法的设计与开发提出了全新的挑战。一方面，部分应用场景逐渐复杂化、需求逐渐精细化，例如：在自动驾驶、智能安防、图像编辑等场景中，基于边界框的定位形式可能不足以满足应用需求，因此会额外提供像素级实例分割标注以满足相关任务需求；另一方面，受限于数据标注的高昂人力物力成本，一些应用场景中的数据变得不完备、标注呈现低质化，例如：某些场景中仅可获取样本的不精确边界框标注、甚至完全没有人工标注。在上述研究背景下，有必要研究各种可适配不同标注场景的目标检测方法，以满足实际应用的需要。据此，本文基于检测方法与数据的依赖关系，尝试针对不同数据标注场景提出相应的目标检测方法，主要工作和创新点归纳如下： 1. 提出了一种面向精细像素级掩膜标注的目标检测与实例分割联合互惠方法框架。目标检测任务一般采用边界框的形式表征物体位置，但一些诸如自动驾驶、图像编辑的应用场景中，额外要求以更精细的像素级物体分割掩膜的形式表征物体位置，即实例分割任务。这些任务中，会额外提供物体的像素级分割掩膜标注。针对此标注场景，本文提出了一种互惠的目标检测与实例分割联合框架，在充分分析两项任务的关联关系的基础上，设计了一种双流网络结构，并创新性提出基于相关滤波的物体掩膜预测方法与基于掩膜的物体边界框优化方法。此框架有效融合了前人工作中“自底向上”与“自顶向下”的两种范式，可同时完成目标检测与实例分割任务，有效缓解了此前方法的检测结果中经常出现的边界框定位不准确、检测框与分割掩膜不统一、分割掩膜不完整等问题。在目标检测任务中，此方法以极低的计算代价取得了更优的目标框定位精度；在实例分割任务中，此方法可生成更精细的实例掩膜，并取得了更优的速度与精度的平衡。 2. 提出了一种面向不精确图像标注的可有效应对边界框标注噪声的检测器训练方法。目标检测数据标注需耗费大量的人力与物力成本，其中边界框标注精细化调整会占用大量的标注时间。在许多现实应用场景中，往往由于人力紧缺、时间有限或标注流程不完善等原因，标注中容易存在误差。针对此标注场景，本文首次基于真实标注数据研究并验证了标注噪声会严重影响现有检测器的训练过程进而使其出现明显的检测性能下降，并对人为标注的噪声分布进行了细致分析。更进一步，本文提出了可有效应对标注噪声的目标检测器训练方法。此方法引入教师-学生学习机制，由教师网络首先通过预测集成的方式挖掘不精确标注中的有效信息，对噪声标注进行校正，再据此校正之后的信息指导学生网络的学习。此方法可有效抑制训练数据中的噪声对检测器训练造成的不良影响，使得在面临不精确数据标注时检测器的精度显著提高。 3. 提出了一种面向无标注预训练数据的检测器可精简自监督预训练方法。现有检测器一般采用在上游任务预训练的模型权重作为参数初始化，其中，基于无标注数据的自监督预训练方案的数据获取成本更低且包含更少的人为偏见，在检测任务中展现出巨大潜力。针对此标注场景，本文首先提出可精简自监督预训练方法。此方法巧妙融合了基于对比学习的自监督预训练方法与知识蒸馏方法，可通过一次预训练，获取得到多个具有不同尺寸大小的预训练模型。此方法可有效应对具有多尺寸模型部署需求的应用场景，对于每个部署平台，均可根据实际需要从一次性获取得到的多个预训练模型中选择符合计算资源限制的模型将其迁移至所需下游任务中，而无需一一单独预训练这些模型。实验结果表明，采用此方法得到的多个预训练模型在包括检测任务在内的多种下游任务中可取得与单独预训练的模型相当甚至更优的迁移性能，且可节省大量的预训练成本。 4. 提出了一种面向无标注预训练数据的轻量级视觉自注意力网络检测器预训练方法。现有自监督预训练研究多关注大尺寸模型，尤其是以大尺寸视觉自注意力网络作为检测器的骨干网络已展现出巨大的性能潜力。然而，其高昂的计算成本与众多实际应用中的轻量化需求不符。在此背景下，本文聚焦轻量级视觉自注意力网络的自监督预训练，首先充分调研并评估了现有方案在检测任务中的性能表现并提出相关评测实验基准，发现其性能甚至逊于全监督预训练方案，表现出有异于大尺寸模型的特性。针对此问题，本文进一步通过深入分析，提出了针对生成式自监督预训练的知识蒸馏方法，采用注意力知识迁移与预训练目标解耦的策略，有效改善了预训练模型的质量，并在下游检测任务中取得了显著的性能提升。此方法也适用于除检测外的其他视觉任务，具有广泛的学术与应用价值。
英文摘要	Object detection is a hot topic of computer vision and digital image processing, aiming to locate and identify the target objects in the image, which is widely used in intelligent security, automatic driving, industrial detection and many other fields. In recent years, the application of object detection methods in various scenarios has also posed new challenges for the design and development of related algorithms. On the one hand, some application scenarios are gradually complicated and the requirements are subdivided. For example, in scenarios such as autonomous driving, intelligent security, and image editing, the positioning form based on bounding boxes may not be enough to meet the application requirements, so additional pixel-level instance segmentation annotations will be provided to meet the requirements of related tasks. On the other hand, limited by the high cost of data annotation, the data in some application scenarios becomes incomplete and the labeling is low-quality. For example, in some scenarios, one can only obtain samples with inaccurate bounding box annotations or even no manual labeling at all. In the above research background, it is necessary to study various object detection algorithms that can be adapted to different labeling scenarios to meet the needs of practical applications. Based on the dependence between the detection algorithm and the data, this thesis attempts to propose corresponding object detection algorithms for different data labeling scenarios, and the main work and innovation points are summarized as follows: 1. A joint reciprocal algorithm framework of object detection and instance segmentation towards scenarios with additional pixel-level mask annotations is proposed. Object detection tasks generally use bounding boxes to represent object positions. But in some application scenarios such as autonomous driving and image editing, it is additionally required to represent object positions in the form of finer pixel-level object segmentation masks, that is, the instance segmentation task. In this task, pixel-level annotations for object segmentation masks are additionally provided. Aiming at this labeling scenario, this thesis presents a joint reciprocal framework of object detection and instance segmentation. On the basis of fully analyzing the relationship between the two tasks, a two-stream network is designed, with a newly designed correlation module for instance masks generation and a mask-based boundary refinement module for better locating the object. This framework effectively integrates the two paradigms of "bottom-up" and "top-down" from previous work, and can simultaneously complete the object detection and instance segmentation tasks. It effectively alleviates problems frequently encountered in previous methods, such as inaccurate bounding box localization, inconsistency between detection boxes and segmentation masks, and incomplete segmentation masks. For the object detection task, this approach achieves better detection performance at a very low additional computational cost and latency. For the instance segmentation task, this approach generates finer instance masks with a better balance of speed and accuracy. 2. A detector training method that can effectively cope with noisy bounding box annotations towards scenarios with inaccurate annotations is proposed. Labeling data for object detection tasks requires a lot of human resources, materials, and time. And the fine adjustment of bounding box annotations will take up a lot of labeling time. In many real-world application scenarios, noise in annotations is often inevitable due to a shortage of manpower, limited time, or imperfect labeling processes. Aiming at this labeling scenario, for the first time based on real-world manually-annotated data, this study verifies that noisy bounding box annotations can severely affect the training process of existing detectors and lead to significant decreases in detection performance, and analyzes the distribution of man-made noise in the annotations in detail. Furthermore, a detector training method that can effectively cope with noisy location annotations is proposed. Specifically, this method introduces a teacher-student learning mechanism, in which the teacher network first mines valid information from inaccurate annotations through prediction ensemble, corrects noisy annotations, and then guides the learning of the student network according to the corrected supervision. This method can effectively suppress the adverse effect of noise in the training data so that the accuracy of the detector is significantly improved when facing inaccurate annotations. 3. A slimmable self-supervised pre-training method for detectors towards scenarios with unlabeled pre-training data is proposed. Existing detectors typically use the weights of models pre-trained on upstream tasks as parameter initialization. The self-supervised pre-training scheme based on unlabeled data has lower data acquisition costs and contains lower human bias, which shows great potential in the detection task. Aiming at this labeling scenario, a slimmable self-supervised pre-training method is proposed. It integrates contrastive-learning-based self-supervised pre-training with knowledge distillation to obtain multiple pre-training models of different sizes through only once pre-training. This method can effectively cope with application scenarios that require models with different sizes to be deployed. For every deployment platform, a suitable model can be selected from the multiple pre-trained models obtained through once pre-training, depending on actual computational resource constraints, and transferred to the desired downstream task without individually pre-training these models one by one. Experimental results show that the multiple pre-trained models by this method can achieve transfer performance comparable to or even better than the individually pre-trained models in various downstream tasks, including detection tasks, and can save a significant amount of pre-training costs. 4. A pre-training method for detectors based on lightweight vision transformers towards scenarios with unlabeled pre-training data is proposed. Most of the existing self-supervised pre-training research focuses on large-scale models and using large-scale vision transformers as the backbone of detectors has shown great performance potential. However, their high computational cost does not meet the lightweight requirements in many practical applications. In this context, this thesis focuses on self-supervised pre-training of lightweight vision transformers. Firstly, a thorough investigation and evaluation of existing pre-training schemes in detection tasks are conducted, with the first experimental benchmarks presented for this topic. It is found that the self-supervised pre-training is even inferior to the fully-supervised pre-training, behaving differently from the large-scale models. To address this issue, this thesis further presents a knowledge distillation method for the generative self-supervised pre-training method, which effectively improves the quality of pre-trained models by using attention knowledge transfer and pre-training target decoupling. It achieves significant performance improvement in downstream detection tasks and is also applicable to other visual tasks, which has broad academic and practical value.
关键词	目标检测实例分割噪声学习自监督学习轻量化网络知识蒸馏
收录类别	其他
语种	中文
七大方向——子方向分类	目标检测、跟踪与识别
国重实验室规划方向分类	视觉信息处理
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/52413
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	王绍儒. 面向不同数据标注场景的图像目标检测研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
博士毕业论文-最终版.pdf（17545KB）	学位论文		限制开放	CC BY-NC-SA