基于深层孪生网络的实时目标跟踪研究

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 视频内容安全

	基于深层孪生网络的实时目标跟踪研究
	张志鹏
	2022-05-22
页数	148
学位类型	博士
中文摘要	视觉目标跟踪是视频理解的主要组成部分，是物体和场景关系理解的研究基石。通用物体（如人、车辆、动物等）的运动跟踪通常被用来建立实例之间的关系，为后端场景理解、行为预测等任务提供信息输入。视觉目标跟踪的相关技术被广泛应用于智能安防、视频编辑、智能教练、自动驾驶和虚拟现实等领域，具有重要的应用和研究价值。其挑战在于难以区分视频中的干扰物，跟踪过程中的误差累计会降低算法的鲁棒性，实时速度的情况下精度难以保证等。因此，目标跟踪的核心科学问题是：如何在保证实时速度的情况下，提升模型特征学习和物体定位的鲁棒性。针对上述问题，本文从如何增强模型的特征学习能力，提升模型对误差累积的容错率，提高物体定位的精度和关系学习的稳定性等四个方展开对实时目标跟踪的研究，并将提出的理论和模型用于典型的孪生跟踪框架。本文的主要贡献概括如下：在增强模型特征学习能力方面，提出了基于残差特征裁剪的深层孪生跟踪网络。在观察到深层卷积神经网络用于孪生跟踪算法时出现性能降低的不适配现象后，本文通过大量实验分析得出结论：产生这个问题的根本原因是卷积层的边界填充操作导致孪生网络两个分支感知不一致，破坏了网络的平移不变性。为了缓解这个问题，本文提出“残差特征裁剪”模块，通过剪除特征图中受到边界填充影响的像素来保证深层孪生网络的平移不变性，有效地缓解了网络无法加深的问题。另外，本文通过分析神经元感受野、网络步长、模板特征尺寸等因素对模型性能的影响，提出了深层孪生跟踪网络的设计准则。借助提出的准则和残差特征裁剪模块，本文设计了不同深度的孪生跟踪网络。实验结果表明本文提出的方法在测试库上取得了4 ~ 6个百分点的提升，显著增强了孪生跟踪网络表征学习的鲁棒性。在提升模型对误差累积容错率方面，提出了基于不对称标签分配的无锚框跟踪框架。本文对基于锚框状态估计的跟踪算法进行了深入分析，发现产生跟踪漂移的一个重要原因是物体尺寸回归网络对前背景分类网络容错率较低。为了缓解这个问题，本文创新性地优化了跟踪算法在训练过程中的标签分配方式，提出了适用于跟踪模型训练的不对称标签分配策略。对于回归网络，算法扩大正样本选取的空间范围，以提升对分类网络的容错率；对于分类网络，算法保留其中心采样的模式，以防止歧义样本导致网络沿着错误方向优化。另外，结合无锚框回归网络，本文提出基于物体感知的特征采样方式，将跟踪框架中的（前背景）分类子任务和（物体尺寸）回归子任务进行联合优化，提高了物体定位的准确性。在提高物体定位精度方面，提出了基于注意力检索的像素级目标跟踪框架。本文首先对单阶段像素级目标跟踪算法进行分析，发现缺乏空间约束引起假阳性像素预测过多是导致这种框架准确率低的主要原因。结合两阶段像素级跟踪和视频目标分割算法的优势，本文提出了基于注意力检索的标签传播机制。通过将视频第一帧的分割掩码标注和测试帧的预测结果引入到网络训练，算法可以有效地区分目标物体和背景干扰。利用网络生成的注意力图，算法可以在抑制背景干扰的同时增强目标区域特征，从而减少假阳性像素预测。本文在大幅提升单阶段像素级跟踪算法精度的同时，保证算法可以实时运行。在提高关系学习稳定性方面，提出了基于自动化匹配网络搜索的孪生网络跟踪框架。针对当前目标跟踪框架中匹配模型不能适应多变场景的问题，本文提出了一种自动化设计匹配网络的算法。该方法抛弃了通过算子显式计算相似性得分实现特征匹配的方式，创新性地将匹配问题松弛为特征融合问题。特征融合是一种隐式的相似性学习过程，它不再依赖显式的相似性计算，而是过大量数据拟合出匹配模式。这种方式更能发挥深度学习在特征关系学习中的自适应拟合能力。通过自动化地组合新定义的算子，可以针对跟踪框架中的分类任务和回归任务搜索出不同的匹配网络，增强了关系学习模型对不同任务的适配能力。在只使用基线算法一半训练数据的前提下，算法在不同基准测试库上取得了5~10个百分点的提升。基于上述的创新性设计，本文提出的跟踪算法在多个视觉目标跟踪公开数据集和多次国际目标跟踪挑战赛中都取得了发表时领先的精度和速度指标。
英文摘要	Visual object tracking (VOT), one of the essential components in visual understanding, is the cornerstone of higher-level object relation learning. General object tracking provides primary input information for scene understanding, behavior prediction, and other tasks requiring the model establish relationship between instances (\emph{e.g.,} pedestrians, vehicles, animals). The technologies of visual object tracking are widely used in intelligent security, video editing, AI coach, automatic driving and virtual reality. The main challenges for visual tracking include distinguishing target and distractors, mitigating accumulation errors, and guaranteeing both real-time speed and promising performance. The key scientific problem is thus how to improve the robustness of representation learning and object localization while ensuring real-time tracking. To this end, this paper conducts research on real-time visual tracking from four aspects: enhancing the representation learning, improving the tolerance for error accumulation, the accuracy of object positioning, and the stability of relation learning. The proposed models are applied to typical Siamese tracking frameworks. The main contributions of this paper are summarized as follows: The cropping-inside-residual (CIR) module is proposed to conduct deeper and wider Siamese tracking frameworks. This paper explores the underlying reason for performance degradation when equipping Siamese tracking model with deep convolution neural network. Based on comparative experiments, we observed that the padding operation in convolution layers leads to perceptual inconsistency of the two branches in Siamese paradigm, which eventually breaks translation invariance of the network. We set out to alleviate this problem by proposing the ''cropping-inside-residual (CIR)'' module, which ensures the translation invariance of the deep Siamese network by chopping off the feature regions affected by the padding. In addition, the network design criteria for Siamese tracking are proposed, which shows how to choose the most suitable receptive field, network stride and template feature size. With the proposed criteria and CIR module, we conduct neural networks with different depths and apply them to typical Siamese tracking methods. The method proposed in this paper achieves improvements of 4 ~ 6 points on the evaluation benchmarks, which significantly improves the robustness of Siamese tracking framework. The asymmetrical label assignment is proposed to conduct object-aware anchor-free siamese networks. This work demonstrates that one essential reason for tracking drift is that the regression network in anchor-based Siamese methods is only trained on the positive anchor boxes, which leads to dissatisfied tolerance for classification error. This mechanism makes it difficult to refine the anchors whose overlap with the target objects is small. To this end, this paper innovatively proposes an asymmetric label assignment mechanism during training to mitigate this problem. In particular, for the regression sub-network, each pixel in the groundtruth box is considered as positive samples in an anchor-free learning manner. For the classification sub-network, the central sampling mode is retained to prevent faulty optimization caused by ambiguous samples. Furthermore, we propose the object-aware feature sampling strategy to optimize the classification and regression sub-networks jointly. Our method achieves better performance while running faster than existing state-of-the-art trackers via the lightweight network design. The attention retrieval is proposed to conduct accurate pixelwise object tracking framework. This paper finds that the lack of spatial constraints is one of the fundamental reasons for false positives in single-stage pixelwise tracking framework. Inspired by two-stage pixelwise tracking methods and video object segmentation (VOS) algorithm, we propose the attention retrieval network based on label propagation mechanism. The proposed method can effectively perceive the target object and background interference by introducing the first frame's annotated segmentation mask and the intermediate frame's prediction results into network training. With the attention map generated by the network, it can suppress the background interference and enhance the representation of the target itself, so as to reduce the false positives. Our work improves the accuracy of the single-stage pixelwise tracking and guarantees its real-time running speed. The automatic matching network design algorithm is proposed for Siamese tracking. Aiming at improving the resillence of Siamese tracking in different scenes, this paper proposes a search algorithm for automatically designing matching networks. For the first time, template matching via explicitly similarity learning is abandoned. We innovatively solve the matching problem by training model on the fused visual features. Directly fusing visual features can be regarded as implicit similarity learning, and it can benefit from the diverse fusion operators in other vision tasks. By automatically searching and combining the newly defined matching operators, we can search task-aware networks for the classification and regression sub-tasks, which increases capacity of the two sub-networks. With only half training data of the baseline tracker, we achieve gains of 9 percentage points in the evaluation benchmark. With the above innovative design, the tracking methods proposed in this paper have achieved leading accuracy and running speeds on multiple public benchmarks and challenges of visual object tracking.
关键词	视觉目标跟踪，孪生网络，深层网络，注意力机制，神经架构搜索
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/48559
专题	多模态人工智能系统全国重点实验室_视频内容安全
推荐引用方式 GB/T 7714	张志鹏. 基于深层孪生网络的实时目标跟踪研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
张志鹏_基于深层孪生网络的实时目标跟踪研（12091KB）	学位论文		开放获取	CC BY-NC-SA