基于深度卷积网络的多目标跟踪方法研究

CASIA OpenIR > 智能感知与计算研究中心

	基于深度卷积网络的多目标跟踪方法研究
	周宗伟
	2021-03-05
页数	145
学位类型	博士
中文摘要	多目标跟踪任务是指通过定位和关联多帧连续图像中具有相同身份的不同观测获得多个目标运动轨迹的过程，在视频监控、视频内容分析、自动驾驶等领域有着广泛的应用前景，一直以来都是模式识别和计算机视觉领域的一个热门研究方向。基于检测的跟踪框架是目前研究多目标跟踪任务的主流框架，其包括逐帧的目标检测、目标特征抽取和数据关联三个基本阶段。近些年，随着深度学习技术在目标检测、分割、识别等多个计算机视觉领域中取得突破性进展，该技术也被广泛引入到基于检测的跟踪框架中的各个阶段。尽管当前基于深度学习的多目标跟踪方法取得了较大的性能提升，但其依然存在着许多问题，如密集场景中目标检测精度不够理想、被遮挡目标表观特征鉴别性较差、跟踪速度很难满足实时性要求等。本文充分分析了目前已有的基于深度网络的多目标跟踪方法，在基于检测的跟踪框架的各个阶段展开探索，最终提出了一种高效鲁棒的在线多目标跟踪系统。本论文的主要工作和贡献概括如下： 1. 针对数据关联的局部性问题，提出一种基于高阶图匹配模型的在线多目标跟踪方法在基于检测的跟踪框架中，传统的数据关联方法一般仅考虑不同图像帧中处于局部空间邻域内的目标关联，对于密集区域的匹配鲁棒性较差。为了能够更好的利用轨迹与检测形成的关联之间的全局信息，本文提出了一种基于高阶图匹配 (High-Order Graph Matching, HOGM) 模型的在线多目标跟踪方法。在该模型中，跟踪任务被形式化表述为分别由轨迹和检测形成的两个超图之间的匹配问题。我们提出了一种具有双向单位 L1 范数约束的张量幂迭代算法求解该匹配问题。在匹配过程中，关联能量同时融合了轨迹与目标间的表观相似度、运动一致性以及关联之间的空间关系，能获得更鲁棒的关联结果。作为关联中最重要的特征，我们采用孪生网络结构以度量学习的方式提取检测的深度表观特征。为了解决密集场景中因包围框重叠带来的特征鉴别性下降问题，提取深度特征时，我们提出了一种掩码池化算子能令提取的特征更侧重于目标可见区域以增强重叠目标的可区分性。运动一致性能通过限制候选关联的空间距离抑制表观相似度较高但空间距离较远时的错误匹配。关联之间的空间关系能够通过空间拓扑结构的稳定性保证空间距离较近时的正确关联。在多目标跟踪基准库上的实验表明，与当时一些最新的跟踪方法相比，基于高阶图匹配模型的在线多目标跟踪方法通过利用更加全局的时空信息可以获得更好的跟踪性能，尤其是更好的轨迹稳定性。 2. 针对复杂场景中目标检测失败问题，提出一种用于在线多目标跟踪的干扰感知鉴别学习网络在基于检测的跟踪框架中，检测精度对跟踪器性能影响较大。为了抑制漏检、检测不精确等检测失败案例以提升跟踪性能，本文提出一种基于干扰感知鉴别学习网络 (Distractor-aware Discrimination Learning Network, DDLN) 的在线多目标跟踪方法。首先，该方法为每个目标创建了独立的干扰感知鉴别学习网络，其中包括模板匹配模块和候选分类模块。模板匹配模块能通过局部检测弥补全局检测中存在的漏检问题，候选分类模块能通过对候选区域的前背景分类来抑制轨迹的跟踪漂移。其次，为了获得更鲁棒的模板，本文在 DDLN 中嵌入了一种基于轨迹历史表观的关系注意力机制。再次，在端到端训练网络时，本文还提出了一种干扰敏感损失函数，通过充分挖掘难分样本提升了密集区域中目标特征的鉴别性。最后，本文基于 DDLN 设计了一种多阶段的在线多目标跟踪方法，在两个具有挑战性的多目标跟踪库 MOT16 和 MOT17 上都取得了当时最好的跟踪性能。 3. 针对目标特征独立抽取的高复杂度问题，提出一种用于在线多目标跟踪的长短时线索联合抽取网络基于 DDLN 的多目标跟踪方法中，每个目标的独立处理导致跟踪速度慢、存储空间大等问题，因此本文提出了一种用于在线多目标跟踪的长短时线索联合抽取网络 (Long-Short Clue Extraction network, LSCE)。该网络以相邻帧图像为输入，通过帧间特征的相关关系获得用于局部检测的位置、尺寸等短时线索，同时通过特征金字塔结构获得用于处理跟丢轨迹的长时深度表观特征。在计算跟丢轨迹与当前检测的表观相似度时，除了需要检测的深度表观特征，还需要获取轨迹的鲁棒特征表示。本文提出使用主成分跟踪法从轨迹的历史深度表观特征中抽取其低秩表示作为轨迹的特征。特征金字塔设计和轨迹特征的低秩表示都能减少遮挡对表观特征鉴别性的损害。基于 LSCE 的跟踪方法以相邻帧图像作为输入，而不是目标区域，能够获得几乎独立于跟踪目标个数的跟踪速度，相较基于 DDLN 的跟踪方法在跟踪性能和速度上都取得了提升。基于 LSCE 的在线多目标跟踪方法在两个常用的多目标跟踪数据库上都取得了当时最好的跟踪性能。 4. 针对跟踪框架中多阶段分离问题，提出一种用于在线实时多目标跟踪的改进单阶段无锚框联合网络为了进一步提升跟踪速度，本文将基于检测的跟踪框架中的目标检测和特征抽取模块进行集成，提出一种改进的单阶段无锚框 (Anchor-Free One-Stage, AFOS) 联合网络，能够从共享特征图中同时进行目标检测和特征抽取。具体而言，该网络能够基于基干网络的输出，在不同的任务头中对目标中心点的特征进行回归或者分类以实现目标检测和特征抽取。考虑到来自于同一目标不同位置的特征之间的强相关性，我们还提出一种可形变局部注意力机制对局部信息进行整合，进一步同时提升了两项任务的性能。另外，考虑到任务之间的差异性，我们提出一种任务敏感预测模块，使得不同的任务可以使用不同位置点的特征，从而更适合任务本身以提升其性能。最后，我们将该模型嵌入到多级跟踪策略中，提出了一种高效的在线多目标跟踪方法。该跟踪方法以实时的跟踪速度在多目标跟踪数据库 MOT16和 MOT17 上取得了与当前最好的多目标跟踪器相当的跟踪性能。总的来说，本论文以解决自然场景中多目标跟踪任务的实际困难为目标，仔细分析了基于检测的跟踪框架的各个阶段，利用当前流行的深度学习方法，从多目标跟踪性能和跟踪速度两方面展开研究，最终提出一种在线鲁棒的多目标跟踪系统，在多目标跟踪常用评测库 MOT17 上以 30 帧每秒的跟踪速度获得了很好的跟踪性能，拉近了多目标跟踪研究与实际应用的距离。
英文摘要	Multi-target tracking, which has been a hot topic in the field of pattern recognition and computer vision for several decades, has great potential in many applications, such as video surveillance, video content analysis, and automatic driving. The objective of multi-target tracking is to determine the trajectories of multiple targets simultaneously by locating targets in each frame and then associating the observations with the same identity among consecutive frames. Detection-based tracking is the main paradigm for the current multi-target tracking algorithms, which usually includes three basic steps: object detection in each frame, feature extraction of each objection, and data association across frames. Deep learning technology has also been widely introduced into various stages of the detection-based tracking paradigm, as it has made breakthroughs in multiple computer vision tasks such as object detection, segmentation, and recognition recently. Although the current multi-target tracking methods based on deep learning have achieved large performance improvements, there are still many issues that need to be solved, such as low detection performance in dense scenes, poor feature discrimination of occluded targets, and slow tracking speed that is hard to meet real-time requirements. This thesis proposes a robust online multi-target tracking system with high performance, after fully analyzing existing multi-target tracking methods based on deep learning and exploring various stages of the detection-based tracking paradigm. The main points and contributions of this thesis are summarized as follows: 1. To address the locality problem of data association, an online multi-target tracking method based on a high-order graph matching model is proposed. In the detection-based tracking framework, traditional data association methods generally only consider the association of targets in local spatial neighborhoods from different image frames, leading to poor matching robustness to dense regions. To make full use of the global information among the associations formed by trajectories and detections, we propose an online multi-target tracking method based on the High-Order Graph Matching (HOGM) model. The tracking task in the model is formulated as a matching problem between two hyper-graphs formed by trajectories and detections respectively and solved by a designed tensor power iterative algorithm with bidirectional unit l1-norm constraints. In the aforementioned matching problem, the association energy takes three aspects (appearance similarity between the trajectory and the detection, the motion consistency, and the spatial relationship between the associations) into account simultaneously, to obtain more robust associations. As the most important feature used in the association, the appearance feature is extracted from a siamese network by metric learning. To extract more discriminative appearance features to reduce the performance loss caused by overlapping bounding boxes in dense scenes, we propose a mask pooling operator, which forces the features more focused on the object's visible area. Also, the motion consistency module can suppress the mismatch, where the apparent similarity is high but the spatial distance is far, by limiting the spatial distance of candidate associations. Last but not least, the spatial relationship between the associations can ensure the correct association facing dense objects with the help of stability of the spatial topology. Experimental results on the multi-target tracking benchmark demonstrate that the proposed online multi-target tracking method based on the high-order graph matching model can achieve better tracking performance benefited from more global information, especially higher trajectory stability when compared with other state-of-the-art tracking methods. 2. To address the problem of detection failures in complex scenes, a distractor-aware discrimination learning network for online multi-target tracking is proposed. The detection accuracy has a great influence on the tracking performance of the detection-based tracking paradigm. To suppress the detection failures, such as missing detection and inaccurate detection, and to improve the tracking performance, we propose an online multi-target tracking method based on a Distractor-aware Discrimination Learning Network (DDLN). Firstly, the network is created for each trajectory independently and it has two modules named template matching and candidate classification respectively. The template matching module can make up for the missing detections in general global detection through local detection, and the candidate classification module can suppress the tracking drift of the trajectory by classifying the local detection as a foreground moving object or background. Secondly, to obtain a more robust template, a relational attention module based on trajectory history appearance features is embedded in DDLN. Thirdly, in the end-to-end training of DDLN, we propose a distractor-aware loss function, which improves the discrimination of detection in dense scenes by fully mining difficult samples. Lastly, a multi-stage tracking pipeline is designed based on DDLN and the tracking method achieves the best tracking performance on two challenging multi-target tracking benchmarks MOT16 and MOT17. 3. To address the high complexity problem of independent extraction of target features, a long- and short-term clue extraction network for online multi-target tracking is proposed. Given the shortcomings of the DDLN-based multi-target tracking method, such as slow tracking speed and large storage space, an online multi-target tracking method is proposed based on a Long and Short Clue Extraction (LSCE) network. Specifically, short-term clues obtained through the correlation between adjacent frames refer to changes in position and size for short-term tracking, while long-term clues refer to the appearance features obtained through a feature pyramid structure to deal with the long-term lost trajectories. When calculating the appearance similarity between the lost track and the current detection, in addition to the detection feature, a robust feature representation of the track is also required. Therefore, we propose to extract the low-rank representation as to the trajectory feature from the historical appearance features of the trajectory through the principal component pursuit method. So both feature pyramid structure and low-rank representation can reduce the damage of occlusion to the identification of appearance features. The proposed tracking method based on the LSCE network can obtain a higher tracking speed that is almost independent of the target number as it takes frames as input instead of cropped target areas. The experimental results demonstrate that the proposed tracking method improves the tracking performance and speed when compared to the method based on the DDLN and achieves the best tracking performance when compared to the state-of-the-art trackers on two challenging multi-target tracking benchmarks. 4. To address the problem of multi-stage separation in the tracking paradigm, an improved anchor-free one-stage network for online real-time multi-target tracking is proposed. To further improve the tracking speed, we integrate the detection and feature extraction steps from the detection-based tracking paradigm in an improved anchorfree one-stage network (AFOS) for target detection and feature extraction from shared features simultaneously. Specifically, the network detects and embeds the targets in different headers using the features from the target centers based on the output feature map of the shared backbone. Considering the strong correlation of the features from the different positions of the same target, we also propose a deformable local attention mechanism to integrate local information, which further improves the performance of both tasks simultaneously. In addition, considering the differences among tasks, we propose a task-aware prediction module, so that different tasks can use the features of different location points of the target, which is more suitable for the task itself to improve its performance. Finally, by embedding the proposed modules into a multi-stage tracking pipeline, we propose an online real-time multi-target tracking method. The tracking method achieves a tracking performance comparable to the state-of-the-art tracking methods on the datasets MOT16 and MOT17 at a real-time tracking speed. To summarize, this thesis aims to solve the practical difficulties of multi-target tracking tasks in real scenes, and conducts research on the tracking performance and tracking speed exploiting the current popular deep learning methods, after analyzing the various stages of detection-based tracking paradigm. A robust online multi-target tracking system is finally proposed, which highly narrows the gap between multi-target tracking research and practical applications, as it can achieve a high tracking performance with the speed of 30 FPS
关键词	深度学习卷积神经网络在线跟踪多目标跟踪实时跟踪
语种	中文
七大方向——子方向分类	目标检测、跟踪与识别
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/44326
专题	智能感知与计算研究中心
通讯作者	周宗伟
推荐引用方式 GB/T 7714	周宗伟. 基于深度卷积网络的多目标跟踪方法研究[D]. 中科院自动化研究所智能化大厦. 中科院自动化研究所,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
学位论文_全版.pdf（7507KB）	学位论文		开放获取	CC BY-NC-SA