本文将行人多目标跟踪方法划分为基于检测的跟踪方法和联合检测的跟踪 方法。前者可拆分为目标检测、特征提取、特征关联和轨迹生成四步，其中检测和特征提取为两个独立过程，能够分模块设计和验证跟踪算法。后者同时进行特征提取和检测结果输出，能够联合两个过程共同提升跟踪效果。基于以上难点和研究方法，本文针对交通视频中行人跟踪的特征提取和关联展开研究， 利用基于检测的跟踪方法分模块研究特征提取、预测和关联方法，并利用联合检测的多目标跟踪方法研究如何提取更准确的跟踪特征。
1. 针对当前交通场景中由于检测存在极大干扰以及行人之间频繁遮挡造成跟踪失败的现象，本文提出了基于双向长短时记忆网络(Bi-directional LSTM,本文简称双向 LSTM) 的外观运动特征提取与两阶段关联的方法。该方法的核心思想是在获得行人深度外观特征的基础上，通过双向LSTM对目标在当前帧的外观进行预测作为短时外观特征，同时选取目标历史轨迹中可靠的检测作为长时稳定外观特征共同构建目标的外观表征，以获得轨迹更鲁棒的外观表示。另外， 本文设计了基于双向LSTM预测和卡尔曼滤波(Kalman Filter, KF) 的运动特征提取模型，在获得轨迹预测位置的同时提升了轨迹的平滑性。最后，本文综合以 上两种特征构建基于稳定轨迹的两阶段关联方法，首先对置信度高的轨迹进行关联，再将未关联的轨迹与检测进行第二阶段关联，以获取更加稳定的跟踪结果。实验结果表明，本文提出的方法能够提升多目标跟踪的准确性。
2.针对当前跟踪场景中行人距离较近或出现遮挡导致的错误检测增多从而造成跟踪特征提取失败的问题，本文设计了基于注意机制的外观特征提取以及基于群组匹配的能量函数最小化方法。首先，建立基于行人姿态的硬注意力模块和基于区域的软注意力模块，加入到行人的局部信息和全局信息获取中，提高模型对于重点位置的关注度，帮助模型获取目标被部分遮挡时的外观特征。 然后，本文基于该外观特征和线性运动特征构建相似性矩阵获取初始关联结果。 更进一步，本文考虑目标之间的运动相关关系，对目标划分群组并基于组内可靠轨迹建立连续帧的群组匹配关系，在每一组内基于相对方位一致性和运动趋势一致性分别构建目标轨迹间的一元、二元能量项，通过最小化网络流能量函 数修正初始关联结果，从而生成更稳定长时的跟踪轨迹。实验结果表明，该方 法可以提高跟踪精度并减少碎片化轨迹。
3. 针对基于检测的多目标跟踪方法受到检测精度的限制，本文将Transformer引入多目标跟踪任务，以联合检测的跟踪方式同时提升检测和跟踪的精度。通过引入语义分割任务帮助Transformer学习到细粒度的图像信息，提升Transformer注意力权重对前景目标的注意力，从而帮助获取更准确的感兴趣目标位置以及目标间的关联特征。此外，本文还提出了基于图像特征的动态查询生成方法，针对不同数据分布生成基于图像特征的动态查询，帮助Transformer解码器更加准确地估计潜在目标位置，提升模型对于新出现目标的捕获能力。 实验表明，该算法提升了跟踪的准确性和轨迹的稳定性。
As an important branch in the field of computer vision，pedestrian tracking in traffic scenes not only can obtain trajectory information in the field of view，but also provides an important basis for traffic behavior analysis and scene understanding，which plays a crucial role in unmanned driving， intersection pedestrian flow detection， pedestrian trajectory prediction， and hazardous situation analysis. Currently，it is a very challenging task to achieve multi-object tracking of pedestrians in complex traffic scenes. This is due to the frequent occlusions and overlaps between pedestrian trajectories， as well as very different sizes of objects in the video caused by different shooting angles. In addition，pedestrian detection in traffic scenes inevitably has a lot of noise. At present，most tracking algorithms are still not robust and accurate enough for feature extraction and feature correlation in the case of object occlusion and trajectory overlap. Under the interference of false detection，the tracking trajectory of the object cannot belong-time and robust. In this dissertation，we classify pedestrian multi-object tracking methods into detection based multi-object tracking methods and joint detection multi-object tracking methods. The former can be split into four steps：object detection，feature extraction，feature correlation and trajectory generation，in which detection and feature extraction are two independent processes that can design and verify tracking algorithms in separate modules. The latter can simultaneously perform feature extraction and detection result output，which can jointly improve the tracking effect. Based on the above difficulties and research methods. This dissertation focuses on feature extraction and correlation for pedestrian tracking in traffic video，using detection-based tracking methods to study feature extraction，prediction and correlation methods in separate modules，and using joint detection tracking methods to study how to extract more accurate tracking features. The main research contents and contributions are summarized as follows：
1.To address the problem of tracking failure caused by the great interference in detection and frequent occlusion between pedestrians in the current traffic scene，we propose a novel appearance motion feature extraction based on Bi-directional Long and Short Term Memory (Bi-LSTM) network and a two-stage correlation method. We first obtain the pedestrian depth appearance features， use the Bi-LST M to predict the object's appearances in the current frame， and use the prediction results as short-time appearance features. Then we select the reliable detection in the object's historical track as long-time robust appearance features to jointly construct the overall appearance features of the objects， and thereby obtain more robust appearance features of the tracks. In addition，we design a motion feature extraction model based on Bi-LST M prediction and Kalman filter to improve the smoothness of the tracks while obtaining the predicted locations. Finally，we combine the above two features to construct a two-stage correlation method based on robust tracks， by firstly correlating the tracks with high confidence， and then correlating the uncorrelated tracks with uncorrelated detections afterward to obtain more robust tracking results. The experimental results show that our method can improve the accuracy of multi-object tracking.
2.To address the problem of feature extraction failure caused by increased false detection of close distance or occlusion of pedestrians in tracking scenarios， we design an appearance feature extraction based on the attention mechanism and an energy function minimization method based on group matching. Firstly，we establish a hard-attention module based on the pedestrian's pose and a soft-attention module based on its regions. These two modules are added to the local and global information acquisition of pedestrians to improve the model's attention to the key location and help the model to acquire the appearance features when the object is partially occluded. Secondly， we construct an affinity matrix based on the objects appearance features and linear motion features to obtain initial correlation results. Considering the influence of the interaction between objects on the trajectory， we group the results and build a group matching method based on the number of reliable trajectories within the group， establish a unary energy term of group members based on the assumption of the consistency of orientation， and a binary energy term between a pair of occluded trajectory based on the assumption of the consistency of mutual movement trend. By minimizing this overall energy function， more robust and continuous tracks can be obtained through the correction of the initial correlation results. The experimental results show that our method can improve tracking accuracy and reduce the number of fragmented tracks.
3.To address the limitations of detection-based multi-object tracking methods on detection accuracy， we introduce the Transformer model into multi-object tracking to improve both detection and tracking accuracy in the joint-detection and tracking paradigm. By introducing the semantic segmentation task into the multi-object tracking problem， the Transformer can learn fine-grained image information，which can be regarded as the proof of improvement of its attention to foreground objects， thus helping to obtain more accurate locations of target objects and correlation features between objects. In addition， we also propose the method of generating dynamic object queries based on image features， for corresponding queries on different data. This helps the Transformer's decoder to better estimate potential target locations and improve its ca- p ability of capturing new targets. Experiments show that the algorithm improves the tracking accuracy and stability.
|Keyword||交通视觉 行人跟踪 注意力机制 深度学习|
|刘雅婷. 交通场景下行人多目标跟踪算法研究[D]. 中科院自动化所. 中科院自动化所,2022.|
|Files in This Item:|
|Recommend this item|
|Export to Endnote|
|Similar articles in Google Scholar|
|Similar articles in Baidu academic|
|Similar articles in Bing Scholar|
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.