基于相关滤波的在线视觉跟踪研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于相关滤波的在线视觉跟踪研究
	张梦丹1,2
	2018-05-25
学位类型	工学博士
中文摘要	作为计算机视觉领域极具挑战的一项关键技术，视觉跟踪在视频监控、导航、军事、人机交互、虚拟现实、智能机器人、自动驾驶等多个领域都有着广泛的应用。其中，相比于基于特定目标检测（例如人、车辆等）的模型固定式视觉跟踪方法，一种被称为模型非固定式在线视觉跟踪方法在学术与工业领域更受关注。该类跟踪方法在仅有待跟踪目标的初始位置标注信息的情况下，仍能通过在线自适应的对目标表观进行灵活而鲁棒的建模，实现对任意目标的准确跟踪。随着该类模型非固定式在线视觉跟踪方法的广泛应用，它面临的挑战也愈来愈严峻。跟踪场景中面临光照变化、目标姿态多样、尺度缩放、运动模糊、遮挡和消失等多种不确定因素，准确、鲁棒、高效的跟踪算法设计仍然是极具挑战的研究课题。基于相关滤波的视觉跟踪算法凭借兼顾准确性和速度的优势，吸引了大量研究者的关注。本文深入研究了基于相关滤波的视觉跟踪算法，针对跟踪目标在尺度、长宽比、旋转方面的姿态变化、剧烈运动、遮挡等问题，提出了在特征学习、运动模型设计、表观模型增强以及跟踪推断策略设计四个方面对基于相关滤波的视觉跟踪算法进行优化改进，保证了算法的准确性、鲁棒性以及实时性。主要的工作和贡献概括如下： 1）提出了基于联合尺度位移空间、旋转空间、时域的综合性相关分析的视觉跟踪算法。具体来说：该跟踪算法在表观建模和跟踪推断策略两方面进行优化。表观建模方面，通过引入块循环矩阵、对数极坐标变换、离散傅里叶变换，对联合尺度位移空间、旋转角度空间内目标表观的相关关系进行细粒度的建模，从而能够提高尺度和角度估计的准确性。在跟踪推断策略方面，通过时域的相关分析以及高阶马尔可夫链模型建模，能够在表观模型中保留目标表观的鲁棒性、多模态性以及有效性，解决冗余表观导致的表观分布不均问题、背景噪声干扰问题，并根据当前候选目标与保留的目标表观的时域相关性来鲁棒而自适应地推断目标的状态。我们在多个视觉跟踪标准评测库上验证了这些创新点的有效性，并大幅度提高了跟踪算法的准确性和鲁棒性。 2）提出了基于引入高层语义以及自顶向下推断的相关滤波的视觉跟踪算法。该算法主要在运动模型方面对传统基于相关滤波的跟踪算法进行优化。首先，我们从传统基于相关滤波的在线视觉跟踪算法鲁棒性不足问题出发，将高层类别相关的语义信息引入在线视觉跟踪中。在目标类别的弱监督作用下，在语义层面对目标进行全局粗定位，优化目标的运动模型，弥补传统基于相关滤波的目标跟踪算法对目标表观表示能力不足、侧重于局部细粒度建模、目标搜索范围有限的缺点，从而实现鲁棒的跟踪。由于视觉跟踪并不提供目标的类别语义标签，我们通过使用在大规模数据集上训练得到的通用卷积神经网络分类器来确定目标的类别分布并进行类别迁移，从而获得跟踪目标的类别语义信息。我们在目前流行的视觉跟踪评测库上进行了算法的对比实验、成分分析实验以及定性评估实验，从而验证算法改进的有效性。 3）提出了基于空间对齐的相关滤波网络的视觉跟踪算法。该算法主要在运动模型、特征学习两个方面对传统基于相关滤波的跟踪算法进行优化。首先，将相关滤波操作转变为可导的相关滤波层引入卷积神经网络中，从而在大规模视频库上离线学习适用于基于相关滤波的目标跟踪的深度特征，增强表观模型的目标表示能力以及判别能力。其次，引入空间对齐网络对目标在连续帧内的运动变换参数进行估计，解决相关滤波的边缘效应问题以及固定目标长宽比建模问题，从而准确地估计目标的大幅度偏移以及长宽比变化。通过端到端的离线网络训练，实现空间对齐网络模块与相关滤波模块的互补学习，增强该网络的跟踪性能。在线跟踪时，仅通过网络的一次前向传播，实现目标的实时跟踪。我们同样在多个视觉跟踪评测库上验证了算法的有效性，并在鲁棒性以及实时性上取得了较好的结果。基于上述方法和创新，我们的跟踪算法在多个跟踪评测库上都取得了当时最好或者领先的评测结果。同时，上述方法和创新，对于其他计算机视觉问题和应用，例如视频分割、视频姿态估计等，也有一定的借鉴意义。
英文摘要	Visual tracking is a fundamental technique in computer vision, and is widely applied ranging from video surveillance, automatic navigation, military, human-computer interaction, virtual reality, intelligent robots, autonomous vehicles, to name a few. Compared to the model-based tracking methods that are designed for specific tracking targets such as pedestrians and vehicles, the model-free online visual tracking methods obtain growing popularity for their tracking characteristics. Namely, these methods are proposed for arbitrary targets without benefiting from any prior knowledge about the targets. The aim of model-free visual tracking is to estimate the trajectory of a target in a video, given only its initial location. Considering complex tracking scenarios including illumination variations, pose deformations, scale changes, motion blur, occlusions, target disappearance, etc., model-free tracking methods should carry out online adaptation to account for continuous target appearance changes, and be robust to background disturbances, and meanwhile work at a high speed, however with very limited knowledge about the tracking target. Therefore, model-free tracking remains a challenging task. The correlation filters based tracking methods have grasped researchers’ attention due to their promising performance and computational efficiency. In this thesis, we also carry out intensive studies on the correlation filters based model-free online visual tracking. Especially, to tackle the issues of the object scale and aspect ratio changes, rotations, abrupt motions, and occlusions, etc., we propose to enhance conventional correlation filters based trackers from four aspects: feature learning, motion model, appearance model, and inference model. As a result, our proposed tracking methods obtain substantial improvements in tracking accuracy, robustness, and speed. The main contributions of this thesis are summarized as follows: 1） We present a fully-functional correlation filters (FCF) based tracker that for the first time fully exploits the correlations in following three complementary spaces: the joint scale-displacement space, the orientation space and the temporal space. With these comprehensive correlation analyses across multiple spaces, FCF gains significant enhancement of its appearance model and the inference model. On the one hand, FCF improves the robustness and adaptability of the appearance model by joint correlation analyses. Specifically, benefiting from the block-circulant structure, the log-polar transform, and the discrete Fourier transform, the object appearance modeling is able to be extended from a single displacement space to the joint scale-displacement space and the orientation space, leading to the better estimation of the object scale changes and rotations. With more accurate estimations, less background noise is introduced into the appearance model, which further boosts the tracking robustness. On the other hand, taking advantage of the temporal correlation analysis using an extended high-order Markov chain model, the appearance model preserves pure object appearance and representative object appearance modalities, and meanwhile gets rid of appearance redundancy and background noise. Furthermore, by taking the temporal correlation analysis to evaluate the correlations between the current object candidate and the object modalities maintained in the enhanced appearance model, the inference procedure becomes more adaptive to object abrupt appearance changes. Comprehensive experiments are performed on three of the largest and widely adopted benchmarks to validate the functionality of each novel component and to show the performance gains in tracking accuracy and robustness. 2） We propose to exploit the category-specific semantics to boost visual object tracking and develop a new visual tracking model that augments the appearance based tracker with a top-down reasoning component. This top-down reasoning component is able to provide consistent semantic clues across video frames inferred from object category information, facilitating the acquirement of object motion model. Specifically, the bottlenecks for the conventional correlation filters based trackers are the low quality and insufficiency of the training data, as well as the low-level appearance modeling, which make these trackers lack robustness. We develop a generic object recognition model and a category-specific semantic activation map method to provide effective top-down reasoning about object locations for the conventional correlation filters based tracker, which alleviates the negative boundary effects brought by the correlation filters and enhances the motion model at a high level. In addition, we develop a voting based scheme for the reasoning component to infer the object semantics. Therefore, even without sufficient training data, the tracker can still obtain reliable top-down clues about the objects. Together with the appearance clues, the tracker can localize objects accurately even in presence of various major distracting factors. Extensive evaluations on two large-scale benchmark datasets clearly demonstrate that the top-down reasoning substantially enhances the robustness of the tracker and provides state-of-the-art performance. 3） We propose a novel end-to-end learnable spatially aligned correlation filters based network to handle complex motion patterns of the target. The whole network not only learns a generic relationship between object geometric transformations and object appearances, but also learns robust representations coupled to the correlation filters in case of various geometric transformations. Therefore, both feature learning and the motion model are enhanced. Specifically, a feature extraction network is combined with a differentiable correlation filter layer for end-to-end training on the large-scale video dataset. Thus, discriminative representations are explicitly learned for correlation filters based tracking. Moreover, a spatial alignment module is further incorporated into this correlation filters based network to provide spatial alignment capabilities and reduce the correlation filter's search space of the object motion. As a result, the challenging issues including boundary effects and aspect ratio variations in the previous correlation filters based trackers are well addressed. Benefitting from the off-line training of the whole network, the spatial alignment and the correlation filters based localization are conducted in a mutual reinforced way, which ensures an accurate motion estimation inferred from the consistently optimized network. Furthermore, in the online tracking process, the light-weight network architecture and the fast calculation of the correlation filter layer allow efficient tracking at a real-time speed. We conduct experiments on three large-scale benchmark datasets, and the experimental results demonstrate that our algorithm performs competitively against existing state-of-the-art methods and achieves high robustness and efficiency. These innovations contribute to plausible tracking results on public available online tracking benchmarks, and some are the best at the time. In addition, some other computer vision applications such as semantic segmentation and pose estimation in the video can take advantage of our proposed methods.
关键词	相关滤波视觉跟踪弱监督学习深度学习卷积神经网络
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/20962
专题	毕业生_博士学位论文
作者单位	1.中国科学院自动化研究所 2.中国科学院大学
第一作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	张梦丹. 基于相关滤波的在线视觉跟踪研究[D]. 北京. 中国科学院研究生院,2018.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Thesis.pdf（47557KB）	学位论文		限制开放	CC BY-NC-SA