CASIA OpenIR  > 模式识别国家重点实验室  > 图像与视频分析
基于深度学习的视觉目标跟踪方法研究
赵飞
Subtype博士
Thesis Advisor唐明
2019-05-22
Degree Grantor中国科学院自动化研究所
Place of Conferral中国科学院自动化研究所
Degree Discipline模式识别与智能系统
Keyword视觉目标跟踪 深度学习 强化学习 对抗学习
Abstract

视觉目标跟踪是计算机视觉领域一个重要的研究方向,并且在智能视频监控、人机交互、自动驾驶等场景有着广泛的应用。传统的视觉目标跟踪算法在跟踪精度和鲁棒性上无法达到令人满意的水平。近年来,得益于可端到端优化和易于并行计算的巨大优势,基于深度学习的跟踪算法达到了较高的跟踪性能。但是,许多问题制约了跟踪性能的进一步提升,如神经网络对响应图的回归不够精确、模板更新策略不够鲁棒、跟踪器容易发生漂移、无法对目标的外观特征进行长时建模、提取的目标特征判别力较差等。本文针对单目标跟踪中的问题进行了深入研究,有针对性地提出了多种基于深度神经网络的视觉目标跟踪算法。本文的主要工作和贡献总结如下:

1. 为了提高跟踪过程中对目标位置的回归精度,本文提出了一种基于对抗学习的视觉目标跟踪算法。此算法包含一个用于回归目标位置的全卷积连体网络,以及一个判别性的分类网络,并利用对抗学习算法对这两个网络进行联合优化。在这一框架下,回归网络和分类网络可以作为一个整体利用大量视频序列进行端到端训练。在测试阶段,回归网络用于产生响应图,该响应图反映了目标模板在每一个候选搜索区域中的位置和尺寸。分类网络用于判断哪一个响应图可以更好地反应目标模板在对应的搜索区域上的位置和尺寸信息。此外,本文针对提出的跟踪算法,还提出了一个注意力可视化算法。在三个大型视觉目标跟踪算法测试集(OTB-100,TC-128,VOT2016)的测试结果表明,本算法可以有效地提升回归网络的回归精度,具有很高的跟踪性能。

2. 由于大部分跟踪算法只保留第一帧或前一帧的目标模板,而随着目标外观的改变或误差的积累,跟踪器很容易在跟踪过程中丢失目标。此外,大多数基于回归响应图的卷积神经网络跟踪器在跟踪过程中不进行在线更新,这会造成跟踪器对待跟踪目标的外观变化十分敏感。针对这些问题,本文提出一个基于强化学习和监督学习的跟踪器。本算法保存若干跟踪过程中产生的目标模板,并通过基于时间差分的演员-评论家(Actor-Critic)框架的强化学习算法训练得到目标模板的更新策略。该算法基于演员网络的输出,对保存的目标模板进行更新,并选取其中的一个模板在当前帧进行目标定位。同时,通过对回归网络进行在线更新提升跟踪器的鲁棒性。在视觉目标跟踪算法测试集上的测试结果表明,本算法具有很高的跟踪性能。

3. 针对大部分基于卷积神经网络的跟踪器在丢失目标后不能对目标进行重捕获,以及在线更新过程中由于缺乏训练样本,从而导致跟踪器对目标外观变化不够鲁棒的问题,本文提出了一个基于卷积神经网络的跟踪器。该跟踪器包含一个回归模块和一个分类模块。回归模块包含两个全卷积连体网络,即一个跟踪网络和一个目标重捕获网络。跟踪网络利用保存的多个目标模板确定候选目标区域,同时对跟踪器是否漂移到背景区域进行检测。如果检测到漂移,目标重捕获网络会在全图范围内对目标进行重捕获。此外,本文提出一个包含噪声注入层的分类网络对候选目标进行判别,进而得到最终的跟踪结果。其中,本文提出的噪声注入层在训练样本有限的在线更新过程中,可以有效提升分类网络对目标外观改变的鲁棒性。在三个大型视觉目标跟踪算法测试集上(OTB-100,TC-128,VOT2016)的测试结果表明,本算法的跟踪性能达到了很高的水准。

4. 大部分基于卷积神经网络的跟踪算法仅考虑连续两帧之间目标外观的变化。此外,虽然一些跟踪器利用循环神经网络对待跟踪目标的外观特征进行长时建模,但目标的特征衰减会严重影响跟踪器的跟踪精度。针对这些问题,本文提出了一个带有抗衰减长短时记忆(Anti-decay LSTM)网络模块的、基于连体卷积神经网络的跟踪算法。本算法从两个方面对传统的LSTM网络进行了扩展,使之更加适应视觉目标跟踪任务:首先,为了提取二维图像的空间信息,用卷积层替换LSTM中的全连接层;第二,改进了LSTM的细胞单元(Cell Unit)结构,从而使得目标的原始外观特征可以无衰减地在LSTM中传递任意的时间步长。同时,本文利用对抗学习的方式对Anti-decay LSTM网络的参数进行优化。本算法不仅使得跟踪器可以更加精确地回归响应图,而且可以提取目标更为鲁棒的特征。实验结果表明本算法在OTB-100、TC-128、VOT2016,以及VOT2017等测试数据集上均展现了优异的跟踪性能。

5. 如何提取目标鲁棒的并具有判别力的时空特征是一个具有挑战性的任务。本文通过利用目标的多个模板和含有3D卷积的连体卷积神经网络提取鲁棒的时空特征。为了对目标模板进行更精确的匹配和定位,本文提出了一个新的带有注意力机制的相关层。同时,为了进一步提升跟踪性能,本文提出了一个有序的四元损失函数以及对应的神经网络模型,该网络可以提取更具有判别力的目标特征,并可以用于在候选目标中判断哪一个是最优的跟踪结果。实验结果表明本算法在OTB-100、TC-128、VOT2017等数据集上均展现了很高的跟踪性能。

Other Abstract

Visual object tracking is one of the important tasks in computer vision with wide applications, such as intelligent video surveillance, human-computer interaction, autonomous driving, etc. The traditional visual tracking algorithms cannot achieve the satisfactory performance both in accuracy and robustness. In recent years, thanks to the huge advantages of end-to-end optimization and parallel computing, the deep learning-based tracking algorithm achieves high tracking performance. However, many problems constrain the further improvement of them, such as the neural networks cannot regress the response maps accurately enough, the strategy of the template updating  is not robust, the tracker is prone to drift, the target's appearance features cannot be modeled for a long time, and the discrimination of the features  is poor. We makes an in-depth study on the problems of single target visual tracking, and proposes several tracking algorithms based on the deep neural networks.  The main contributions of this paper are summarized as follows.

1. We propose an adversarial learning based tracker to improve the regression accuracy during tracking. The tracker is composed of a fully convolutional siamese neural network (regression network) and a discriminative classification network. We jointly optimize the both network by adversarial learning. In our framework, the regression network and classification network can be trained end-to-end as a whole using large amounts of video training datasets. During the testing phase, the regression network generates a response map which reflects the location and the size of the target within each candidate search patch, and the classification network discriminates which response map is the best in terms of the corresponding template patch and candidate search patch. In addition, we propose an attention visualization algorithm which reflects the area that attracts the attention of our tracker. The experimental results on three large-scale visual tracking benchmarks (OTB-100, TC-128, and VOT2016) demonstrate the effectiveness of the proposed tracking algorithm and show that our tracker performs comparably against the state-of-the-art trackers.

2. Most of the trackers only use one target template which is cropped in the first frame based on the ground truth, or cropped in the last frame based on the tracking result. In this way, the tracker is likely to miss the target because of the target drifting. Besides, most of the regression based CNN trackers do not update the network parameters during tracking. This makes the trackers are sensitive to the appearance variations of the targets. We ameliorate these problems by resorting to reinforcement learning and supervised learning. Specifically, we preserve a target template pool and learn a target template updating policy based on the actor network which is trained by deep reinforcement learning in the actor-critic framework with Temporal-Difference error. The actor network learns to select which template in the template pool should be replaced based on the tracking results. Meanwhile, we update the regression CNN of our tracker online. This makes our tracker robust for the appearance changes of the target. The experimental results on three large benchmarks demonstrate that the proposed tracker performs favorably against the state-of-the-art trackers.

3. Most CNNs based trackers cannot recapture the target after drifting into the background. In addition, during online training, the number of training samples is limited. This makes the tracker be sensitive to the appearance variations of the target. These are two critical issues for robust tracking. We propose a CNN-based tracker containing a regression module and a classification module to ameliorate these problems. Specifically, the regression module consists of a tracking network and a recapturing network which are both fully-convolutional Siamese networks. The tracking network uses multiple templates to generate target candidates and detects whether the tracker drifts into the background. If the drifting is detected, the recapturing network will recapture the target within the global area. Furthermore, we utilize the classification module with a proposed ``Noise Injection" (NI) layer to distinguish the target from the candidates. The NI layer improves the robustness of the classification network to appearance changes of the target with limited training samples. The experimental results show that our tracker performs favorably against the state-of-the-art trackers on three popular benchmarks (OTB-100, TC-128, and VOT2016).


4. Most of the CNN-based trackers only consider the appearance variations between two consecutive frames in a video sequence. Besides, although some trackers model the appearance of the targets in long-term by using the RNN, the decay of the target's features reduces the tracking performance. We propose an Anti-Decay LSTM (AD-LSTM) for siamese tracking. Specifically, we extend the architecture of the standard LSTM in two aspects for visual tracking task. First, we replace all of the fully-connected layers with convolutional layers for extracting the features with spatial structure. Second, we improve the architecture of the cell unit. In this way, the information of the target appearance can flow through the AD-LSTM without decay along any length of tracking temporal. Meanwhile, because there is no ground truth for the feature maps which are generated by the AD-LSTM, we proposed an adversarial learning algorithm to train the AD-LSTM. With the help of the unsupervised adversarial learning, the siamese network can generate the response maps more accurately, and the AD-LSTM can generate the feature maps of the target more robustly.  The experimental results show that our tracker performs favorably against the state-of-the-art trackers on four popular benchmarks: OTB-100, TC-128, VOT2016, and VOT2017.

5. How to extract robust and discriminative features from the target with both temporal and spatial information is still a challenging task. To obtain the temporal information of the target, we feed multiple template patches of the target into one branch of the siamese network and utilize the 3D convolutional block to extract robust features. To obtain the spatial information and locate the target more accurately, we propose a novel dense correlation layer with attention mechanism which outperforms the existing two widely used correlation layers. Meanwhile, to improve the tracking performance further, we propose a novel ordered-quadruplet loss and the corresponding network to distinguish the target from the candidates. The experimental results show that our tracker performs favorably against the state-of-the-art trackers on three popular benchmarks: OTB-100, TC-128 and VOT2017.

Pages155
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/23775
Collection模式识别国家重点实验室_图像与视频分析
Recommended Citation
GB/T 7714
赵飞. 基于深度学习的视觉目标跟踪方法研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2019.
Files in This Item:
File Name/Size DocType Version Access License
2015级-博士学位论文-赵飞.pdf(37107KB)学位论文 开放获取CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[赵飞]'s Articles
Baidu academic
Similar articles in Baidu academic
[赵飞]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[赵飞]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.