基于深度判别性模型的目标跟踪

CASIA OpenIR > 毕业生 > 博士学位论文

	基于深度判别性模型的目标跟踪
	于斌
	2022-05-18
页数	144
学位类型	博士
中文摘要	在计算机视觉领域中，视觉目标跟踪是一个重要的研究方向，是视频语义理解的一个核心组成部分。当前，硬件性能的提升以及大数据的发展，使得目标跟踪的应用涉及安防监控、智能驾驶、人机互动、视频编辑等多个领域。近年来，随着深度学习技术的使用与发展，目标跟踪算法的性能得到了较大的提升。在深度学习方法中，基于判别性模型的方法能够在线地整合背景信息，因此成为了当前的主流跟踪方法。但是当前跟踪器所应用的的特征融合和构建方法以及判别性模型限制了跟踪性能的进一步提升，当前的特征融合与构建方法不能有效处理互补特征融合问题和样本不均衡问题，现有的判别性模型也不能有效缓解过拟合问题，传统的判别性模型在求解效率和包含信息量等方面也存在着限制。本文以深度判别性模型为研究对象，针对现有的上述问题从特征融合与构建方面和判别性模型设计方面进行了深入研究，提出了多种基于深度判别性模型的视觉目标跟踪方法，提升了目标跟踪的鲁棒性、准确性和效率。本文的主要贡献概括如下： 1. 从特征融合方面，针对互补特征的自适应融合问题，本文详细推导了多核相关滤波的上界模型，提出了基于多核相关滤波的判别性跟踪算法，实现了多种互补卷积特征的自适应融合。此前目标跟踪方法难以有效利用互补特征和非线性核函数，限制了算法的精度与效率。本文将多核学习应用到相关滤波模型中，将原始多核相关滤波模型的目标函数的上界作为新的优化函数，推导了多核学习参数和相关滤波器的联合优化过程，并利用迭代求解、频域快速优化方法和特征降维显著提升了计算效率。所提出的模型为预训练的浅层和深层卷积特征分别构建核矩阵，通过在线优化实现了自适应特征融合。实验结果表明，所提出的方法在多个主流数据集上取得了同时期方法中领先的定位精度和较高的运行速度。 2. 从特征构建方面，针对在线跟踪中的样本不均衡问题，本文提出了基于目标已知的特征的深度判别性跟踪方法。由于在在线跟踪中前背景样本极度不均衡 (前景类样本远少于背景类样本)，训练得到的模型难以较好地拟合前景样本。因此为了缓解不平衡问题带来的负面影响，本文通过投影的方式将学习得到的目标未知的特征映射到目标已知的特征空间。为了使卷积特征提取网络更适合于目标已知的特征构建方法和判别性跟踪框架，本文在网络中嵌入判别性模型求解器，通过端到端的离线训练学习更合适的特征空间。基于这样的目标已知的特征构建方法，学习得到的模型可以在在线跟踪中较好地拟合前景样本。在多个主流数据集上的实验结果表明，该方法都显著超过了基准方法，验证了本文提出的特征构建方法的有效性。 3. 从判别性模型设计方面，本文针对过拟合问题提出了动态正交投影限制的判别性跟踪方法。当前跟踪方法所学习到的特征往往是高维的，造成了判别性模型中存在大量可学习的参数，增加了在线跟踪中过拟合的风险。因此本文提出通过卷积神经网络进行特征降维来缓解过拟合的风险。本文首先提出了一个正交限制的岭回归模型来降低特征维度，然后设计了一个动态子网络来学习特征降维。在经过正交损失和回归损失指导的离线训练后，该子网络能够动态地生成一系列的正交基向量，在在线跟踪中进行自适应特征降维。基于这样的判别性模型和降维子网络，本文提出了一个有效的跟踪方法。实验结果表明所提出的跟踪方法在七个主流数据集中取得了先进的跟踪精度，同时取得了实时的运行速度。 4. 从判别性模型设计方面，不同于贡献三中的基于岭回归模型的方法，本文提出了基于Transformer的判别性目标跟踪方法。当前的判别性方法使用的跟踪模型 (如岭回归) 大都是手工设计的，存在求解效率较低，包含信息量有限等问题。本文提出了基于Transformer的方法，有效地将编码器和解码器中的注意力机制应用到判别性跟踪框架中。其中编码器能够有效地利用训练图像中的场景信息并生成判别性特征；预测头网络能够密集地得到每个空间位置的前背景概率和目标边框估计值。在离线训练中，该方法通过分类和回归损失指导网络的训练，驱动编码器生成具有判别性的特征。所提出的方法摒弃了传统的目标跟踪网络架构，且不需要手工设计判别性模型求解器。实验结果表明该方法在多个主流数据集上取得了领先的跟踪精度和较高的运行速度。
英文摘要	In the field of computer vision, visual object tracking is an important research direction and a core component of video semantic understanding. At present, with the improvement of hardware performance and the development of big data, the application of object tracking involves many fields such as security monitoring, intelligent driving, human-computer interaction, and video editing. In recent years, with the usage and development of deep learning technology, the tracking performance of object tracking has been greatly improved. Among deep learning methods, methods based on discriminative models have become the current mainstream tracking methods because they can integrate background information online. However, the feature fusion and construction methods of the current trackers and the applied discriminative models limit the further improvement of tracking performance. The current feature fusion and construction methods cannot effectively deal with the complementary feature fusion problem and the sample imbalance issue. The existing discriminative models cannot effectively alleviate the over-fitting problem and these models also has limitations in terms of solution efficiency and the amount of information contained. This dissertation takes deep discriminative model as the research object, and conducts in-depth research on the existing problems from the aspects of feature fusion and construction and the discriminative model design, and proposes serveral visual tracking methods based on deep discriminative model. The proposed methods improve the robustness, accuracy and efficiency of object tracking. The main contributions of this dissertation are summarized as follows: 1. From the aspect of feature fusion, aiming at the adaptive fusion of complementary features, this dissertation deduces the multi-kernel correlation filters with upper bound in detail, and proposes a tracking algorithm with multi-kernel correlation filters, which is able to fuse multiple complementary convolutional features adaptively. Previous object tracking methods are difficult to effectively utilize complementary features and nonlinear kernel functions, which limit the accuracy and efficiency of the algorithm. In this dissertation, multi-kernel learning is introduced into the correlation filters. The upper bound of the objective function of the original multi-kernel correlation filters is used as the new optimization function, and the joint optimization process for multi-kernel learning parameters and correlation filters is derived. The usage of iterative solution, fast optimization methods in frequency-domain and feature dimensionality reduction significantly improve the efficiency. The proposed model constructs kernel matrices for the pre-trained shallow and deep convolutional features respectively, and implements adaptive feature fusion through online optimization. Experimental results show that the proposed method achieves the leading localization accuracy and high running speed among the contemporaneous methods on multiple popular datasets. 2. From the aspect of feature construction, aiming at the problem of sample imbalance in online tracking, this dissertation proposes a deep discriminative tracking model based on target-aware feature construction method. Due to the extreme imbalance of foreground and background samples in online tracking (the foreground samples are far less than the background samples), it is difficult for the trained model to fit the foreground samples well. Therefore, in order to alleviate the negative impact of the imbalance problem, this dissertation maps the learned target-unaware features to the target-aware feature space by projection. In order to make the convolutional feature extraction network more suitable for the target-aware feature construction method and the discriminative tracking framework, this dissertation integrates a discriminative model solver into the network, and proposes to learn a more suitable feature space through end-to-end offline training. Based on such target-aware feature construction method, the learned model can better fit the foreground samples in online tracking. Experimental results show that the proposed method significantly outperforms the baseline method on multiple popular datasets, verifying the effectiveness of the feature construction method. 3. From the aspect of discriminative model design, aiming at the overfitting issue, this dissertation proposes a dynamic orthogonal projection constrained discriminative tracking method. The features learned by current tracking methods are usually high-dimensional, resulting in a large number of learnable parameters in the discriminative model, increasing the risk of overfitting in online tracking. Therefore, this dissertation proposes to use convolutional neural networks for feature dimensionality reduction to alleviate the risk of overfitting. This dissertation first proposes an orthogonal projection constrained ridge regression model to reduce feature dimensionality, and then designs a dynamic sub-network to learn feature dimensionality reduction. After offline training with an orthogonal loss and a regression loss, the sub-network is able to dynamically generate a set of orthogonal bases to reduce feature dimensionality adaptively in online tracking. Based on such discriminative model and dimensionality reduction sub-network, this dissertation proposes an effective tracking method. Experimental results show that the proposed tracking method achieves state-of-the-art performance on seven popular datasets while running at a real-time speed. 4. From the aspect of discriminative model design, different from the method based on ridge regression model in Contribution 3, this dissertation proposes a Transformer-based discriminative object tracking method. The tracking models (such as ridge regression) used by the current discriminative methods are almost manually designed, which have problems such as poor efficiency and limitation of contained information. In this dissertation, a Transformer-based approach is proposed to effectively apply the attention mechanisms in the encoder and decoder into a discriminative tracking pipeline. The encoder can effectively utilize the scene information in the training images and generate discriminative features; the prediction head network can densely obtain the foreground and background probabilities and target bounding box estimations for each spatial location. In offline training, the networks are trained with classification and regression losses, driving the encoders to generate discriminative features. The proposed method gets rid of the traditional object tracking network architectures and removes the need of hand-designed discriminative model solvers. Experimental results show that the method achieves the leading tracking accuracy and high running speed on multiple popular datasets.
关键词	目标跟踪判别性模型深度学习特征降维正交投影
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/48667
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	于斌. 基于深度判别性模型的目标跟踪[D]. 中国科学院大学人工智能学院. 中国科学院大学人工智能学院,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
学位论文终版.pdf（23966KB）	学位论文		限制开放	CC BY-NC-SA