基于孪生网络的实时视觉目标跟踪研究

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 视频内容安全

	基于孪生网络的实时视觉目标跟踪研究
	王强
	2020-05-20
页数	140
学位类型	博士
中文摘要	视觉目标跟踪是计算机视觉领域的一项基本研究课题，被广泛应用于智能监控、车辆导航、人机交互和虚拟现实等领域。通用物体的视觉目标跟踪被用来建立任意类型目标物体在视频帧之间的关联关系，进而确定目标运动轨迹，实现对于物体的运动感知，它是智能感知中的重要组成部分。针对通用物体的目标跟踪模型需要适应物体的光照、形变、遮挡等复杂变化，这对跟踪算法具有极大挑战。同时，以无人机、自动驾驶为典型代表的无人自主智能平台对视觉目标跟踪算法实时性的要求较高，因此做到精度与速度的兼顾与平衡也是视觉跟踪算法得以在现实场景中应用的关键之一。本文围绕视觉目标跟踪算法面向现实场景应用的上述需求，以高效的孪生网络目标跟踪架构为研究对象，提出了多种有效的特征表示学习方法，实现了端到端的跨层级特征融合表示以及类脑注意力机制建模。同时我们将图像分割思想引入到目标跟踪对象的状态表述中，拓展了目标跟踪的表述形式，并首次构建了视觉目标跟踪与视频目标分割的一体化处理框架，建立了目标跟踪新范式。本文的主要贡献概括如下：提出了基于端到端学习的判别相关滤波器高效跟踪算法。本文通过对判别相关滤波操作的反向传播过程进行推导，创新性地实现了深度特征自动提取与相关滤波判别模型的联合优化。该方法有效提升了深度特征表示的学习能力，增加了特征表示网络设计的自由度，同时显著降低了算法的计算存储消耗。在端到端学习过程中，通过基于尺度-位移空间的联合学习，算法引入了尺度空间样本，进而可以提供更准确的目标尺度估计。在此基础上，本文又探索了基于深度特征的语义嵌入模型，并提出使用编解码自监督学习孪生网络实现对目标及其周围环境结构信息的有效感知，提升了特征表示的泛化性能与细粒度表示能力。最后，本文通过分别构建具有上下文感知能力的判别相关滤波器和自监督学习语义嵌入模型，实现了具有互补性的跨层级特征融合表示与学习，显著提升了算法的跟踪性能。提出了基于残差注意力机制的孪生网络高效目标跟踪算法。本文首先重新形式化了判别式目标跟踪算法框架，将整体跟踪网络解耦为目标特征的表示网络以及用于判别分析的判别网络。然后，本文提出了带有加权的交叉相关操作算子，可以对目标不同空间位置的相关操作赋以自适应调整的权重。通过联合学习判别相关损失以及目标区域的判别系数，算法实现了较强的表观形态适应性。本文通过注意力机制实现判别权重表述，并提出将注意力机制分解为用于统计整体样本分布的先验注意力机制、具有个体自适应性的残差注意力机制以及对于网络的不同语义层进行权重调整的通道注意力机制。该算法通过多种注意力机制的引入，减少了训练过程的过拟合。同时本文算法通过轻量化的网络设计，保证了良好的跟踪速度。提出了基于孪生网络的视觉目标跟踪与分割一体化高效处理框架。本文深入分析了当前视觉目标跟踪的输出表述形式，借鉴图像分割表述思想，首次创新性地提出针对于目标跟踪的多任务输出表示方法。本文通过引入独立的分割分支到全卷积孪生网络框架，使得孪生网络架构可以同时估计目标的矩形框位置以及输出精细的目标分割表述。对于分割分支的架构设计，本文采用向量化的分割表述方式获取目标全局信息，并提出自顶向下的堆叠精细化模块来增强分割细节。该框架的离线训练过程可通过多分支任务联合学习进行优化。在线跟踪过程中，算法只需要输入初始帧标注的矩形框初始化，即可同时完成视觉目标跟踪任务与视频目标分割任务。整个框架在完成多个任务的基础上，具有较高的分割效率，运行速度接近55帧每秒。最后，本文将上述一体化处理框架扩展到多目标跟踪场景，实现了无输入标签监督的多目标视频实例分割。基于上述方法和创新，本文所提出的跟踪算法在多个公开数据集与挑战赛上都取得了当时最好或者领先的精度指标。同时本文对于跟踪算法的计算效率进行重点关注，本文算法均取得了实时的运算速度。最后，本文的方法和创新对于其它相关计算机视觉任务和应用，比如行为理解等，也有一定的借鉴意义。
英文摘要	Visual object tracking is a basic research problem in computer vision, and is widely used in intelligent monitoring, vehicle navigation, human-computer interaction, virtual reality, and other fields. The generic object tracking is used to establish the association between objects in video frames, and then determine the target trajectory for realizing the motion perception of the object. It is an important part of intelligent perception. The main challenging issues for successful tracking lie in various appearance changes caused by drastic illumination changes, non-rigid deformation, and heavy occlusion, etc. At the same time, the unmanned autonomous intelligent platform typified by drones and autonomous driving has high requirements for the real-time performance of visual object tracking. Therefore, a good compromise between accuracy and speed is also one of the keys for the application of visual tracking methods in real-world scenes. Around the visual object tracking algorithm and its application, this thesis proposes multiple effective feature learning methods for siamese network-based tracking architecture. Besides, we introduce the idea of image segmentation to expand the expression form of target tracking with the dense description of the tracking result, and for the first time build an integrated framework for visual object tracking and video object segmentation, leading to a new paradigm for visual tracking. The main contributions of this thesis are summarized as follows. We propose an end-to-end learnable correlation filter tracking algorithm. Through the derivation of the back-propagation process of discriminative correlation operations, this thesis unifies the feature representation learning and correlation filter-based appearance modeling within an end-to-end learnable framework. It effectively improves the learning ability of deep feature representation, increases the freedom of feature network design, and significantly reduces both the computational cost and memory demand of the algorithm. In the end-to-end learning process, through joint scale-position learning, the algorithm introduces scale samples, which can provide more accurate target scale estimation. Then this thesis explores the semantic embedding model based on deep features, and proposes an encoder-decoder network for structure-aware self-supervised learning, which improves the generalization performance and fine-grained expression ability of the model. Finally, two complementary cross-layer features are used to jointly learn the context-aware correlation filters and semantic embedding, which significantly increases the tracking accuracy. We propose a residual attentional siamese network for high-performance visual object tracking. This thesis first reformulates the discriminative visual tracking framework, and decouples the overall tracking network into a feature representation network and a discriminant network for discriminant analysis. Then, a weighted cross-correlation operator is proposed to perform adaptive weight adjustment on different spatial positions of the target. The model achieves strong adaptability of apparent morphology through jointly learning to discriminate related losses and discriminant coefficients of target areas. This thesis implements the discriminant weight expression through the attention mechanism, and proposes to decompose the attention mechanism into a priori attention mechanism for statistical overall sample distribution, a residual attention mechanism with individual adaptability, and a channel attention for adjusting the weights of different semantic layers of the network. Through the introduction of attention mechanism, the algorithm not only mitigates the overfitting problem in deep network training, but also enhances its discriminative capacity thanks to the separation of representation learning and discriminator learning. Besides, benefiting from the lightweight network design, the speed of the proposed tracker is far beyond real-time. We propose a unified framework for visual object tracking and video object segmentation in siamese networks. This thesis deeply analyzes the current output expression form of visual object tracking, and propose an accurate output description method for visual tracking. By introducing an independent segmentation branch into the full convolutional siamese network framework, it can simultaneously predict the rectangular bounding box of target location and the object's dense representation with binary segmentation. In order to further refine the segmentation representation, this thesis proposes a top-down refinement module to enhance segmentation details. Once the offline training is completed, the algorithm solely relies on a single bounding box for initialization, and can simultaneously implement real-time visual object tracking and segmentation tasks. Despite the collaborative handling of multiple tasks, it has high processing efficiency around 55 frames per second. Finally, this thesis extends the above framework to multi-object tracking scenarios and achieve unsupervised video instance segmentation. These innovations contribute to leading evaluation results on some publicly available tracking benchmarks, and most are the best at the time. The proposed lightweight network design enables the algorithm to achieve leading accuracy while maintaining a real-time speed that is beneficial to practical applications. In addition, some other computer vision applications (e.g., action recognition) can take advantage of our proposed methods.
关键词	视觉目标跟踪孪生网络端到端学习注意力机制实例分割
语种	中文
七大方向——子方向分类	目标检测、跟踪与识别
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/39069
专题	多模态人工智能系统全国重点实验室_视频内容安全
推荐引用方式 GB/T 7714	王强. 基于孪生网络的实时视觉目标跟踪研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
王强_博士毕业论文终版_compress（8516KB）	学位论文		开放获取	CC BY-NC-SA