|导师||胡卫明 ; 兴军亮|
|关键词||相关滤波 视觉跟踪 弱监督学习 深度学习 卷积神经网络|
Visual tracking is a fundamental technique in computer vision, and is widely applied ranging from video surveillance, automatic navigation, military, human-computer interaction, virtual reality, intelligent robots, autonomous vehicles, to name a few. Compared to the model-based tracking methods that are designed for specific tracking targets such as pedestrians and vehicles, the model-free online visual tracking methods obtain growing popularity for their tracking characteristics. Namely, these methods are proposed for arbitrary targets without benefiting from any prior knowledge about the targets. The aim of model-free visual tracking is to estimate the trajectory of a target in a video, given only its initial location. Considering complex tracking scenarios including illumination variations, pose deformations, scale changes, motion blur, occlusions, target disappearance, etc., model-free tracking methods should carry out online adaptation to account for continuous target appearance changes, and be robust to background disturbances, and meanwhile work at a high speed, however with very limited knowledge about the tracking target. Therefore, model-free tracking remains a challenging task.
The correlation filters based tracking methods have grasped researchers’ attention due to their promising performance and computational efficiency. In this thesis, we also carry out intensive studies on the correlation filters based model-free online visual tracking. Especially, to tackle the issues of the object scale and aspect ratio changes, rotations, abrupt motions, and occlusions, etc., we propose to enhance conventional correlation filters based trackers from four aspects: feature learning, motion model, appearance model, and inference model. As a result, our proposed tracking methods obtain substantial improvements in tracking accuracy, robustness, and speed. The main contributions of this thesis are summarized as follows:
1） We present a fully-functional correlation filters (FCF) based tracker that for the first time fully exploits the correlations in following three complementary spaces: the joint scale-displacement space, the orientation space and the temporal space. With these comprehensive correlation analyses across multiple spaces, FCF gains significant enhancement of its appearance model and the inference model. On the one hand, FCF improves the robustness and adaptability of the appearance model by joint correlation analyses. Specifically, benefiting from the block-circulant structure, the log-polar transform, and the discrete Fourier transform, the object appearance modeling is able to be extended from a single displacement space to the joint scale-displacement space and the orientation space, leading to the better estimation of the object scale changes and rotations. With more accurate estimations, less background noise is introduced into the appearance model, which further boosts the tracking robustness. On the other hand, taking advantage of the temporal correlation analysis using an extended high-order Markov chain model, the appearance model preserves pure object appearance and representative object appearance modalities, and meanwhile gets rid of appearance redundancy and background noise. Furthermore, by taking the temporal correlation analysis to evaluate the correlations between the current object candidate and the object modalities maintained in the enhanced appearance model, the inference procedure becomes more adaptive to object abrupt appearance changes. Comprehensive experiments are performed on three of the largest and widely adopted benchmarks to validate the functionality of each novel component and to show the performance gains in tracking accuracy and robustness.
2） We propose to exploit the category-specific semantics to boost visual object tracking and develop a new visual tracking model that augments the appearance based tracker with a top-down reasoning component. This top-down reasoning component is able to provide consistent semantic clues across video frames inferred from object category information, facilitating the acquirement of object motion model. Specifically, the bottlenecks for the conventional correlation filters based trackers are the low quality and insufficiency of the training data, as well as the low-level appearance modeling, which make these trackers lack robustness. We develop a generic object recognition model and a category-specific semantic activation map method to provide effective top-down reasoning about object locations for the conventional correlation filters based tracker, which alleviates the negative boundary effects brought by the correlation filters and enhances the motion model at a high level. In addition, we develop a voting based scheme for the reasoning component to infer the object semantics. Therefore, even without sufficient training data, the tracker can still obtain reliable top-down clues about the objects. Together with the appearance clues, the tracker can localize objects accurately even in presence of various major distracting factors. Extensive evaluations on two large-scale benchmark datasets clearly demonstrate that the top-down reasoning substantially enhances the robustness of the tracker and provides state-of-the-art performance.
3） We propose a novel end-to-end learnable spatially aligned correlation filters based network to handle complex motion patterns of the target. The whole network not only learns a generic relationship between object geometric transformations and object appearances, but also learns robust representations coupled to the correlation filters in case of various geometric transformations. Therefore, both feature learning and the motion model are enhanced. Specifically, a feature extraction network is combined with a differentiable correlation filter layer for end-to-end training on the large-scale video dataset. Thus, discriminative representations are explicitly learned for correlation filters based tracking. Moreover, a spatial alignment module is further incorporated into this correlation filters based network to provide spatial alignment capabilities and reduce the correlation filter's search space of the object motion. As a result, the challenging issues including boundary effects and aspect ratio variations in the previous correlation filters based trackers are well addressed. Benefitting from the off-line training of the whole network, the spatial alignment and the correlation filters based localization are conducted in a mutual reinforced way, which ensures an accurate motion estimation inferred from the consistently optimized network. Furthermore, in the online tracking process, the light-weight network architecture and the fast calculation of the correlation filter layer allow efficient tracking at a real-time speed. We conduct experiments on three large-scale benchmark datasets, and the experimental results demonstrate that our algorithm performs competitively against existing state-of-the-art methods and achieves high robustness and efficiency.
These innovations contribute to plausible tracking results on public available online tracking benchmarks, and some are the best at the time. In addition, some other computer vision applications such as semantic segmentation and pose estimation in the video can take advantage of our proposed methods.
|张梦丹. 基于相关滤波的在线视觉跟踪研究[D]. 北京. 中国科学院研究生院,2018.|