英文摘要 | Visual object tracking (VOT), one of the essential components in visual understanding, is the cornerstone of higher-level object relation learning. General object tracking provides primary input information for scene understanding, behavior prediction, and other tasks requiring the model establish relationship between instances (\emph{e.g.,} pedestrians, vehicles, animals). The technologies of visual object tracking are widely used in intelligent security, video editing, AI coach, automatic driving and virtual reality. The main challenges for visual tracking include distinguishing target and distractors, mitigating accumulation errors, and guaranteeing both real-time speed and promising performance. The key scientific problem is thus how to improve the robustness of representation learning and object localization while ensuring real-time tracking. To this end, this paper conducts research on real-time visual tracking from four aspects: enhancing the representation learning, improving the tolerance for error accumulation, the accuracy of object positioning, and the stability of relation learning. The proposed models are applied to typical Siamese tracking frameworks. The main contributions of this paper are summarized as follows:
- The cropping-inside-residual (CIR) module is proposed to conduct deeper and wider Siamese tracking frameworks. This paper explores the underlying reason for performance degradation when equipping Siamese tracking model with deep convolution neural network. Based on comparative experiments, we observed that the padding operation in convolution layers leads to perceptual inconsistency of the two branches in Siamese paradigm, which eventually breaks translation invariance of the network. We set out to alleviate this problem by proposing the ''cropping-inside-residual (CIR)'' module, which ensures the translation invariance of the deep Siamese network by chopping off the feature regions affected by the padding. In addition, the network design criteria for Siamese tracking are proposed, which shows how to choose the most suitable receptive field, network stride and template feature size. With the proposed criteria and CIR module, we conduct neural networks with different depths and apply them to typical Siamese tracking methods. The method proposed in this paper achieves improvements of 4 ~ 6 points on the evaluation benchmarks, which significantly improves the robustness of Siamese tracking framework.
- The asymmetrical label assignment is proposed to conduct object-aware anchor-free siamese networks. This work demonstrates that one essential reason for tracking drift is that the regression network in anchor-based Siamese methods is only trained on the positive anchor boxes, which leads to dissatisfied tolerance for classification error. This mechanism makes it difficult to refine the anchors whose overlap with the target objects is small. To this end, this paper innovatively proposes an asymmetric label assignment mechanism during training to mitigate this problem. In particular, for the regression sub-network, each pixel in the groundtruth box is considered as positive samples in an anchor-free learning manner. For the classification sub-network, the central sampling mode is retained to prevent faulty optimization caused by ambiguous samples. Furthermore, we propose the object-aware feature sampling strategy to optimize the classification and regression sub-networks jointly. Our method achieves better performance while running faster than existing state-of-the-art trackers via the lightweight network design.
- The attention retrieval is proposed to conduct accurate pixelwise object tracking framework. This paper finds that the lack of spatial constraints is one of the fundamental reasons for false positives in single-stage pixelwise tracking framework. Inspired by two-stage pixelwise tracking methods and video object segmentation (VOS) algorithm, we propose the attention retrieval network based on label propagation mechanism. The proposed method can effectively perceive the target object and background interference by introducing the first frame's annotated segmentation mask and the intermediate frame's prediction results into network training. With the attention map generated by the network, it can suppress the background interference and enhance the representation of the target itself, so as to reduce the false positives. Our work improves the accuracy of the single-stage pixelwise tracking and guarantees its real-time running speed.
- The automatic matching network design algorithm is proposed for Siamese tracking. Aiming at improving the resillence of Siamese tracking in different scenes, this paper proposes a search algorithm for automatically designing matching networks. For the first time, template matching via explicitly similarity learning is abandoned. We innovatively solve the matching problem by training model on the fused visual features. Directly fusing visual features can be regarded as implicit similarity learning, and it can benefit from the diverse fusion operators in other vision tasks. By automatically searching and combining the newly defined matching operators, we can search task-aware networks for the classification and regression sub-tasks, which increases capacity of the two sub-networks. With only half training data of the baseline tracker, we achieve gains of 9 percentage points in the evaluation benchmark.
With the above innovative design, the tracking methods proposed in this paper have achieved leading accuracy and running speeds on multiple public benchmarks and challenges of visual object tracking.
|
修改评论