CASIA OpenIR  > 毕业生  > 博士学位论文







Single object tracking is defined as arbitrarily designating a moving target in the first frame and continuously locating it in subsequent frames. As one of the fundamental research directions in computer vision, single object tracking allows machines to possess human-like tracking capabilities in complex environments, and plays an important role in application scenarios such as intelligent surveillance, autonomous driving, and intelligent transportation. Despite significant advancements in tracking algorithms over the years, they still lack robustness when faced with various challenging factors in real-world environments, leading to noticeable disparities with human visual tracking abilities. This drawback arises from the primary emphasis on algorithm design of the field, while the evaluation of the technology remains relatively limited. Consequently, the development of evaluation techniques lags behind, and explorations into the algorithm's capabilities become more challenging. Furthermore, although some studies have constructed evaluation environments and designed evaluation metrics, they predominantly focus on measuring the performance of tracking algorithms, neglecting the evaluation of intelligence. Therefore, the same algorithm may exhibit significant performance disparities between experimental environments and real-world scenarios. In light of these issues, this thesis addresses the existing limitations of evaluation techniques, and investigates several key aspects of a comprehensive intelligent evaluation system for single object tracking, encompassing evaluation tasks, experimental environments, task executors, and evaluation mechanisms. This thesis aims to propose improved solutions for the intelligent evaluation of single object tracking, and conducts research as follows:

(1)   Large-scale multi-dimensional global instance tracking intelligent evaluation platform. In complex environments, including adversarial scenarios, numerous single object tracking algorithms that perform well in experimental environments still lag significantly behind human dynamic visual abilities. This discrepancy highlights the need for an evaluation method that goes beyond machine-machine comparisons and incorporates the human factor. To address these issues, this thesis expands the traditional single object tracking task to the global instance tracking task, with the goal of accurately modeling human dynamic visual abilities. Building upon this concept, the thesis introduces a video content decoupling framework and develops a large-scale, multi-dimensional global instance tracking intelligent evaluation platform featuring 7.46 million frames. This platform boasts the most extensive single object tracking dataset to date, designed to provide a high-quality environment that approximates real-world application scenarios for algorithm research. Finally, the thesis employs a human-machine comparison approach, introducing human subjects through a visual Turing test. This evaluation method assesses the dynamic visual tracking abilities of both human subjects and 20 representative tracking algorithms, measuring the algorithms' intelligence against human baselines. As of November 2023, this platform has received more than 236,000 visits from over 130 countries and regions. It has been downloaded, registered, and evaluated by research institutions and enterprise users from domestic and foreign organizations, including the Chinese Academy of Sciences, Peking University, Stanford University, and Alibaba, a total of 1,073 times. Additionally, 374 single object tracking algorithms have been assessed, making it a representative platform.

(2)   User-defined single object tracking task space. The research goal of single object tracking is to achieve dynamic visual capabilities similar to humans. Despite several datasets and evaluation platforms are constructed by researchers, the isolated characteristic of these datasets poses challenges in effectively utilizing high-challenge data for robustness studies. Moreover, evaluation metrics based on mean values tend to overlook challenging frames with low proportions. To address these issues, this thesis introduces the 3E paradigm as a task analysis framework, integrating representative datasets to transform the original static closed datasets into a dynamic open data space comprising 12.56 million frames. Additionally, a subspace construction method is proposed to facilitate user-defined task studies. To further overcome the limitations of existing evaluation techniques, this thesis proposes new evaluation systems and metrics that can adapt to different evaluation needs. Extensive experiments are performed to conduct fine-grained analysis on 23 representative single object tracking algorithms. These experiments effectively explore the performance bottlenecks of different architectural trackers when encountering various challenging factors.

(3)   Visual Turing-based intelligent evaluation technology for single object tracking.  Computer scientists typically use large-scale datasets to evaluate machine models, while neuroscientists typically employ simple experimental environments to evaluate human subjects. This discrepancy makes it challenging to integrate human-machine evaluation into a unified framework for comparison and analysis. To address the aforementioned issues, this thesis extends the visual Turing evaluation approach proposed in research (1) and conducts work based on the user-defined single object tracking task space proposed in research (2). Firstly, an experimental environment is constructed to enable a fair comparison between human and machine dynamic visual capabilities. The environment comprises 87 highly representative sequences that comprehensively cover various sub-tasks of single object tracking. These sequences provide a thorough examination of the perceptual abilities, cognitive abilities, and robust tracking capabilities of the task executors. Based on this foundation, 20 algorithms with different architectures and 15 human subjects are selected to conduct tests. And a human-machine dynamic visual capability evaluation framework is designed. Finally, a fine-grained experimental analysis is carried out from the perspectives of human-machine comparison and human-machine collaboration. The experimental results demonstrate that representative tracking algorithms have gradually narrowed the gap with human subjects. Furthermore, both humans and machines exhibit unique strengths in dynamic visual tasks, suggesting significant potential for human-machine collaboration.

(4)   Benchmark for multi-modal global instance tracking based on a hierarchical semantic framework. Single object tracking algorithms often face challenges in maintaining robust tracking abilities when dealing with long videos that contain complex causal and spatio-temporal relationships. While some studies attempt to incorporate the language modality to aid tracking, they typically utilize short videos with simple narrative content and provide semantic labels that suffer from semantic ambiguity and incomplete grammatical structures. To address these issues, this thesis first constructs a benchmark for multi-modal global instance tracking that includes videos with complex narrative content. The average length of the video sequences in this benchmark is 5-22 times longer than existing multi-modal single object tracking benchmarks. Additionally, by incorporating the hierarchical structure of human cognition, this thesis proposes a multi-granularity semantic annotation strategy aimed at providing high-quality semantic information. Finally, through comparison with other multi-modal single object tracking benchmarks, this thesis validates the feasibility of the multi-granularity annotation strategy as a means to couple human understanding abilities of long videos with semantic labels. Moreover, this thesis introduces a multi-modal evaluation mechanism and conducts a detailed experimental analysis on different structured trackers based on this mechanism, thereby exploring the performance bottlenecks of multi-modal trackers.


关键词单目标跟踪 智能评估技术 视觉跟踪能力 实验环境 评估平台
GB/T 7714
胡世宇. 单目标跟踪中的智能评估技术研究[D],2023.
文件名称/大小 文献类型 版本类型 开放类型 使用许可
单目标跟踪中的智能评估技术研究-签字版.(33423KB)学位论文 限制开放CC BY-NC-SA
所有评论 (0)
