单目标跟踪中的智能评估技术研究

CASIA OpenIR > 毕业生 > 博士学位论文

	单目标跟踪中的智能评估技术研究
	胡世宇
	2023-11-23
页数	192
学位类型	博士
中文摘要	单目标跟踪任务是指在一段视频序列的首帧任意指定一个运动目标，并在后续帧对其进行持续定位。作为计算机视觉的基础研究方向之一，单目标跟踪通过对人类动态视觉系统进行建模，使机器能够在复杂环境中具备类人的跟踪能力，并在智能监控、自动驾驶、智慧交通等应用场景中发挥重要作用。经过数十年的研究，跟踪算法已经在代表性实验环境中表现出良好的性能，但其在真实环境中的多样化挑战因素面前依旧缺乏鲁棒性，相比于人类的视觉跟踪能力仍存在较大差距。究其原因，目前单目标跟踪领域更着眼于算法设计，而评估技术的相关研究则相对较少、发展相对滞后，且难以深入挖掘当前算法的能力瓶颈问题。虽然部分研究从构建评测环境和设计评测指标的角度开展工作，但仍主要关注于如何度量跟踪算法的性能，忽略了对算法智能水平的评估，最终导致同一算法在实验环境和真实场景中表现出较大的性能差异。针对上述问题，本文从现有评估技术的局限性出发，对一个完整的智能评估系统所涉及的各个关键环节（评测任务、评测环境、任务对象和评估机制）开展研究，旨在探索单目标跟踪智能评估的更优解决方案。本文的研究工作归纳如下：（1）大规模多维度全局实例跟踪智能评估平台。在复杂场景尤其是对抗环境下，大量在实验环境表现优异的单目标跟踪算法依旧与人类动态视觉能力存在鸿沟，这表明合理的评测方式不应局限于“机-机”的性能对比，而更应将“人”的因素加入到评测回路中。针对上述问题，本工作首先将单目标跟踪任务拓展为全局实例跟踪任务，旨在对人类动态视觉能力进行精确建模。在此基础上，本工作提出6D准则作为视频内容解耦框架，并构建一个包含746万帧的大规模多维度全局实例跟踪智能评估平台。本平台包含目前规模最大的单目标跟踪数据集，其具有丰富的镜头切换和场景转换，旨在为算法研究提供一个近似真实应用场景的高质量环境。最后，本工作从“人机对抗”的视角出发，基于视觉图灵测试思想引入人类被试，并对人类被试和20个代表性跟踪算法的动态视觉能力进行评估，旨在以人类的动态视觉能力为基准度量算法的智能性。截止2023年11月，本平台已获得130余国家和地区23.6万余次的访问，并获得包含中国科学院、北京大学、斯坦福大学、阿里巴巴在内的国内外科研机构和企业用户总计1073次下载，注册并评估374个单目标跟踪算法，具有一定代表性。（2）用户自定义的单目标跟踪任务空间。实现类人的动态视觉能力是单目标跟踪的研究目标。虽然研究者已经构建若干数据集和评估平台，但是彼此孤立的数据集导致研究者难以有效利用高挑战性的数据开展鲁棒性研究，而基于求均值形式设计的评测指标会导致困难帧因占比较低而被忽略。针对上述问题，本工作首先提出3E范式作为任务分析框架，对包含工作（1）在内的代表性数据集进行整合，将原本静态封闭的数据集融合为一个包含1256万帧的动态开放数据空间，并在此基础上提出子空间构建算法，旨在帮助研究者开展自定义的任务研究。此外，本工作针对现有评测技术的局限性提出全新的评测系统和评价指标，以适配不同的评测需求和任务对象，并通过大量的实验对23个代表性算法开展细粒度的分析，有效挖掘出不同架构的算法在应对各项挑战因素时的性能瓶颈。（3）基于视觉图灵的单目标跟踪智能评估技术。计算机科学家通常使用大规模数据集来评估机器模型，而神经科学家通常使用简单的实验环境来评估人类被试，导致“人-机”难以被纳入统一的评测框架进行对比和分析。针对上述问题，本工作从工作（1）提出的视觉图灵评估思路出发进行拓展，并基于工作（2）提出的用户自定义单目标跟踪任务空间开展研究。本工作首先构建了一个可以公平对比人机动态视觉能力的实验环境，该环境由87段具有较高代表性的视频序列构成，全面覆盖单目标跟踪的各项子任务，并充分考察任务对象的感知能力、认知能力和鲁棒跟踪能力。在此基础上，本工作选择20个具有不同架构的算法和15位人类被试作为任务对象开展动态视觉能力的测试，并设计了一套人机通用的动态视觉能力评估框架。最后，本工作从“人机对抗”和“人机协作”的角度开展了细粒度实验分析。实验结果表明，代表性跟踪算法在逐步缩短和人类被试的差距，且人机在动态视觉任务中各有所长，二者存在较大的合作潜力。（4）基于层级化语义框架的多模态全局实例跟踪任务基准。单目标跟踪算法在面对包含复杂因果关系和时空关系的长视频时，通常难以维持鲁棒性。虽然部分研究尝试引入语言模态来辅助跟踪，但它们通常选用叙事内容简单的短视频，且提供的语义标签存在语义信息歧义和语法结构残缺。针对上述问题，本工作首先构建一个包含复杂视频叙事内容的多模态全局实例跟踪基准，其中视频序列的平均时长为现有的多模态单目标跟踪基准的5-22倍。此外，本工作结合人类认知的层级化结构，创新性地提出一种多粒度的语义信息标注策略，旨在提供高质量的语义信息。与其他多模态单目标跟踪基准的对比实验表明，本工作提出的多粒度标注策略是将人类长视频理解能力与语义标签相耦合的可行方案。最后，本工作进一步提出一套适配于多模态单目标跟踪任务的评测机制，并基于该评测机制对不同结构的单目标跟踪方法进行详细的实验分析，从而深入挖掘多模态单目标跟踪算法的性能瓶颈。
英文摘要	Single object tracking is defined as arbitrarily designating a moving target in the first frame and continuously locating it in subsequent frames. As one of the fundamental research directions in computer vision, single object tracking allows machines to possess human-like tracking capabilities in complex environments, and plays an important role in application scenarios such as intelligent surveillance, autonomous driving, and intelligent transportation. Despite significant advancements in tracking algorithms over the years, they still lack robustness when faced with various challenging factors in real-world environments, leading to noticeable disparities with human visual tracking abilities. This drawback arises from the primary emphasis on algorithm design of the field, while the evaluation of the technology remains relatively limited. Consequently, the development of evaluation techniques lags behind, and explorations into the algorithm's capabilities become more challenging. Furthermore, although some studies have constructed evaluation environments and designed evaluation metrics, they predominantly focus on measuring the performance of tracking algorithms, neglecting the evaluation of intelligence. Therefore, the same algorithm may exhibit significant performance disparities between experimental environments and real-world scenarios. In light of these issues, this thesis addresses the existing limitations of evaluation techniques, and investigates several key aspects of a comprehensive intelligent evaluation system for single object tracking, encompassing evaluation tasks, experimental environments, task executors, and evaluation mechanisms. This thesis aims to propose improved solutions for the intelligent evaluation of single object tracking, and conducts research as follows: (1) Large-scale multi-dimensional global instance tracking intelligent evaluation platform. In complex environments, including adversarial scenarios, numerous single object tracking algorithms that perform well in experimental environments still lag significantly behind human dynamic visual abilities. This discrepancy highlights the need for an evaluation method that goes beyond machine-machine comparisons and incorporates the human factor. To address these issues, this thesis expands the traditional single object tracking task to the global instance tracking task, with the goal of accurately modeling human dynamic visual abilities. Building upon this concept, the thesis introduces a video content decoupling framework and develops a large-scale, multi-dimensional global instance tracking intelligent evaluation platform featuring 7.46 million frames. This platform boasts the most extensive single object tracking dataset to date, designed to provide a high-quality environment that approximates real-world application scenarios for algorithm research. Finally, the thesis employs a human-machine comparison approach, introducing human subjects through a visual Turing test. This evaluation method assesses the dynamic visual tracking abilities of both human subjects and 20 representative tracking algorithms, measuring the algorithms' intelligence against human baselines. As of November 2023, this platform has received more than 236,000 visits from over 130 countries and regions. It has been downloaded, registered, and evaluated by research institutions and enterprise users from domestic and foreign organizations, including the Chinese Academy of Sciences, Peking University, Stanford University, and Alibaba, a total of 1,073 times. Additionally, 374 single object tracking algorithms have been assessed, making it a representative platform. (2) User-defined single object tracking task space. The research goal of single object tracking is to achieve dynamic visual capabilities similar to humans. Despite several datasets and evaluation platforms are constructed by researchers, the isolated characteristic of these datasets poses challenges in effectively utilizing high-challenge data for robustness studies. Moreover, evaluation metrics based on mean values tend to overlook challenging frames with low proportions. To address these issues, this thesis introduces the 3E paradigm as a task analysis framework, integrating representative datasets to transform the original static closed datasets into a dynamic open data space comprising 12.56 million frames. Additionally, a subspace construction method is proposed to facilitate user-defined task studies. To further overcome the limitations of existing evaluation techniques, this thesis proposes new evaluation systems and metrics that can adapt to different evaluation needs. Extensive experiments are performed to conduct fine-grained analysis on 23 representative single object tracking algorithms. These experiments effectively explore the performance bottlenecks of different architectural trackers when encountering various challenging factors. (3) Visual Turing-based intelligent evaluation technology for single object tracking. Computer scientists typically use large-scale datasets to evaluate machine models, while neuroscientists typically employ simple experimental environments to evaluate human subjects. This discrepancy makes it challenging to integrate human-machine evaluation into a unified framework for comparison and analysis. To address the aforementioned issues, this thesis extends the visual Turing evaluation approach proposed in research (1) and conducts work based on the user-defined single object tracking task space proposed in research (2). Firstly, an experimental environment is constructed to enable a fair comparison between human and machine dynamic visual capabilities. The environment comprises 87 highly representative sequences that comprehensively cover various sub-tasks of single object tracking. These sequences provide a thorough examination of the perceptual abilities, cognitive abilities, and robust tracking capabilities of the task executors. Based on this foundation, 20 algorithms with different architectures and 15 human subjects are selected to conduct tests. And a human-machine dynamic visual capability evaluation framework is designed. Finally, a fine-grained experimental analysis is carried out from the perspectives of human-machine comparison and human-machine collaboration. The experimental results demonstrate that representative tracking algorithms have gradually narrowed the gap with human subjects. Furthermore, both humans and machines exhibit unique strengths in dynamic visual tasks, suggesting significant potential for human-machine collaboration. (4) Benchmark for multi-modal global instance tracking based on a hierarchical semantic framework. Single object tracking algorithms often face challenges in maintaining robust tracking abilities when dealing with long videos that contain complex causal and spatio-temporal relationships. While some studies attempt to incorporate the language modality to aid tracking, they typically utilize short videos with simple narrative content and provide semantic labels that suffer from semantic ambiguity and incomplete grammatical structures. To address these issues, this thesis first constructs a benchmark for multi-modal global instance tracking that includes videos with complex narrative content. The average length of the video sequences in this benchmark is 5-22 times longer than existing multi-modal single object tracking benchmarks. Additionally, by incorporating the hierarchical structure of human cognition, this thesis proposes a multi-granularity semantic annotation strategy aimed at providing high-quality semantic information. Finally, through comparison with other multi-modal single object tracking benchmarks, this thesis validates the feasibility of the multi-granularity annotation strategy as a means to couple human understanding abilities of long videos with semantic labels. Moreover, this thesis introduces a multi-modal evaluation mechanism and conducts a detailed experimental analysis on different structured trackers based on this mechanism, thereby exploring the performance bottlenecks of multi-modal trackers.
关键词	单目标跟踪智能评估技术视觉跟踪能力实验环境评估平台
学科领域	计算机感知
学科门类	工学::计算机科学与技术（可授工学、理学学位）
收录类别	其他
语种	中文
七大方向——子方向分类	目标检测、跟踪与识别
国重实验室规划方向分类	智能能力评估
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/54535
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	胡世宇. 单目标跟踪中的智能评估技术研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
单目标跟踪中的智能评估技术研究-签字版.（33423KB）	学位论文		限制开放	CC BY-NC-SA