基于序列图像信息挖掘的物体检测与跟踪

CASIA OpenIR > 毕业生 > 博士学位论文

	基于序列图像信息挖掘的物体检测与跟踪
	何嘉伟
	2024-05
页数	124
学位类型	博士
中文摘要	物体感知是计算机视觉的一项关键技术，其中物体检测和跟踪是两个基础任务。基于图像的物体检测与跟踪被广泛应用于自动驾驶车辆、智能机器人、无人机、增强现实、工业质检、视频监控、运动分析等实际使用场景。然而，对于物体检测与跟踪，有一些困难和具有挑战性的场景，如对于小目标、远距离物体的三维检测，遮挡重叠、外观相似、长时间不可见的物体的跟踪。这些困难和挑战限制了物体检测与跟踪技术的实际应用和进一步的发展潜力。对于序列图像中时序信息的利用是解决这些困难和挑战的有效方法之一。因此，本文研究以时序相连接的物体检测与跟踪这两项计算机视觉感知任务，尤其重点研究在序列图像或视频中物体时序信息的挖掘。本文的主要创新点包括：提出了基于可学习图匹配的多目标跟踪范式，强调帧内关系的重要性。该方法将帧内关系表示为无向图，并将数据关联问题建模为通用图匹配问题。针对原始图匹配二次分配形式NP-hard的问题，提出一种基于二次规划的连续松弛形式。利用隐函数定理和KKT条件将其嵌入到深度神经网络中，即可微分图匹配模块。为了加快图匹配问题求解，设计了门控搜索树算法，大大加快了图匹配问题的求解。提出了只需二维标签的二维三维联合的在线多目标跟踪方法。通过从运动恢复结构的三维重建方法，获取三维场景点云。该工作结合可学习图匹配范式，提出了新的图像间关键点匹配方法，更好地重建整个场景。在三维重建点云中，物体被聚类成点云簇，三维物体中心位置可据此获得。本工作设计了基于重建的伪三维物体标签生成与三维物体表示学习模块。该方法仅通过单目视频学习物体的三维表示，并由二维跟踪标签进行监督，无需来自激光雷达或预训练深度估计器进行额外标注。提出了基于物体时序全局优化的时序三维物体检测与跟踪方法。实现了以物体为中心的时序三维重建，依此设计了两阶段时序三维物体检测器。特别是本工作设计了以物体为中心的时间对应关系学习模块并提出了以特征度量的物体光束法平差损失函数，这些设计使得时序特征学习成为三维物体检测的第二阶段进行联合训练。利用提出时序方法可以更加准确地进行三维物体检测，大大提升了远距离三维物体检测效果。提出了基于多阶段泛化的弱监督单目三维物体检测方法，在工作二的基础上，更进一步的研究了二维监督的三维物体边界框的学习方法。利用神经网络的泛化能力，首次提出了这个问题的实用解决方案。从三维重建得到的三维边界框伪标签开始，本工作设计了三个阶段的泛化：从完整物体到部分可见物体、从静态物体到运动物体、从近距离到远距离，使得弱监督三维物体检测方法接近全监督性能。总的来说，针对序列图像物体检测与跟踪问题，本文工作首先展开全监督方法的研究，包括利用图匹配的物体关联，应用于多目标跟踪任务，以及以物体为中心的时序三维物体检测。利用三维重建这一升维手段，可在无需深度模型的基础上获得物体三维表示方式。依此思想，本文探索了二维三维联合跟踪以及弱监督三维物体检测方法。本文所提出的方法对比同期工作，均具有显著的性能提升，在领域内通用的评测数据集上达到领先的性能指标，能够有效地解决遮挡、远距离等复杂场景下的物体定位和时序关联问题，具有很好的学术创新意义和实际应用价值。
英文摘要	Object perception is a key technology in computer vision, with object detection and tracking being two fundamental tasks. Image-based object detection and tracking are widely used in practical scenarios such as autonomous driving vehicles, intelligent robots, drones, augmented reality, industrial quality inspection, video surveillance, and motion analysis. However, there are some difficult and challenging scenarios for object detection and tracking, such as 3D detection of small targets or objects at long distances, and tracking of occluded or overlapping objects with similar appearances or objects that have been invisible for a long time. These difficulties and challenges limit the practical application and further development potential of object detection and tracking technologies. The utilization of temporal information in sequential images is one effective method to address these difficulties and challenges. Therefore, this study focuses on the exploration of object detection and tracking tasks in computer vision perception research, which takes time sequence as the bridge. Particularly, this dissertation emphasizes the mining of temporal information in sequential images or videos. The main contributions presented in this dissertation include: 1) Proposing a multi-object tracking paradigm based on learnable graph matching that emphasizes the importance of intra-frame relationships. This method represents intra-frame relationships as undirected graphs and models data association problems as general graph matching problems. To solve the NP-hard quadratic assignment problem arising from original graph matching formulation efficiently, a continuous relaxation form based on quadratic programming is proposed. By combining it into deep neural networks using implicit function theorem and KKT conditions, a differentiable graph matching module can be obtained. To accelerate solving the graph matching problem, a gated search tree algorithm is designed which significantly speeds up solving the graph matching problem. 2) Proposing an online multi-object tracking method that only requires 2D labels for joint 2D-3D tracking. By reconstructing 3D scene point clouds from Structure-from-Motion methods, this work combines learnable graph matching paradigm to propose a new inter-image keypoint correspondence method that better reconstructs the entire scene. In reconstructed 3D point clouds, objects are clustered into point cloud clusters and the center positions of 3D objects can be obtained accordingly. This work designs a reconstruction-based pseudo-3D object label generation and 3D object representation learning module. By learning the 3D representation of objects solely from monocular videos and supervised by 2D tracking labels, there is no need for additional annotations from LiDAR or pre-trained depth estimators. 3) Proposing a temporal 3D object detection and tracking method based on global optimization of object temporal information. This work achieves object-centric temporal 3D reconstruction and designs a two-stage temporal 3D object detector accordingly. In particular, this work designs an object-centric temporal correspondence learning module jointly trained as the second stage with the object detection and proposes a featuremetric object-centric bundle adjustment loss function. The proposed temporal method can be used to detect 3D objects more accurately, especially improving the performance of long-distance 3D object detection. 4) Proposing a weakly supervised monocular 3D object detection method based on multi-stage generalization. Building upon the previous work, this study further investigates the learning methods for weakly supervised bounding box estimation in 3D using only 2D supervision. By leveraging the generalization ability of neural networks, a practical solution to this problem is proposed for the first time. Pseudo-labels of 3D bounding boxes obtained from 3D reconstruction are used in this work, which designs three stages of generalization: from complete objects to partially visible objects, from static objects to moving objects, and from close-range to long-range, making the weakly supervised 3D object detection method close to fully supervised performance. In summary, regarding sequential image-based object detection and tracking problems, this study initially conducts research on fully supervised methods including graph matching-based data association applied to multi-object tracking tasks as well as temporally connected centric-object-based temporal 3D object detection. By utilizing dimensionality elevation through 3D reconstruction, it is possible to obtain representations of objects in 3D without relying on depth estimation models. Based on this foundation, joint 2D and 3D tracking and weakly supervised 3D object detection methods are explored. The proposed methods in this study demonstrate significant performance improvements compared to concurrent works, achieving leading performance indicators on commonly used evaluation datasets in the field. They can effectively address object localization and temporal association problems in complex scenarios such as occlusion and long distances, demonstrating both academic innovation and practical application value.
关键词	三维物体检测多目标跟踪时序信息挖掘图匹配三维重建
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/57422
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	何嘉伟. 基于序列图像信息挖掘的物体检测与跟踪[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于序列图像信息挖掘的物体检测与跟踪_何（22944KB）	学位论文		限制开放	CC BY-NC-SA