面向复杂场景的跨维度视觉感知方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	面向复杂场景的跨维度视觉感知方法研究
	潘聪
	2024-05
页数	113
学位类型	博士
中文摘要	在人工智能技术的迅猛发展推动下，视觉场景感知在自动驾驶、智能监控和机器人导航等领域展现出了巨大的应用潜力。面向复杂场景的视觉感知方法旨在精准捕捉和处理视觉信息，从而实现对物体的识别、场景的理解和行为的指导。随着深度学习技术的快速进步，视觉场景感知领域取得了显著发展，感知性能不断提升。然而，在实际应用场景中，物体尺度的多样性和场景的复杂性为传统的二维视觉感知方法带来了新的挑战。特别是在复杂的自动驾驶场景中，高效的环境感知和目标精确识别对于确保车辆安全行驶至关重要。因此，结合二维图像和相机标定参数进行跨维度的视觉感知成为了视觉感知领域的一个重要研究方向。本文采用从二维到三维、由单目到多目的逐步深入的策略，对复杂场景下的跨维度视觉感知方法进行了研究。本文的主要贡献包括： 1. 提出了一种基于尺度学习的可部署二维目标检测方法。通用二维目标检测面临的最大挑战之一是尺度变化，在实际应用中物体种类繁多且尺度不一，同一类物体可能以不同尺度出现。现有方法在学习目标尺度、训练效率和推理速度方面仍存在局限性，且难以满足硬件部署的需求。针对该挑战，本方法旨在保证二维目标检测网络对不同尺度物体的感知能力的同时，实现硬件的可部署性。通过分析目前通用视觉场景下二维目标检测网络框架中的感受野分布，本方法设计了一种自动搜索的全局多尺度感知网络，并提出了一种尺度分解方法，将学习到的分数尺度转换为整数且固定的尺度组合。同时，设计了一个快速部署网络，该网络能够在推理过程中加速并支持硬件优化。此外，本研究还使用推理引擎对提出的模型进行优化，实现更快的推理速度。实验结果表明，本方法在目标检测任务上相较于现有方法能够取得一定的性能提升，且更适合硬件部署。 2. 提出了一种基于标准化流增强深度的单目三维目标检测方法。在自动驾驶的单目视觉系统中，由于缺乏直接的深度信息，准确预测物体在三维空间中的位置、形状和方向成为了一个极具挑战性的任务。深度信息的准确性对三维目标检测的性能至关重要。为克服这一挑战，本方法利用标准化流技术引入了真实标签的深度先验分布，将其融合到单目深度估计模型生成的原始深度图分布中，以获得更加精确的深度输入。同时，本方法构建了一个包含交叉注意力机制的层级化Transformer网络，将图像特征和深度图特征进行有效融合。此外，提出深度引导的相对位置编码，进一步促进图像特征和深度图特征的融合，从而增强模型对三维场景的感知能力。多个公开数据集上的实验结果表明了本方法在单目三维目标检测任务上的有效性。 3. 提出了一种基于鸟瞰图与图像特征交互的多目三维语义分割方法。在自动驾驶场景中，单一视角的信息往往无法为驾驶决策和路径规划提供充分支持，而多目视觉系统可以提供更全面的环境信息，这对于确保车辆安全行驶至关重要。同时，随着多个大规模多目自动驾驶数据集的发布，基于环视多目视觉感知的研究逐渐受到关注。但是，多个单目图像本质上仍属于二维信号，缺乏深度信息，且多目图像带来的计算量成倍增加也为多目视觉感知带来了新的挑战。为了克服上述挑战，本研究设计了一个双向前置交互的Transformer框架，采用双向交叉注意力机制隐式地约束图像特征提取，同时促进图像特征空间与鸟瞰图特征空间的对齐，将多视角图像特征整合为统一的鸟瞰视角下的特征表示。此外，本方法通过扩大输入图像分辨率，并在特征交互之前对多尺度图像特征进行下采样，以保证模型的参数和计算量可控，同时还能获得更好的语义分割性能。实验结果表明，本方法在保证实时推理的同时提升了模型的语义分割性能，能够实现跨视角、跨维度的场景感知。
英文摘要	Propelled by the rapid development of artificial intelligence, visual scene perception has demonstrated immense potential for application in areas such as autonomous driving, intelligent surveillance, and robot navigation. Methods of visual perception tailored for complex scenes aim to precisely capture and process visual information, thereby facilitating object recognition, scene comprehension, and behavior guidance. With the swift progress of deep learning technology, the field of visual scene perception has seen significant advancements, with continual enhancements in perception performance. However, in practical application scenarios, the diversity of object scales and the complexity of scenes present new challenges to traditional two-dimensional visual perception methods. Particularly in intricate autonomous driving scenarios, efficient environmental perception and accurate object recognition are crucial for ensuring the safe navigation of vehicles. Consequently, cross-dimensional visual perception, which integrates two-dimensional images and camera calibration parameters, has emerged as an important research direction in the field of visual perception. This dissertation adopts a progressively deepening strategy from two-dimensional to three-dimensional and from monocular to multi-view, to explore cross-dimensional visual perception methods for complex scenes. The main contributions of this dissertation include: 1. A deployable two-dimensional object detection method based on scale learning is proposed. Scale variation, with objects of the same class appearing at different scales and different objects having different scales, poses a major challenge for general two-dimensional object detection. Existing methods have limitations in learning object scales, training efficiency, and inference speed, and are difficult to meet the requirements for hardware deployment. This dissertation aims to ensure the detection network's ability to perceive objects of different scales while achieving hardware deployability. By analyzing the distribution of receptive fields in the framework of two-dimensional object detection networks under general visual scenes, an automatically searched global multi-scale perception network is designed, along with a scale decomposition method that converts fractional scales into fixed integer scale combinations. A fast deployment network is also designed to accelerate inference and support hardware optimization. The proposed model is optimized using an inference engine for faster inference speed. Experimental results demonstrate performance improvements in object detection tasks and suitability for hardware deployment compared to existing methods. 2. A monocular three-dimensional object detection method based on normalizing flow-enhanced depth is proposed. In the monocular vision system for autonomous driving, accurately predicting the position, shape, and orientation of objects in three-dimensional space is challenging due to the lack of direct depth information. The accuracy of depth information is crucial for three-dimensional object detection performance. To address this challenge, this dissertation uses normalizing flow technology to introduce the depth prior distribution of ground truths and integrates it into the original depth map distribution generated by the monocular depth estimation model, obtaining more accurate depth input. A hierarchical Transformer network with a cross-attention mechanism is constructed to fuse image features and depth map features effectively. Additionally, a depth-guided relative position encoding is proposed to further promote the fusion of image features and depth map features, thereby enhancing the model's perception of three-dimensional scenes. Experimental results on multiple public datasets demonstrate the effectiveness of this method in monocular three-dimensional object detection. 3. A method for multi-view three-dimensional semantic segmentation based on the interaction between bird's-eye view and image features is proposed. In autonomous driving, a single viewpoint often fails to provide sufficient information for driving decisions and path planning, whereas a multi-view vision system can offer a more comprehensive view of the environment, which is crucial for safe vehicle operation. Research on surround-view multi-view vision perception has gained attention with the release of large-scale multi-view autonomous driving datasets. However, the inherently two-dimensional nature of multiple monocular images, lack of depth information, and increased computational load pose new challenges. To address these challenges, this dissertation designs a bidirectional early-interaction Transformer framework with a bidirectional cross-attention mechanism to constrain image feature extraction and the alignment between the image and bird's-eye view feature spaces. Then the multi-view image features are integrated into a unified bird's-eye view representation. The input image resolution is expanded, and multi-scale image features are downsampled before feature interaction to control the model's parameters and computational load while improving semantic segmentation performance. Experimental results demonstrate the method's effectiveness in enhancing semantic segmentation performance and enabling cross-view and cross-dimensional scene perception while ensuring real-time inference.
关键词	视觉场景感知二维目标检测单目三维目标检测鸟瞰图语义分割视觉Transformer
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/57595
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	潘聪. 面向复杂场景的跨维度视觉感知方法研究[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
潘聪博士毕业论文_2024_最终打印提交（28980KB）	学位论文		限制开放	CC BY-NC-SA