CASIA OpenIR  > 毕业生  > 硕士学位论文
基于图像和点云融合的3D目标检测方法研究
张永昌
2023-05-22
页数104
学位类型硕士
中文摘要

在自动驾驶、机器人视觉和工业制造等领域中,准确而鲁棒的3D目标检测技术是实现安全、高效和精确控制的关键。目前,基于图像或点云数据的3D目标检测方法已经得到广泛应用,但依赖于单一数据源的方法仍存在一些局限,如RGB图像数据缺乏空间深度信息,点云数据具有稀疏性且缺乏丰富纹理信息等。针对上述问题,本文在充分研究了现有的3D目标检测算法的基础之上,结合图像和点云数据的特征,提出了几种新的3D目标检测算法,并通过实验充分证明了所设计的网络结构的可行性。本文的主要研究内容及贡献包括:

(1) 在视觉3D目标检测中,提出了一种基于深度概率分布的3D目标检测网络架构,该方法采用内嵌的网络分支预测像素的深度概率密度,提升对于目标三维信息的描述。深度概率分布网络预设每个像素的深度值区间,利用网络输出对应的深度值概率分布,从而生成可靠的三维特征,这种内嵌网络的方式提升了网络的效率,而且在训练过程中,采用点云数据生成深度概率分布标签,用于监督概率分布网络,以此来提升了深度概率分布网络输出的准确性。最后将三维特征转换到鸟瞰图空间下进行检测,进而提升网络效率。最终的实验结果证明了所设计网络的有效性。

(2)在点云3D目标检测中,提出了一种基于多级特征融合的3D目标检测算法,该算法利用多视角、多尺度特征融合方式,解决点云稀疏性的难题。在多视角方面,透视图特征提供了场景中的丰富语义信息,鸟瞰图特征则保持目标尺寸的一致性,两者融合增强点云特征的丰富性。在多尺度方面,体素级特征提供了点云目标丰富的细节信息,而目标级特征提供了目标的全局信息,从而丰富目标的中心特征,进而提升了目标框的定位和回归精度。最终的实验结果验证了多级特征融合对于点云3D目标检测的有效性。

(3)在图像和点云融合的3D目标检测中,提出了一种的基于区域注意力融合的3D目标检测算法,利用交叉注意力机制,实现了多模态特征的深度融合。该方法将特征投影到鸟瞰图下进行融合,解决了图像和点云维度不一致的难题。在特征融合过程中,交叉注意力机制自适应地选择图像特征中感兴趣的特征进行动态融合,实现了在特征位置不严格对齐时的精确融合。最终的实验结果验证了区域注意力融合对于3D目标检测的有效性。

(4)为了实现运动目标的定位和预测,提出了一种基于时空数据关联的3D目标追踪方法,该方法结合了目标时间运动模型和3D 目标检测技术,用于实时检测目标的空间位置,并推理目标未来时刻的位置和状态。最后利用置信度匹配和卡尔曼融合,将时间运动预测的目标位置和空间目标检测位置进行匹配和关联,实现了稳定而高效的追踪。实验结果验证了时空数据关联的有效性。

英文摘要

In fields such as autonomous driving, robotics vision, and industrial manufacturing, accurate and robust 3D object detection technology is key to achieving safe, efficient, and precise control. Currently, 3D object detection methods based on image or point clouds have been widely used. However, methods that rely on a single data source still have some limitations, such as RGB image data lacking spatial depth information and point cloud data being sparse and lacking rich texture information. In response to these issues, this paper thoroughly studies existing 3D object detection methods, and proposes several new 3D object detectors that combine the features of both images and point clouds. Through experiments, the feasibility of the designed network structures has been fully demonstrated. The main research content and contributions of this paper include:

(1) A visual 3D object detection network architecture based on depth probability distribution is proposed. The method employs an embedded network branch to predict pixel depth probability density, which enhances the description of the 3D information of the objects. The depth probability distribution network sets a depth value range for each pixel and uses the network output to generate reliable 3D features. This embedded network approach improves the efficiency of the network. During training, point clouds is used to generate depth probability distribution labels, which are used to supervise the probability distribution network and improve the accuracy of its output. Finally, the 3D features are transformed into the bird's eye view space for detection, thereby improving the efficiency of the network. The effectiveness of the designed network is demonstrated through experiments.

(2) A point cloud 3D object detection method based on multi-level feature fusion is proposed, which addresses the challenge of sparse point clouds by utilizing multi-view and multi-scale feature fusion. In terms of multi-view, perspective view features provide rich semantic information in the scene, while bird's eye view features maintain consistency in object size. The fusion of both enhances the richness of point cloud features. In terms of multi-scale, voxel-level features provide rich detailed information of point cloud objects, while object-level features provide global information of the objects, enriching the central features of the objects, and improving the localization and regression accuracy of the object bounding boxes. Experimental results demonstrate the effectiveness of multi-level feature fusion in point cloud 3D object detection.

(3) In the 3D object detection by fusing images and point clouds, a region-based attention fusion method is proposed to achieve deep fusion of multi-modal features through a cross-attention mechanism. This method projects the features onto the bird's-eye view to perform fusion, solving the problem of inconsistent dimensions between the images and point clouds. During the feature fusion process, the cross-attention mechanism dynamically selects the interested features from the image feature map for precise fusion, achieving accurate fusion even when the features are not strictly aligned. Experimental results confirm the effectiveness of the region-based attention fusion approach for 3D object detection that fuses image and point clouds.

(4) To achieve the localization and prediction of moving objects, a 3D object tracking method based on spatio-temporal data association is proposed. This method combines the temporal motion model and 3D object detection technology to detect the spatial position of the object in real-time and infer its future position and state. Finally, using confidence matching and Kalman fusion, the predicted object position based on temporal motion and the spatial object detection position are matched and associated to achieve stable and efficient tracking. The experimental results confirm the effectiveness of the spatio-temporal data association.

关键词3D目标检测 图像和点云 注意力融合 时空数据关联 目标追踪
语种中文
七大方向——子方向分类三维视觉
国重实验室规划方向分类其他
是否有论文关联数据集需要存交
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/51697
专题毕业生_硕士学位论文
推荐引用方式
GB/T 7714
张永昌. 基于图像和点云融合的3D目标检测方法研究[D],2023.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
毕业论文_张永昌.pdf(6879KB)学位论文 限制开放CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[张永昌]的文章
百度学术
百度学术中相似的文章
[张永昌]的文章
必应学术
必应学术中相似的文章
[张永昌]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。