CASIA OpenIR  > 毕业生  > 博士学位论文
基于多模态融合的3D目标检测方法研究
陶满礼
2024-05
Pages111
Subtype博士
Abstract

  随着互联网技术的发展、摄像头和激光雷达等不同传感器设备的普及、2D图像和3D点云数据快速增长并有力加速了多模态大数据时代到来。现实生活中,三维环境理解与人们的生活息息相关,如何利用这些多模态数据提升3D环境感知显得至关重要。作为三维环境感知和计算机视觉领域重要的基础性任务之一,3D目标检测应用范围十分广泛,例如:自动驾驶、机器人导航、安全监控、工业自动化等诸多领域。因此,对3D目标检测技术的研究具有十分重要的学术价值和现实意义。

  近年来,3D目标检测技术取得了快速发展。根据处理数据的不同,3D目标检测算法主要分为两类:1)单模态3D目标检测算法;2)多模态融合的3D目标检测算法。总体而言,3D目标检测的发展历程是由早期的单模态算法突破逐步过渡到现在的多模态占据主导。单模态算法中,基于图像的3D目标检测算法面临深度估计不准的难题,在空间尺度回归方面的表现远不如基于点云的3D目标检测方法。但是,由于3D数据固有的稀疏性和无序性,基于点云的3D目标检测算法在3D目标分类中的表现不佳。此外,对于3D场景中存在遮挡和反射点残缺的困难目标,基于点云的3D目标检测模型在分类和3D检测框回归上的精度都比较低。另一方面,当前的多模态融合3D目标检测算法受制于空间对齐的融合机制,对标定参数的波动非常敏感,很难适应实际应用场景。本文围绕3D目标检测任务,针对上述存在的问题,提出了有效的解决方案,极大地提高了3D目标检测器的性能。

  多模态方法和单模态方法之间并非孤立并行,3D点云网络作为多模态方法的主分支,对整体的多模态融合效果具有决定性的影响作用。因此,本文遵照从子任务到总目标(分类到检测),从单模态到多模态,由浅至深的研究路线,分别开展相关研究。本文的主要成果和贡献归纳如下

1. 针对点云稀疏性导致的目标分类性能不足的问题,本文提出了全局子点云(Global Patch Point Clouds)概念,并设计了基于交叉注意力机制的目标分类网络GPCAN。现有的点云分类网络多关注局部几何特征提取,在全局关系建模方面的描述不足,对空间结构相似物体的区分能力较差。因此,本文提出了一种基于全局子点云和注意力机制的网络模型用于提升在点云分类和局部分割(部件分类)等相关任务上的性能。该方法完成了3D目标检测中的目标分类子问题,有利于后续基于点云数据进一步开展3D目标检测研究。

2. 3D场景中通常存在由空间位置遮挡或不同物体表面材质反射率强弱不均等因素造成的几何点残缺现象,基于点云的3D目标检测算法对上述困难目标的检测性能不佳。针对该问题,本文提出了一种基于点云特征增强的3D目标检测模型Objformer。通过设计几何交互模块和语义交互模块,分别对场景中的实例进行几何和语义信息的全局交互传递,增强了基于点云的3D目标检测方法在尺寸回归和类别分类上的性能。该方法旨在提升单模态点云模型对困难3D目标的检测性能,对后续进行多模态融合研究具有重要的参考价值和指导意义。

3. 当前的多模态融合方法多以3D点云网络作为主分支,3D网络的目标召回性能对后续融合阶段的效果具有决定性影响。受制于3D点云数据的特征表现力,3D点云主分支很容易遗漏一些困难的3D目标,现有多模态融合方法无法在融合阶段找回丢失的目标。针对上述问题,本文提出了一种两阶段多模态目标融合互补方法ImFusion。不同于现有的融合方案,本文基于不同模态输入数据得到的2D3D候选目标,设计了一种基于2D图像目标的伪3D目标生成方法。通过将伪3D目标和点云网络得到的原始3D候选目标进行实例级融合,有效提升了模型第一阶段对困难目标的召回能力。该方法通过融合图像数据解决了3D困难样本召回率低的问题,同时借助图像的语义优势增强了模型对困难目标的分类性能。

4. 现有多模态3D目标检测算法的融合机制多依赖高精度的标定配置以实现空间一致性约束,进而根据空间对应关系完成特征对齐。实际应用中,由于震动等原因引起的不同传感器间的标定波动将导致现有多模态方法出现性能下降甚至失效的问题。针对上述问题,本文提出了一种对齐无关的特征融合方法,将不同模态数据在空间上进行解耦,旨在充分利用不同模态的数据优势以完成不同模态跨实例间的信息交互增强。该方法避免了现有多模态融合算法对标定参数的依赖,并在多个数据集上取得了先进的3D目标检测性能,具有重要的实际应用价值。

Other Abstract

The proliferation of internet technology, along with the widespread adoption of various sensor devices such as cameras and LiDAR, has led to a rapid increase in 2D imagery and 3D point cloud data, thereby significantly hastening the onset of the multimodal big data era. In daily life, the comprehension of three-dimensional environments is integral to human activities, making the enhancement of 3D environmental perception through multimodal data utilization critically important. As one of the fundamental tasks in the domains of 3D environmental perception and computer vision, 3D object detection boasts a broad spectrum of applications, encompassing areas such as autonomous driving, robotic navigation, security monitoring, and industrial automation. Consequently, research on 3D object detection technology holds significant academic value and practical significance.

In recent years, 3D object detection technology has advanced rapidly. Depending on the nature of the data processed, 3D object detection algorithms mainly fall into two categories: 1) single-modal 3D object detection; 2) multi-modal fused 3D object detection. Overall, the development of 3D object detection has transitioned from the early breakthroughs in single-modality algorithms to the current dominance of multi-modality approaches. In single-modal algorithms, image-based 3D object detection algorithms struggle with inaccurate depth estimation,  and their performance in spatial scale regression is far inferior to that of point cloud-based 3D object detection methods. However, due to the inherent sparsity and disorder of 3D data, point cloud-based 3D object detection algorithms perform poorly in 3D object classification. Moreover, these models exhibit low accuracy in classification and 3D bounding box regression for challenging targets within 3D scenes, such as those with occlusions and incomplete point reflections. On the other hand, current multi-modality fusion 3D object detection algorithms are constrained by spatial alignment mechanisms and are highly sensitive to calibration parameter fluctuations, making it difficult to adapt to real-world application scenarios. This paper addresses the task of 3D object detection, proposes effective solutions to the aforementioned issues, and significantly enhances the performance of 3D object detectors.

The relationship between multimodal and single modal methods is not isolated and parallel. The 3D point cloud network, as a main branch of multimodal methods, plays a decisive role in the overall multimodal fusion effect. Therefore, this paper follows a research trajectory from subtask to the overall objective (from classification to detection), from single-modal to multi-modal, and from shallow to deep investigations, this paper systematically conducts relevant research. The main achievements and contributions of this paper are summarized as follows:

1. In response to the problem of insufficient performance in object classification caused by the sparsity of point clouds, this paper proposes the concept of Global Patch Point Clouds and designs a target classification network (GPCAN) based on the cross attention mechanism. Existing point cloud classification networks often focus on extracting local geometric features, lacking in describing global relationships and demonstrating poor discriminative ability for spatially similar objects. Therefore, this paper presents a network model based on global patch and attention mechanisms to enhance performance in tasks such as point cloud classification and local segmentation (part classification). This method addresses the sub-problem of object classification in 3D object detection, which facilitates subsequent research on 3D object detection based on point cloud data.

2. In 3D scenes, geometric point incompleteness often arises from factors such as spatial occlusion or variations in surface material reflectance, 3D object detection algorithms based on point clouds exhibit suboptimal performance in detecting the aforementioned challenging targets. To address this issue, this paper proposes a point cloud feature-enhanced 3D object detection model called Objformer. By designing geometric interaction modules and semantic interaction modules, global interaction of geometric and semantic information for instances in the scene is achieved, thereby enhancing the performance of point-based 3D object detection methods in size regression and category classification. This method aims to improve the detection performance of difficult 3D targets based on single-modal point clouds, providing significant reference value and guidance for subsequent multimodal fusion research.

3. Current multi-modal fusion methods often use 3D point cloud networks as the main branch, and the target recall performance of the 3D network has a decisive impact on the subsequent fusion stage. Limited by the expressive power of 3D features, 3D point cloud networks easily overlook some difficult 3D targets, and existing multi-modal fusion methods cannot recover lost targets through subsequent fusion. To address these issues, this paper proposes a two-stage multi-modal target fusion complementary method called ImFusion. Unlike existing fusion schemes, this paper designs a pseudo-3D target generation method based on 2D image targets obtained from different modal input data. By integrating pseudo-3D targets with original 3D candidate targets obtained from point cloud networks at the instance level, the recall capability of the model's first stage for difficult targets is effectively improved. This method addresses the low recall rate of 3D difficult samples by fusing image data and enhances the classification performance of the model for difficult targets by leveraging the semantic advantages of images.

4. Existing multi-modal 3D object detection algorithms often rely on high-precision calibration configurations to achieve spatial consistency constraints and then complete feature alignment based on spatial correspondence. In practical applications, fluctuations in calibration between different sensors caused by factors such as vibration can lead to performance degradation or even failure of existing multi-modal methods. To address these issues, this paper proposes an alignment-agnostic feature fusion approach that decouples different modal data in space, aiming to fully exploit the advantages of different modalities and enhance information interaction across instances of different modalities. This method avoids the dependency of existing multi-modal fusion algorithms on calibration parameters and achieves advanced 3D object detection performance on multiple datasets, demonstrating significant practical value.

Keyword深度学习 点云 多模态融合 特征对齐 3D目标检测
Subject Area模式识别
MOST Discipline Catalogue工学::控制科学与工程
Language中文
IS Representative Paper
Sub direction classification三维视觉
planning direction of the national heavy laboratory环境多维感知
Paper associated data
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/57233
Collection毕业生_博士学位论文
紫东太初大模型研究中心
Recommended Citation
GB/T 7714
陶满礼. 基于多模态融合的3D目标检测方法研究[D],2024.
Files in This Item:
File Name/Size DocType Version Access License
毕业论文提交IR.pdf(21006KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[陶满礼]'s Articles
Baidu academic
Similar articles in Baidu academic
[陶满礼]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[陶满礼]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.