基于主动学习的大规模复杂场景三维语义分割

CASIA OpenIR > 毕业生 > 博士学位论文

	基于主动学习的大规模复杂场景三维语义分割
	荣梦琪
	2023-08-21
页数	144
学位类型	博士
中文摘要	在计算机视觉、摄影测量、无人系统等研究领域，随着图像三维重建技术的快速发展，以及激光传感设备的应用普及，场景三维数据的获取变得日益便捷，使得计算机对周围环境的感知由二维感知开始向三维感知转变。在这一背景下，三维语义分割作为一项基础而重要的任务应运而生，旨在准确地将三维空间中的对象划分为不同的语义类别。近年来，深度学习在图像语义分割任务上取得了显著进展，但对于复杂的大规模三维场景仍然存在诸多挑战。首先，三维数据标注繁琐，手工标注成本高，导致可用于监督训练的大规模三维语义分割数据集十分匮乏；其次，在大规模三维场景中，通常包含数量众多且分布广泛的地物类别，难以构建通用三维分割模型适应各种类型场景；此外，在采用预训练模型微调以适应特定三维场景时，微调样本的选择策略通常也比二维任务更加复杂。针对这些问题，本文采取两项关键思想：第一是通过原始影像、渲染影像、正射影像等方式构建三维模型与多视角二维图像之间的对应关系，并通过先二维分割再三维融合的策略，实现大规模复杂场景的三维分割能力；第二是引入主动学习思想，通过三维分割不确定性以及特征多样性等度量，自动挑选难分二维样本，实现少量标注条件下的分割模型跨域跨场景适应能力。具体地，本文的主要工作及创新点总结如下： 1. 提出了一种基于原始影像的主动学习三维语义分割方法。考虑到基于图像重建的大规模三维场景中，三维模型与二维图像具有严格的点云-像素关联性，本文首先在图像上进行语义分割，然后将分割结果投影到三维模型上进行融合，同时在融合过程中通过邻域语义一致性约束提高全局三维融合的鲁棒性。之后，对三维融合分割结果进行观测不确定性和观测离散度度量，并通过主动学习策略自动挑选少量难分图像样本进行标注，进而对图像语义分割网络进行微调。在三个由不同方式采集获取的室外大规模三维场景上的实验表明，该方法仅需标注少量图像即可实现对大规模三维场景的准确三维语义分割。 2. 提出了一种基于渲染影像的主动学习三维语义分割方法。在基于图像重建的三维模型中，由于光照变化、动态物体干扰、相机位姿估计不准等因素，三维模型与原始影像的语义标签可能存在不一致，导致前述方法在全局三维融合时存在误差。针对这一问题，本文提出了一种基于渲染影像的方法，能够根据不同场景的特性选择恰当的渲染方式并生成任意位置的虚拟视角图像。此外，为解决三维语义分割中普遍存在的小类别样本不均衡问题，本文进一步在分割不确定性度量的基础上，提出了区域复杂性和类别多样性两种策略，使得主动学习过程具有更加平衡的数据选择能力。实验结果表明，该方法在大规模城市航拍场景和复杂室内场景中均取得了良好的分割效果，尤其提升了小类别物体的分割精度，并且具备发现未知类别的能力。 3. 提出了一种基于正射影像的主动学习三维语义分割方法。多视角图像虽然能够更全面地捕捉三维场景信息，但冗余的图像在语义融合和主动学习阶段会导致巨大的计算负担。针对这一问题，本文提出了一种基于正射影像的方法，能够用较少的图像数据有效地呈现全局场景。此外，高分辨率图像在主动学习过程中并不需要对所有的像素都进行精确标注。对此，本文提出了一种自适应的联通区域计算方法，能够从图像中选择一些分割质量较低的不规则像素区域进行标注，进一步减少了标注数据的规模。实验结果表明，该方法显著提升了大规模三维场景语义分割的效率，并在准确性上优于基于多视角图像的方法。 4. 城市级大规模场景高效三维语义分割实践。将理论方法和关键技术应用于实际生产和生活，进而解决真实场景中的关键问题，具有重要的意义和价值。本文以河南郑州作为真实案例，在大规模城市实景三维模型中验证了所提出的理论方法的可行性和实用性。实验结果表明，采用少量图像标注的方式，仍能够在大规模真实场景中快速且准确地实现语义分割，并且通过语义分割所获得的建筑物信息能够有效支持后续的矢量化建模任务，为三维数字模型的构建和地理信息系统的开发提高了有效的支持。
英文摘要	In the fields of computer vision, photogrammetry, and unmanned systems, with the rapid development of image-based 3D reconstruction techniques and the popularization of laser sensing devices, the acquisition of 3D scene data has become increasingly convenient, enabling computers to perceive their surrounding environment in 3D, moving beyond traditional 2D. In this context, 3D semantic segmentation has emerged as a fundamental and important task, aiming to accurately classify objects in 3D space into different semantic categories. While deep learning has made significant progress in image semantic segmentation tasks in recent years, there are still many challenges when it comes to complex large-scale 3D scenes. Firstly, annotating 3D data is labor-intensive and costly, resulting in a scarcity of large-scale 3D semantic segmentation datasets available for supervised training. Secondly, in large-scale 3D scenes, there are typically numerous object categories that are widely distributed, making it difficult to develop a general 3D segmentation model that can adapt to various types of scenes. Moreover, when fine-tuning pre-trained models to adjust to specific 3D scenes, the selection strategy for fine-tuning samples is often more complex compared to 2D tasks. To address these issues, this dissertation adopts two key ideas. The first is to construct the corresponding relationship between the 3D models and multi-view 2D images through original images, rendered images, and orthographic images. By employing a two-step strategy of first performing 2D segmentation and then integrating the results into 3D, the dissertation achieves the capability of 3D segmentation for large-scale complex scenes. Secondly, the dissertation introduces the idea of active learning, automatically selecting challenging 2D samples based on metrics such as 3D segmentation uncertainty and feature diversity, thereby achieving the adaptation of segmentation models across different domains and scenes, even with limited annotated data. Specifically, the main contributions and innovations of this dissertation are summarized as follows: 1. Proposed an active learning-based 3D semantic segmentation method using original images. Considering the strict correspondence between point clouds and pixels in large-scale 3D scenes based on image reconstruction, this dissertation first performs semantic segmentation on the images and then projects the segmentation results onto the 3D models for fusion. Meanwhile, during the fusion process, neighborhood semantic consistency constraints are applied to improve the robustness of global 3D fusion. Subsequently, the fused 3D segmentation results are measured for observation uncertainty and observation disparity, followed by the application of an active learning strategy to automatically select a limited number of challenging image samples for annotation, thereby fine-tuning the image semantic segmentation network. The experimental results on three outdoor large-scale 3D scenes acquired through different acquisition methods demonstrate that this method achieves accurate 3D semantic segmentation of large-scale 3D scenes with minimal image annotation requirements. 2. Proposed an active learning-based 3D semantic segmentation method using rendered images.In 3D models based on image reconstruction, inconsistencies between the semantic labels of the 3D model and the original images may arise due to factors such as lighting variations, dynamic object interference, and inaccurate camera pose estimation. These inconsistencies result in errors during global 3D fusion in the aforementioned method. To address this issue, this dissertation proposes a method based on rendered images, which can select appropriate rendering techniques based on scene characteristics and generate virtual viewpoint images from any location. Additionally, to tackle the common problem of imbalanced small-class samples in 3D semantic segmentation, this dissertation further proposes two strategies, namely region complexity and category diversity, in addition to segmentation uncertainty measurement. These strategies enhance the data selection capability of the active learning process, achieving a more balanced selection of samples. The experimental results demonstrate that this method achieves outstanding segmentation performance in large-scale aerial urban scenes and complex indoor scenes, particularly improving the segmentation accuracy of small-class objects and exhibiting the ability to discover unknown classes. 3. Proposed an active learning-based 3D semantic segmentation method using orthographic images. Although multi-view images can capture 3D scene information comprehensively, redundant images can impose a significant computational burden during the semantic fusion and active learning stages. To address this issue, this dissertation proposes a method based on orthographic images, which can effectively represent the global scene with fewer image data. Additionally, high-resolution images do not require precise annotation for all pixels during the active learning process. Therefore, this dissertation introduces an adaptive connected region computation method, which selects irregular pixel regions with lower segmentation quality for annotation, further reducing the scale of annotated data. The experimental results demonstrate that this method significantly improves the efficiency of large-scale 3D scene semantic segmentation and outperforms the method based on multi-view images in terms of accuracy. 4. Efficient 3D semantic segmentation practice in urban-scale large scenes. Applying theoretical methods and key technologies to real-world production and daily life, and thereby addressing crucial issues in actual scenes, holds significant importance and value. This dissertation presents a real case study in Zhengzhou, Henan, to validate the feasibility and practicality of the proposed theoretical methods in large-scale urban real-world 3D models. The experimental results demonstrate the ability to achieve fast and accurate semantic segmentation in large-scale real-world scenes, even with a small number of annotated images. Furthermore, the building information obtained through semantic segmentation can effectively support subsequent tasks such as vectorization modeling, providing valuable support for the construction of 3D digital models and the development of geographic information systems.
关键词	大规模复杂三维场景三维语义分割主动学习
语种	中文
七大方向——子方向分类	三维视觉
国重实验室规划方向分类	其他
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/52390
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	荣梦琪. 基于主动学习的大规模复杂场景三维语义分割[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
毕业论文-答辩后修改-签名版.pdf（22974KB）	学位论文		限制开放	CC BY-NC-SA