基于视觉信息表征与融合的图像语义分割研究

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 三维可视计算

	基于视觉信息表征与融合的图像语义分割研究
	许镕涛
	2024-05
页数	110
学位类型	博士
中文摘要	作为计算机视觉领域的一项基础性任务，图像语义分割旨在为图像中每个像素分配精确的语义标签，其工作对于自动驾驶环境感知、医学图像分析、机器人感知交互以及高分辨率遥感图像分析等应用具有重要意义。尽管深度学习已显著提升图像语义分割精度，但实现高精度的图像语义分割仍面临许多挑战，例如高分辨率图像复杂前景背景关系处理，弱监督图像分割中类激活图对前景无关区域考虑不足，以及低对比度图像分割的目标形状多变和视觉特征相似度高等。本论文围绕图像语义分割技术进行深入研究，聚焦于视觉信息表征融合算法的研究，以解决图像语义分割中的核心挑战问题。针对以上面临的挑战，本论文提出了一系列基于视觉信息表征融合的图像语义分割算法，具体包括：提出了一种基于显著性表征融合的高分辨率图像分割方法。本文针对高分辨率遥感图像分割任务中存在的大尺度变化、复杂背景样本以及前景-背景分布不均衡等难题，提出了融合显著性表征的高分辨图像的分割框架RSSFormer。该框架融合了视觉显著性表征和其他视觉信息，以优化高分辨率图像的语义分割效果。RSSFormer框架的关键组成部分包括自适应Transformer融合模块、细节感知注意层和前景显著性引导的损失函数。其中，自适应Transformer融合模块结合了多尺度特征，并可以根据目标显著性进行动态调整，以有效抑制背景干扰，从而增强目标显著性；细节感知注意层通过空间和通道注意力机制的相互作用，可以精准捕获与前景相关的细微信息，从而进一步提升显著性表达；而前景显著性引导的损失函数则侧重于优化显著性响应较低的困难样本，从而促进模型训练更加平衡。实验证明，RSSFormer在LoveDA数据集、Vaihingen数据集、Potsdam数据集和iSAID数据集上均表现出了优越性能，不仅优于现有的通用图像语义分割方法，也在高分辨率遥感图像分割方面取得了突破性效果。提出了一种基于前背景表征融合的弱监督图像语义分割方法。上述基于显著性表征融合的高分辨率图像分割方法是一种全监督分割方法，需要大量耗时费力的像素级标注，因此研究如何提升低标注依赖的弱监督语义分割(WSSS)的性能具有很强的实际意义。本研究针对弱监督图像语义分割中类激活图（Class Activation Maps, CAM）对前景无关区域处理不足，导致的假阳性问题，提出了一种提升CAM生成质量的前背景表征融合方法——波状类激活图WaveCAM。WaveCAM通过将前景信息和背景信息高效融合来提升类激活图的质量，并进一步提高语义分割准确性。WaveCAM设计了两种关键的表征建模策略：前景感知表征建模，专门用于强化模型对于前景区域的识别能力；前景无关表征建模，则用于针对性地处理背景区域以减少误判。这两种表征被巧妙地表示为具有幅度和相位属性的波函数，并采用动态聚合机制来提取深层次的语义特征。最终，通过一个自适应融合模块整合前背景表征，从而生成具有丰富语义信息的输出结果。WaveCAM能够灵活嵌入到多阶段和端到端的弱监督语义分割框架中，并在PASCAL VOC 2012数据集上进行了消融研究，验证了其架构决策的有效性。实验证明，在PASCAL VOC 2012和MS COCO 2014数据集上的实验结果表明，WaveCAM可以显著提升了弱监督语义分割的效果，分割性能先进。提出了一种基于双流表征融合的图像语义分割方法。上述两种方法专注于单任务学习，这限制了视觉信息表征的来源，不利于融合强大且鲁棒表征的探索。为了克服这一限制，本文将分割任务与高分辨率任务相结合，探索多任务学习下融合高分辨率表征的通用低分辨率图像分割方法。本文提出了一种基于双流表征融合的图像语义分割方法，旨在解决低对比度图像分割中图像对象形状多变、细节丰富和视觉特征相似性高引起的挑战，本论文设计了双流表征融合学习（Dual-Representation Fusion Learning, DRFL）范式来有效利用高分辨率深度表征和细粒度结构表征。DRFL范式包括三个关键模块：表征融合Transformer模块（Representation Fusion Transformer Module, RFTM）、双流融合模块（Dual-Stream Fusion Module, DFM）以及峰度融合注意力模块（Peakiness Fusion Attention Module, PFAM）。双流融合模块通过共享同一特征提取器下的分割流与超分辨率流的结合，整合两个输出以优化分割结果。RFTM借助Transformer来融合高分辨率表征，并利用不同空洞率的空洞卷积来扩大感受野，从而整合空间结构的信息。而PFAM则用于挖掘并融合关键的纹理特征，以补偿上采样和下采样过程中的空间信息损失，从而实现更精确的医学图像分割。实验表明，提出的DRFL范式能够在不同的医学图像分割任务中取得分割质量的显著提升，包括肺结节分割、肺部分割、皮肤病变分割、细胞轮廓分割和前列腺分割。综上，本文围绕图像语义分割关键技术的挑战问题展开深入研究，提出并开发了多个基于视觉信息表征融合的高效图像语义分割算法：提出了RSSFormer框架，以解决高分辨率图像分割中的前景-背景分布不均衡、大尺度变化及复杂背景样本等问题；提出了WaveCAM方法通过高效融合前景信息表征和背景信息表征，以提升弱监督图像语义分割中的类激活图质量；此外，本文提出了双流表征融合学习范式，该范式通过融合高分辨率表征来解决医学图像中的目标形状多变性和视觉特征相似性高等挑战引起的分割质量问题。所提出的视觉信息的表征融合与图像语义分割方法，可有效提升当前图像语义分割的精度及鲁棒性。
英文摘要	As a fundamental task in computer vision, image semantic segmentation aims to assign precise semantic labels to every pixel in an image, which is crucial for applications such as autonomous driving perception, medical image analysis, robot perception interaction, and high-resolution remote sensing image analysis. Despite the significant advancements in image semantic segmentation accuracy brought by deep learning, achieving high-precision image semantic segmentation still faces numerous challenges, including handling complex foreground-background relationships in high-resolution images, insufficient consideration of irrelevant regions in weakly supervised image segmentation using class activation maps (CAM), and the variability of target shapes and high similarity of visual features in low-contrast image segmentation. This paper conducts in-depth research on image semantic segmentation techniques, focusing on the study of visual information representation fusion algorithms to address the core challenges in image semantic segmentation. In response to the aforementioned challenges, this paper proposes a series of image semantic segmentation algorithms based on visual information representation fusion, including: A high-resolution image segmentation method based on saliency representation fusion. This paper proposes the RSSFormer framework to address challenges in high-resolution remote sensing image segmentation, such as large-scale variations, complex background samples, and imbalanced foreground-background distributions. The framework integrates visual saliency representations and other visual information to optimize semantic segmentation results. Key components of the RSSFormer framework include an adaptive Transformer fusion module, detail-aware attention layers, and a foreground saliency-guided loss function. Experimental results demonstrate the superior performance of RSSFormer on datasets including LoveDA, Vaihingen, Potsdam, and iSAID, outperforming existing generic image semantic segmentation methods and achieving breakthroughs in high-resolution remote sensing image segmentation. A weakly supervised image semantic segmentation method based on foreground-background representation fusion. The aforementioned high-resolution image segmentation method based on saliency representation fusion is a fully supervised segmentation method, requiring extensive and time-consuming pixel-level annotations. Therefore, investigating how to improve the performance of weakly supervised semantic segmentation (WSSS), which relies less on annotations, is of great practical significance. This study proposes a foreground-background representation fusion method called WaveCAM to enhance the quality of CAM generation, addressing the inadequacies in handling irrelevant regions in weakly supervised image semantic segmentation. WaveCAM efficiently integrates foreground and background information to improve the quality of class activation maps and further enhance semantic segmentation accuracy. WaveCAM employs two key representation modeling strategies: foreground-aware representation modeling to enhance the model's ability to recognize foreground regions, and foreground-independent representation modeling to selectively handle background regions to reduce misjudgments. These representations are cleverly represented as wave functions with amplitude and phase attributes, and a dynamic aggregation mechanism is utilized to extract deep semantic features. Finally, through an adaptive fusion module, WaveCAM integrates foreground-background representations to generate output results with rich semantic information. WaveCAM can be flexibly embedded into multi-stage and end-to-end weakly supervised semantic segmentation frameworks, and experiments on the PASCAL VOC 2012 dataset validate the effectiveness of its architecture decisions. Experimental results on the PASCAL VOC 2012 and MS COCO 2014 datasets demonstrate that WaveCAM significantly improves the effectiveness of weakly supervised semantic segmentation, achieving state-of-the-art performance. An image semantic segmentation method based on dual-stream representation fusion. The above two methods focus on single-task learning, which limits the sources of visual information representation and hinders the exploration of integrating powerful and robust representations. To overcome this limitation, this paper combines segmentation tasks with high-resolution tasks to explore a universal low-resolution image segmentation method under multi-task learning, aiming to merge high-resolution representations. This paper proposes an image semantic segmentation method based on dual-stream representation fusion, aiming to address challenges in low-contrast image segmentation such as shape variability, rich details, and high visual feature similarity. The paper designs a Dual-Representation Fusion Learning (DRFL) paradigm to effectively utilize high-resolution deep representations and fine-grained structural representations. The DRFL paradigm includes three key modules: Representation Fusion Transformer Module (RFTM), Dual-Stream Fusion Module (DFM), and Peakiness Fusion Attention Module (PFAM). The DFM combines the segmentation flow and super-resolution flow under the same feature extractor to integrate the two outputs and optimize segmentation results. The RFTM utilizes Transformer to fuse high-resolution representations and employs dilated convolutions with different dilation rates to expand the receptive field and integrate spatial structural information. PFAM is utilized to explore and fuse critical texture features to compensate for spatial information loss during the upsampling and downsampling processes, thereby achieving more accurate medical image segmentation. Experimental results show that the proposed DRFL paradigm significantly improves segmentation quality in various medical image segmentation tasks, including lung nodule segmentation, lung segmentation, skin lesion segmentation, cell contour segmentation, and prostate segmentation. In summary, this paper conducts in-depth research on the key challenges of image semantic segmentation technology, and proposes and develops several efficient image semantic segmentation algorithms based on visual information representation fusion. These include the RSSFormer framework, which addresses challenges in high-resolution image segmentation, WaveCAM method for improving the quality of class activation maps in weakly supervised image semantic segmentation, and the DRFL paradigm for enhancing segmentation quality in medical images by merging high-resolution representations. The proposed fusion of visual information representation and image semantic segmentation methods effectively improve the accuracy and robustness of current image semantic segmentation.
关键词	图像语义分割，表征融合，Transformer，弱监督学习
语种	中文
七大方向——子方向分类	图像视频处理与分析
国重实验室规划方向分类	环境多维感知
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/57541
专题	多模态人工智能系统全国重点实验室_三维可视计算毕业生_博士学位论文
推荐引用方式 GB/T 7714	许镕涛. 基于视觉信息表征与融合的图像语义分割研究[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
许镕涛毕业论文最终版.pdf（19342KB）	学位论文		开放获取	CC BY-NC-SA