CASIA OpenIR  > 毕业生  > 博士学位论文
基于关系建模的图像语义分割方法研究
何兴建
2022-05-18
页数122
学位类型博士
中文摘要

随着移动互联网的快速发展,图像数据作为最常见的信息载体之一,呈现出爆炸式增长的趋势。如何快速且智能地处理海量图像数据成为当今亟待解决的问题。图像语义分割能够解析图像中每个像素的类别信息,细粒度地理解图像内容,在图像编辑、视频监控、地物分析、医学诊断以及自动驾驶等多个领域具有广泛的应用。因此,图像语义分割具有重要的研究意义与实用价值。

图像语义分割需要建立图像底层特征到高层语义的特征映射,获得更具可分性的高层语义特征。图像中的像素或场景中的目标并非独立存在,相互之间存在一定的关联关系,这种关系能够为图像解析提供丰富的语义上下文信息。近年来,基于全卷积网络的图像语义分割方法取得了较大的进展,为语义分割领域的发展打下了坚实的基础。然而,全卷积网络受限于局部感受野,在提取特征时仅能参考像素周围的信息,无法全面地利用场景中的关系信息,因而限制了图像语义特征的表达。本文基于深度神经网络,研究基于自底向上和自顶向下的关系建模方法,充分挖掘和利用图像中目标或像素之间存在的关系信息,获得更具判别性的语义特征,进而提高模型的语义分割性能。

论文的主要成果及创新点归纳如下:

1. 基于金字塔关系网络的图像语义分割方法。考虑到场景中的目标存在一定的共存关系,本文提出对图像中像素之间的关系进行建模,并将这种关系直接作为关系特征,辅助模型的语义分割。设计了一种空间关系模块建模高层特征中像素-像素之间的关系,并针对计算两两像素关系计算量大的问题,进一步引入像素-区域之间的关系。同时采用通道关系模块在通道维度上编码像素之间的关系特征。利用空间关系特征和通道关系特征共同强化语义特征表达,获得更具判别性的高层语义特征。实验结果表明,该方法能够有效提升模型的编码能力,获得更加准确的语义分割结果。

2. 基于采样注意力网络的图像语义分割方法。针对自注意力机制在图像分割任务中计算复杂度较高的问题,本文提出了一种基于随机性采样的自注意力机制,高效建模图像中的像素与具有代表性的稀疏采样点之间的关系,在保证语义分割精度的同时,大大降低了模型的计算复杂度。此外,考虑到图像中的不同像素对全局上下文信息的依赖程度不同,本文设计了一种基于确定性采样的注意力机制,在高层特征的指导下从低层特征中获取局部细节信息,进一步提升了模型的分割性能。实验结果表明,该方法能够在常用数据集上实现高效、准确的图像语义分割。

3. 基于动态全连接网络的图像语义分割方法。针对空间全连接层无法建模内容相关的上下文信息的问题,本文设计了一种动态全连接层,根据图像内容自适应地建模图像中的全局信息。通过引入跳跃连接操作解耦局部信息建模和全局信息建模,使得跳跃连接的特征更关注局部上下文信息,而空间全连接层中的注意力权重更侧重于全局上下文建模。此外,本文采用动态参数机制对图像内容相关的上下文信息进行建模,自适应地获取长距离依赖。实验结果表明,该方法能够应对复杂场景图像,取得更加准确的语义分割结果。

4. 基于一致--可分离特征表达网络的图像语义分割方法。针对当前语义分割方法通常对像素进行独立优化,忽略了特征空间中固有结构关系的问题,本文提出了一个一致--可分离特征学习模块,通过为每个类别构建一个语义中心,并对像素与类别语义中心的关系进行建模,获得更具判别性的图像语义特征。设计了一种类别感知的语义一致性损失函数,通过自顶向下的监督范式约束相同类别语义特征更加一致,不同类别语义特征更加可分,获得具有良好结构关系的特征空间。实验结果表明,该方法能够方便地嵌入到其他语义分割网络中,在几乎不增加计算复杂度的同时,有效提升模型的分割性能。

英文摘要

With the rapid development of the mobile Internet, image data, as one of the most common information carriers, shows an explosive growth trend. How to process massive image data quickly and intelligently has become an urgent problem to be solved today. Image semantic segmentation can analyze the category information of each pixel in the image and understand the image content in a fine-grained manner, is widely used in various fields, such as image editing, video surveillance, terrain analysis, medical diagnosis, and automatic driving. Therefore, research on image semantic segmentation is of great significance and value.

For semantic segmentation, it is important to establish a mapping between image low-level features and high-level semantic features to obtain discriminative semantic features. Actually, pixels in an image or objects in a scene do not exist independently but have some kinds of relationships, which can provide rich semantic contextual information for image parsing. In recent years, great progress has been made in image semantic segmentation methods based on fully convolutional networks, laying a solid foundation for the development of the field of semantic segmentation. However, the fully convolutional network, with the limited receptive field, can only extract features from a local region, and insufficiently utilize the relationship information in the scene, thus lacking the ability of semantic feature representation. Based on the deep neural network, this dissertation studies the bottom-up and the top-down relational modeling methods that fully exploit and utilize the relational information existing in the image, to obtain more discriminative semantic features and improve the semantic segmentation performance.

The main contributions of this dissertation are summarized as follows:

1. Pyramid relational network for semantic segmentation. Considering that the objects in the scene have a certain coexistence relationship, this dissertation proposes to model the relationship between the pixels in the image, and directly use this relationship as a relational feature to assist the model in semantic prediction. A spatial relation module is designed to model the one-to-others relationship in high-level features, and the pixel-to-region relationship is further introduced to solve the problem of heavy computational cost caused by the pixel-pair relationship. In addition, a channel relation module is designed to encode the relational features in the channel dimension. The spatial relational features and channel relational features are jointly used to enhance the semantic feature representation for obtaining more discriminative high-level semantic features. The experimental results demonstrate that this method can effectively improve the feature encoding ability of the model and obtain more accurate semantic segmentation.

2. Sampling-based attention network for semantic segmentation. Self-attention mechanism in image semantic segmentation suffers from high computational complexity. To this end, this dissertation proposes a stochastic sampling-based attention module to efficiently model the relationship between query points and representative sparse sampling points. This module achieves comparable segmentation performance while significantly reducing computational redundancy. In addition, with the observation that not all pixels are interested in the global contextual information, this dissertation proposes a deterministic sampling-based attention module to sample features from a local region for obtaining the detailed information under the guidance of high-level features, which further improves the segmentation performance. Experimental results demonstrate that this method could achieve efficient and accurate semantic segmentation on commonly used datasets.

3. Dynamic fully connected network for semantic segmentation. The spatial fully connected layer cannot model content-aware contextual information. This dissertation designs a novel dynamic fully connected layer to adaptively model the global information according to the image content. A skip connection is introduced to decouple the local and global context modeling, in which the original features from the skip connection focus on the local context, while the attention weights from fully connected layer focus on the global context. In addition, a dynamic parameter mechanism is adopted to model the contextual information related to the image content for adaptively obtaining long-range dependencies. Extensive experiments demonstrate that this method can deal with complex scene understanding and achieve more accurate semantic segmentation.

4. Consistent-Separable feature representation network for semantic segmentation. Most segmentation models use a pixel-wise loss as their optimization criterion, ignoring the inherent structural relationship in the feature space. This dissertation proposes a consistent-separable feature learning module, which builds a semantic center for each category and models the relationship between pixels and semantic centers. A category-aware semantic consistency loss is designed to learn intra-class consistent and inter-class separable features in a top-down supervised scheme, obtaining a better-structured feature space. Extensive experiments demonstrate that this method can be easily embedded into existing segmentation networks with negligible cost and achieve significant improvements compared to these models.

关键词图像语义分割 关系建模 自注意力机制 稀疏采样 结构化约束
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/48886
专题毕业生_博士学位论文
推荐引用方式
GB/T 7714
何兴建. 基于关系建模的图像语义分割方法研究[D]. 中国科学院大学人工智能学院. 中国科学院大学人工智能学院,2022.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
博士毕业论文_何兴建_电子签名.pdf(7064KB)学位论文 限制开放CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[何兴建]的文章
百度学术
百度学术中相似的文章
[何兴建]的文章
必应学术
必应学术中相似的文章
[何兴建]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。