基于上下文建模的图像语义分割方法研究

	基于上下文建模的图像语义分割方法研究
	付君
	2020-05-27
页数	134
学位类型	博士
中文摘要	随着互联网和多媒体的快速发展以及信息基础设施的不断完善，图像数据呈现爆炸式的增长。如何高效地对图像进行语义解析以帮助图像的管理和应用具有十分重要的现实意义。图像语义分割通过对图像内容进行像素级别的语义理解，可以实现图像语义解析的最细粒度表达。图像语义分割作为计算机视觉中一个核心且基础的任务，在手机拍照、医学诊断、自动驾驶、遥感分割等领域有着广泛的应用前景。图像语义分割需要模型能提取图像高层语义信息完成目标的准确识别，同时也需要模型能捕获空间细节信息来恢复精细的目标边缘等。全卷积神经网络的提出对于图像语义分割的研究具有里程碑式的意义，其通过预训练的图像分类网络来提取图像的高层语义信息，采用双线性插值的上采样方式来恢复分割结果的空间细节信息，从而获得图像像素级别的语义理解。尽管该算法获得了很大成功，但仍然存在一些问题和挑战。一方面，分类网络的连续下采样操作使得其输出特征损失了大量的空间细节信息，导致细小目标丢失且目标边缘比较粗糙；另一方面，分类网络输出特征的有效感受野有限，对高层语义信息的捕获能力不足，从而造成目标语义判别不准。解决这些问题的关键在于如何提高像素级特征的表达能力。人们对于图像中某一个像素的语义判别往往会依赖于包含该像素所在目标的上下文信息，因此准确捕获和有效利用这些上下文信息对于像素的识别至关重要。本文以全卷积神经网络为基础，通过设计合理的网络结构以及策略来显式建模上下文，使得特征可以同时学习到高层语义信息和空间细节信息，从而极大地改善图像语义分割性能。本文的主要研究成果和贡献归纳如下： 1.针对反卷积网络中编码器识别能力有限、解码器不易训练的问题，提出了一种基于堆叠反卷积网络的深层网络模型。通过堆叠多个浅层反卷积单元来建模多尺度上下文，增强特征的高层语义信息和空间细节信息。在反卷积单元内和单元间引入多种合理且有效的跳跃连接来增强信息的流动和梯度的反向传播，从而实现多尺度特征的有效融合。通过引入层级监督来约束解码器的特征学习，提升特征判别能力。实验结果表明，该方法可有效改善细小目标和目标边缘的精确分割，并在多个数据集上获得同期最好结果。 2.针对反卷积网络在解码阶段侧重底层细节信息的恢复，而忽视特征高层语义表达的问题，提出了一种上下文反卷积网络。通过在高层特征的通道维度和空间维度上分别提取全局信息和局部信息，利用注意力机制将其转化为权重掩码并作用到高层特征上，引导网络增强高层特征的语义表达。在解码阶段通过融合不同层级语义的特征，提升反卷积网络对目标的语义感知能力。实验结果表明，该方法有效地改进了目标的误分类现象，且在多个数据集上取得了较好的性能。 3. 针对基于全卷积神经网络的图像语义分割方法中特征上下文建模不充分的问题，提出了一种基于关系感知的双重注意力网络来有效建模特征的关联上下文，提升特征的表达能力。具体地，利用自注意力机制分别来构建像素间和通道间的关联关系，根据不同维度的语义关联来自适应地加权特征，增强特征的语义判别能力。进一步地，通过在空间维度和通道维度进行特征融合，由此简化了关联关系建模过程，显著地降低了关联建模所带来的计算成本与显存成本。实验表明，该方法能有效应对复杂场景的语义解析并显著地改进细小目标的分割精度，且在多个场景数据集上的准确率高于同期其他方法。 4. 分析了网络中全局上下文和局部上下文在目标识别和细节恢复中的差异化现象，并由此提出了一种自适应地融合全局上下文和局部上下文的图像语义分割方法。利用全局上下文信息和像素特征之间的关联关系来构建像素感知的上下文偏好，然后通过学习合适的耦合策略，挖掘全局上下文和局部上下文之间的兼容性和互补性，促使网络更好地实现目标识别和细节恢复。实验表明，该方法能有效缓解大目标错分和小目标遗漏，同时在多个数据集上都取得同期最好结果。
英文摘要	With the rapid development of the Internet and the improvement of information infrastructure, a large scale of images are generated. This necessitates efficient semantic analysis for image management and application. As one of core and basic tasks in computer vision, semantic image segmentation aims to predict the categories of individual pixels in an image, thus obtains exact semantic understanding capability. Since it has extensive applications in mobile phone photography, medical diagnosis, autonomous driving, and other fields, semantic image segmentation has attracted wide attention from academic communities. Semantic image segmentation as a pixel-wise labeling task calls for the feature representations containing high-level semantic information and spatial details, in order to make precise object recognition and meanwhile reserve the detailed structure of objects. Fully Convolutional Networks (FCNs) has been made great progresses in semantic segmentation. It extracts semantic information of the image with a pretrained classification network and uses bilinear interpolation to recover the spatial information. Despite the great success of this method, there still exist lots of difficulties and challenges. First, the successive downsampling operation in image classification network causes the loss of spatial details, e.g., small objects and rough object edges. Second, the limited receptive field and insufficient semantic information of the network cause inaccurate semantic discrimination of objects. To address the above problems, this thesis proposes some effective network designs and strategies to explore suitable context information, in order to learn the features with high-level semantic information and spatial details simultaneously, and further to improve the performance of semantic segmentation. The main contributions of this dissertation are summarized as follows: 1. Several works on image segmentation adopt an encoder to capture high-level semantic information and a decoder to recover spatial details. However, the insufficient discrimination of encoders and difficult training procedures of decoders hinder the performance of the deconvolutional network. To alleviate this problem, this dissertation proposes a stacked deconvolutional network which stacks multiple shallow deconvolutional units to integrate contextual information and bring the fine recovery of localization. Meanwhile, inter-unit and intra-unit connections are designed to assist network training and enhance feature fusion since the connections improve the flow of information and gradient propagation throughout the network. Besides, hierarchical supervision is applied during the upsampling process of each deconvolutional unit, which enhances the discrimination of feature representations and benefits the network optimization. Extensive experiments demonstrate the effectiveness of the proposed approach, and the best segmentation results are achieved on several semantic segmentation datasets at the same time. 2. Currently, the deconvolutional networks always focus on the recovering of local details, but neglect to the representation ability of semantic information. To alleviate this problem, a contextual deconvolution network is proposed. Specifically, a channel contextual module is built to captures image-level semantic information by aggregating the feature maps across spatial dimensions, and clarifies global ambiguity of features. A spatial contextual module is designed to obtain patch-level semantic context by learning a spatial weight map, and enhance the feature discrimination. By embedding the two contextual modules into individual components of the decoder network, thus improving the representation power and gaining more precise segment results. The experimental results demonstrate that this method effectively decreases misclassification on objects and obtains stable performance improvements on several semantic segmentation datasets. 3. This dissertation proposes a dual relation-aware attention network to capture contextual information based on the relation-aware attention mechanism. Specifically, a self-attention mechanism is introduced to model semantic associations between any two pixels or channels. Each pixel or channel can adaptively aggregate context from all pixels or channels according to their correlations. To reduce the high cost on computation and memory caused by the above pair-wise association computation, two types of compact attention modules are built. In the compact attention modules, pixels or channels are built into association only with a few number of gathering centers and obtain corresponding context aggregation over these gathering centers. Meanwhile, a cross-level gating decoder is designed to selectively enhance spatial details, which boost the performance of the network. Extensive experiments on several scene parsing datasets demonstrate that our method achieves the best performance of single model in the same period. 4. This dissertation proposes an adaptive context network to capture the pixel-aware contexts by a competitive fusion of global context and local context according to different per-pixel demands. Specifically, when given a pixel, the global context demand is measured by the similarity between the global feature and its local feature, whose reverse value can be used to measure the local context demand. The adaptive contextual features are obtained by combining the proposed global context module and local context module. Furthermore, by importing such modules, several adaptive context blocks are constructed in different levels of network to obtain a coarse-to-fine result. The experimental results demonstrate that the proposed method can effectively improve object recognition and detail restoration simultaneously.
关键词	图像语义分割关联上下文注意力机制全局上下文局部上下文
语种	中文
七大方向——子方向分类	目标检测、跟踪与识别
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/39224
专题	紫东太初大模型研究中心_图像与视频分析
推荐引用方式 GB/T 7714	付君. 基于上下文建模的图像语义分割方法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
毕业论文最终版-fujun.pdf（5497KB）	学位论文		开放获取	CC BY-NC-SA