基于深度学习的图像语义分割方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于深度学习的图像语义分割方法研究
	王宇航1,2
	2017-05-23
学位类型	工学博士
中文摘要	随着人类社会进入智能化的大数据时代，海量增长的图像资源和日益普及的智能设备都要求我们实现更加有效的图像语义解析，对图像内容进行快速、准确地理解。图像语义分割作为一种细粒度的图像语义解析任务，不仅能够判别图像中目标的语义信息，还能够准确地对其进行定位并描绘出其边缘，现已受到学术界的广泛关注，并在视频内容分析、无人驾驶、智能医疗等多个领域展现出了巨大的应用前景。深度学习极大地推动了图像语义分割技术的发展，使我们能够学习到更具判别力的模型实现对图像内容的识别，然而要实现更加准确有效的分割算法依然存在着很多困难和挑战。首先，在不同光照、姿态、尺度、遮挡等情况下，目标所表现出的视觉多样性需要语义分割模型具有更为鲁棒、准确的判别能力。其次，作为一种细粒度的语义解析任务，语义分割模型的特征表达需要兼顾全局性与局部性。其中，全局性保证模型对目标做出正确的语义判别，而局部性用于实现对目标细节轮廓的精确保持。最后，目前大部分的图像语义分割方法都是基于全监督学习的，需要提供高成本的像素级语义标注，如何降低或弱化模型学习对标注样本的依赖，是一个值得探讨的研究问题。本文的研究工作主要围绕上述困难和挑战展开深入探讨，主要成果和贡献包含以下几个方面： 1. 提出了一种联合目标候选区域判定与语义分割的深度学习框架。首先基于目标性分析初步定位图像中的目标，从而将对图像全图的语义分割转化为对若干个候选目标区域的语义分割。然后构建轻量反卷积神经网络实现更加精细的特征图上采样过程，从而获得更加精细的分割结果。通过在公开测试集上的实验分析比较，该模型相比于全卷积神经网络等经典分割模型，具有模型小、收敛快、性能好的特性。 2. 提出了一种基于视频帧间的时空关联挖掘的深度学习框架，用于解决视频语义分割问题。该框架将单帧视频作为独立的图像通过反卷积网络进行特征图学习，再基于视频序列时空邻域内像素点间的语义相关性，学习一组状态转移矩阵对帧间信息进行融合，以获得更加准确的像素级语义判别。同时，将状态转移矩阵通过一组卷积层的形式实现，使之与反卷积网络整合成为统一的网络结构实现端到端的联合学习。该方法在多个视频语义分割数据库上取得了同期最好的分割结果。 3. 提出了一种联合图像深度估计与语义分割的深度学习框架，充分考虑了图像深度信息和语义信息在模型学习中的互补特性。通过构建协同反卷积神经网络同时提取图像的深度和语义特征，并以外积的方式对其特征图进行融合，从而在一个统一的网络结构中完成两个任务的联合学习，实现它们的互相增强。同时，对于深度估计任务提出了更加合理的标签映射策略，使之从回归问题转化为分类问题，以提高模型训练效果。此外，还改进了全连接条件随机场作为后处理，使用预测得到的像素深度信息进一步优化语义分割的结果。该方法在多个室内场景数据库上获得了图像深度估计和语义分割两个任务的同期最好结果。 4. 提出了一种基于半监督学习的图像语义分割方法，以降低算法对像素级标注样本的依赖。在标准全卷积神经网络的学习框架下，先使用少量的像素级标注样本学习目标的结构信息，再使用大量的图像级标注样本实现模型的进一步的优化，并通过使用图像级和像素级多粒度监督的协同学习，从全局和局部的角度约束模型对图像中目标的识别能力。该方法在降低标注工作负担的同时保证了模型具有良好的分割效果。
英文摘要	With the explosive growth of web images and increasing popularization of intelligent devices, techniques on image semantic understanding become more and more essential. As a fine-grained task in this area, semantic image segmentation aims to not only discriminate semantic meanings of objects in images, but also locate the object boundaries. More exact semantic understanding capability makes the task of semantic segmentation show great potentials in industrial applications, including video content analysis, automatic driving, intelligent healthcare, etc. Deep learning contributes greatly to the recent progress of semantic image segmentation with more capable models. But there still exist lots of difficulties and challenges to implement more accurate and effective semantic segmentation. First, objects in real world are complicated and diverse with different scales, poses, illumination and occlusion. Thus, more robust and discriminative models for object recognition are required. Second, semantic segmentation as a pixel-wise labeling task calls for the feature representations that can describe both the global and local information of objects, in order to make precise semantic judgements and meanwhile reserve the detailed structure of objects. Third, most of the recent semantic segmentation methods are fully-supervised, which rely heavily on the high-cost pixel-wise annotated images. How to reduce the dependence on such expensive annotation data during the modeling process is an important and challenging problem. To address the above problems, this dissertation presents several effective solutions, while the main contributions are listed as follows: 1. An objectness-aware semantic segmentation framework is proposed which jointly learns an object proposal network and a lightweight deconvolutional neural network. The object proposal network is learned to preliminarily locate object bounding boxes, while the lightweight deconvolutional neural network provides finer upsampling for feature maps to implement more detailed segmentation. Compared with previous methods like FCN, our approach performs an obvious decrease on model size and convergence time, and meanwhile achieves better semantic segmentation performance. 2. A jointly trained deep learning framework is proposed to make the best use of spatial and temporal information for semantic video segmentation. Along the spatial dimension, the lightweight deconvolutional neural network is used to conduct pixel-wise semantic interpretation for single video frame. Along the temporal dimension, a set of state transition matrices are learned to make the pixel-wise label prediction consist with adjacent pixels across space and time domains. The learning process of the state transition matrices can be implemented as a set of convolutional calculations connected with the lightweight deconvolutional neural network. These two parts are jointly trained as a unified deep network. Our approach achieves the best performance of the same period on several video datasets. 3. A collaborative deconvolutional neural network is proposed to jointly model the single-view depth estimation and semantic image segmentation problems, considering their complementarity in scene understanding. The network extracts semantic and depth features simultaneously and integrates them via a bilinear layer, which implements joint training of the two tasks for mutual promotion. Specifically, for depth estimation, a reasonable label mapping strategy is proposed to transfer it from a regression problem into a classification problem. Besides, a fully connected CRF is also proposed as post-processing to further improve the semantic segmentation performance with predicted pixel depth. Extensive experiments on several indoor scene datasets demonstrate the validity of our approach and it achieves the best performance of the same period on both the depth estimation and semantic segmentation tasks. 4. A semi-supervised learning framework is proposed for semantic image segmentation to alleviate its dependence on pixel-level labeled images. The model is firstly trained with few pixel-level annotated images for preliminary recognition of objects, and then refined with more image-level labeled images to further improve its discrimination capability, where multi-grained supervision is employed to facilitate a collaborative learning of the global and local information of objects. Our approach achieves promising semantic segmentation performance on several commonly used datasets with obviously lower demands on manual annotation.
关键词	语义分割深度估计深度学习半监督学习反卷积神经网络
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/15041
专题	毕业生_博士学位论文
作者单位	1.中国科学院自动化研究所 2.中国科学院大学
第一作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	王宇航. 基于深度学习的图像语义分割方法研究[D]. 北京. 中国科学院大学,2017.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于深度学习的图像语义分割方法研究.pd（6021KB）	学位论文		限制开放	CC BY-NC-SA