图像语义解析的相关技术研究

CASIA OpenIR > 毕业生 > 博士学位论文

	图像语义解析的相关技术研究
	李勇
	2016-05
学位类型	工学博士
中文摘要	随着互联网和智能终端的快速发展，用户可以方便快捷地产生高质量的图像与视频数据，并利用互联网进行快速传播，图像与视频数据呈现出爆炸式的增长。数据的迅速增长给图像与视频查询和分析带来了巨大的挑战与机遇，对图像与视频数据进行智能分析与处理成为了研究的热点。图像语义解析，是对图像中包含的内容进行高层语义解析，它不仅指出图像具有的语义标签，还要对语义标签进行定位，为图像提供像素级别标注。与传统图像分类、标注任务相比，图像语义解析提供更细粒度的区域语义信息；与传统基于底层特征的分割任务相比，图像语义理解为图像区域提供高层语义信息。图像语义解析是解决``语义鸿沟''问题的一项关键技术。图像语义解析问题，根据解析粒度不同可分为目标检测、目标分割和图像语义分割三类。本文主要针对后面两类问题，围绕特征表示学习、目标协同分割、弱监督图像语义分割以及视频图像语义分割等方面开展研究。本文主要研究内容和贡献如下：基于结构化约束的特征表示学习。本文基于字典重构的基本框架，提出了一种特征矩阵可保持块对角结构的特征学习方法，这种块对角结构化约束，使得学到的特征表示有效地保持了同类样本的相似性，并提高了不同类样本的可分性。本文联合优化特征矩阵的稀疏性、低秩性以及块对角结构特性，学习到的特征表示具有鲁棒性、紧致性和强判别力。联合显著性检测和判别式学习的目标协同分割。目标协同分割是给定同类目标图像集合的情况下，对集合中共有的目标前景实现分割。本文通过引入显著性检测算法，有效地解决了协同分割问题中存在的背景一致性问题。通过引入判别式学习，提取出图像集合中共有的显著性区域。本文将显著性检测与判别式学习联合到统一的框架中进行优化，最终获得共有的显著性区域作为目标前景。基于弱监督受限玻尔兹曼机的图像语义分割。弱监督语义分割是指给定图像级别标签情况下，实现图像像素级别语义标注。本文在受限玻尔兹曼机的基础上，对隐层节点分块，各分块与图像的语义标签具有一一对应关系，对未在该图像中出现的标签所对应的隐层节点响应进行抑制。此外，本文引入了一致性约束，外观相似的图像区域具有相似的隐层表示。最终通过学习构建起视觉底层特征到高层语义的映射关系。基于反卷积网络的视频图像语义分割。本文提出了基于反卷积网络的视频图像语义分割模型，更好地保留物体边缘信息，对物体边缘实现精细划分。本文在反卷积网络基础上引入了帧间融合层，对视频的帧间关系进行建模，通过邻近帧图像信息辅助当前帧图像语义分割，取得了更好的视频图像语义分割结果。此外，本文引入了基于目标区域的样本增强方法，学习到的反卷积网络对目标区域取得了更好的分割结果。基于目标语义解析的商品图像检索。本文提出了基于目标语义解析的商品图像检索方法。该方法对图像目标语义进行判断并完成定位，有效地克服了商品图像中存在的复杂背景、商品交叠等问题。该方法对图像中目标区域进行索引并标记目标语义。实验结果表明，该方法获得的检索结果不仅具有视觉相似性，而且具有一致的语义信息。
英文摘要	With the advent of Internet and mobile devices, users can easily take high-quality images and videos, and share them on the Internet, then the number of images and videos grows seriously. The great number of data poses grand challenges to image analysis and retrieval, and how to deal with image and video data intelligently has become a heated research topic. Semantic image parsing is to analyse image contents with high-level concept, it assigns concept labels to local regions of image. Compared with image classification, semantic image parsing locates high-level concept to local regions; compared with image segmentation, semantic image parsing can provide concept information to local regions. Semantic image parsing is an important technique to deal with the problem of semantic gap. Semantic image parsing can be divided into three categories based on parsing degrees, including object detection, object segmentation and semantic segmentation. This paper mainly focus on representation learning, object segmentation, semantic segmentation and object-aware image retrieval, The main contributions of this paper are listed as follows: Representation learning based on structural regularization. In this section, we propose a representation learning method in which the learnt representation matrix has block-diagonal structure. Representation with block-diagonal structure will highlight inter-class differences and enhance intra-class similarities. The proposed method is to learn robust, compact and discriminative representation with low rank and sparse property. Object cosegmentation based on salient and common regions discovery. The goal of object cosegmentation is to simultaneously segment the object regions in a set of images with the same object class. Different from typical methods, simply assuming that the common regions among images are the object regions, we additionally consider the disturbance from consistent backgrounds, and indicate not only common regions but salient ones among images to be the object regions. To this end, we propose a unified algorithm with saliency detection and discriminative learning to get salient and common regions as object regions. Image semantic segmentation based on the weakly supervised Restricted Boltzmann Machine. We deal with the task of semantic segmentation with only image-level labels available based on the weakly supervised Restricted Boltzmann Machine. The hidden nodes of the Restricted Boltzmann Machine are divided into multiple blocks, and each block corresponds to a specific label. The hidden response of each superpixel is suppressed on the labels outside its parent image-level label set, and a non-image-level label suppression term is formulated to implicitly import the image-level labels as weak supervision. Meanwhile, consistency constraint is imported to regularize visually similar regions have similar response. Mapping between image label and low-level feature of local regions will be learnt during optimization. Video semantic segmentation based on deconvolutional network. Deconvolutional network will preserve more information of boundary and get fine object boundary. In this section, we bring in fusion layers to incorporate timeline information, and adjacent frames are leveraged to help make semantic segmentation for current frame. Furthermore, we augment the training data with object based proposals to pay more attention to object class, and the learnt network will provide better result to object class. Commodity image retrieval based on object-aware retrieval framework. In this section, we propose a classification network to get commodity categories, and class-specific detection network is used to locate the object regions, then deep representation of object regions is extracted for matching and ranking. The proposed method is robust to complex background and instance reoccurring, and provide visual similar retrieval results with the same category.
关键词	语义解析语义分割协同分割目标检测图像检索
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/11788
专题	毕业生_博士学位论文
作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	李勇. 图像语义解析的相关技术研究[D]. 北京. 中国科学院大学,2016.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
图像语义解析的相关技术研究-李勇.pdf（12268KB）	学位论文		限制开放	CC BY-NC-SA