面向人像的精细化视觉解析方法研究
朱炳科
2021-05-28
页数136
学位类型博士
中文摘要

随着移动互联网的高速发展以及智能终端的快速普及,图像视频数据呈现出爆炸式的增长,已经成为信息传递的重要载体。针对海量的视觉数据实现准确高效的语义理解和分析,是推动计算智能化核心技术手段之一。其中人像是多媒体数据中的核心元素,针对人像进行精细化视觉解析,可以实现人像语义理解的最细粒度表达,有效地推动机器视觉智能化的发展,在自动驾驶、人机交互、虚拟试衣、艺术创作、广告精准投放等领域有着广泛的应用前景。因此,面向人像的精细化视觉解析方法研究具有重要的学术价值与实际意义。

本文通过精细化视觉解析来实现对人像内容的透彻理解和细粒度感知,内容主要包括边缘精细化的人像软分割、类别精细化的人体解析和实例精细化的人体解析。虽然深度学习方法极大地促进了精细化视觉解析的发展,但其仍然面临边缘精细化要求高、噪声标签干扰、尺度和姿态变化迥异、复杂背景干扰和场景信息多样化等困难和挑战。因此,本文以深度学习方法为基本工具,通过设计合理的深度网络模型和模型优化方法,提升精细化视觉解析的计算效率和精度。

本文的主要研究成果和贡献归纳如下:

1. 基于边缘精细化方法的人像软分割。当前人像软分割方法的计算效率较低,无法满足在移动设备上的实时需求。为此,本文提出了一种面向移动设备的、可实时进行人像软分割的轻量化网络。一方面,通过轻量化的分割网络进行粗略的人像分割,提高人像分割的计算效率,同时保证分割精度损失较小;另一方面,为了弥补轻量化分割网络的精度损失问题,提出了一种可自适应学习的羽化模块,以较小的计算代价实现边缘精细化的羽化操作,提高了人像软分割的精度。实验结果表明,将本文提出的轻量化分割网络和羽化模块结合,不但能够在移动设备上实时地进行人像软分割,而且精度与同期其他的人像软分割方法相当。

2. 基于人体结构渐进式学习的人体解析。针对人体解析任务中背景特征对前景特征造成干扰的问题,本文提出了一种渐进式分割网络,该网络将人体解析任务进行人体结构解耦,从而实现从粗粒度到细粒度的渐进解析。为了实现渐进式分割网络,本文引入了一种基于区域特征学习的卷积模块。通过模拟生物注意力机制的方式,该卷积模块提取感兴趣区域来过滤掉无关区域,从而显著减少无关区域对分割目标的干扰。实验结果表明,本方法在各个人体解析数据集上取得了一致的精度提升,并在多个数据集上的准确率高于同期其他方法。

3. 基于部件解耦学习的人体实例解析。针对人体实例解析中的解析结果依赖于人体检测的问题,本文提出了一种部件解耦网络,在学习特征时将各个人体部件与人体整体解耦,并将各个人体部件作为实例目标进行实例分割。同时,为了将各个人体部件实例与人体实例关联到一起,本文建立了一个二分图模型,使用了匈牙利算法进行实例关联。此外,为了增强各个人体部件实例与人体实例的关联准确率,本文加入了一种基于人体结构的图卷积模型进行特征学习。实验结果表明,本方法有效地提升了人体实例解析任务的精度,并在多个数据集上的准确率高于同期其他方法。

4. 基于合成噪声标签的正则化人体解析。人体解析数据集往往存在大量的噪声标签,当前的基于深度学习的优化方法会对数据标签过拟合,导致模型泛化性能较差。为此,本文提出了一种正则化的人体解析方法。该方法利用合成的噪声标签,在原始梯度下降方向上增加了一个正则化偏移,从而缓解了对数据标签的过拟合,增强了模型的泛化性能。本方法在各个人体解析数据集上证明了模型的性能和泛化能力。另一方面,将本方法应用于前述三个精细化视觉解析任务中,进一步提升了轻量化软分割网络、渐进式分割网络和部件解耦网络的模型性能。

英文摘要

With the rapid development of mobile internet and the popularization of smart terminals, images and videos have grown exponentially, which have become the important media for information transmission. Achieving accurate and efficient semantic understanding for visual data is a fundamental technique for the promotion of computer intelligence. Human images are the core element for multimedia data. Fine-grained visual parsing on human images can achieve the most fine-grained expression for semantic understanding, and effectively promote the development of vision intelligence. Besides, fine-grained visual parsing has a wide range of applications in the fields of automatic driving, human-computer interaction, virtual fitting, artistic creation, accurate advertising and so on. Therefore, it is of great academic value and practical significance to study the fine-grained visual parsing method for human images.

This dissertation adopts the fine-grained visual parsing to achieve the thorough understanding and fine-grained perception of the human images. The main content includes the edge-level fine-grained portrait soft segmentation, category-level fine-grained human parsing and instance-level fine-grained human parsing. Although deep learning method has greatly promoted the development of fine-grained visual parsing, it still faces many difficulties and challenges, such as high requirements of edge refinement, noisy label interference, various scale and pose changes, complex background interference, and scene diversification. Therefore, this dissertation adopts deep learning method as the basic tool, designs the reasonable deep network models and optimization methods to improve the efficiency and accuracy of fine-grained visual parsing.

The main contributions of this dissertation are summarized as follows:

1. This dissertation proposes a portrait matting method based on the edge refinement method. The existing portrait matting techniques have low efficiency, which cannot reach real-time performance on mobile devices. Therefore, this dissertation proposes a light-weight network for real-time portrait matting on mobile devices. On the one hand, a light-weight segmentation network is adopted to roughly segment the portrait, which improves the efficiency of the portrait segmentation and ensures a slight decline of accuracy. On the other hand, in order to improve for the precision of light-weight segmentation network, a learnable feathering module is proposed, which can realize the feathering operation for edge refinement with a little computational cost, and improve the accuracy of portrait matting. The experimental results show that the combination of light-weight segmentation network and feathering module can not only perform real-time portrait matting on mobile devices, but also achieve the performance comparable to the contemporary state-of-the-art methods.

2. This dissertation proposes a human parsing method based on progressive learning of human structure. In order to solve the problem that the background features can interfere with the foreground features, this dissertation proposes a Progressive Segmentation Network for human parsing, which decouples the human structure, so as to realize the progressive parsing from coarse level to fine-grained level. In order to realize progressive segmentation network, a convolutional module based on region feature is introduced. The convolutional module extracts regions of interest by simulating biological attention mechanism, filtering out the irrelevant regions, so as to reduce the interference from irrelevant regions on parsing target. The experimental results show that the proposed method can achieve consistent improvement on various human parsing datasets, and the accuracy on multiple datasets is higher than other contemporary methods.

3. This dissertation proposes an instance-level human parsing based on component decoupling. In order to solve the problem that the instance-level human parsing relies on the detection of human, this dissertation proposes a component decoupling network, which decouples each human part from the whole human when learning features, and takes each human part as an instance object for instance segmentation. At the same time, in order to associate each human part instance with human instances, this dissertation establishes a bipartite graph model and adopts Hungarian algorithm for instance association. In addition, a graph convolutional model based on human structure is added to enhance the association accuracy between human parts and human instances. The experimental results show that the proposed method effectively improves the accuracy of instance-level human parsing, and the accuracy on multiple datasets is higher than other contemporary methods.

4. This dissertation proposes a regularized human parsing method based on the synthetic noisy label. There are a large number of noisy labels in human parsing datasets. The current optimization methods based on deep learning will overfit the noisy labels, resulting in poor generalization performance on the model. Therefore, this dissertation proposes a regularized human parsing method. In this method, a regularized offset is added to the original gradient descent direction by using the synthesized noisy label, which alleviates the overfitting of the labels and enhances the generalization performance of the model. The experimental results have proved the performance and generalization ability of the model on various human parsing datasets. On the other hand, the proposed method can be applied to the above three fine-grained visual parsing tasks to further improve the model performance, including the Light-weight Soft Segmentation Network, the Progressive Segmentation Network, and the Component Decoupling Network.

关键词精细化视觉解析 人像软分割 人体解析 实例人体解析
语种中文
七大方向——子方向分类图像视频处理与分析
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/44997
专题紫东太初大模型研究中心_图像与视频分析
通讯作者朱炳科
推荐引用方式
GB/T 7714
朱炳科. 面向人像的精细化视觉解析方法研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2021.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
Thesis-bkzhu-Final-电(11853KB)学位论文 开放获取CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[朱炳科]的文章
百度学术
百度学术中相似的文章
[朱炳科]的文章
必应学术
必应学术中相似的文章
[朱炳科]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。