基于结构化建模的人体解析研究

	基于结构化建模的人体解析研究
	张小梅
	2021
页数	138
学位类型	博士
中文摘要	随着互联网和多媒体技术的高速发展以及信息基础设施的不断完善，图像数据呈现出爆炸式的增长。如何利用图像数据为人类的生产生活服务成为一项日益重要的研究课题。而对图像数据中的人体进行解析，是数据智能应用中的一个基础而又必不可少的环节，并在诸如虚拟试衣、姿态识别、行人重识别和动作识别等领域具有广泛的应用价值和发展前景。人体解析通过对图像中的人体进行像素级别的分类，可以实现对人体语义最细粒度的表达。基于全卷积神经网络的人体解析算法对于人体解析任务的研究具有重要意义。该类算法通过预训练图像分类网络来获取图像的高层语义信息，采用双线性插值等上采样方法来恢复目标的空间细节信息，从而获得人体部件像素的分类结果。尽管取得了很大的成功，但此类算法仍然面临着一些挑战。一方面，由于杂乱场景和相似背景的干扰，使得算法很难提取出完整准确的前景，造成类别判别不准确。为了解决这个问题，本文根据人体的固有层次结构构建深度模型，使其尽量关注人体前景，抑制杂乱场景和相似背景的干扰。另一方面，由于人体部件的尺度、遮挡、形变、姿态等状态的多样性，人体部件的表观特征变化较大，不同部件之间的识别容易混淆。解决这个问题的关键是如何提高特征表达的鲁棒性。通常，会借助于某个像素或区域所在目标的上下文信息来判断该像素或区域的语义。因此，准确捕获和利用这些上下文信息对于像素或区域的识别至关重要。本文通过设计合理的全卷积网络结构以及策略进行结构化建模，使得特征可以获得丰富的上下文，从而提升人体解析的精度。本文的主要成果和贡献归纳如下： 1. 针对自然场景中的杂乱场景，以及主流解析方法使用单个分类器造成解析结果精度低的问题，提出了一种基于树状层次结构网络的深度模型算法。该结构网络利用二叉树的思想，逐步分割人体的各个部件，其中每一步都使用针对特定部件的特征融合模块生成准确的解析结果，并将之传递到下一层。这种结构网络使得人体解析过程更加关注感兴趣的区域并且忽略不相关的信息，从而减少了背景的干扰。为了减少信息传递过程中的累积误差，算法还通过融合原始特征来修正错误信息。实验结果表明，该方法可有效应对杂乱场景下的人体解析任务，改善单个分类器网络的解析精度，在多个数据集上获得了同期较好的结果。 2. 针对相似或者复杂背景会影响前景提取的问题，提出了一种混合语法网络，来探索人体固有的层次结构以及不同人体部件之间的关系。在每个语法规则中，借助容易区分的部件来提高难区分部件的提取准确率，从而提升整个前景的精度。同时，使用规则模块来传递语法规则信息。为了有效训练规则模块，引入语法损失来监督其训练，从而提升其特征判别能力。实验结果表明，该方法能有效应对复杂背景中的人体解析任务，并且在多个数据集上的准确率高于同期其他方法。 3. 针对不同大小和形状的人体部件，提出了一种部件上下文网络，自适应地产生每个部件的上下文。该网络通过特征提取器获得原始特征，然后使用图卷积探索学习人体各个部件之间的高阶关联语义，从而获得部件的全局上下文。同时，使不同人体部件的特征尽量远离，并保持本部件的特征尽量紧凑，从而获得部件的局部上下文。最后，融合原始特征、全局和局部上下文获得部件自适应的上下文。实验结果表明，该方法能有效缓解大目标错分和小目标漏分的问题，在多个数据集上都取得同期最好的结果。 4. 针对高低层特征融合过程中的语义鸿沟问题，提出了一种融合高低层特征的网络，有效地缩小了高低层特征间的语义鸿沟。该网络为低层特征引入更多的语义信息，为高层特征引入更多的空间细节信息，从而增强了高低层特征融合的有效性。该网络还通过融合不同层级的特征生成多尺度特征，扩大了感受野。实验结果表明，该方法有效地缩小了不同层级特征间的语义鸿沟，获得了多尺度上下文，并在多个通用人体解析数据库上取得了远超其它人体解析方法的精度，实现了同期单模型的最好性能。
英文摘要	The rapid development of the Internet and multimedia as well as the improvement of information infrastructure have led to the explosive growth of digital images. How to use these images to serve the human production and life has become an important research topic. Human parsing is a basic task in the application of data intelligence, and has broad prospects for development in fields such as virtual fitting, human pose estimation, person re-identification, action recognition and so on. Human parsing is a task of pixel-level classification of the human body in images, which tries to achieve the most fine-grained semantic expression of the human body. Algorithms of human parsing based on the fully convolutional network are of significance for the human parsing task. These algorithms obtain the high-level semantic information of the image by pre-training the image classification network, and use the up-sampling methods such as bilinear interpolation to recover the spatial details of the target, so as to obtain the classification result of the pixels of the human body parts. Although these algorithms have achieved good results, they still face some challenges. First, due to the interference of complex scenes and backgrounds similar to the human targets, these algorithms are difficult to extract the complete and accurate foreground, leading to the inaccurate semantic discrimination of human parts. To address this problem, this dissertation models the inherent structure of the human body by using deep models, which focuses on the foreground information of the human body and suppresses the interference of complex scenes and similar backgrounds. Second, due to the variance of size, occlusion, deformation and posture, the appearance of human parts vary greatly, and the identification of human parts may be confused. The key to solve this problem is how to improve the robustness of feature expression. Usually, the semantic discrimination of pixels or regions in images depends on the contextual information of the target. Therefore, it is very important to accurately capture context information for the recognition of pixels or regions. To this end, this dissertation designs reasonable full convolutional networks and strategies for structural modeling, obtaining features with rich context, and then improving the performance of human parsing. The main contributions of this dissertation are summarized as follows: 1. This dissertation proposes a tree hierarchical network to suppress the interference of cluttered scenes in natural scences, and improve the accuracy of a single classifier. The network employs the idea of the binary tree and partitions the human parts step by step. In each step, the network uses the part-aware fusion to generate accurate parsing results and passes the results to the next step. The network can automatically focus on the areas of interest and ignore the irrelevant information in the human parsing process, thereby reducing the interference of the background. To reduce accumulated errors, the network corrects the errors by merging the original features. Experimental results show that the proposed approach can effectively parse the human body in the cluttered scenes and improve the parsing results of a single classifier, achieving the competitive parsing results on several public objects parsing datasets during the same period of time. 2. This dissertation designs a blended grammar network to solve the problem of how to extract the whole foreground from similar or complex background effectively. The network exploits the inherent hierarchical structure of a human body and the relationship of different human parts by designing grammar rules of human parts. In each grammar rule, conspicuous parts, which are easily distinguished from the background, can amend the segmentation of inconspicuous ones, improving the foreground extraction. This dissertation also designs a rule module to pass messages which are generated by grammar rules. To train rule modules effectively, a blended grammar loss is presented to supervise the training of rule modules. Extensive experiments on several human parsing datasets demonstrate that our method achieves the best contemporaneous performance of a single model. 3. This dissertation proposes a part-aware context network to solve the problem of how to generate adaptive contextual features for the various sizes and shapes of human parts. The proposed network uses the feature extraction to obtain original features of whole human bodies. Then, graph convolution network mines associated semantics of human parts to obtain part-aware global context. Meanwhile, the features of different human parts are kept as far away as possible, and the features of the same part are kept as compact as possible, which generates part-aware local context. Finally, original features, part-aware global context and part-aware local context are fused to obtain adaptive part-aware context. The experimental results demonstrate that the proposed method can effectively improve object recognition and detail restoration simultaneously, and the best contemporaneous parsing results are achieved on several public human parsing datasets. 4. This dissertation proposes a high- and low-level feature fusion network to solve the problem of the semantic-spatial gap between low-level and high-level features. The proposed network introduces the semantic information into low-level features and high-resolution details into high-level features, achieving the more effective fusion. The network also expands the receptive field and generates multi-scale contexts by fusing features of different levels. The experimental results demonstrate that this method can shrink the gap between different level features, and the accuracy of the human parsing method is far higher than that of other human methods in several human parsing datasets. The best contemporaneous performance of the same time single model is achieved.
关键词	人体解析结构化建模多尺度上下文全卷积神经网络
语种	中文
七大方向——子方向分类	图像视频处理与分析
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/44890
专题	紫东太初大模型研究中心_图像与视频分析
推荐引用方式 GB/T 7714	张小梅. 基于结构化建模的人体解析研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Thesis-zxm-电子签字版.pdf（9315KB）	学位论文		开放获取	CC BY-NC-SA