人脸与人体结构化视觉分析

	人脸与人体结构化视觉分析
	刘智威
	2020-08-20
页数	126
学位类型	博士
中文摘要	人类是图像视频等多媒体数据中的核心元素。对图像中人类的视觉理解与分析通常对整个场景内容的解析具有重要的意义，在虚拟现实、增强现实、人机交互、视觉监控等领域有着广泛的应用前景。其中，人脸和人体的结构形态通常包含着与人相关的视觉理解过程中最有价值的信息，例如，人脸中的身份信息，表情信息。肢体中的动作信息，行为信息等。因此，他们通常是人结构中的重点分析对象。人脸与人体结构化视觉分析是研究如何利用计算机视觉技术从图像中对人脸和人体的结构形态进行理解及分析的方向，本文以图像中的人脸和人体姿态作为分析目标，对该问题进行了深入的研究。具体来说，本文首先研究了关注人脸结构的人脸关键点定位任务。针对该任务中轻量化，高精度，高稳定性等挑战性需求，对现有方法的缺点及瓶颈进行了分析，并提出了相应的解决方案。接下来本文研究了人脸中对身份识别最具判别性价值的人脸子区域的选裁问题。同时在人脸识别任务上验证了针对该问题所提出方法的有效性。最后，本文将分析目标从表达人局部信息的人脸转向表达人全局信息的人体姿态。针对人体姿态估计任务中复杂多变的场景及人物表观所带来的挑战提出了针对性的解决方案。本文主要的工作和贡献有：针对现有轻量化关键点定位网络精度不足，无法在低功耗的嵌入端设备上应用的问题，从损失函数设计和结构设计两个方面改进现有算法。首先，提出了一种基于非刚性曲线拟合的损失函数，将对弱语义点的拟合看作局部曲线拟合，以解决人脸轮廓点弱语义性所导致的无效训练误差问题。其次，基于对不同干扰因素解耦并级联处理可提升效率的思想，提出了基于快速校正法的两阶段关键点定位框架，有效地提升了算法的效率。基于全卷积网络响应图回归的人脸关键点定位算法虽然精度较高，但时序上仍缺乏稳定性。针对上述现象，我们分析得出，由于稠密关键点中大量分布在边缘的关键点位于弱纹理区，其位置具有不确定性。对该类型关键点的人工标注不可避免的存在标注随机性问题。因此在训练中会引入大量缺乏纹理信息作指导的无效误差。为了克服该问题，对于人脸结构中的任意一个关键点，本工作为数据集中每个样本引入一个无随机误差且满足样本间语义一致性的关键点真实位置，并在训练中将其看作隐变量与网络参数进行联合求解。采用求解得到的语义一致位置作为网络新的回归目标克服了直接将含有随机误差的人工标注作为回归目标的缺陷，使得网络的拟合能力集中在了真正需要的地方，最终有效提升了关键点定位网络的精度。传统人脸识别集成算法通常依照关键点等结构信息对不同人脸图像采取统一的选裁方式来获得人脸子区域的组合，该子区域组合需要在特定的数据集贪心搜索得到。且对各个人脸子区域特征的融合是一个离线的过程。该框架存在算法复杂度过高，泛化性不足等缺点。为了设计更好的算法对人脸进行有效的视觉分析以提升人脸识别的性能，本工作研究了人脸中对身份识别最具判别性的人脸子区域的自适应选裁问题，提出了一个端到端的多模型人脸识别集成学习框架。该方法可根据不同人脸样本各自的特性自适应选裁出不同的人脸子区域组合用于身份识别。相比人工选取的固定人脸子区域组合，该自适应子区域组合中的成员判别性更强且相互之间具有互补性，结合端到端的特征融合方法，最终有效地提升了人脸识别的性能。面向无约束场景的人体姿态估计任务中，无规则变化的人物服饰，复杂的场景，高灵活度的姿态等因素导致样本的分布极其复杂。现有主流的响应图回归法的优化目标是建立每个训练样本的人体表观信息与相应关节点坐标之间的对应关系。因此存在由于数据分布不均所导致的算法泛化性不足现象。为了利用有限的训练数据增强网络对姿态识别的鲁棒性，本工作提出了一种基于样本关系挖掘的人体姿态估计方法。该方法在回归网络中引入判别学习，通过挖掘样本关系优化高层特征对人体姿态的判别性。所提出的局部样本关系模型有效提升了姿态估计算法的性能和泛化能力。总的来说，本文围绕图像中的人脸和人体姿态，对人脸与人体结构化视觉分析问题进行了深入的研究。针对现有算法，从理论和实际应用的角度分别提出了多种创新及改进方法，最终有效提升了算法在相关任务中的精度和鲁棒性。
英文摘要	Humans are often a central element in images and videos. Understanding their posture, the social cues they communicate, and their interactions with the world is critical for holistic scene understanding and the development of intelligent technology. Moreover, in human structure, face and human body usually contain the most valuable information used for visual understanding of human. For example, face appearance contains identity information and expression information. Human posture contains action information. Therefore, Both of them are regarded as important targets of visual human understanding. Visual Structure Analysis of Face and Human Body is a technology which enables computer to understand the structure of both face and human body. It is an important research area in computer vision and pattern recognition, and has been widely applied in virtual reality, human computer interaction and intelligent surveillance. In this thesis, we focus on the major issues of this topic and regard face and human posture as our analysis targets. For the visual analysis of face, we first study facial landmark detection task which focuses on the topological structure of face. Aiming at the target of achieving high speed, high accuracy and high stability in a facial landmark detection system, we propose several ways to overcome the problems of existing methods. Then we analyse the high-level semantic information from face images and propose a method to find discriminative and complementary face patches for face recognition task. For the visual analysis of human posture, we propose a method to deal with the challenge from complex human appearances and environments on human pose estimation task. The main contributions are shown as follows. Facial landmark detection is a key component of numerous face analysis tasks. Most existng methods rely on a heavy network to handle the complicated pose, illumination, and expression variations in unconstrained environment. Thus they cannot achieve real-time speed on low-cost handheld devices such as mobile phones. In order to design a CNN-based framework with satisfactory performance and high efficiency. We improve the existing facial landmark detection framework from two aspects. First, inspired by ICP Algorithms used in surface registration, we propose a novel nonrigid contour fitting loss to reduce the meaningless loss during training. Second, we decouple the complex sample variations in face alignment task and propose a Fast Normalization Module (FNM) to efficiently normalize considerable variations that can be described by geometric transformation. FNM and nonrigid contour fitting loss can be used into a common two-stage facial landmark detection framework which improves the overall performance with nearly no extra computational cost. Recently, deep learning based facial landmark detection has achieved great success, especially the heatmap regression based methods. Despite this, most of these methods cannot stablely detect landmark on videos. Based on this, we notice that some landmarks which evenly distributed along the face contour do not have clear and accurate definition. Thus it is inevitable for annotators to introduce random noises during annotating. The inconsistent and imprecise annotations can mislead CNN training and cause degraded performance. To overcome this problem, We propose a semantic alignment method to introduce a ‘real’ ground-truth as a latent variable, and simultaneously optimize this latent variable and CNN weight in an end-to-end way. Experiments show that our semantic alignment can significantly improve the performance of facial landmark detection network. Facial structure information provided by facial landmark can act as the prior information for some high-level facial semantic analysis task such as face recognition. Traditional face recognition framework trains multiple CNNs separately with many face patches selected by a fixed strategy. Although this selection strategy can keep semantic consistency of the selected facial regions. The same face structural region from different face images can still have different levels of discrimative abilities because of many factors including illumination, pose and occlusion. Thus the tradition framework might be lack of generalization capability for cross-database applications. In addition, offline feature aggregation method used in the existing framework is also suboptimal. In order to overcome these problems of existing methods and improve the performance of face recognition. We propose a novel end-to-end CNN ensemble architecture which automatically learns the complementary and discriminative patches for face recognition. Extensive experiments conducted on LFW and YTF datasets show that our framework outperforms the traditional face recognition framework. Human pose estimation is a challenging problem. Popular heatmap regression based methods tend to establish a mapping between the body appearance and joint locations of each human sample when training the networks. However, the challenge exists in that training set cannot cover all the complex human appearances and environments. Thus the network tends to remember the pose irrelative appearances which is not discriminative for pose estimation. In order to improve the robustness of CNN with limited training data, we explore to improve pose estimation by learning pose discriminative features. Specifically, we assume that: if two persons share the similar skeleton, their top-level features should also be close. Based on this assumption, we investigate possible ways to build discriminative learning in the feature space and finally propose Local Sample Relation Module (L-SRM) to optimize the joint-wise feature distribution. Extensive experiments on the commonly used challenging benchmark support our aforementioned assumption, and our L-SRM significantly improves the pose estimation performance and outperforms other state-of-the-arts.
关键词	人脸关键点定位，人体姿态估计，人脸识别
语种	中文
七大方向——子方向分类	目标检测、跟踪与识别
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/40390
专题	紫东太初大模型研究中心_图像与视频分析
推荐引用方式 GB/T 7714	刘智威. 人脸与人体结构化视觉分析[D]. 中科院自动化所. 中国科学院大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Thesis.pdf（6223KB）	学位论文		开放获取	CC BY-NC-SA