CASIA OpenIR  > 模式识别实验室
基于结构信息利用的人脸及人体形状和姿态估计
张鸿文
2021-05
页数158
学位类型博士
中文摘要

随着人工智能的兴起,以人为中心的形状和姿态估计算法的应用场景层出不穷,如智能家居、全息通信、辅助驾驶等。这些新兴的应用对形状和姿态估计算法的准确性和鲁棒性提出了新的要求。在这些以人为中心的感知和理解应用中,人脸和人体成为了备受关注的物体。然而,人脸和人体是极具可塑性的三维柔性物体,实际应用中存在的姿态变化、遮挡等因素也严重影响了算法的性能,在现实场景中的人脸和人体的形状和姿态估计至今仍是一个具有挑战性的课题。此外,人脸和人体同时也是极具结构性的可形变物体,有效利用其内在的结构信息对提高算法的鲁棒性、准确性和可解释性有着重大意义。为了提升现实场景下人脸及人体的形状和姿态估计性能,本文从形状和姿态的表示方式和结构信息利用两方面入手开展研究工作,并基于结构模型约束、对抗性结构先验学习、结构特征学习和对齐反馈等策略提出了几种形状和姿态估计算法。

本文取得的主要研究成果归纳如下:

1、基于结构模型约束的人脸关键点定位。
针对严重遮挡等复杂条件下人脸图像的关键点定位,本文提出一种基于数据及模型混合驱动的人脸关键点定位方法,目的在于充分利用数据驱动下深度网络的表达能力和模型驱动下点分布模型的推理能力。其中,深度网络充分提取人脸图片中的纹理信息,而点分布模型存储了形状结构信息。为使两者有机互补,本文提出一种加权约束均值漂移算法迭代地精调关键点位置。实验结果表明,所提出的方法能较好地应对人脸图片中因表情、姿态和遮挡引起的变化,极大地提高了关键点定位的鲁棒性。

2、基于对抗性语义结构先验的三维人脸关键点定位。
针对自然环境下的三维人脸关键点定位,本文提出一种用于表示三维关键点形状的语义型体素表达。相比于传统方式,这种体素表达既能有效降低表达的维数,同时也能保留关键点在体素表达中的语义信息,从而有效地辅助三维人脸关键点定位任务。在此基础上,本文提出联合体素和坐标回归框架进行统一的二维与三维人脸关键点定位,其端到端的训练方式使得定位结果更为精确。此外,本文还提出了伴随回归对抗学习策略,将三维标注数据库中的人脸几何结构迁移到现实场景的二维标注数据库中,从而进一步提升算法在现实场景下的三维人脸关键点形状估计性能。

3、基于稠密部件结构特征学习的三维人体模型重建。
为应对人体重建过程中高度非线性映射带来的挑战和解决旋转姿态表示方式带来的位置偏差问题,本文提出一种基于稠密部件信息聚合的三维人体模型重建方法。所提出的方法采用稠密部件关联图作为网络的中间表示,并在网络设计时针对形状及姿态估计的需要兼顾全局和细粒度信息的感知。为更好地利用人体各部件的结构先验知识,本文根据人体运动链设计图卷积模块对部件信息进行聚合,从而有效提升人体各部件位置和旋转姿态的重建精度。实验结果表明所提出的方法能有效应对现实场景中人体图像出现的遮挡,光照、背景变化等复杂情况。

4、基于结构特征对齐反馈的三维人体模型重建。
在人体模型重建中,微小的参数偏差也可能导致预测的网格模型的重投影和图像之间有明显的偏差。为了进一步解决这个问题,本文提出一种基于网格对齐特征反馈的深度回归网络,使得重建网络能够根据当前预测的网格模型与图像的对齐状态显式地修正人体模型参数。所提出的网络的核心是能够从高分辨率特征提取网格对齐特征作为闭环中的反馈信息,从而能够有效地校正偏离的人体部件位置。此外,文中还提出对高分辨率特征使用像素级的辅助监督,以增强空间结构特征的相关性和可靠性。实验结果表明,所提出的方法显著改善了重建的人体模型与图像的对齐效果。

英文摘要

The techniques of artificial intelligence have revolutionized the applications of human-centric shape and pose estimation. With the techniques evolving, there are emerging applications aimed at everyday life, including intelligent housing system, holographic telepresence, and driver-assistance system etc. These emerging applications make new demands on the accuracy and robustness of the shape and pose estimation algorithms. In these human-centric applications, the perception and understanding of faces and human bodies have received particular attention from both academia and industry. Faces and human bodies are three-dimensional objects with high flexibility, making their shape and pose estimation quite challenging in complex scenes with extreme postures and occlusions. Besides, faces and human bodies are also deformable objects with structural constraints. Leveraging such internal structural information is the key to enhance the robustness, accuracy, and interpretability of algorithms. This thesis exploits new representations of shapes and poses and introduces several strategies on structural information utilization for better human-centric shape and pose estimation in real-world scenarios, showing how the structural models, adversarial learning of structural prior, structural feature learning and alignment feedback help the algorithms to have more robust and accurate performances in challenging cases.

In summary, the main contributions in this thesis are listed as follows.

(1) Facial landmark detection with structural model constraints.
An effective and robust approach is proposed for facial landmark detection in challenging conditions by combining data- and model-driven methods. In the proposed method, a deep neural network holistically captures the appearance information in a data-driven manner, while a pre-trained point distribution model explicitly utilizes the structural constraint in a model-driven manner. Moreover, a weighted version of regularized landmark mean-shift is proposed to selectively balance the efforts between the partial likelihood and global prior. In this way, the proposed method perfectly combines the advantages of the global robustness of the data-driven method, outlier correction capability of the model-driven method, and non-parametric optimization of the regularized landmark mean-shift. The proposed method is able to produce satisfying detection results on face images with exaggerated expressions, large head poses, and partial occlusions.

(2) 3D facial landmark localization with adversarial learning of semantic structure prior.
An adversarial voxel and coordinate regression framework is proposed for 3D face shape estimation in the wild. In the proposed framework, a semantic volumetric representation is introduced to encode the semantic information of landmarks in a compact manner. Compared with the conventional volumetric representation, the proposed volumetric representation is more compact while still preserving the semantic information of landmarks. Based on the semantic volumetric representation, a joint voxel and coordinate regression pipeline is proposed to combine the merits of both heatmap regression and coordinate regression based methods, and unify the 2D and 3D landmark localization tasks in the same framework. Moreover, an auxiliary regression adversarial learning strategy is proposed to distill the 3D geometric structures learned from synthetic datasets to in-the-wild datasets, enabling our method to produce plausible 3D face shape results for both synthetic and in-the-wild images.

(3) 3D human model reconstruction with structural feature learning on dense body parts.
A Decompose-and-aggregate Network (DaNet) is introduced to reconstruct 3D human models in real-world scenarios. DaNet adopts dense correspondence maps as intermediate representations to facilitate the learning of 2D-to-3D mapping. The prediction modules of DaNet are decomposed to enable global and fine-grained perceptions for the shape and pose predictions, respectively. Moreover, the structural position constraint of the skeleton is utilized to enhance the robust prediction of the rotation-based joint poses, where the kinematic chains of the human body are exploited to aggregate messages from different body parts. The proposed approach achieves state-of-the-art performances and can produce reasonable results even in cases with extreme postures, heavy occlusions, and incomplete human bodies.

(4) 3D human model reconstruction with structural feature alignment feedback.
Minor deviation in the estimated parameters can lead to noticeable misalignment between the resulting meshes and image evidences. To address this issue, a Mesh Alignment Feature Feedback (MAFF) loop is introduced for regression-based human mesh recovery. In the core of MAFF, the model parameter deviation can be corrected explicitly in a feedback loop based on the mesh-aligned features extracted from spatial feature maps. Moreover, an auxiliary dense correspondence task is imposed on the spatial feature maps, providing guidance to enhance the relevance and reliability of the mesh-aligned features. The efficacy of MAFF is validated on both indoor and in-the-wild datasets, where the proposed approach can significantly improve the mesh-image alignment over previous approaches.

关键词形状和姿态估计 人脸关键点定位 人体模型重建 结构信息利用
语种中文
七大方向——子方向分类图像视频处理与分析
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/44864
专题模式识别实验室
通讯作者张鸿文
推荐引用方式
GB/T 7714
张鸿文. 基于结构信息利用的人脸及人体形状和姿态估计[D]. 中国科学院自动化研究所. 中国科学院大学,2021.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
张鸿文-基于结构信息利用的人脸及人体形状(44386KB)学位论文 开放获取CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[张鸿文]的文章
百度学术
百度学术中相似的文章
[张鸿文]的文章
必应学术
必应学术中相似的文章
[张鸿文]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。