CASIA OpenIR  > 毕业生  > 博士学位论文
语义解耦和人脸属性编辑方法
刘云帆
2023-05-19
Pages124
Subtype博士
Abstract

      作为计算机视觉领域的经典研究方向之一,人脸属性编辑旨在修改人脸图像中特定的属性信息,并生成高质量的编辑结果。长久以来,人脸属性编辑技术在图像处理和影视制作等行业中发挥着关键作用,同时也为人脸识别系统的实际应用提供了辅助和支持。随着社交媒体的流行以及数字人、元宇宙等概念的兴起,人脸属性编辑技术已经逐渐走入大众视野,在数字内容创作、信息安全保护等与人们日常生活紧密相关的应用场景中扮演着不可或缺的重要角色,吸引了工业界和学术界的广泛关注。

      近年来,随着生成对抗网络相关理论的快速发展,人脸属性编辑方法的各 方面性能稳步提升,生成图像的质量已经达到了肉眼难辨真假的水平。然而,现有方法在对目标属性进行编辑时,往往会导致多种无关的图像语义随之发生改变,这种耦合的图像变化通常会严重降低属性编辑的精确度和输出图像的可用性。因此,研究语义解耦的人脸属性编辑方法具有重要的理论意义和应用价值。 本文以生成对抗网络为工具,研究语义解耦的人脸属性编辑方法。本文将人脸图像所含语义划分为非目标属性、身份特征和图像空域信息三部分,并针对目标属性与这三方面内容的耦合变化问题,按照从特殊到一般的思路,由浅入深地进行探讨。

      本文的主要工作和创新点归纳如下:

      1、针对非目标属性耦合变化的问题,本文提出了一种基于属性感知的人脸 年龄编辑方法。人脸图像的“年龄”属性是一种常见的属性编辑对象,蕴含了纹理、颜色等综合性的语义特征。为了提高年龄编辑的解耦性,本方法将除年龄以外的其他人脸属性(如性别、种族等)作为条件输入引入到生成器和判别器的训练过程中,以对源图像的属性信息进行全面学习。在此基础上,本方法优化了对抗训练的目标函数,使其能够同时对输出图像的真实性以及其与给定属性之间的匹配程度进行评估,实现了对非目标属性变化的约束。考虑到人脸年龄变化具有图像空域上的局部性和稀疏性,编辑结果中的大部分内容应与源图像中相应位置的信息保持一致。因此,本方法令生成器只对图像的改变量进行建模,并利用空域注意力机制实现了对图像像素变化的自适应约束,从而能够减少不必要的像素变化并提升输出图像的整体质量。除此以外,本方法还使用小波包分解来提取与年龄变化相关的纹理特征,以提高判别器对输入图像年龄进行分辨的能力。在多个人脸年龄数据集上的实验结果表明,本方法能够有效提升年龄编辑的准确性,同时保持属性和身份信息的稳定性。

      2、针对身份特征耦合变化的问题,本文提出了一种基于三维可变模型(3D Morphable Model, 3DMM)的人脸语义分解和操控方法,以实现样例驱动的人脸 姿态和表情编辑。引入样例图像作为姿态和表情编辑的驱动信息虽然可以提升编辑结果的可控性,但同时也会增加身份特征发生耦合变化的风险。为了解决这个问题,本文提出利用3DMM 对人脸图像中包含的姿态、表情、纹理以及光照等语义信息进行解耦的参数化表示,通过对源图像和样例图像的参数进行重组,实现身份特征的统一和编辑结果的参数估计。在3DMM 参数的基础上,本方法利用神经渲染网络分别对源图像和编辑结果进行重建,并用重建图像之间的像 素对应关系描述人脸姿态和表情的变化模式。这种建模方法使得人脸姿态和表情的编辑可以被视为源图像的扭曲操作,以最大限度地保留源图像的纹理和颜色信息,从而提高输出图像的视觉质量。实验结果验证了该方法在保持身份特征的稳定性和实现精准属性编辑方面的有效性。

      3、针对图像空域信息耦合变化的问题,本文提出了一种基于预训练风格生成器隐编码插值的通用人脸属性编辑方法。考虑到生成图像中的语义对象由生成器中相应的特征图控制,而每张特征图又和风格系数一一对应,因此为了提升属性编辑时不同图像空域信息之间的解耦性,本文提出在一种新的隐空间——“风格空间”中进行隐编码变换。通过对风格空间的自身性质进行理论分析,本方法提出训练受稀疏条件约束的线性支持向量机进行属性分类,并以分类超平面的法向量作为隐编码的插值方向。更进一步地,本文观察到仅在风格空间中进行编辑虽然可以提升空域信息的解耦性,但同时会降低输出图像的视觉逼真度。因此,本文提出将输入空间和风格空间中的插值向量进行加权融合以作为最终的隐编码变化量,并将加权系数的求取建模为一个直观的多目标优化问题。实验结果表明,本方法能够有效地将两个隐空间各自的优点进行有机结合,得到既逼真又准确的人脸属性编辑结果。

Other Abstract

As one of the classic research directions in the field of computer vision, facial attribute editing aims to modify specific attributes in the source image and generate high-quality editing results. For a long time, facial attribute editing technology has played a crucial role in industries such as image processing and film production, as well as providing assistance and support for the practical application of face recognition systems. With the popularity of social media and the gradual rise of novel concepts such as digital humans and metaverse, facial attribute editing technology has entered the public’s view, playing an indispensable and important role in applications related to people’s daily lives such as digital content creation and information security protection, thereby attracting widespread attention from both the industry and academia.

In recent years, with the rapid development of Generative Adversarial Networks (GANs), the performance of facial attribute editing methods in various aspects has been steadily improved, and the quality of generated images has reached a level where it is difficult for the naked eye to distinguish between real and fake. However, modifications in irrelevant image semantic information such as non-target attributes, identity features, and spatial information could often be observed in editing results obtained by existing methods. Such entangled image change usually seriously reduces the precision of image modification and the usability of editing results. Therefore, it is of great theoretical significance and practical value to develop facial attribute editing methods which could achieve disentangled semantic changes. To solve the entanglement between the target attribute and other attributes, identity features, and spatial information of the image, following a ’specific-to-general’ research route, this paper conducts a comprehensive study on GAN-based facial attribute editing methods with disentangled semantics.

The main contributions of this thesis are summarized as follows:

1. To address the problem of entangled changes between target and non-target attributes, this paper proposes an attribute-aware face age editing method. The ‘age’ attribute of a face image contains comprehensive visual semantic features, including texture representations such as wrinkles and spots, as well as color information such as skin and hair color, making it a popular editing target. To improve the disentanglement between age and other attributes, this method proposes to introduce non-target face attributes (such as gender, race, etc.) into the training process, which enables the model to learn the semantic information of source images more comprehensively. At the same time, this method optimizes the objective function of adversarial training to evaluate the authenticity of the output image as well as its matching degree with the given attributes, achieving a constraint on the possible other attribute changes in the output image without the need for an additional attribute estimation network. Considering that facial age changes have local and sparse characteristics in the image space, most of the content in the editing result should directly inherit from the source image. To achieve this, this method proposes to model only the change of the image by the generator network and uses spatial attention mechanisms to achieve adaptive constraints on pixel changes, thereby reducing unnecessary pixel changes in the source image and improving the overall quality of the output image. Additionally, this method uses wavelet decomposition to extract texture features related to age changes, such as wrinkles, nasolabial folds, and beards, to improve the discriminator’s ability to distinguish the age of the input image. Experimental results on multiple face age datasets demonstrate that this method can effectively improve the accuracy of age editing while maintaining the stability of attribute and identity information.

2. To address the problem of entangled changes between the target attribute and identity features, this paper proposes a pose and expression editing method based on facial semantic decomposition with 3D Morphable Models (3DMM). Although introducing exemplar images as the driving information can improve the controllability of attribute manipulation in facial pose and expression editing, it also increases the likelihood of entangled changes in identity features. To solve this problem, this paper proposes to use 3DMM for decomposing image semantics including pose, expression, lighting condition, and texture, as well as computing their parametric representations. Therefore, the parameter of attribute editing results could be estimated by re-organizing the parameters of source and exemplar images. Afterwards, the reconstruction of source and output images could be obtained by a neural rendering network based on their 3DMM parameters, and the pixel correspondence in-between could be considered as a description of the face motion pattern of pose and expression changes. This allows facial pose and attribute editing to be approximated by warping the source image, which preserves the texture and color information in the source image and improves the visual quality of the output image. Experimental results verify the effectiveness of the proposed method in maintaining stable identity features and achieving precise attribute editing.

3. To address the problem of entangled changes between attributes and spatial information of the image, a general-purpose facial attribute editing method is proposed, which is built based on a pre-trained style-based generator and implemented by latent code interpolation. Since each semantic object in the generated image is controlled by the corresponding feature map in the generator, and each feature map is associated with one element in the style code, to improve the disentanglement between facial attributes and local texture, a novel latent space, named ‘style space’, is proposed for performing latent code manipulation. By analyzing the properties of the style space, this method proposes training a linear support vector machine subject to sparse constraints in the style space for attribute classification and using the normal vector of the classification hyperplane as the interpolation direction for the latent code. Although traversing in the style space could produce results with spatially disentangled changes, the visual fidelity of output images would be inevitably damaged. Therefore, this paper proposes to adaptively merge the translation vectors in both the input space and style space as the final displacement vector, where the weighting coefficients are solved via an intuitive multi-objective optimization problem. Experimental results show that this method could combine the advantages of the two spaces in an organic manner to obtain both realistic and accurate facial attribute editing results.

Keyword生成对抗网络 人脸属性编辑 人脸年龄编辑 人脸姿态和表情编辑
Language中文
Sub direction classification图像视频处理与分析
planning direction of the national heavy laboratory视觉信息处理
Paper associated data
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/52067
Collection毕业生_博士学位论文
Corresponding Author刘云帆
Recommended Citation
GB/T 7714
刘云帆. 语义解耦和人脸属性编辑方法[D],2023.
Files in This Item:
File Name/Size DocType Version Access License
201918020629008刘云帆.p(17522KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[刘云帆]'s Articles
Baidu academic
Similar articles in Baidu academic
[刘云帆]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[刘云帆]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.