面向多粒度语义目标的人脸图像编辑

CASIA OpenIR > 模式识别实验室

	面向多粒度语义目标的人脸图像编辑
	邓琪瑶
	2022
页数	128
学位类型	博士
中文摘要	人脸图像表征了人的生物特性，一直是计算机视觉和图像处理领域的研究热点。随着移动智能设备的普及和社交媒体的推广，人脸图像编辑已经成为学术界和工业界的热门研究方向。近年来，深度学习技术的提出和迅速发展为人脸图像编辑注入了新的发展动力，在虚拟数字人、社交娱乐、影视制作等领域开创了新的应用形式。人脸图像编辑具有语义目标复杂、表示形式多样等特点，需要根据语义形式设计和构建人脸图像编辑模型。因此，从多种语义目标的角度研究人脸图像编辑具有重要的理论意义和应用价值。本文面向多粒度的语义目标对人脸图像编辑展开研究，以生成对抗网络为研究基础，从粗到细地分析不同粒度语义目标的特点。本文的主要工作和创新性点归纳如下： 1、针对基于类别标签的人脸属性编辑，本文提出了一种可控的高分辨率人脸属性编辑方法。高分辨率图像比低分辨图像包含更多的细粒度纹理和内容信息，对模型参数量和训练稳定性提出了更高的要求。为了编辑高分辨率人脸图像，本文提出从小波域分解人脸图像，并引入了一个小波感知损失重建原始图像，利用小波分解的不同小波系数获取并保留高分辨图像的全局拓扑结构和纹理信息。与图像域的重建损失相比，小波感知损失更好地还原了原始图像的细粒度纹理和高频细节。为了缓解非目标属性不可控问题，本文提出对所有属性（目标属性和非目标属性）施加相同的属性类别约束是导致非目标属性区域变化的重要原因，进而提出利用权重策略对目标和非目标属性的分类损失进行加权。具体地，在训练目标中引入加权二元交叉熵损失，加强对目标属性的类别约束，同时减少对非目标属性的关注。实验表明，在不影响模型规模的情况下，本方法可以对高分辨率（512×512）图像实现准确的人脸属性编辑，在多属性、局部遮挡、连续属性变化等场景中具有良好的鲁棒性。 2、针对基于局部区域的人脸部件编辑，本文提出了一种参考图像引导的人脸部件编辑方法。相比于其他人脸属性，人脸部件（如眼睛、鼻子、嘴）的编辑更加聚焦于形状变化。基于类别标签的人脸属性编辑方法由于其给定的标签信息有限，难以灵活操控人脸部件的形状。为了给人脸部件提供更大的形状变化空间，本文提出了“以合成代替编辑”的研究思路。具体地，该思路以去除目标部件区域的人脸作为输入，通过合成缺失区域内容实现对人脸部件形状的编辑。人脸部件的形状难以用语义标签准确地进行描述，本文提出利用参考图像提供目标人脸部件形状，这种方式更加符合实际应用需求，也为形状风格提供了多样化的选择。为了引导网络学习参考图像的目标信息，在网络中引入了一个示例指导的注意力模块，该模块将参考图像的目标人脸部件特征融入原始图像。为了监督所提出的模型，采用上下文损失来约束生成图像和参考图像之间的形状相似性，同时采用风格损失和感知损失来保持生成图像和原始图像之间的外观纹理一致性。本文提出的“以合成代替编辑”的研究思路走出了人脸图像编辑的常规框架。实验表明，该方法可以根据给定的参考人脸实现多样化、高质量和可控的人脸部件编辑，摆脱了对精确中间表示的依赖。 3、针对基于像素语义的人脸肖像编辑，本文提出了一种语义随机量驱动的人脸肖像编辑方法。当前的人脸肖像编辑与合成方法难以同时实现语义可控性和风格多样性。本文提出将语义信息与随机量相结合，设计了一种语义随机量。语义随机量不仅可以利用其单独采样的特点控制每个语义区域，还可以通过重采样生成多样化的风格，实现了语义可控且风格多样的人脸肖像合成。为了将语义随机量进一步扩展到对真实人脸进行操纵，本文提出了一种三元网络结构，在同一个框架中实现了人脸语义合成和真实人脸编辑两个任务。三元网络结构由生成网络、重建网络和操纵网络组成。其中，生成网络根据语义随机量生成高质量的人脸图像，重建网络用于保留输入图像中不需要编辑的内容，操纵网络将真实人脸特征中需要编辑的语义替换为语义随机量实现了语义操纵。实验结果表明，该方法能够合成高质量、高多样性的人脸图像，准确地编辑人脸的像素语义，并在各项性能评估中取得了良好的结果。
英文摘要	Face images contain various biological characteristics of human faces, and thus have always been an important research topic in computer vision and image processing. With the popularization of mobile smart devices and the promotion of social media, face image editing has become a popular research direction in both academia and industry. In recent years, the rapid development of deep learning technology has injected new impetus into face image editing, and has created new forms of application, such as virtual digital human, social entertainment, and film production. Due to the complexity of semantic targets and diversity of data representations, it is necessary to solve the problem of face image editing by designing and constructing deep learning models according to semantic characteristics. Therefore, there is remarkable theoretical significance and application value in studying face image editing from the perspective of various semantic targets. In this thesis, we study face image editing with semantic targets at different levels of granularity. Based on Generative Adversarial Networks, characteristics of different granular semantic objects are analyzed from coarse to fine. The major innovations and contributions are summarized as follows: 1. For face attribute editing based on category labels, this thesis proposes a controllable face attribute editing method for high-resolution images. High-resolution images contain more fine-grained texture and content information than low-resolution images, which imposes higher requirements on model parameters and training stability. To edit high-resolution face images, this thesis proposes decomposing face images from the wavelet domain. A wavelet-aware loss is introduced to reconstruct source images, in which different coefficients of wavelet decomposition are used to obtain and preserve the global topology and texture information of high-resolution images. Compared with the reconstruction loss in the image domain, the wavelet-aware loss restores the fine-grained texture and high-frequency details of source images. To alleviate the uncontrollable problem of non-target attributes, this thesis proposes that imposing the same attribute category constraints on all attributes (target attributes and non-target attributes) is an important reason for the change of non-target attribute regions. Therefore, a weighting strategy is proposed to weight the classification loss of target and non-target attributes. Specifically, the weighted binary cross-entropy loss is introduced into the training objective, which strengthens the category constraints on the target attributes and reduces the attention to non-target attributes. Experiments show that this method can achieve accurate face attribute editing for high-resolution (512×512) images without affecting the scale of the model, and has good robustness in scenarios such as multi-attribute, partial occlusion, and continuous attribute changes. 2. For face component editing based on local regions, this thesis proposes a reference-guided face component editing method. Compared to other face attributes, face component (eyes, nose, mouth) editing focuses more on shape changes. Due to the limited geometric information contained in labels, it is difficult for face attribute editing methods based on category labels to flexibly manipulate the shape of face components. To achieve controllable and flexible manipulation of the shape of face components, this thesis proposes a research idea of ``replacing editing with synthesis". Specifically, the proposed method takes the source image with target face components corrupted as input, and then synthesizes the content within the missing area. As it is difficult to accurately describe the shape of human faces with semantic labels, this thesis proposes to use the reference image to provide the target face shape, which is more in line with the needs of practical applications and also provides a variety of choices for shape styles. To guide the network to learn the target information of reference images, an example-guided attention module is designed to incorporate the target face component features of reference images into source images. To supervise the proposed model, a contextual loss is adopted to constrain the similarity of shape between generated images and reference images, while a style loss and a perceptual loss are employed to maintain the appearance texture consistency between generated images and source images. The research idea of "replacing editing with synthesis" breaks the limitation of conventional frameworks for face image editing. Experiments show that this method can achieve diverse, high-quality, and controllable face component editing based on given reference faces, and get rid of the dependence on precise intermediate representations. 3. For face portrait editing based on pixel-level semantics, this thesis proposes a semantic-aware noise driven face portrait editing method. It is difficult for existing face portrait editing and synthesis methods to achieve both semantic controllability and style diversity. To fill this gap, this thesis proposes a semantic-aware noise to combine semantic information and noisy input. Semantic-aware noise can not only control each semantic region by individual sampling, but also generate diverse styles through resampling, realizing face portrait synthesis with controllable semantics and diverse styles. To further extend our method to real image manipulation, a novel ternary network structure is proposed to allow simultaneous diverse semantic image synthesis and real image manipulation in a unified framework. Ternary network structure consists of a generation network, a reconstruction network, and a manipulation network. The generation network leverages semantic-aware noise to synthesize photorealistic images, the reconstruction network is used to retain the content that does not need to be edited in input images, and the manipulation network achieves semantic manipulation by replacing the target semantic regions of feature representations with semantic-aware noise. Experimental results show that the method can synthesize photo-realistic face images with high diversity, edit the semantics of faces with high accuracy, and achieve good results in various performance evaluations.
关键词	人脸图像编辑语义图像合成人脸属性编辑生成对抗网络
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/48696
专题	模式识别实验室毕业生_博士学位论文
推荐引用方式 GB/T 7714	邓琪瑶. 面向多粒度语义目标的人脸图像编辑[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
博士论文-最终版.pdf（72573KB）	学位论文		开放获取	CC BY-NC-SA