CASIA OpenIR  > 模式识别实验室






Face images contain various biological characteristics of human faces, and thus have always been an important research topic in computer vision and image processing. With the popularization of mobile smart devices and the promotion of social media, face image editing has become a popular research direction in both academia and industry. In recent years, the rapid development of deep learning technology has injected new impetus into face image editing, and has created new forms of application, such as virtual digital human, social entertainment, and film production. Due to the complexity of semantic targets and diversity of data representations, it is necessary to solve the problem of face image editing by designing and constructing deep learning models according to semantic characteristics. Therefore, there is remarkable theoretical significance and application value in studying face image editing from the perspective of various semantic targets. In this thesis, we study face image editing with semantic targets at different levels of granularity. Based on Generative Adversarial Networks, characteristics of different granular semantic objects are analyzed from coarse to fine. The major innovations and contributions are summarized as follows:

1. For face attribute editing based on category labels, this thesis proposes a controllable face attribute editing method for high-resolution images. High-resolution images contain more fine-grained texture and content information than low-resolution images, which imposes higher requirements on model parameters and training stability. To edit high-resolution face images, this thesis proposes decomposing face images from the wavelet domain. A wavelet-aware loss is introduced to reconstruct source images, in which different coefficients of wavelet decomposition are used to obtain and preserve the global topology and texture information of high-resolution images. Compared with the reconstruction loss in the image domain, the wavelet-aware loss restores the fine-grained texture and high-frequency details of source images. To alleviate the uncontrollable problem of non-target attributes, this thesis proposes that imposing the same attribute category constraints on all attributes (target attributes and non-target attributes) is an important reason for the change of non-target attribute regions. Therefore, a weighting strategy is proposed to weight the classification loss of target and non-target attributes. Specifically, the weighted binary cross-entropy loss is introduced into the training objective, which strengthens the category constraints on the target attributes and reduces the attention to non-target attributes. Experiments show that this method can achieve accurate face attribute editing for high-resolution (512×512) images without affecting the scale of the model, and has good robustness in scenarios such as multi-attribute, partial occlusion, and continuous attribute changes.

2. For face component editing based on local regions, this thesis proposes a reference-guided face component editing method. Compared to other face attributes, face component (eyes, nose, mouth) editing focuses more on shape changes. Due to the limited geometric information contained in labels, it is difficult for face attribute editing methods based on category labels to flexibly manipulate the shape of face components. To achieve controllable and flexible manipulation of the shape of face components, this thesis proposes a research idea of ``replacing editing with synthesis". Specifically, the proposed method takes the source image with target face components corrupted as input, and then synthesizes the content within the missing area. As it is difficult to accurately describe the shape of human faces with semantic labels, this thesis proposes to use the reference image to provide the target face shape, which is more in line with the needs of practical applications and also provides a variety of choices for shape styles. To guide the network to learn the target information of reference images, an example-guided attention module is designed to incorporate the target face component features of reference images into source images. To supervise the proposed model, a contextual loss is adopted to constrain the similarity of shape between generated images and reference images, while a style loss and a perceptual loss are employed to maintain the appearance texture consistency between generated images and source images. The research idea of "replacing editing with synthesis" breaks the limitation of conventional frameworks for face image editing. Experiments show that this method can achieve diverse, high-quality, and controllable face component editing based on given reference faces, and get rid of the dependence on precise intermediate representations.

3. For face portrait editing based on pixel-level semantics, this thesis proposes a semantic-aware noise driven face portrait editing method. It is difficult for existing face portrait editing and synthesis methods to achieve both semantic controllability and style diversity. To fill this gap, this thesis proposes a semantic-aware noise to combine semantic information and noisy input. Semantic-aware noise can not only control each semantic region by individual sampling, but also generate diverse styles through resampling, realizing face portrait synthesis with controllable semantics and diverse styles. To further extend our method to real image manipulation, a novel ternary network structure is proposed to allow simultaneous diverse semantic image synthesis and real image manipulation in a unified framework. Ternary network structure consists of a generation network, a reconstruction network, and a manipulation network. The generation network leverages semantic-aware noise to synthesize photorealistic images, the reconstruction network is used to retain the content that does not need to be edited in input images, and the manipulation network achieves semantic manipulation by replacing the target semantic regions of feature representations with semantic-aware noise. Experimental results show that the method can synthesize photo-realistic face images with high diversity, edit the semantics of faces with high accuracy, and achieve good results in various performance evaluations.

关键词人脸图像编辑 语义图像合成 人脸属性编辑 生成对抗网络
GB/T 7714
邓琪瑶. 面向多粒度语义目标的人脸图像编辑[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.
文件名称/大小 文献类型 版本类型 开放类型 使用许可
博士论文-最终版.pdf(72573KB)学位论文 开放获取CC BY-NC-SA
所有评论 (0)
