CASIA OpenIR  > 毕业生  > 博士学位论文
基于多层次解耦的人像内容生成研究
马天翔
2024-05-11
Pages119
Subtype博士
Abstract

人像内容生成是指包含人脸与人体图像在内的数字内容生成技术,在社交娱乐、影视制作、虚拟现实、隐私安全等诸多领域中具有广泛的应用前景。随着深度学习尤其是深度生成式模型的进步,人像内容生成技术高速发展,但仍存在着许多难点与挑战。例如,由于完备的人像数据集构建成本高,导致一些人像生成任务所使用的监督数据受限;其次,受制于人脸与人体两类图像的类型与结构差异,单一的生成模型在人脸图像生成和人体图像生成任务中的通用性较差;另外,在高质量人像生成任务中,同时实现精细化语义控制的人像生成难度较高。为此,本文在人体姿态迁移、人脸表情翻译、人脸与人体图像翻译和3D人脸生成等几个具体的人像内容生成任务中,针对监督数据受限、模型通用性局限和控制精度有限这三个技术难点,利用生成模型多层次解耦的方法展开研究工作,主要贡献如下:

1. 针对监督数据受限的人体姿态迁移任务,本文提出一种空间层次下姿态表观解耦的人体图像姿态迁移方法。该方法能够完整的解耦出人体图像中的表观信息,并防止任何姿态信息泄漏到输出端,使得网络能够在监督数据受限(无配对人体图像)下实现自监督训练。首先,在空间层次上将人体图像分解为独立的肢体组件,以实现人体图像粗解耦。然后,使用一个表观编码器学习每个人体组件的多尺度潜在特征,并用另一个姿态编码器从预先提取出的人体姿态图中学习潜在姿态特征。最后,通过生成器将两个编码器中学到的特征进行融合来重建人体图像。在此过程中提出了一种多级特征统计量迁移网络,从表观编码器的多尺度潜在特征中解耦出人体各个组件的表观信息,并且阻止任何姿态信息从表观编码器中泄露出来,进而保证该模型能够在无配对人体图像的条件下通过自监督重建来学习人体姿态迁移。本方法的性能超越同类型方法,甚至与完备数据监督下的方法相媲美。此外,在开放数据集上的测试结果展示出本方法的良好泛化性。

2. 针对监督数据受限的人脸表情翻译任务,本文提出一种物理层次下几何纹理解耦的人脸表情翻译方法。该方法能够从人脸图像中解耦出3D几何形状,并将人脸表情信息注入其中构建出一种新的人脸表情表征方法,使得在监督数据受限(无配对人脸图像)下也能训练出高性能的人脸表情翻译模型。首先,引入预训练人脸捕捉模型来提取人脸的表情特征和表情无关特征,然后将参考人脸的表情特征与输入人脸进行替换,并渲染出具有参考人脸表情特征的输入人脸3D几何。这是一种新颖的人脸表情表征方法可用于引导人脸表情生成。其次,设计了一种多级特征对齐Transformer网络,从参考人脸图像中提取出细节特征,再与3D几何对齐后输入到生成器网络中帮助合成精细人脸表情。此外,提出一个预训练的去表情模型来构建伪配对数据,以辅助无配对人脸图像条件下的人脸表情翻译模型训练。充足的定性与定量对比实验以及视频数据测试实验证明了本方法的优越性。

3. 针对通用性局限的人脸与人体图像翻译任务,本文提出一种特征层次下跨域信息解耦与融合的通用图像翻译框架。该方法能够将不同图像域信息映射到统一的特征空间,并合并为一个整体,然后学习该整体特征的信息解耦与融合,以实现通用且鲁棒的特征学习和图像翻译。首先,将输入图像和条件图像编码到统一的特征空间中,以适应人脸或人体等不同类型的图像及其条件图像。然后,将两个编码特征拼接成一个整体,并提出跨域特征融合Transformer网络学习该整体特征。在这个过程中,网络可以同时学习到各图像域内以及域间的特征对应关系,从中解耦出有用信息并有机地融合。此外,该网络内提出一种Hiformer结构,能将整体特征递归式的分解为多个局部特征,学习各局部特征内的相互作用,再融合局部特征得到用于图像翻译的目标特征。最后,空间自适应生成器以条件图像和该目标特征为输入生成最终的图像翻译结果。本方法在5个不同类型的图像数据集上进行了大量实验,包括人脸和人体图像数据集。实验结果验证了该方法的高性能和强通用性。

4. 针对控制精度有限的3D人脸生成任务,本文提出一种模型层次下神经辐射场解耦的3D人脸生成方法。该方法能够将原本建模完整人脸的神经辐射场网络显式地解耦成多个子网络,并令每个子网络单独学习建模3D人脸的一个语义区域,进而实现对3D人脸独立且精细的语义生成与控制。首先,提出一种组合式神经辐射场,其由多个局部语义3D生成器组成。每个生成器学习一个子神经辐射场,并输出一个3D人脸语义区域的3D特征值、3D颜色值、3D掩膜值和残差SDF表示。然后,通过基于3D掩膜的语义融合以及体积聚合来渲染出完整2D人脸图像和2D特征。接着,提出一个高分辨率2D生成器将得到的2D特征进行超分辨率,以生成高清人脸图像。此外,提出两种鉴别器网络,全局级鉴别器和语义级鉴别器,分别用于约束生成人脸的3D一致性和局部语义解耦准确性。本方法在常用人脸数据集和自建的卡通人脸数据集上进行了充足的实验,验证了该方法的有效性和应用潜力。

Other Abstract

Portrait content generation refers to the digital content generation technology that includes face and human images. It holds broad application prospects in various fields such as social entertainment, film production, virtual reality, and privacy security. With the advancement of deep learning, especially deep generative models, portrait content generation technology has been rapidly developing, yet it still faces many challenges and difficulties. For example, the high cost of constructing comprehensive portrait datasets has limited the availability of supervised data for some portrait generation tasks. Furthermore, due to the differences in types and structures between face and human images, a single generative model exhibits poor generalization performance in tasks involving face and human image generation separately. Additionally, achieving fine-grained semantic control in high-quality portrait generation tasks poses significant challenges. Therefore, this paper focuses on several specific portrait content generation tasks, including human pose transfer, facial expression translation, face and human image translation, and 3D face generation. Addressing the technical challenges of limited supervised data, model generalization limitations, and limited control precision, the research work unfolds by utilizing the method of multi-level decoupling in generative models. The main contributions are as follows:

1. For the task of limited supervised data in human pose transfer, a method is proposed in this paper called spatial hierarchical pose-appearance decoupling for human pose transfer. This method can fully decouple the appearance information in human images and prevent any pose information from leaking to the output end, enabling the network to achieve self-supervised training under limited supervised data (unpaired human images). Firstly, in the spatial domain, human images are decomposed into independent limb components to achieve rough decoupling of human images. Then, a appearance encoder learns multi-scale latent features for each human component, while another pose encoder learns latent pose features from pre-extracted human pose maps. Finally, the features learned from both encoders are fused by a generator to reconstruct human images. In this process, a multi-level statistics transfer network is proposed to decouple the appearance information of various human components from the multi-scale latent features of the appearance encoder, and prevent any pose information from leaking from the appearance encoder, ensuring that the model can learn human pose transfer through self-supervised reconstruction under unpaired human images. The performance of this method surpasses similar methods and even rivals methods under complete data supervision. Additionally, testing results on outdoor environment data demonstrate the good generalization of this method.

2. For the task of limited supervised data in facial expression translation, a method is proposed in this paper called physical hierarchical geometry-texture decoupling for facial expression translation. This method can decouple 3D geometry from facial images and inject facial expression information to construct a new facial expression representation method, enabling the training of high-performance facial expression translation models even under limited supervised data (unpaired facial images). Firstly, a pre-trained face capture model is introduced to extract facial expression features and expression-agnostic features from facial images. Then, the facial expression features of a reference face are replaced with those of an input face, and a 3D geometry with the facial expression features of the reference face is rendered from the input face, which is a novel facial expression representation method that can guide facial expression generation. Secondly, a multi-level feature alignment Transformer network is designed to extract detailed features from the reference face image, align them with the 3D geometry, and input them into the generator network to assist in synthesizing fine facial expressions. Additionally, a pre-trained de-expression model is proposed to construct pseudo-paired data, aiding in the training of facial expression translation models under conditions of unpaired facial images. Abundant qualitative and quantitative comparative experiments, as well as tests on video data, demonstrate the superiority of this method.

3. For the task of limited generality in facial and human image translation, a framework is proposed in this paper called feature-level cross-domain information decoupling and fusion for universal image translation. This method can map information from different image domains to a unified feature space, merge them into a whole, and then learn the decoupling and fusion of this whole feature to achieve universal and robust feature learning and image translation. Firstly, input images and conditional images are encoded into a unified feature space to accommodate different types of images such as faces or bodies and their conditional images. Then, the two encoded features are concatenated into a whole, and a cross-domain feature fusion Transformer network is proposed to learn this whole feature. In this process, the network can simultaneously learn the correspondence of features within and between image domains, decoupling useful information and organically fusing them. Additionally, within this network, a Hiformer structure is proposed to recursively decompose the whole feature into multiple local features, learn the interactions within each local feature, and then fuse local features to obtain target features for image translation. Finally, a spatial adaptive generator takes conditional images and the target feature as input to generate the final image translation results. This paper conducts extensive experiments on five different types of image datasets, including facial and human image datasets. Experimental results verify the high performance and strong generality of this method.

4. For the task of limited control precision in 3D face generation, a method is proposed in this paper called model-level neural radiance field decoupling for 3D face generation. This method can explicitly decouple the neural radiance field network, which originally models the entire face, into multiple sub-networks. Each sub-network independently learns to model a semantic region of the 3D face, thereby achieving independent and fine semantic generation and control of 3D faces. Firstly, a composite neural radiance field is proposed, consisting of multiple local semantic 3D generators. Each generator learns a sub-neural radiance field and outputs 3D feature values, 3D color values, 3D mask values, and residual SDF representations of a 3D facial semantic region. Then, complete 2D facial images and 2D features are rendered through semantic fusion based on 3D masks and volume aggregation. Subsequently, a high-resolution 2D generator is proposed to perform super-resolution on the obtained 2D features to generate high-definition facial images. Additionally, two discriminator networks are proposed: a global-level discriminator and a semantic-level discriminator, which are used to constrain the 3D consistency and the accuracy of local semantic decoupling of the generated faces, respectively. This paper conducts sufficient experiments on common facial datasets and a self-built cartoon facial dataset to validate the effectiveness and application potential of this method.

Keyword人脸图像生成 人体图像生成 生成对抗网络 多层次解耦
Language中文
IS Representative Paper
Sub direction classification图像视频处理与分析
Paper associated data
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/56642
Collection毕业生_博士学位论文
Recommended Citation
GB/T 7714
马天翔. 基于多层次解耦的人像内容生成研究[D],2024.
Files in This Item:
File Name/Size DocType Version Access License
博士学位论文最终版(签字)_马天翔.pd(50535KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[马天翔]'s Articles
Baidu academic
Similar articles in Baidu academic
[马天翔]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[马天翔]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.