基于受限样本的图像转换方法研究

CASIA OpenIR > 模式识别实验室

	基于受限样本的图像转换方法研究
	曹杰
	2021-05-22
页数	162
学位类型	博士
中文摘要	图像转换是计算机视觉领域的一个经典研究方向。随着深度学习理论的发展和大数据时代的到来，图像转换技术已取得了巨大的突破。当前方法能够在特定的场景下，合成肉眼难辨真假的结果。在当前阶段，基于图像转换技术的商业软件能够实现一键换脸、自动美颜等功能，效果受到了广泛好评。除此之外，图像转换技术在深度伪造、数据扩展等领域也取得了显著的成果。然而，图像转换方法仍面临着理论和应用方面的诸多挑战。比如，当前方法依赖于海量的训练数据和人工标注。对于每个特定的任务，都需要对特定的数据进行大规模的采集、整理和标注；在训练样本的数量或标注信息有限的情况下，这些方法训练的模型会出现严重的过拟合，导致性能显著下降，无法满足实际需求。本文以生成模型为基础，研究数据受限情况下的图像转换问题。针对存在的挑战，按照从特殊到一般的顺序，先对人脸转正和跨光谱人脸合成这两个具体的任务展开研究，接下来对通用的多域图像转换任务展开研究。本文取得的主要研究成果如下： 1. 提出了一种高保真姿态归一化模型，用来将任意角度的侧脸图像转正，以及一种新的几何监督信息，称为人脸稠密关联场。本文设计了基于单目图像提取人脸稠密关联场的完整流程，证明了人脸稠密关联场提供的面部几何信息是稠密且完备的，并且任意二维人脸图像对应的关联场只取决于身份特征，与姿态无关。高保真姿态归一化模型借助可微的人脸扭曲操作，通过端到端的神经网络来学习人脸稠密关联场和二维人脸图像的像素级对应关系。在训练过程中，本模型通过优化人脸稠密关联场的重建损失，保证了合成结果的跨姿态不变性。本模型使用了一种融合扭曲网络来合成图像背景，保证了人脸转正结果的完整性。此外，本模型使用了基于字典学习的纹理提取网络，用于表达面部的细粒度特征，保证了高分辨率合成结果的视觉真实感。实验表明，在不依靠三维人脸数据的情况下，本方法能够合成视觉效果清晰真实的转正结果，同时能有效地恢复非可控场景下极端角度侧脸图像的身份信息。 2. 提出了一种对抗式跨光谱人脸补全模型，用来实现跨光谱人脸图像合成。本模型将跨光谱人脸合成任务解耦成两个独立的子任务，即纹理补全和姿态校正，并使用两个深度神经网络来完成这两个子任务。纹理补全网络能够恢复近红外图像中缺失的细节信息，并将面部纹理从近红外光谱转换到可见光光谱。姿态校正网络能够将任意姿态的输入图像转变为正脸姿态，有效地解决训练数据的属性未对齐问题，使得后续训练过程能以有监督的形式进行。这两个网络的输出由纹理扭曲网络来实现端到端的融合，无需任何形式的图像后处理操作。本模型包括两个判别器，即多尺度判别器和细粒度判别器，来提供多层次的监督信息。多尺度判别器通过小波分解实现多尺度的监督，着重关注生成图像高频小波系数，显著提升了合成结果的视觉质量；细粒度判别器在特征层对解耦过程进行监督，保证了模型能够学习到相互独立的姿态表达和纹理表达。模型以多任务学习的方式进行训练，结合了姿态校正损失、两个判别器提供的对抗式损失以及像素级损失，实现了高质量的跨光谱人脸合成。实验表明，在训练数据存在姿态、光照、表情等属性未对齐问题的情况下，本模型仍然可以合成逼真的光谱转换结果，并且能够有效地保持输入图像的身份信息，从而辅助提升跨光谱人脸识别的精度。 3. 提出了一种富信息样本发掘模型，用于通用的多域图像转换任务。本模型的核心思想是根据训练样本的信息量进行样本选择，选择出富含信息的样本，并在损失计算中赋予这些样本更大的权重。本模型通过条件生成对抗式网络实现图像转换，采用对抗式样本赋权实现样本选择。在理论上，本方法证明了在对抗式训练中样本的信息量由全局最优判别器的输出决定，并结合重要性采样理论推导出了最优判别器和样本权重之间的解析关系。在实际应用中，本方法使用训练中动态变化的判别器作为最优判别器的近似替代，并提出了一种基于样本感知相似度的权重调整方法。此外，本方法采用了多级跳转样本训练策略，引导模型以多次转换的方式来合成目标结果。该策略使得模型能够对转换任务进行分解，从而在保持样本信息量的同时，有效地降低了训练的难度。本模型不依赖于任何专业领域的先验知识，因此适用于一般物体的图像转换任务，并且可以在训练过程中对样本权重进行自适应调整，在样本受限条件下表现仍然良好。实验结果表明，本模型能够保持图像原有的语义信息，合成真实可信的转换结果，并且在各项性能指标上都取得了显著的提升。
英文摘要	Image-to-image translation is a classic research direction in the field of computer vision. Due to the development of deep learning methods and the era of big data, significant improvements have been made. Current image-to-image translation methods can synthesize photorealistic results in some particular cases. Some commercial software based on image-to-image methods can achieve functions like automatic face swapping and selfie retouch, gaining favorable reputations. In addition, image-to-image translation methods also achieve remarkable results in the fields of digital image forgery and data augmentation. However, there are still many challenges in the theory and application of image-to-image translation. For instance, existing methods rely on huge amounts of training data and manual annotations. For a specific task in practice, large-scale data collection, cleaning, and labeling are required. Otherwise, when the amount of data or label is limited, models trained by these methods severely suffer from overfitting, which results in degraded performances and makes these models incapable of practical applications. Based on the generative model, we study the task of image-to-image translation with limited data. To deal with the existing challenges, from specific cases to general cases, we first study two specific image-to-image translation tasks, i.e., face frontalization and cross-spectral face synthesis. Then, we study general multi-domain image-to-image translation. The main contributions in the thesis are summarized as follows: 1. We propose the high fidelity pose invariant model, which frontalizes profile faces with any angle and a novel dense correspondence field as geometric supervision. We also present a complete process to extract the correspondence field from monocular images. We verify that the geometric supervision is dense and complete, and the correspondence field of a face image is determined by the identity and irrelevant to the pose. Our model learns the pixel-level correspondence between the dense correspondence field and face image through an end-to-end neural network with a differentiable warping operation. During the training process, the model optimizes the reconstruction loss of the correspondence field to ensure the synthesized results pose-invariant. The model uses a fusion warping network to synthesize the background to ensure the integrity of the results. In addition, the model uses a texture extraction network based on dictionary learning to express fine-grained features, which renders photorealistic high-resolution results. Experimental results show that without three-dimensional face data, the proposed method can synthesize high-quality face frontalization results and recover the identity information from profile faces in unconstrained environments with extreme poses. 2. We propose the adversarial cross-spectral face completion model for cross-spectral face synthesis. The model decoupled the cross-spectral face synthesis task into two independent subtasks, i.e., texture inpainting and pose correction. We use two deep neural networks for the two tasks. The texture inpainting network restores the missing details in the near-infrared image and maps the facial texture from the near-infrared spectrum to the visible spectrum. The pose correction network transforms the input face image into the frontal view, which addresses the misalignment problem in the training data. The pose correction network also makes the subsequent training process feasible for supervised training. In addition, an end-to-end texture warping network integrates the outputs of the two networks without the need for any post-processing operation. Two discriminators, namely the multi-scale discriminator and fine-grained discriminator, are designed to provide multi-level supervision. The multi-scale discriminator realizes multi-scale supervision through wavelet decomposition and focuses on the high-frequency wavelet coefficients, which significantly improves the visual quality of the synthesized results; The fine-grained discriminator supervises the feature dismantlement process in order to ensure that the model can learn independent pose and texture representations. We train the proposed model in a multi-task manner, combining the pose correction loss, the adversarial loss provided by the two discriminators, and the pixel-level loss to achieve high-quality cross-spectral face synthesis. Experiments show that despite the misalignment of the pose, lighting, expression, and other attributes in the training data, the proposed model can still synthesize realistic cross-spectral results. The model can also maintain the identity information of the input image, which boosts the performance of cross-spectrum face recognition. 3. We present the informative sample mining network for general multi-domain image-to-image translation tasks. The core idea is to select samples that carry rich information and then give them more weight in the loss calculation. We build the model based on the conditional generative adversarial network and propose the adversarial importance weighting to realize the sample selection. In theory, we prove that in adversarial training, the informativeness of a sample is determined by the global optimal discriminator's output. Applying the importance sampling theory, we derive the analytical relation between the optimal discriminator and sample weight. In practical applications, we replace the optimal discriminator with the dynamic discriminator during training. A weight rescale method is also proposed based on the perceptual similarity of samples. We also propose the multi-hop sample training scheme, which guides the model to synthesize the target result with multiple translations. The proposed training scheme enables the model to decompose the translations, reducing training difficulty while maintaining the informativeness of samples. The model does not rely on any prior knowledge and is suitable for the image-to-image translation task of general objects. Furthermore, the model can adaptively adjust the sample weight and perform well even with limited training data. Experimental results show that the model preserves the original semantic information and generates high-quality translation results. In addition, the model makes improvements in multiple metrics.
关键词	图像转换图像合成生成对抗式网络人脸转正异质人脸图像生成
语种	中文
七大方向——子方向分类	图像视频处理与分析
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/44328
专题	模式识别实验室
推荐引用方式 GB/T 7714	曹杰. 基于受限样本的图像转换方法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
博士学位论文.pdf（12921KB）	学位论文		开放获取	CC BY-NC-SA