知识与数据协同驱动的异质图像表示与合成

CASIA OpenIR > 毕业生 > 博士学位论文

	知识与数据协同驱动的异质图像表示与合成
	骆曼迪
	2022-05-20
页数	158
学位类型	博士
中文摘要	异质图像的表示与合成是近年来计算机视觉领域的热点研究方向之一。得益于深度学习技术及生成模型的高速发展，研究人员已经在单一环境下的图像表示与合成任务上取得了重大研究进展，然而多变环境下的异质图像表示与合成仍然面临许多严峻的挑战。例如在异质图像表示与合成任务中，相关数据集规模较小、类别不均衡导致模型容易出现过拟合、虚假相关和泛化性差等问题；数据域间差异大导致图像具有较大的外观差异，因此神经网络难以筛选有效信息；某些特定任务缺乏配对数据导致无法使用有监督的方法。此外，在异质图象合成任务中，保证图象的高保真、可控性、多样性同样极具挑战性。以人为中心的异质图像是异质图像数据的典型代表，包括人脸、人体等生物特征数据。通过对以人为中心的数据进行有效地管理、分析和整合，可以更好的为人类提供服务，满足各类社会需求。目前，以人为中心的异质图像表示和合成方法大多依赖于数据驱动或一些简单的先验知识，难以充分提取有效信息以面对上述挑战。因此，本文提出了知识和数据协同驱动的异质图像表示与合成方法，通过同时利用知识、数据、算法、算力等四种要素，对以人为中心的异质图像表示和合成等具体任务展开研究。本文的主要贡献如下： 1. 提出了两种基于信息瓶颈的异质图像表示模型，即显著性搜索模型和跨模态一致性模型。第一种模型定义了人脸图像中像素级的显著性指标，通过赋予每个像素点0到1之间的权重实现显著性选择。进一步地，提出了自动特征搜索算法，根据模型有效性检验的结果自适应地调整选择参数。整个选择过程基于全局信息瓶颈网络的约束，通过平衡信息瓶颈损失，在不影响身份信息的条件下最大限度地实现冗余信息的压缩。此模型实现了异质人脸图像表示中有效特征自适应和权重化、自动化的提取及冗余信息的去除，显著提高了下游异质人脸识别任务的性能。第二种模型通过不同模态之间人体图像特征的对齐及跨模态信息瓶颈网络的约束，实现了模态间的信息互补和模态内的信息选择，同时，通过引入模态对比损失进一步加强了模态间一致性信息的学习，实现了人体图像特征的有效提取及冗余信息的去除，显著提高了下游异质行人重识别任务的性能。 2. 提出了两种基于结构先验的异质图像合成算法，即人脸图像增广生成对抗模型和深度感知的人体交互编辑模型。第一种算法针对不同形变的人脸图像合成中的自遮挡问题，提出了几何保持模块，通过引入图神经网络学习了不同人脸区域间的空间和语义关系，从而得到了归一化的人脸解析图，充分学习了人脸的几何结构信息。进一步地，利用人脸结构信息作为先验，通过分层解耦表示学习方案解耦身份信息及与形变相关的属性信息。测试过程中，给定模型任意形变的人脸及相应的属性标签，即可实现身份保持的可控不同形变人脸合成。第二种算法针对不同姿态的人体图像合成中的实体间遮挡问题，定义了三维空间内实体间的相对深度关系。通过实体的X-Y轴坐标、目标人体姿态以及实体间的相对深度关系，共同描绘了人与其他实体间的交互关系。提出了无监督的模仿对比学习策略，通过添加人工遮挡，实现了单张图像的相对深度关系学习，并在这种三维结构关系的指导下实现了空间次序感知的人体姿态合成。 3. 提出了一种基于记忆调制的异质图像转换算法，即记忆模块调制的Transformer模型。考虑到输入图像中信息的缺失，将异质图像转换问题定义为“一对多”而非“一对一”的生成问题。针对异质人脸图像转换任务中的可控性较低及多样性缺乏等挑战，提出了通过引入样例图像进行风格化过程的指导，同时提出了记忆模块学习样例域的原型风格模式，增加了图像转换结果风格的多样性及可控性。针对异质人脸图像转换任务中的感知差异和姿态差异较大等问题，提出了风格化的Transformer模块，通过将内容信息和风格信息切片，并利用Transformer结构探索不同切片之间长距离的依赖关系，实现了同时从全局和局部两个角度学习样例域的风格信息。测试过程中，可以选择从样例图像中学习风格信息或直接从更新后的原型风格模块学习相关信息。实验表明，本模型在近红外-可见光、热红外-可见光、草图-照片以及灰度图-彩图等多个异质人脸图像转换任务上均能实现清晰、可控、多样、高保真的双向人脸图像转换。生成的结果也可以进一步用于提高异质人脸识别任务的性能。
英文摘要	Heterogeneous image representation and synthesis have been one of the hot research directions in the field of computer vision for years. Benefiting from the rapid development of deep learning techniques and generative models, researchers have made significant achievements on image representation and synthesis tasks in constrained environments. However, problems become serious when it comes to unconstrained environments with heterogeneous images. For example, the datasets are usually in small-scale with unbalanced categories, leading to problems of over-fitting, false correlation, and poor generalization. Large domain gaps result in big differences in the appearance of images, making it hard for neural networks to extract effective information. The lack of paired data for some specific tasks prevents the use of supervised methods. In addition, high-fidelity, controllable, and diverse image synthesis is rather challenging. Human-centric heterogeneous images, such as face and human body, are typical representatives of heterogeneous image data. The effective management and analysis of human-centric data can better serve human beings and meet real needs. Existing heterogeneous image representation and synthesis methods are usually driven by data or some simple priors, making it difficult for them to extract enough effective information. Therefore, to address the above problems, we propose a representation and synthesis method of heterogeneous images driven by knowledge and data synergistically. We propose to solve the human-centric heterogeneous images related tasks, including representation, synthesis, and translation, by leveraging the advantages of knowledge, data, algorithms, and computing power at the same time. The contributions of this paper is summarized as follows: 1. We propose two heterogeneous image neural representation models, namely saliency search network (SSN) and mutual information and modality consensus network (CMInfoNet), based on information bottleneck theory. For SSN, we propose a novel pixel selection block responsible for searching salient fields at the pixel level. Note that every pixel is selected with a specific weight between 0 and 1. In addition, we present an automatic feature search (AFS) algorithm, which automatically optimizes the searching results to produce the optimal solution. The whole search process is guided under a global information bottleneck (GIB). By adjusting the GIB trade-off, the network can compress redundant information as well as preserve identity-relevant information. Thus, our proposed SSN realizes effective, attentive, automatic, and adaptive neural representation extraction. Extensive experiments demonstrate that the SSN can also boost the performance of related heterogeneous face recognition. CMInfoNet realizes information complementing between modalities and information selection within modalities through the alignment of human image features between different modalities and a cross-modality information bottleneck (CIB). In addition, the introduction of weighted regularization quadruplet (WRQ) loss further maintains the inter-modality consistency. By learning effective representation and avoiding information redundancy, the performance of Visible-Infrared person re-identification is significantly improved. 2. We propose two heterogeneous synthesis algorithms, namely face augmentation generative adversarial network (FA-GAN) and depth-aware interaction learning network (InterDepthNet), based on structural priors. To address the self-occlusion problem caused by different face deformations, we propose FA-GAN with a geometry preserving module to extract geometric information by exploring both spatial and semantic relations among different face regions using graph convolutional networks (GCNs). Under the guidance of this learned geometric information, FA-GAN further disentangles the identity-related information deformation attributes via a hierarchical disentangled representation learning scheme. During inference, given the arbitrarily deformed face of the model and the corresponding attribute labels, identity-preserving face synthesis with various deformations can be realized. To address instance-level occlusions in the task of pose transfer, we propose InterDepthNet to learn the relative depth between different instances. The interaction relation between a certain person and the context is defined by the instances' 2D coordinates on the X-Y axis, the human pose, and the relative depth. Moreover, we propose an unsupervised imitative contrastive learning strategy to trace the order implicated by 2D images. Finally, instance-level occlusion-aware human pose synthesis is realized under the guidance of this learned 3D structural relation. 3. We propose a novel reference-based heterogeneous image translation method, namely, memory-modulated Transformer Network (MMTN). Considering the lack of information in the input images, the heterogeneous image translation problem is defined as a "one-to-many" rather than a "one-to-one" generation problem. To address the lack of controllability and diversity, we propose to learn style information from reference images. Specifically, we propose a memory module to explore and quantify the prototypical style patterns of the reference domain. In addition, we propose to learn the style of the reference domain both locally and globally by introducing a style Transformer module. It explores long-range dependencies between the input and reference patches. During inference, we can either get the style information from reference images or the corresponding memory module. Extensive experiments on multiple datasets for various heterogeneous face recognition tasks, including NIR-VIS, thermal-VIS, sketch-photo, and gray-RGB, are conducted. The results demonstrate that the proposed MMTN can realize diversely photorealistic identity-preserving heterogeneous image translation. The performance of corresponding heterogeneous face recognition tasks is also boosted.
关键词	知识与数据协同驱动异质图像表示异质图像合成异质图像转换生成对抗网络信息瓶颈
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/49666
专题	毕业生_博士学位论文中国科学院自动化研究所模式识别实验室毕业生
推荐引用方式 GB/T 7714	骆曼迪. 知识与数据协同驱动的异质图像表示与合成[D]. 中国科学院自动化研究所. 中国科学院大学,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
博士论文-知识与数据协同驱动的异质图像表（10423KB）	学位论文		限制开放	CC BY-NC-SA