基于异质图像知识的视觉感知方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于异质图像知识的视觉感知方法研究
	严岚
	2022-05
页数	152
学位类型	博士
中文摘要	随着图像采集设备与多媒体技术的发展，人们获取的图像方式越来越多样化，各类不同形态的异质图像不断涌现。由此，产生了大量基于特定异质图像的实际应用需求。传统的计算机视觉模型主要关注自然场景图像，如手机或相机拍摄的人像照片、街景图像等。而不同形态的图像之间往往存在较大的域间模态差异。因此这些传统的计算机视觉算法通常难以解决特定异质图像的感知问题。面对这些问题，本文尝试利用特定视觉任务的相关领域知识，即特定的异质图像知识，来更好地处理相应的视觉感知问题，并提出了一个基于异质图像知识的视觉感知框架。该框架一方面基于异质图像通过特征提取与表征学习构建视觉表征，另一方面针对特定异质图像相关的视觉任务发掘与建立知识表征，并将视觉表征与知识表征结合起来，以有效地解决异质图像相关的视觉问题。其中，知识表征可以通过异质图像知识模型来获得。该框架旨在通过人类的经验总结或知识挖掘获取特定任务的异质图像知识，并建立知识引导机制将其融合到传统的视觉感知模型中，加深模型对特定任务的理解，从而更有效地解决异质图像相关的视觉问题。本文在与异质图像相关的图像去雾、人脸素描合成、人脸漫画合成、以及弱监督素描行人搜索等任务上分别对基于该框架设计的模型进行了探索与研究，结果表明该框架在传统的底层、中层以及高层视觉任务上都能取得较好的表现，主要可以总结为以下三个方面： 1. 恶劣天气和环境下获得的图像往往会出现对比度下降、色彩暗淡、细节缺失等问题。这不但会影响人的主观视觉感受，还会严重影响安防监控、自动驾驶等智能系统的性能。一方面，本文针对单幅图像去雾问题，基于暗通道先验以及雾霾图像的特性，通过注意力机制建立了异质图像知识表征，并与作为视觉模型的卷积神经网络进行了结合。另一方面，针对将低光噪声图像复原为明亮清晰的图像这一问题，本文将其视为一种特殊的带噪声的图像转换问题。基于 VGG 网络对噪声敏感的特性，构建了异质图像知识模型，即设计了一个感知损失函数，并利用生成对抗网络来作为视觉感知模型。设计的感知损失可以通过增强不同层次的结构一致性的约束来减轻噪声的影响。主要创新点为：（1）提出了一种用于单幅图像去雾的特征聚合注意网络，该网络结合了注意力机制和残差学习，能够自适应地聚合不同层次的特征；（2）提出了一个增强的生成对抗网络来端到端地解决带噪声的图像转换问题；（3）在去雾数据集上的实验表明，提出的方法实现了最佳的去雾效果。实验验证了提出的增强的生成对抗网络显著优于其他先进方法，直接用于图像去噪也取得了最先进的性能。 2. 人脸作为生物特征识别的重要研究对象之一，具有信息量丰富、获取方便等优点。但实际中并不总是能够获得人脸照片。此外，随着二次元文化与社交媒体的发展，素描、漫画等典型的二次元作品也越来越多的出现在人们的日常生活中。人脸图像天然地隐含着丰富的身份信息，而已有的异质人脸合成方法却极少考虑合成过程中身份信息的保留。因此，一方面，对于人脸照片与素描合成任务，本文引入额外的身份标签，并构建了异质图像知识模型，即身份识别损失函数，同时利用基于循环一致性的生成对抗网络作为视觉模型。另一方面，对于人脸漫画合成任务，本文基于隐含的人脸身份特性构建了异质图像知识模型，即身份保留损失函数，并利用包含翘曲控制器的生成对抗网络来获得视觉表征。主要创新点包括：（1）提出了一种用于人脸照片与素描合成的身份敏感的生成对抗网络；（2）提出了一种身份保留的生成对抗网络用于无监督的人脸漫画生成；（3）大量实验表明，与其他先进方法相比，提出的方法实现了最优秀的性能，合成的结果更逼真、更具视觉吸引力且保留了更多的身份细节。 3. 尽管现有的行人搜索模型已经取得了很好的性能，但训练这些模型需要大量精细的数据标签。然而，在大规模场景中获取和标注这些标签往往十分困难。为了克服这个问题，本文提出了一种弱监督行人搜索方法。另外，考虑到实际中并不总是能获得待搜索目标人物的照片，本文以素描而非照片作为查询探针，提出并研究了弱监督素描行人搜索问题。基于素描画像的像素分布稀疏的特性利用注意力机制构建了异质图像知识表征，并利用提出的弱监督行人搜索方法作为视觉感知模型。主要创新点是：（1）研究了弱监督行人搜索问题，并提出了一种基于聚类和斑块的弱监督学习方法；（2）提出并研究了弱监督素描行人搜索问题，并设计了一种基于聚类和特征注意的解决方案；（3）在两个公开数据集上的大量实验验证了所提出的弱监督设置在行人搜索任务上的可行性并验证了本文方法的有效性。
英文摘要	With the development of image acquisition equipment and multimedia technology, people have more and more diversified ways to obtain images, and different kinds of heterogeneous images are emerging. Thus, many practical application requirements based on specific heterogeneous images have arisen. Traditional computer vision models mainly focus on natural scene images, such as photos and street view images taken by mobile phones or cameras. There are often large inter-domain modal differences between images of different forms. Therefore, these traditional computer vision algorithms are usually difficult to address the specific heterogeneous images perception problem. Facing these problems, this paper tries to utilize domain knowledge related to specific visual tasks, i.e., specific heterogeneous image knowledge, to better deal with the corresponding visual perception problems, and proposes a visual perception framework based on heterogeneous image knowledge. On the one hand, the framework constructs visual representation based on heterogeneous images through feature extraction and representation learning. On the other hand, it uncovers and establishes knowledge representation for specific heterogeneous image related visual tasks, and combines visual and knowledge representations to effectively solve the visual problems related to heterogeneous images. Among them, knowledge representations can be obtained by heterogeneous image knowledge models. The framework aims to obtain the heterogeneous image knowledge of specific tasks through human experience summary or knowledge mining, and establish a knowledge guidance mechanism to integrate it into traditional visual perception models to deepen the models' understanding of specific tasks, so as to solve heterogeneous image related visual problems more effectively. In this paper, the models designed based on the framework are explored and studied on the tasks related to heterogeneous images, such as image defogging, face sketch synthesis, face cartoon synthesis, and weakly supervised sketch pedestrian search. The results show that the framework can achieve good performance in the traditional low-level, middle-level and high-level visual tasks. This paper can be summarized in the following three aspects: 1. Images obtained in bad weather and environment often suffer from poor contrast, dull colors, and lack of detail. This not only affects the subjective visual perception of people, but also seriously affects the performance of intelligent systems such as security monitoring and autonomous driving. On the one hand, for single image dehazing, based on the dark channel prior and the characteristics of haze images, a heterogeneous image knowledge representation is established through the attention mechanism and combined with convolutional neural network as a visual model. On the other hand, the problem of transforming dark noisy images to bright and noise-free images is considered as a special case of image-to-image translation with noise in this paper. As the VGG networks are sensitive to noise, a heterogeneous image knowledge model is constructed, i.e., a perceptual loss function is designed and a generative adversarial network is used as a visual perception model. The perceptual loss can mitigate the effects of noise and boost the performance by enhancing the constraints for different level structural consistency. In summary, the main contributions are: (1) A feature aggregation attention network (FAAN) for single image dehazing is proposed, which incorporates attention mechanisms and residual learning and can adaptively aggregate different level features; (2) An enhanced generative adversarial network (EGAN) is proposed to solve the problem of image-to-image translation with noise end-to-end; (3) Experiments on the dehazing dataset show that the proposed method achieves the best results. The experiments verify that the proposed EGAN significantly outperforms other state-of-the-art methods and achieves best performance when directly applied to image denoising. 2. As one of the important research objects of biometric recognition, human face has the advantages of rich information and easy access. But in practice, face photos are not always available. In addition, with the development of the culture of animation, comics, games, novel (ACGN) as well as social media, sketches, caricatures and other representative ACGN works are increasingly appearing in people's daily life. Face images naturally contain rich identity information, but the existing heterogeneous face synthesis methods rarely consider the preservation of identity information in the synthesis process. Therefore, on the one hand, for face photo-sketch synthesis, the paper introduces additional identity labels and constructs a heterogeneous image knowledge model, that is, an identity recognition loss function, while using a cycle consistency based generative adversarial network as a visual model. On the other hand, for caricature synthesis, we construct a heterogeneous image knowledge model, namely identity preservation loss function, based on the implicit face identity characteristics, and utilize a generative adversarial network containing warping controllers to obtain visual representations. In summary, the main contributions are: (1) An identity-sensitive generative adversarial network for face photo-sketch synthesis is proposed; (2) An identity-preservation generative adversarial network is proposed for unsupervised photo-to-caricature translation; (3) Extensive experiments show that compared with other advanced methods, the proposed methods achieve the best performance, and the synthesized results are more realistic, more visually appealing and retain more identity details. 3. While existing person search methods have achieved good performance, they require the images used for training contain fine labels. However, it is expensive and difficult to manually annotate these labels in the large scale scenario. To overcome this problem, a weakly supervised person search method is proposed in this paper. In addition, considering that photos of the target person to be searched are not always available in many practical scenarios, this paper proposes and investigates the weakly supervised sketch based person search problem, which uses a sketch instead of a photo as the probe for retrieving. Based on the sparse pixel distribution of sketch, a heterogeneous image knowledge representation is built by using the attention mechanism, and the proposed weakly supervised person search method is used as a visual perception model. In summary, the main contributions include: (1) A weakly supervised learning method based on clustering and patches is proposed for weakly supervised person search; (2) Weakly supervised sketch based person search problem is proposed and studied, and a solution based on clustering and feature attention is designed; (3) A large number of experiments on two publicly available datasets validate the feasibility of the proposed weakly supervised setting for person search and the effectiveness of the proposed method.
关键词	异质图像计算机视觉深度学习生成对抗网络
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/48873
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	严岚. 基于异质图像知识的视觉感知方法研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于异质图像知识的视觉感知方法研究.pd（13252KB）	学位论文		限制开放	CC BY-NC-SA