基于视觉和语义特征增强的零样本学习研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于视觉和语义特征增强的零样本学习研究
	贾真
	2020-05-31
页数	108
学位类型	博士
中文摘要	图像物体分类是计算机视觉领域的基本问题。近年来，随着以卷积神经网络为代表的深度学习技术的不断发展，图像物体分类方法取得了长足进步。但是，目前基于卷积神经网络的图像物体分类方法在模型训练阶段十分依赖大规模图像数据集。与之不同的是，人类可以通过阅读和聆听等方式，从其他模态的描述性知识中快速学习一个新物体类别的相关概念，从而在第一次看到新类别的物体时，就能认出其类别。为了使图像物体分类模型拥有和人类一样的知识学习能力，可以根据其他模态的知识识别从未见过的图像物体，摆脱模型对大规模图像数据集的依赖，研究者提出了零样本学习问题。在零样本学习问题的任务设定中，图像数据集被划分为已知类别和未知类别，要求图像物体分类模型在只使用已知类别图像训练的情况下，实现对未知类别图像的分类。解决零样本学习问题的关键是在未知类别图像训练样本缺失的情况下，如何利用辅助信息中的语义知识指导视觉图像的特征学习，实现跨模态的知识迁移。对此，本文分别在零样本学习的图像端和语义端围绕跨模态的知识迁移问题开展了深入研究。本文主要研究内容和取得的创新性成果包括：在零样本学习的语义端，研究使用百科全书中的物体类别知识扩充语义信息，增强物体类别原型在语义空间中的全局表达能力和可区分性，提升零样本学习方法性能。由于词向量可以通过大规模语料库自动提取，且具有较好的语义特性，因此成为了零样本学习方法中经常使用的语义特征。然而，词向量很难有效地表达物体类别的全局信息，分类能力有待提高。为了解决这一问题，本文使用百科全书知识来增强基于词向量的物体类别原型的全局语义表达能力，提升其可区分性。模型首先提取各物体类别在对应百科词条中的词频-逆向文件频率（Term Frequency-Inverse Document Frequency，TF-IDF）关键词。将关键词的词向量按照其TF-IDF 权重进行凸组合，将得到的向量作为新的各物体类别的语义特征，从而使特征获得更好的全局语义表征能力，提升了零样本学习模型的分类性能。在ImageNet 数据集上的大规模零样本分类测试中，本方法取得了优异的实验性能，验证了方法的有效性。在零样本学习的图像端，研究使用生成式模型生成未知类别的图像特征，扩充了模型可用的图像训练样本，以抑制零样本学习中经常出现的映射偏移问题。在零样本学习问题中，训练数据集不包含未知类别的图像样本，因此未知类别的测试图像在语义空间中的映射容易偏离其相应的类别原型，从而导致对未知类别图像的错误分类。为了解决这一问题，本文提出了基于深度无偏知识迁移的零样本学习方法，其中包含深度嵌入迁移模块和未知图像特征生成模块。在深度嵌入迁移模块中，本文提出了混合迁移网络，综合了线性映射和非线性映射的优点，用于图像样本从视觉空间到语义空间的跨模态映射。同时，使用端到端的方式协同训练卷积神经网络图像特征提取器和混合迁移网络。在未知图像特征生成模块中，生成式对抗网络被用于生成未知类别的图像特征。使用生成的未知类别图像特征对混合迁移网络再次进行训练，有效地抑制了未知类别的映射偏移问题。除此之外，本文提出了映射偏移抑制指数，用于定量评价零样本学习模型对映射偏移问题的抑制能力。本方法在多个数据集的零样本分类和广义零样本分类任务中均取得了优异的实验结果，验证了方法的有效性。同时，通过可视化定性分析和映射偏移抑制指数的定量评价，验证了本方法能够有效地抑制映射偏移问题。同时在零样本学习的图像端和语义端，研究挖掘视觉特征和语义特征中的显著性信息，实现更高效的跨模态知识迁移。注意机制在人类的认知系统中扮演着重要角色，人类在观察各种物体时通常更关注其中最显著的区域，在阅读物体的相关知识时也更倾向于记住其中最具区分性的内容。基于以上启发，本文提出了一种基于双重注意网络的零样本学习方法，在跨模态映射的图像端和语义端同时引入了注意机制。本方法包含一个视觉注意模块和一个语义注意模块，分别用于挖掘图像特征和语义特征中的显著性信息。其中，视觉注意模块在不同的特征分辨率下，对图像的卷积神经网络特征进行显著性检测，赋予图像中更具显著性的区域特征更高的权重。语义注意模块根据样本的图像特征，生成对应的语义权重，赋予具有更强区分性的语义特征更高的权重，使这些特征在零样本分类中发挥更大的作用。在多个具有代表性的零样本学习数据集上，本方法取得了优异的实验性能，验证了方法的有效性。
英文摘要	Image classification is the basic task in the area of computer vision. Recently, with the developments of deep learning, especially the convolutional neural network (CNN), image classification methods gain huge progress. However, CNN based image classification models highly rely on training over large scale image datasets. Different from CNN based image classification models, human beings can learn visual knowledge of objects from other modality information. They can recognize the objects at the first time they see them by just reading or hearing some descriptive information. In order to endow image classification models with the capability to recognize objects which are not seen in training phase, researchers introduce the problem of zero-shot learning (ZSL). The target of ZSL is to design a model that can recognize object categories which do not appear in training dataset. The key problem in ZSL is how to realize the cross-modal knowledge transfer with the guidance of auxiliary side information. In this dissertation, we carry out the research of knowledge transfer at both the visual end and the semantic end in ZSL respectively. The contributions of this dissertation are as follows. Encyclopedia knowledge of object categories is utilized to expand the semantic knowledge at the semantic end of ZSL methods. By enhancing the global representation and the discriminative capability of category prototypes with encyclopedia knowledge, the performance of the ZSL model gets increased. A large number of previous ZSL models use word vectors of the class labels directly as category prototypes. However, word vectors cannot capture the global knowledge of an object category sufficiently. In this dissertation, we propose a new encyclopedia enhanced semantic embedding (EESE) model to promote the discriminative capability of word vector prototypes with the global knowledge of each object category. The EESE model extracts the term frequency-inverse document frequency (TF-IDF) key words from encyclopedia articles to acquire the global knowledge of each object category. The convex combination of the key words' word vectors acts as the prototypes of the object categories. The prototypes of seen and unseen classes build up the embedding space where nearest neighbour search is implemented to recognize the unseen images. The experimental results show that the proposed method achieves the state-of-the-art performance on the challenging ImageNet Fall 2011 dataset. Generative models are implemented to synthesize visual features of unseen image classes on the visual end of zero-shot learning methods. By augmenting the visual features of unseen classes, the ZSL model can alleviate the projection domain shift problem effectively. Previous mapping-based zero-shot learning methods suffer from the projection domain shift problem due to the lack of unseen image classes in the training stage. In order to alleviate the projection domain shift problem, a deep unbiased embedding transfer (DUET) model is proposed in this dissertation. The DUET model is composed of a deep embedding transfer (DET) module and an unseen visual feature generation (UVG) module. In the DET module, a novel combined embedding transfer net which integrates the complementary merits of linear and nonlinear embedding mapping functions is proposed to connect the visual space and the semantic space. What's more, the end-to-end joint training process is implemented to train the visual feature extractor and the combined embedding transfer net simultaneously. In the UVG module, a visual feature generator trained with a conditional generative adversarial model is used to synthesize the visual features of unseen classes to alleviate the disturbance of projection domain shift problem. Furthermore, a quantitative index, namely score of resistance on domain shift (ScoreRDS), is proposed to evaluate different models regarding their resistance capability on the projection domain shift problem. The experimental results on five zero-shot learning benchmarks verify the effectiveness of the proposed DUET model. Discriminative information is explored at both the visual end and the semantic end of zero-shot learning methods to achieve more effective cross-modal knowledge transfer. Attention mechanism plays an important role in the human cognition system. Human beings always focus on and memorize the most discriminative information when watching and reading. In this dissertation, a dual focus network (DFN) model is proposed to embed both the visual and semantic attention mechanisms in zero-shot learning tasks. The DFN model contains a visual focus (ViF) module and a semantic focus (SeF) module. The ViF module endows salient parts of images with greater weights in multi-resolution feature maps of CNN-based visual feature extractors. The SeF module generates semantic weights to reweight semantic attribute features with the guidance of visual features, where the semantic attributes with more visual discriminative capability will get greater weights. Extensive experiments on five zero-shot learning benchmarks are performed to demonstrate the superiorities of the proposed DFN model, compared to other state-of-the-art models.
关键词	零样本学习图像物体分类深度学习卷积神经网络多模态知识迁移跨模态映射
语种	中文
是否为代表性论文	是
七大方向——子方向分类	图像视频处理与分析
国重实验室规划方向分类	小样本高噪声数据学习
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/39849
专题	毕业生_博士学位论文
通讯作者	贾真
推荐引用方式 GB/T 7714	贾真. 基于视觉和语义特征增强的零样本学习研究[D]. 中国科学院自动化研究所. 中国科学院大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于视觉和语义特征增强的零样本学习研究-（13043KB）	学位论文		限制开放	CC BY-NC-SA