英文摘要 | Image classification is the basic task in the area of computer vision. Recently, with the developments of deep learning, especially the convolutional neural network (CNN), image classification methods gain huge progress. However, CNN based image classification models highly rely on training over large scale image datasets. Different from CNN based image classification models, human beings can learn visual knowledge of objects from other modality information. They can recognize the objects at the first time they see them by just reading or hearing some descriptive information. In order to endow image classification models with the capability to recognize objects which are not seen in training phase, researchers introduce the problem of zero-shot learning (ZSL). The target of ZSL is to design a model that can recognize object categories which do not appear in training dataset.
The key problem in ZSL is how to realize the cross-modal knowledge transfer with the guidance of auxiliary side information. In this dissertation, we carry out the research of knowledge transfer at both the visual end and the semantic end in ZSL respectively. The contributions of this dissertation are as follows.
-
Encyclopedia knowledge of object categories is utilized to expand the semantic knowledge at the semantic end of ZSL methods. By enhancing the global representation and the discriminative capability of category prototypes with encyclopedia knowledge, the performance of the ZSL model gets increased. A large number of previous ZSL models use word vectors of the class labels directly as category prototypes. However, word vectors cannot capture the global knowledge of an object category sufficiently. In this dissertation, we propose a new encyclopedia enhanced semantic embedding (EESE) model to promote the discriminative capability of word vector prototypes with the global knowledge of each object category. The EESE model extracts the term frequency-inverse document frequency (TF-IDF) key words from encyclopedia articles to acquire the global knowledge of each object category. The convex combination of the key words' word vectors acts as the prototypes of the object categories. The prototypes of seen and unseen classes build up the embedding space where nearest neighbour search is implemented to recognize the unseen images. The experimental results show that the proposed method achieves the state-of-the-art performance on the challenging ImageNet Fall 2011 dataset.
-
Generative models are implemented to synthesize visual features of unseen image classes on the visual end of zero-shot learning methods. By augmenting the visual features of unseen classes, the ZSL model can alleviate the projection domain shift problem effectively. Previous mapping-based zero-shot learning methods suffer from the projection domain shift problem due to the lack of unseen image classes in the training stage. In order to alleviate the projection domain shift problem, a deep unbiased embedding transfer (DUET) model is proposed in this dissertation. The DUET model is composed of a deep embedding transfer (DET) module and an unseen visual feature generation (UVG) module. In the DET module, a novel combined embedding transfer net which integrates the complementary merits of linear and nonlinear embedding mapping functions is proposed to connect the visual space and the semantic space. What's more, the end-to-end joint training process is implemented to train the visual feature extractor and the combined embedding transfer net simultaneously. In the UVG module, a visual feature generator trained with a conditional generative adversarial model is used to synthesize the visual features of unseen classes to alleviate the disturbance of projection domain shift problem. Furthermore, a quantitative index, namely score of resistance on domain shift (ScoreRDS), is proposed to evaluate different models regarding their resistance capability on the projection domain shift problem. The experimental results on five zero-shot learning benchmarks verify the effectiveness of the proposed DUET model.
-
Discriminative information is explored at both the visual end and the semantic end of zero-shot learning methods to achieve more effective cross-modal knowledge transfer. Attention mechanism plays an important role in the human cognition system. Human beings always focus on and memorize the most discriminative information when watching and reading. In this dissertation, a dual focus network (DFN) model is proposed to embed both the visual and semantic attention mechanisms in zero-shot learning tasks. The DFN model contains a visual focus (ViF) module and a semantic focus (SeF) module. The ViF module endows salient parts of images with greater weights in multi-resolution feature maps of CNN-based visual feature extractors. The SeF module generates semantic weights to reweight semantic attribute features with the guidance of visual features, where the semantic attributes with more visual discriminative capability will get greater weights. Extensive experiments on five zero-shot learning benchmarks are performed to demonstrate the superiorities of the proposed DFN model, compared to other state-of-the-art models.
|
修改评论