基于深度学习的图像识别算法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于深度学习的图像识别算法研究
	谢国森
	2016-06
页数	974-983
学位类型	博士
中文摘要	图像识别是计算机视觉领域的热点问题之一，近几年基于深度学习的方法取得了很大进展，但是仍存在很多不足。本文针对图像识别中词袋（BoW）模型、卷积神经网络（CNN）等主流方法存在的问题，提出一系列有效的改进策略和方法。在多种图像识别任务~（物体识别、场景识别、领域迁移、多标签分类、细粒度识别）上验证了所提方法的有效性。主要创新性工作如下：提出一种将受限波尔茨曼机（RBM）模型和判别子空间准则融合的联合训练算法，通过对RBM隐含层进行判别约束，可以达到判别和生成训练联合进行的目的。联合训练算法将训练样本和测试样本通过RBM编码之后可以产生判别性更强的表示，进而更利于分类。另外传统的子空间学习方法只能够降低特征维度，而我们所提方法可以同时提升特征维度。在三个数据库上的分类结果验证了联合训练算法比RBM算法以及其他特征提取算法（FDA，MFA，hkMFA）所产生的表示更具判别性。提出一种图像识别词袋模型（BoW）中的局部描述子编码方法。本文提出使用正则化的自编码网络（Auto-Encoder，AE）将局部描述子（SIFT）直接向高维空间（AE的隐层结点可以控制维度大小）做一次非线性的稀疏、选择投影。在AE的训练中，我们加入了稀疏和选择约束来指导AE网络的隐层激活，使得隐层激活对每个输入描述子具有稀疏性，同时使得每个隐层结点对一定的输入样本具有选择性。将AE嵌入了BoW模型中，在三个图像数据库的实验结果表明，这种编码方式优于传统的编码，同时编码速度获得极大提升。提出一种将CNN和基于字典的模型进行有机融合的整合表示。给定训练好的CNN模型，两种基于字典的表示进一步被构建，它们分别为中层局部表示~（MLR）和卷积Fisher表示~（CFV）。构建MLR时，一种有效的两阶段聚类算法被提出，并用于生成类别混合或类别具体的部件字典。之后再将原图编码成一个多尺度的、具有空间划分的局部判别表示（MLR）。在构建CFV时，基于CNN最后一个卷积层特征，我们得到每张图片上Fisher表示。CNN中另一个有用的特征为最后一个全连接层特征~（FCR）。通过整合MLR，CFV和FCR，我们可以极大的提升场景识别和领域迁移问题的性能。提出一种监督的，可以自动学习到汇聚表示的策略，命名为任务驱动汇聚（Task-Driven feature Pooling，TDP）。TDP学习过程中可以最大化分类损失，同时使学习到的表示尽可能和同一张图片上的其他描述子彼此相似。TDP的输入可以为传统的BoW模型的编码向量或者是目前主流的CNN的卷积层，另外TDP被扩展为了多任务的形式，这样可以充分利用不同任务之间的互补信息。我们定义不同尺度的CNN的中间层输出为不同的任务，并将不同的任务建模为一个多任务学习的框架。对于测试样本的TDP学习问题，由于不存在样本标签信息，我们提出一种自学习机制来学习测试样本的TDP表示。多任务TDP表示在三个数据库上获得了明显的性能提升。提出一种新的图像表示，该表示具备选择性~（Selective）、判别性~（Discriminative）、等同性~（Equalizing）的特性，被命名为SDE。SDE是直接通过优化一个双层的优化模型学习所得，SDE表示的获取过程也可视为一种特征学习、汇聚过程。这是我们第一次将特征学习、汇聚过程建模为一个双层优化问题。将SDE学习嵌入到训练好的CNN中，我们可以学习到明显优于Max，Ave，Gmp，TDP等方法的新表示，多尺度SDE表示加上全链接层表示可以使获取的新表示极具判别性。在5个单标签和2个多标签的数据库上的测试结果充分证明了SDE表示的有效性。提出一种融合图像部件和全局信息的细粒度图像识别系统。首先采用最简单的CNN网络定位了冗余部件点，接着基于显著性检测的方法获得图像的显著性物体区域，进而只保留显著性区域内的部件点。另外我们分别使用显著性检测和分割的方法来发现图像中全局目标位置。最后我们构建了双通道的局部协同整体训练的CNN网络~（LG-CNN）。LG-CNN实现了卷积层之前权值共享，上层分支用于挖掘部件信息，下层分支用于保持全局信息，进而双分支分别连接Soft-max损失。基于LG-CNN构建特征表示，我们在三个细粒度识别数据库获得了最高性能。
英文摘要	Image recognition is one of the most hot topics in the field of computer vision. Recent years have witnessed great progress in deep learning. But drawbacks of deep learning still exists. In this paper, we focused on the underlying drawbacks in the Bag of Word~(BoW) and Convolutional Neural Network~(CNN) models, and proposed several efficient improving strategies and methods. We have validated our proposed methods in multiple image recognition tasks~(object recognition, scene recognition, domain adaptation, multi-label classification, and fine-grained recognition). The proposed innovative works are as follows: We propose the joint training method by combining Restricted Boltzmann Machine~(RBM) with the discriminative supervised subspace models. Specifically, the hidden layer of RBM is regularized by the supervised subspace criteria, and the joint learning model can then be efficiently optimized generatively and discriminatively. By forwarding both training and testing samples through RBM, we can obtain more powerful representations, making them more suitable for classification. More importantly, traditional subspace models can only reduce the dimensionality, while the proposed models can also increase the dimensionality. Experiments on three databases demonstrate that the proposed hybrid models outperform both RBM and their counterpart subspace models (FDA, MFA, hkMFA) consistently. We propose a local descriptor coding method in traditional Bag of Words (BoW) framework for image categorization. We propose to use the Auto-Encoder (AE) network as a local descriptor coding block, and further project the local descriptor~(SIFT) into high dimentional feature space, both selectively and sparsely. To make the hidden activities of AE network to be both selective and sparse, we add an efficient and effective regularization term into the learning process of AE network, which can promote sparsity of the hidden layer for each input descriptor as well as the selectivity for each hidden node. By incorporating the AE network coding with the BoW framework, we can achieve better results and faster speeds than other state-of-the-art feature coding methods on three image datasets. We propose to combine CNN with dictionary-based models for scene recognition and visual domain adaptation. Specifically, based on the well-tuned CNN models, two dictionary-based representations are further constructed, namely mid-level local representation (MLR) and convolutional Fisher vector representation (CFV). In MLR, an efficient two-stage clustering method is used to generate a class-mixture or a class-specific part dictionary. After that, the part dictionary is used to operate with the multi-scale image inputs for generating MLR. In CFV, we obtain the Fisher vectors based on the last convolutional layer of CNN. By integrating the complementary information of MLR, CFV and the CNN features of the fully connected layer, the state-of-the-art performance can be achieved on scene recognition and domain adaptation problems. We propose a novel task-driven pooling (TDP) model to directly learn the pooled representation from data in a discriminative manner. TDP can elegantly integrate the learning of representations into the given classification task. The optimization of TDP can equalize the similarities between the descriptors and the learned representation, and maximize the classification accuracy. TDP can be combined with the traditional BoW models or the recent state-of-the-art CNN models to achieve a much better pooled representation. A multi-task extension of TDP is also proposed to further improve the performance. Furthermore, a self-training mechanism is used to generate the TDP representation for a new test image. Experiments on three databases well validate the effectiveness of our models. We propose a new feature representation, termed as Selective, Discriminative and Equalizing representation ~(SDE). We use bilevel optimization to obtain the SDE representation. The process of obtaining SDE representation can also be seen as feature learning and pooling mechanism by jointly optimizing the pooled representations with the target of learning more selective, discriminative and equalizing features. When embedding the SDE learning into the trained CNN model, we can learn much better representation than Max, Ave, gmp, and TDP. The combination of multiple scale SDE and the fully connected representation has more discriminative power. Experiments on seven benchmark databases (including five single-label and two multi-label ones) well validate the effectiveness of our framework. We propose one joint trained Convolutional Neural Network~(CNN) architecture~(fusing the global object and local part information). Specifically, we first detect the part candidates by a simple CNN. Saliency detection methods are used to obtain the saliency map, then part candidates within the saliency region are preserved. The object locations are also obtained based the saliency methods and segmentation. Finally, we construct the shared CNN architecture with local and global joint training~(LG-CNN). The upper stream is used to discovery the part information of the input image, the bottom is used to keep the global information. Finally, softmax loss is added into both the two streams. Based on the features constructed from LG-CNN, we achieve state-of-the-art results on three fine-grained datasets.
关键词	卷积神经网络图像识别特征表示汇聚双层优化部件
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/11954
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	谢国森. 基于深度学习的图像识别算法研究[D]. 北京. 中国科学院研究生院,2016.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
XIE_GUOSEN_v05_20160（9753KB）	学位论文		限制开放	CC BY-NC-SA