CASIA OpenIR  > 毕业生  > 博士学位论文
基于深度特征融合的图像分类方法研究
李成华1,2
学位类型工学博士
导师卢汉清
2018-05-28
学位授予单位中国科学院大学
学位授予地点中国科学院自动化研究所
关键词图像分类 深度卷积神经网络 特征融合 图像包网络 一致性融合 动态门控融合
摘要
图像分类作为计算机视觉任务的重要基本问题已经在包括安防领域的人脸识别和智能视频分析等, 交通领域的交通场景识别, 互联网领域基于内容的图像检索和相册自动归类, 医学领域的图像识别等诸多领域中得到了广泛的应用。通常来说, 图像分类系统的性能主要取决于两点: 选择分类能力强的的图像特征以及选择高效的分类方法。传统的基于手工设计的图像特征的缺点是泛化能力弱, 只针对特定的任务有效, 不能根据不同的应用场景进行相应地调整, 比如颜色直方图特征可能对图像分类任务有效, 但对于图像的语义分割却毫无帮助; 此外, 传统的基于手工设计的图像特征往往侧重图像某个方面特征, 因此泛化能力差, 比如SIFT特征侧重图像的局部外观的兴趣点, 方向梯度直方图(Histogram of Oriented Gradien, HOG)侧重图像的边缘信息、局部二值模式特征(Local Bianray Pattern, LBP)侧重图像的纹理等。不同的是, 深度卷积神经网络(Deep Convolutional Neural Networks, DCNNs)提取的图像特征具有良好的泛化性, 不仅在图像分类任务上取得了很好的性能, 而且能极好地泛化到其它计算机视觉任务, 比如物体检测、语义分割、视频跟踪等。
 
特征融合是一类非常有效的增强特征表征能力的一类机器学习方法, 在图像分类领域也得到了广泛的应用。传统的特征融合方法包括全局式融合策略与动态融合策略。全局式融合策略是对所有样本采取相同的融合策略, 包括Voting、Averaging、Boosting、Bagging等; 动态融合策略是不同的样本采取不同的融合策略, 比如Stacking。考虑利用深度特征的融合策略, 传统的全局式融合技术不能动态地选择特征组合方式, 而传统的动态的融合策略与DCNNs的结合不是那么直观; 而且, 近年来对深度特征融合的研究较少。基于DCNNs在各类应用中的非常突出的表现, 因此研究基于深度特征融合的分类方法和实现技术, 以进一步提升分类精度, 推动其在专业领域的应用, 具有重要研究意义和应用价值。
 
本论文首先从考查DCNN的层级特性出发, 结合传统特征融合方法, 研究基于深度特征融合的图像分类方法和实现技术, 分别提出了基于底层深度特征融合、中间层深度特征融合、以及高层深度特征融合的图像分类方法。论文的主要工作和创新点归纳如下: 
 
第一, 提出了一种基于图像包网络(BundleNet)的图像分类方法。大规模图像数据, 特别是专业领域的数据, 人工标注是困难的, 一定程度的标注错误难以避免; 此外, 一些自动标注方法的标签噪声会更大。因此, 在处理大规模图像数据分类问题时, 图像的标签噪声也是一个亟待解决的问题。传统解决方案是人工清洗数据噪声或构建专用的特征清晰模块, 人工清洗数据噪声的缺点在于代价太高, 噪声清晰模块的缺点在于设计和训练难度较大。为此, 本文提出了BundleNet来增强DCNN对标签噪声的鲁棒性, BundleNet是一种底层深度特征融合方式, 核心要素包括图像包(image-bundle)数据和相应的图像包模块(image-bundle module)两个方面。其中, 图像包是具有相同标签的一小批图像经过预处理并拼接而成的, 好处是多幅图像具有更丰富的信息来刻画所属的类别, 而且能有效地抑制一定程度(<50\%)的标签噪声。为了将图像包的特性赋予DCNN, 可以用图像包模块直接替换DCNN的输入数据层和第一个卷积层, 替换后得到一个新的DCNN, 称为图像包网络(BundleNet)。与通常的DCNN相比, BundleNet的区别有两点: 第一, 输入数据由一个图像换成一个图像包 (假设一个图像包含有$m$个图像); 第二, 第一层卷积核的通道数变多(变为原来的$m$倍)。BundleNet能充分结合图像包数据的特性和深度网络的特征提取优势, 显著提升对标签噪声的鲁棒性。此外, BundleNet的思想也适用于广义线性模型 (Generalized Linear Model, GLM), 可以证明相应的BundleGLM在优化目标上比GLM多了一个``类正则项'', 即一个考虑了类似图像间相关性的的一个正则项; 而且我们探讨了DCNN在一定假设下也就有。通过在公开数据集上的实验结果分析比较, 本文方法以增加很小的计算复杂度为代价, 显著提升了DCNN对标签噪声的鲁棒性; 另一面, 实验分析验证了BundleNet还适用于小样本数据分类问题。
 
第二, 提出了一种基于一致性融合的图像分类方法。多分类器或多特征融合作为一种重要的图像分类方法能够有效地集成多模型的优势进而提升分类性能, 得到研究者们广泛的重视和研究。平均融合策略是最简单有效的融合策略之一, 其一个主要的特点是融合策略是固定的, 不能充分挖掘各融合成员之间的内在联系(比如相关性、一致性等)。为此, 本文设计了一个基于一致性融合的图像分类方法, 称为一致性融合(Consistency-based Fusion, CF), 通过考虑特征之间的一致性来提高特征融合分类模型的性能。具体地, 首先定义特征的一致性为特征向量的内积, 然后基于此一致性度量构造一个一致性融合模块(CF Module), 对每个样本的融合权重进行更新。假设融合成员均为DCNN, 本文提出的CF图像分类系统是一个基于中间层深度特征融合的图像分类模型, 而且是一个可以端到端训练的深度网络模型, 主要包括四个部分: 特征提取、类别编码、基于一致性的特征融合权重更新、以及类别预测。需要注意的是, 该方法的类别预测是基于类别编码步骤得到的类别向量的做出的预测, 与传统的基于Softmax分类方法不同。实验对比分析表明, 提出的一致性融合方法能有效提升分类模型的性能。
 
第三, 提出了一种基于动态门控融合的图像分类方法。传统的融合策略(比如投票融合策略、平均融合策略)多为全局性融合策略, 即所有样本采用相同的融合方法(比如平均融合策略的加权融合权重是固定的, 对所有样本都一样)。全局性融合策略的一个缺点是互补性次优问题, 即不同样本的最优融合策略可能不同, 不能充分挖掘融合分类模型的潜能。为此, 本文提出了一种基于动态门控融合的图像分类方法, 称为动态门控融合(Dynamic Gating Fusion, DGF), 其核心是一个控制特征融合权重的门控模块(DGF Module), 即一个独立的深度卷积神经网络, 可根据输入的不同的样本, 动态地输出相应的融合权重向量, 进而与各融合成员模型组成一个可以端到端训练的模型系统。本文提出的DGF图像分类系统是一个基于高层深度特征融合的图像分类模型(采用Softmax层输出特征向量融合), 能根据单个样本的特性, 输出不同的融合策略, 进而是一种样本敏感的动态融合策略。实验结果表明, 相对于低层和中层深度特征融合方式, 本文方案能充分组合DCNN单模型分类优势, 提升分类性能。
其他摘要
Image classification is an important and basic task in computer vision, which has been applied in many fields including face recognition and intelligent video analysis in the field of security, traffic scene recognition in the field of transportation, content-based image retrieval, automatic classification of photo albums, medical image recognition and so on. Two key points of an image classification systems are feature extraction and classifier designing. Traditional hand-crafted features suffer low generalization ability, which are commonly effective for specific applications and cannot fit other tasks, such as color histograms features may be effective for classification, but have little help for semantic segmentation. Besides, these hand-crafted features commonly focus one specific feature of an image and thus has poor generalization ability, such as SIFT for local interest points, HOG for object edges, LBP for textures, and etc. Differently, features extracted by deep convolutional neural networks (DCNNs)  can not only achieve much higher classification performance, but also can be well generalized to other computer vision tasks, such as object detection, semantic segmentation, visual tracking, and so on. 
 
Besides, feature fusion technologies are very effective machine learning methods for increasing the representation ability of features, which are also widely applied in image classification problems. Previous fusion methods include global fusion policy and dynamic fusion policy. The global fusion policy takes a fixed fusion technology for all the samples, for example, Voting, Averaging, Boosting, Bagging and etc. The dynamic fusion policy can generate different fusion methods for different samples, such as Stacking. By using the features of DCNN (deep features), the disadvantage of the global fusion policies is lacking the ability of dynamically selecting fusion way for different samples and there is a gap between the dynamic fusion policies and DCNNs. Moreover, there is few studies aiming at the fusion of deep features.  As DCNNs achieve outstanding performance in various applications, it has important research significance and application value to study the classification and implementation techniques based on deep feature fusion.
 
This dissertation starts with the hierarchical characteristics of DCNNs, combining traditional feature fusion technologies, and mainly studies image classification methods and implementations based on depth feature fusion technologies. The main work and innovations of this dissertation are summarized as follows:
 
Firstly, we come up with an image classification model named BundleNet. For large scale image dataset, the labeling is difficult and error prone, especially for professional data. Besides, the label noise may be even big for some auto-labeling methods. Thus, label noise is a burning question for large scale image classification problems. Previous methods mainly focus on  hand-cleaning or constructing a so-called ``noise cleaning module''. Unfortunately, the hand-cleaning method is expensive and time-consuming. Also, the designing and training of a ``noise cleaning module'' is difficult. For this reason, this chapter proposes BundleNet to increase the robustness of DCNNs to label noise. BundleNet is indeed a bottom deep feature fusion model, which has two key elements including the image-bundle and BundleNet. An image-bundle is build by preprocessing and concatenate a bundle of images with the same label, where a group of images has richer informations to represent the features of a specific image class and has the ability to suppress the effects of label noise. To compose the image-bundles with DCNNs, an image-bundle module is designed and can be directly replace the input layer and the first convolutional layer of a DCNN, which results BundleNet. In comparison to a common DCNN, BundleNet differs by the input (a $m$-image-bundle) and a more complex first-convolutional kernels (with $m$ times channels than the original). BundleNet largely increases the robustness to label noise by combing the strengths of image-bundle and the strong feature extraction ability of DCNNs. In addition, we prove that the corresponding BundleGLM and the generalized linear model (GLM) differs in a ``similar regularization term'', which is a term considers a type of correlation between images in an image-bundle. This is also true for DCNNs under some assumptions. By analyzing and comparing the experimental results on the public data sets, the proposed BundleNet in this chapter largely increases the robustness to label-noise of DCNNs. On the other hand, BundleNet is also fit to classification problems with small training data.
 
Secondly, we design an image classification system based on a consistency fusion method. The combination methods of multiple classifiers or features for image classification can effectively improve the classification performance, which have drawn widely attention and efforts to research. Among which, the averaging fusion policy is one of the simplest and most effective fusion method, but it is lack of the ability of digging the inner relations between the fusion members such as correlation, consistency, etc. To this end, an image recognition approach based the proposed Consistency Fusion (CF) method, which aims to improve the image classification performance by considering the consistency between the member feature vectors. In detail, we firstly define ``inner product'' as the consistency measure between feature vectors and then a CF Module is designed based on this measure. The CF module is intended to update the fusion weights for each sample. The whole system is actually an image classification model based on the fusion of middle deep features and can be trained end-to-end if all the fusion members are DCNNs. The proposed CF image classification system is composed of four key steps: deep features extraction, class encoding, consistency fusion, and prediction. To be noticed, the prediction is based on the class vectors obtained by ``class encoding'' operation, which is different from the Softmax method. The experimental comparative analysis shows that the proposed CF method can effectively exploit and utilize the consistency between the feature vectors and boost the classification performances.
 
Thirdly, we develop an image classification method based on a dynamic gating module. Most of the traditional feature fusion methods (such as voting, averaging) are global fusion policy, where an overall fusion method is applied for all the samples (for example, using the same weights for averaging fusion). This may suffer the suboptimal complementarity problem, which means that different samples cannot hold their specific optimal fusion policy under the global fusion method. To solve this problem, we propose an image classification method based on a dynamic gating module, named Dynamic Gating Fusion, DGF. The key lies in a DGF module, which aims to dynamically generate the coefficients of a weighted averaging policy. This module can be an independent DCNN, which outputs different weights vector for different input samples. And the whole classification system can be trained end-to-end if all the fusion models are DCNNs. The proposed method (DGF) is an image classification model based on the ensemble of higher deep features and is a dynamic ensemble policy by generating different fusion weight vectors for different input samples. Experiments on the public image datasets show that DGF can fully combine the advantages of all the single DCNNs and boost the classification performance.
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/21002
专题毕业生_博士学位论文
作者单位1.中科院自动化研究所
2.中国科学院大学
推荐引用方式
GB/T 7714
李成华. 基于深度特征融合的图像分类方法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2018.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
基于深度特征融合的图像分类⽅法研究.pd(12872KB)学位论文 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[李成华]的文章
百度学术
百度学术中相似的文章
[李成华]的文章
必应学术
必应学术中相似的文章
[李成华]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。