基于特征表示和度量学习的大规模目标检索

CASIA OpenIR > 毕业生 > 博士学位论文

	基于特征表示和度量学习的大规模目标检索
	郭海云1,2
	2018-05-28
学位类型	工学博士
中文摘要	随着互联网的快速发展和图像采集设备的日益普及，视频图像数据呈现出爆炸式增长的趋势。对于海量的图像数据，如何高效准确地从中获取与用户查询相关的目标信息，是大规模目标检索的关键性问题，也是学术界和工业界共同关注的研究热点。目标检索两大核心要素在于高效的目标特征表示和有效的特征距离度量。特征表示学习旨在从图像中学习具有判别力的目标特征。而度量学习致力于学习一种特征之间的距离度量来有效地反应目标之间的语义相似性。传统的目标检索方法通常采用基于人工设计的图像特征来描述目标，然而这种底层的图像特征不足以描述目标丰富的高层语义内容，形成了“语义鸿沟”。此外，传统的目标检索方法通常将特征提取和距离度量分为单独两个步骤，实际上，高效的特征表示可以降低度量学习的难度，而有效的度量学习也有助于学习更具有判别力的特征表示。针对传统目标检索方法的不足，考虑到卷积神经网络（Convolutional Neural Network，简称CNN）在视觉识别任务中的优势，本文分别从特征表示和度量学习的角度展开深入研究，提出了多种基于CNN的目标检索方法，显著提升了大规模目标检索的性能。一方面，CNN可以从图像中提取出丰富的语义信息，从而学习到语义层级更高的特征表示。另一方面，通过设计合理的CNN网络结构和监督损失函数，可以将特征表示和度量学习纳入一个端到端的学习框架，从而同时学到更具有判别力的特征表示和更有效的距离度量。本文主要研究内容和贡献归纳如下： 1、针对基于CNN提取的高维特征导致大规模目标检索效率低下的问题，本文提出了一种基于紧凑特征表示学习的目标检索方法。该方法通过自动编码机将CNN提取的高维特征压缩为低维二值的紧凑特征编码。在此基础上，本文采用引导聚焦算法（Bootstrap aggregating，简称Bagging）来组合多个自动编码机，从而有效减小泛化误差，并且进一步提升了目标检索的准确度。此外，Bagging自动编码机适合于并行计算，保证了训练和检索的效率。实验结果表明，该方法在尽量不降低检索精度的前提下可以显著加快目标检索的速度。 2、针对基于语义类别分类训练的CNN对目标颜色属性描述力不足的问题，本文提出了一种基于多深度卷积特征学习的目标检索方法。具体来说，本文设计了一种颜色CNN来提取目标的深度卷积颜色特征，与传统颜色特征相比，该特征更具有颜色描述力和判别力。然后将其与基于语义类别分类训练的CNN提取的深度卷积结构特征进行有效融合，从而形成多视角全面综合的目标特征表示。实验结果表明，本文提出的这种基于颜色属性和结构属性协同表达的目标检索框架，能够有效提升单一深度卷积特征的目标检索性能。 3、针对多视角三维目标检索任务中，基于分类损失监督训练的CNN提取的目标特征判别力不足的问题，本文提出了一种基于深度映射网络的多视角三维目标检索方法。具体来说，本文设计了一种基于分类损失和三元组损失联合优化的深度映射网络，可以将目标图像映射到一个欧式度量空间，使得特征之间的欧氏距离可以直接反应目标之间的语义相似性。基于此，深度映射网络的学习过程等价于特征表示和距离度量端到端的学习过程，因此该方法可以同时学到更具判别力的特征表示和更高效的距离度量。实验结果表明，该方法以12%的性能提升超过了当前最好的三维目标检索方法。 4、针对车辆检索任务中复杂的类内类间差异，本文提出了一种基于结构化特征度量学习的车辆检索方法。具体来说，本文设计了一种层级式排序损失，可以将同一辆车的图像紧凑地聚集到一起，同时有效增大不同车辆以及不同车型之间的间距。基于该损失的监督，CNN能够由粗到细地学到一个结构化特征度量空间，使得特征之间的类内紧凑性和类间判别性得到有效增强，从而刻画出车辆图像之间多层级的语义相关性。实验结果表明，该方法可以将之前最好的车辆检索方法的性能提升约10%。与此同时，本文还发布了一个当前最大的车辆检索数据集，包含了不同光线、视角、监控场景下拍摄的近一百万张车辆图片，将有效推动车辆检索领域的研究进展。
英文摘要	With the rapid development of internet and the increasing popularity of image acquisition devices, recent years have witnessed an explosive growth of video and image data. Large-scale object retrieval, which aims at efficiently and accurately retrieving the relevant objects to the user's query from the massive image data, is a challenging task and also a research hotspot in both academia and industry. The two most crucial parts of object retrieval are efficient feature representation and effective distance metric. Feature representation learning aims to learn discriminative object features from the image. Metric learning aims to learn effective feature distance metric to measure the semantic similarity between objects. Traditional object retrieval methods usually adopt the hand-crafted image features to describe the object. While the low-level image features cannot well describe the rich high-level semantic content of objects, forming a huge "semantic gap". Additionally, in the traditional object retrieval process, feature extraction and distance measurement are two independent stages. Actually efficient feature representation can reduce the difficulty of metric learning, and effective metric learning can also benefit the learning of more discriminative feature representation. To address the shortcomings of traditional object retrieval methods, considering the advantages of CNN (short for Convolutional Neural Network) in visual recognition tasks, this dissertation conducts in-depth researches, in terms of feature representation and metric learning respectively, and proposes multiple CNN based object retrieval methods, which significantly improve the performance of large-scale object retrieval. For one thing, CNN can abstract rich semantic information from the image, thus can learn higher-level semantic feature representation. For another, with the suitable CNN architecture and loss function, we can involve feature representation and metric learning into an end-to-end learning framework, thus learning more discriminative feature representation and effective distance metric in the meantime. The main contributions of this dissertation are summarized as follows: 1、This dissertation proposes a compact feature representation learning based object retrieval method to address the low efficiency issue of high-dimensional CNN features in large-scale object retrieval task. This method takes advantage of the auto-encoder to compress the high-dimensional real-valued CNN features to low-dimensional binary features. Then Bagging, short for Bootstrap aggregating, is adopted to fuse multiple auto-encoders to reduce the generalization error, thus further improving the retrieval accuracy. In addition, bagging auto-encoder is easy for parallel computing, so the efficiency of training and retrieval can be well guaranteed. Experimental results demonstrate that this method can significantly accelerate the object retrieval speed with very little retrieval accuracy decrease. 2、This dissertation proposes a multiple deep convolutional features learning based object retrieval method to address the inadequate color attribute description issue of features extracted from CNN models trained for object semantic category classification. To begin with, this dissertation proposes a color CNN to learn deep convolutional color features of the object. Compared with the traditional color features, this feature is more descriptive and discriminative with regards to color attribute. Then this feature is fused with the deep convolutional structure feature, which is extracted from the CNN model trained for object semantic category classification, to form a comprehensive multi-view object feature representation. Experimental results demonstrate that this method, based on the co-expression of color and structure attributes, can effectively improve the retrieval performance of those based on single deep convolutional feature. 3、This dissertation proposes a deep embedding network based multi-view 3D object retrieval method to address the insufficient inter-class discrimination issue of CNN supervised by classification loss. To be specific, this dissertation proposes a deep embedding network jointly supervised by classification loss and triplet loss to map the input object image into a feature embedding space, where the Euclidean distance between features directly corresponds to the semantic similarity between objects. Accordingly, the learning process of deep embedding network is actually the end-to-end learning process of feature representation and distance metric. Thus, this method can learn more discriminative feature representation and more efficient distance metric in the meantime. Experimental results show that this method outperforms the state-of-the-art 3D object retrieval approach by over 12%. 4、This dissertation proposes a structured feature embedding learning based vehicle retrieval method to address the complex intra-class and inter-class variations of vehicle retrieval task. Specifically, this dissertation proposes a hierarchical ranking loss to pull the images of the same vehicle together compactly and pull different vehicles as well as different vehicle models apart. With the supervision of this ranking loss, CNN can learn a structured feature embedding space in a coarse-to-fine manner to enhance the intra-class compactness and inter-class distinction, thus characterizing multi-level semantic similarity between vehicle images. Experimental results demonstrate that the proposed method can improve the retrieval performance of the state-of-the-art by 10% or so. Meanwhile, this dissertation releases so far the largest vehicle retrieval dataset, which contains nearly 1 million vehicle images captured in various surveillance scenarios and can effectively advance the research in the vehicle retrieval field.
关键词	目标检索特征表示度量学习卷积神经网络
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/20940
专题	毕业生_博士学位论文
作者单位	1.中国科学院大学 2.中国科学院自动化研究所模式识别国家重点实验室
第一作者单位	模式识别国家重点实验室
推荐引用方式 GB/T 7714	郭海云. 基于特征表示和度量学习的大规模目标检索[D]. 北京. 中国科学院研究生院,2018.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
郭海云博士学位论文.pdf（6963KB）	学位论文		限制开放	CC BY-NC-SA