基于哈希的近似近邻搜索方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于哈希的近似近邻搜索方法研究
	冷聪
	2016-05-27
学位类型	工学博士
中文摘要	近年来，随着互联网的快速发展，网络上的数据呈爆炸式增长。如何从海量数据中快速搜索用户所需信息已成为一个至关重要的问题，然而传统的暴力搜索方法由于其高昂的计算代价变得不切实际。另一方面，由于实际问题中数据的维度通常比较高，使得基于树结构的搜索方法遇到维度灾难问题。在这个背景下，近似近邻搜索技术成为解决海量数据搜索问题的一个新的突破口，而哈希方法是一种重要的近似近邻搜索方法。这种方法将数据从原始空间映射到二值的汉明(Hamming)空间，使原始空间中相似的样本在汉明空间中也相近，反之亦然。通过这种方式，原始空间中的搜索问题可以转换到汉明空间中进行。由于哈希方法在搜索速度和存储上的显著优势，其已在机器学习、计算机视觉和数据挖掘等领域取得了广泛关注，一系列算法相继被提出。尽管如此，哈希方法中仍有一些重要的问题很少被提及，比如流式数据(streaming data)的哈希函数更新问题、分布式数据的哈希函数学习问题等。本论文围绕上述问题展开深入研究，并在计算机视觉问题中进行应用。具体工作如下： 1. 针对流式数据的哈希函数学习问题，本文提出了一种基于数据素描(data sketching)的在线哈希学习方法。实际检索系统中数据通常以流式的形式出现。然而，大多数现有的哈希方法都假设训练数据是事先给定且不会发生变化的，因此很难有效地处理实际问题中的流式数据。本文通过数据素描方法在线地为流式数据维护一个很小的素描，然后基于该素描进行哈希函数学习。为了保持哈希学习过程中数据的零均值特性，本文还提出了一种新的零均值素描方法，并从理论上证明了该素描方法能够保持数据中哈希学习所需要的主要特性。 2. 针对分布式数据的哈希函数学习问题，本文提出了一种分布式哈希方法。我们引入了一种基于最小化量化误差的集中式哈希学习模型，并基于该模型设计了一套分布式哈希学习框架。通过引入一致性约束，本文对集中式哈希模型进行形式上的分解，并通过ADMM算法对分解后的模型进行分布式求解。实验证明该求解方法能很快地收敛到一个较好的解。本文提出的分布式算法没有对网络做任何假设，因此可以适用于任意网络拓扑结构。另外，在整个训练的过程中各个计算节点间只需要交换一些中间变量而无需交换训练数据，因此本文方法的通信代价非常小。 3. 为了解决基于矩阵特征值分解的哈希方法中长编码噪声大、性能低的问题，本文提出了基于集成学习的哈希方法。该方法仅通过信息量最大的若干个特征向量生成短的哈希编码，然后将多段短编码串联成长编码。为了保证各段短编码间的多样性，本文借鉴了两种著名的集成学习方法，即Bagging算法和随机子空间算法，分别在样本空间和特征空间中随机采样，得到BPCAH和RPCAH两种哈希方法。本文将提出的方法和著名的LSH算法进行类比，并从理论上证明了基于本文方法生成的哈希编码性能会随着编码长度的增加而增强。 4. 为了加速三维重建中的图像匹配过程，本文提出一种层级哈希算法。本文将图像匹配问题抽象成搜索问题，通过哈希方法将图像的特征点从欧式空间映射到汉明空间，然后在汉明空间中进行快速匹配。本文提出的层级哈希算法通过多表哈希查找、哈希重投影以及快速哈希排序三层结构来实现由粗到细的搜索过程。本文详细分析对比了层级哈希算法和其它匹配方法的时间复杂度。在精度相当的情况下，层级哈希算法的匹配速度可以达到目前三维重建中最常用的$Kd$-tree方法的10倍以上。
英文摘要	With the rapid advances of internet, we have entered the age of big data. How to search the needed information fast from massive data becomes a crucial problem. In this case, traditional exhaustive search is infeasible due to its high computation complexity. Moreover, the real world data is often high dimensional, which incurs the “curse of dimensionality” problem of tree based search methods. Against this background, approximate nearest neighbor (ANN) search becomes a new breakthrough point for the search of massive data. Hashing is an important ANN technique. In hashing methods, data will be embedded from the original space to binary Hamming space so that similar samples in the original space will have close hashing codes, and vice versa. In this way, the search process can be efficiently conducted in the Hamming space. Owing to the remarkable advantages in both search speed and storage, hashing has attracted much attention from the researchers in machine learning, computer vision and data mining. A series of hashing algorithms have been proposed. However, there are still some important challenges rarely discussed in existing works, such as the learning of hashing functions for streaming data and distributed data. This thesis mainly focuses on these challenges and the application of hashing in computer vision. The main contributions can be summarized as follows: 1. We propose a novel online hashing method to address the hashing learning problem for streaming data. The proposed method is innovated from the idea of data sketching, and allows a dramatic reduction in computation and storage overhead. In real world applications, the data often comes in a streaming fashion, but most of existing hashing methods are batch based models. Our approach maintains a small size sketch for the streaming data online, and learns hashing functions on the sketch rather than the whole dataset. To overcome the mean-varying problem in online hash function learning, we propose a new algorithm to sketch the zero-meaned stream data online. We theoretically prove that the obtained sketch can preserve most of the useful information in data for hashing function learning. 2. We propose a novel distributed hashing method for data which is distributed across different nodes in a network. We introduce a centralized hashing model within the framework of minimizing quantization distortion. We then cast this centralized hashing model as a set of decentralized subproblems with consensus constraints and solve them with ADMM algorithm. Experiments demonstrate that our method can fast converge to a local optimum. Since we make no assumption about the network, the proposed distributed algorithm can adapt to arbitrary network topology. Besides, since there is no exchange of training data across the nodes in the learning process, the communication cost of our method is very small. 3. For the eigendecomposition based hashing methods, the information caught by different eigenvectors is unbalanced, which causes that longer hashing codes are often noisy with inferior performance. We propose ensemble learning based hashing methods to address this problem. Our methods use only the top eigenvectors to generate a piece of short but strong code and repeat this process several times, then we concatenate many pieces of short codes into one piece of long code. In order to guarantee the diversity among different pieces of short codes, we adopt the bagging and random subspace strategy to sample in the data and feature space respectively, leading to two hashing algorithms BPCAH and RPCAH. We compare the proposed methods with the famous LSH algorithm, and theoretically prove that longer codes tend to yield better performance than the shorter ones under the ensemble learning strategy. 4. We propose a cascade hashing structure to speed up the image matching process for 3D reconstruction. We find that the image matching problem can be essentially converted into a nearest neighbor search problem, and use hashing method to map the image descriptors from Euclidean space to Hamming space. The proposed cascade hashing is a coarse to fine strategy with three steps, i.e., multi-table hashing lookup, hashing remapping and hashing based top $k$ ranking. We compare the time complexity of our cascade hashing with other well-known matching methods. It is easy for our method to realize 10 times or more acceleration than $Kd$-tree based image matching with comparable accuracy.
关键词	近似近邻搜索哈希函数在线学习分布式学习集成学习三维重建
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/11783
专题	毕业生_博士学位论文
作者单位	中科院自动化研究所
第一作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	冷聪. 基于哈希的近似近邻搜索方法研究[D]. 北京. 中国科学院大学,2016.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
博士论文-冷聪.pdf（11535KB）	学位论文		限制开放	CC BY-NC-SA