快速、鲁棒的半监督学习算法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	快速、鲁棒的半监督学习算法研究
其他题名	A Study on Fast and Robust Semi-Supervised Learning Algorithm
	张燕明
	2011-06-01
学位类型	工学博士
中文摘要	机器学习作为智能信息处理的代表性方法在近几十年中取得了飞跃式的发展。然而经典的监督学习需要大量人工标记的数据作为训练样本；由于标记样本往往代价高昂，从而大大提高了监督学习方法的使用成本。因此，人们转而寻求能够使用少量标记样本和大量无标记样本的学习方法，由此产生了对半监督学习问题的广泛关注。早期对半监督学习的研究集中于如何在统一的框架内使用标记样本和无标记样本，如何建立新的模型以及新的学习方法；往往忽视了半监督学习方法的实用性。这具体表现在多数半监督学习方法的计算复杂度过高，而且在实际应用中的性能不稳定。本文的研究目的在于通过改进模型、使用近似算法等手段，提出实用有效的半监督学习算法，使其在精度、鲁棒性、计算复杂性方面达到应用的要求。具体地，本文针对半监督学习中最具代表性的基于图的半监督学习方法进行了全面的研究，并在构图、标记学习算法等方面提出了改进方法。本文的主要内容和贡献包括：提出一种针对保持数据局部性的构图方法(LPGC)。将欧式距离作为距离度量，LPGC通过最小化所有的节点间加权距离和来构造图。经过求解一个二次规划问题， LPGC可以同时学习图上的连接关系和边上的权重。由于该二次规划问题的解通常是稀疏的，我们利用这一性质设计了基于切平面方法的优化算法，从而大大提高了效率。经过聚类和分类实验，LGPC方法表现出了比常用的$k$近邻图和$\epsilon$近邻图更出色的性能。提出一种基于自适应图的直推学习方法(TLAG)，它的最大特点是同时学习样本标记和构造图。这样做的意义在于，TLAG方法可以利用样本的标记信息指导图的构造。TLAG使用一个迭代算法优化目标函数。每轮迭代完成两个操作：第一步，基于当前图和当前的预测标记，学习一张新图；第二步，使用新图，重新预测数据标记。实验表明，TLAG方法相比传统基于图的方法在精度方面有一定的提升。提出一种基于最小张成树的快速、鲁棒的直推学习方法(MTC)，并证明最小张成树结构对图参数选取的鲁棒性。MTC方法由两步构成：首先，使用最小张成树近似表示图；然后，通过割最小化准则对张成树进行标定。除去最小张成树构造简单之外，它对于图的构造十分鲁棒。对于$\epsilon$近邻图和径向基权重函数，我们严格地证明了最小张成树的连接结构对于$\epsilon$的选取具有不变性。更重要地，基于树的最小割标记算法具有线性的时间和空间复杂度，使MTC方法非常适宜在大规模问题中使用。大量的实验表明，MTC方法在鲁棒性和速度方面相对传统方法有明显改善。在半监督学习中使用低维嵌入假设，提出子空间正则化方法以及基于子空间正则化的半监督学习算法。方法的出发点在于：通过将高维数据投影到低维空间，缓解了维度灾难的影响，使分类器的训练更为简单。具体的，我们提出了一个优化准则用于同时学习降维子空间和定义在子空间上的分类器。优化目标含有两个部分：数据拟合项用于评价不同类别的标记样本在降维子空间中的可分程度；而正则项用于评价在降维过程中数据信息的损失程度。相比使用平滑假设的基于图的方法，新方法在处理图同类别相互重叠等情况时具有...
英文摘要	In many machine learning applications, labeled data are scarce because labeling data is both time-consuming and expensive. However, unlabeled data are very easy to collect in many applications such as text categorization and image classification. This has motivated machine learning researchers to develop learning methods that can exploit both labeled and unlabeled data during model learning. Such a learning paradigm developed over the past decade or so is referred to as semi-supervised learning. Although semi-supervised learning has been an active area of research, its use in real world application are still limited because most of existing methods are lack of scalability and robustness. This thesis proposes efficient and robust semi-supervised learning methods for real-world applications. Its main contributions are summarized as follows, We proposed a graph construction method with the aim of preserving local information of the data set. By formulating the task as a quadratic programming problem, the method learns the edges and weights simultaneously. Exploiting the sparsity of the graph, we further propose a more efficient cutting plane algorithm to solve the optimization problem. We proposed a transductive learning method based on adaptive graphs, which enhances the performance of graph-based transductive learning by learning the graph and label inference simultaneously. The underling idea is to improve the construction of graph by using label information. We gave a simple iterative algorithm to optimize the objective function. Each iteration contains two steps: first, we fix the label and learn a new graph; then, for the new graph, we update the prediction. We proposed an efficient and robust algorithm for graph-based transductive classification. After approximating a graph with a minimum spanning tree, we develop a linear-time algorithm to label the tree such that the cut size of the tree is minimized. In addition to its great scalability on large data, our proposed algorithm demonstrates high robustness and accuracy. We also prove the minimum spanning tree structure facilitates the graph parameter selection. We introduce into semi-supervised learning the classic low-dimensionality embedding assumption, and propose a semi-supervised learning algorithm based on this assumption. The motivation here is that we hope to find a low-dimensional representation of data so as to make the labeled data denser and therefore much easier for training...
关键词	半监督学习直推学习流形学习基于图的机器学习方法 Laplacian矩阵 Semi-supervised Learning Transductive Learning Manifold Learning Graph-based Machine Learning Methods Graph Laplacian
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6383
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	张燕明. 快速、鲁棒的半监督学习算法研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2011.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20071801462807（1488KB）			暂不开放	CC BY-NC-SA