维数约简中的数据性质研究

CASIA OpenIR > 毕业生 > 博士学位论文

	维数约简中的数据性质研究
其他题名	The research on data characteristics of dimensionality reduction
	毕华
	2009-06-03
学位类型	工学博士
中文摘要	维数约简是机器学习领域中一个重要的研究方向。近年来，高维海量不可控数据的现状，维数约简算法又一次成为人们关注的焦点。高维数据使我们不得不面对维数灾难(Curses of Dimensionality) 问题，维数的不断膨胀给高维数据中的模式识别与数据分析带来极大的挑战，但与此同时，维数的增长又带来了“维数福音" (Blessings of Dimensionality), 高维数据中蕴藏着的丰富信息中可产生解决问题的新的可能性。如何将高维数据表示在低维空间中，并由此发现其内在结构是高维信息处理研究的关键问题之一。传统的机器学习算法有两种不同的建模理念：整体学习与局部学习。整体学习体现数据的整体性质，建立统一的模型；而局部学习通过数据的局部特性来归纳模型。我们根据局部学习的不同特点，将局部学习算法分为三种类型。在统计推断中稳健性是指实际问题的数据来源与我们的模型假定有偏离时，所采用的算法的结果受到的扰动很小，并且保持算法的预测性能。将统计稳健性的研究方法引入维数约简算法中，分析得到邻域局部加权估计这种局部学习能够在大样本的情形下收敛到Bayes 最优估计，同时收敛条件可以说明邻域局部加权估计是稳健估计。并在模拟数据和真实数据库上进行试验，结果表明在某些离群点影响模型的情况下，仍然保持监督学习预测的泛化性能。 Boosting 算法试图用弱学习器的线性组合逼近复杂的自然模型，以其优秀的可解释性和预测能力，得到计算机界的高度关注。但只是将Boosting 看作是一种特定损失下的优化问题，我们提出从统计理念看待Boosting 方法：在统计学框架下，Boosting 算法仅仅是重采样方法的一个特例。目前机器学习算法只重视算法性能忽略数据性质的现状，把预测准确性作为衡量的唯一目标。我们希望改变这种现状，在注重统计解释性的同时，提高算法预测性。本论文的主要成果是: 1. 分析了机器学习两种不同理念整体学习和局部学习，将局部学习算法划分为邻域局部加权学习算法，模型局部算法和局部流形算法三类，并对其主要算法进行了较为详细的阐述。 2. 探讨了统计稳健性的概念和分类，并提出了机器学习中的算法稳健性，分析了不同噪音方式对数据的影响，并分析了一种特殊的局部学习算法- 邻域局部加权算法的稳健性。 3. 介绍了Boosting 算法的发展过程，对重采样方法的发展历史进行了较为详细的综述；提出了机器学习算法的四个步骤：样本采集、采样策略、算法类型、集群方法,分析了Boosting 方法的统计学性质-Boosting 算法仅仅是重采样方法一个特例。
英文摘要	Dimension reduction is one of the most important research directions in the fields of machine learning. Especially in recent years, `high dimensional and large volume data is generated in an uncontrolled manner.The study of dimension reduction once again becomes the focus of attention. We have to face curse of dimensionality which has challenged the pattern recognition and data analysis on high-dimensional data. At the same time,the blessings of dimensionality shows that the abundance information of the high-dimensional data set means the new feasibility. How to express the high-dimensional data in the low-dimensional space and discover the intrinsic structure is the pivotal problem of high-dimensional information processing.Thereinto,dimensional reduction as the availability method to overcome the curses of dimensionality has arouse the broad notice. The correlative research is in the ascendant. Two different paradigms in machine learning: global learning and local learning. Global learning focuses on describing a phenomenon or modeling data in a global way. On the other hand , local learning does not intend to summarize a phenomenon, but builds learning systems by concentrating on some local parts of data. According to the different characteristics of local learning, local learning algorithms are divided into three types. Robustness in statistical inference means that when the real data depart from an assumed sample distribution, there will be little perturbation in the results of the algorithm and remarkable prediction performance of the algorithm. The research methods of statistical robustness are introduced into dimensional reduction. neighborhood weighted estimation algorithm, which is a kind of local learning, can converge to Bayes optimal estimation in the case of large amounts of samples. At the same time, nearest neighbor estimation algorithm is a kind of robust algorithm under the converge condition. In addition, their experimental results on synthetic and real-world data sets are also given. The generalization performance of this algorithm can be guaranteed when the model is affected by some outliers. In boosting algorithm complex natural model is approximated by the linear combination of weak learners. Due to its excellent interpretability and prediction power, boosting has become an intensive focus among computer science field. However, it is only considered as an optimizing procedure with a specific loss function. In essence, a statistic...
关键词	机器学习维数约简局部学习稳健性重采样 Boosting Machine Learning Dimension Reduction Local Learning Robust Resampling Boosting
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6217
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	毕华. 维数约简中的数据性质研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2009.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20041801462807（3472KB）			暂不开放	CC BY-NC-SA