非参数核密度聚类与特征提取算法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	非参数核密度聚类与特征提取算法研究
其他题名	Algorithm Study on Non-Parametric Kernel Density Clustering and Feature Extraction
	袁晓彤
	2009-05-15
学位类型	工学博士
中文摘要	聚类分析的目标是不依赖或者利用少量先验知识将给定的有限数据集合划分成一组反映数据自然结构的子集合，属于数据处理的基本问题之一。随着计算技术的不断发展，聚类分析方法已经成为机器学习，人工智能，模式识别以及数据挖掘等领域的研究热点，同时也是难点问题之一。特征提取是对原始数据特征进行变换产生一些有用的或者新的特征，以降低聚类或者分类学习问题的难度。本文主要研究基于非参数核密度估计的数据聚类和特征提取算法。我们首先简要回顾了聚类分析与特征提取的问题描述和研究现状。随后我们系统介绍了本文重点研究的均值漂移(Mean-Shift)聚类算法，以及若干密切相关的基础理论，包括共轭函数理论，随机梯度学习理论和Renyi熵理论。本文研究的主要贡献包括两个方面：一、Mean-Shift聚类算法新的理论解释和算法扩展。本文对Mean-Shift算法的数学本质和方法扩展进行深入分析和研究，相关贡献包括： 1. 证明当核函数为凸时，Mean-Shift等价于对核密度函数的半二次优化，从而很好地解释了其优越的数值性能。 2. 在半二次优化分析的框架内，推导出Mean-Shift的二次界优化本质，并在此基础上提出了一种核密度数据集覆盖方法。该集合覆盖方法通过构造一组超椭球对数据集进行稀疏覆盖，具有实现简单，运行高效的优点。该集合覆盖过程可以迭代运行直到收敛，从而得到一个称为Agglo-MS的非参数核密度凝聚聚类算法，可以显著加快Mean-Shift聚类的过程。作为Agglo-MS的算法扩展，进一步开发出一种增量非参数核密度聚类算法(IAgglo-MS) 和一种约束非参数核密度聚类算法(CAgglo-MS)。 3. 将传统的离线Mean-Shift算法扩展为一个基于可变学习率的随机梯度上升算法，并分析了其渐进收敛性质。该随机梯度Mena-Shift算法可以应用于大规模数据库或者实时数据流中的密度模式搜索。在人工和实际采集的数据库上大量的对比实验结果验证了本文提出的各种Mean-Shift改进算法的特点和优越性。二、基于信息理论学习的鲁棒特征提取方法。本文提出了一种称为Renyi熵辨别分析(REDA)的鲁棒特征提取算法框架。我们将特征提取的目标函数定义为目标特征的Renyi熵以及目标特征与标号之间的Renyi交叉熵的加和形式，并利用非参数核密度方法对Renyi二次熵/交叉熵进行估计。形式上看，算法框架具备流形正则化和鲁棒M-估计两方面的优点。利用半二次优化技术，所提出的目标函数以迭代的方式进行优化，并且理论上保证收敛。同时，一些常用的特征提取算法，如局部保留投影(LPP)，谱回归辨别分析(SRDA)和Laplacian正则化最小二乘回归(LapRLS)可以看作是REDA框架下的特例。在实际数据集上充分的对比实验结果表明了我们的方法对随机数据噪声和标记噪声都具有良好的鲁棒性。
英文摘要	Cluster analysis is one of the fundamental problems associated with data analysis. It aims to separate a finite data set into a discrete set of natural hidden data structures with little or no prior knowledge. With the rapid development of computational techniques, cluster analysis has become an attractive and challenging research topic in a wide variety of fields, ranging from machine learning, artificial intelligence, pattern recognition and data mining. Feature extraction utilizes some transformations to generate useful and novel features from the original ones, so that clustering or classification tasks can be eased. This thesis focuses on the non-parametric kernel density estimation based clustering and feature extraction methods. We begin by a problem description and background review of the cluster analysis and feature extraction. We then systemically introduce the Mean-Shift (MS) clustering method and some related theories that form the base of this work. The main contributions of this thesis include two: 1. Novel theoretical interpretation and algorithmic extensions of the MS clustering method} We give a deep analysis and study of the mathematical essence behind MS and its algorithmic extensions. The key contributions of this part include: (1) When kernel function is convex, we prove that MS is equivalent to a half-quadratic optimization procedure for the kernel density estimator, and thus interprets the elegant numerical performance of MS. (2) Inside the proposed HQ analysis framework, we revisit the quadratic bounding optimization (QBO) nature of MS. Motivated by the QBO perspective of MS, we develop a simple yet efficient kernel density set covering algorithm which can cover a given data set with a sparse group of hyperellipsoids. By iteratively running the proposed set covering algorithm on a query set, we obtain an agglomerative non-parametric kernel density clustering algorithm called as Agglo-MS, which may significantly accelerate MS clustering procedure. As the extension of Agglo-MS, we develop an incremental non-parametric kernel density clustering algorithm called as IAgglo-MS, and a constrained non-parametric kernel density clustering algorithm called as CAgglo-MS. (3) Furthermore, we extend the classical batch MS into its stochastic gradient version along with asymptotic property analysis. Our stochastic gradient Mean-Shift can be applied to efficient kernel density mode-seeking on large scale data sets or data streams. 2. A...
关键词	聚类分析非参数核密度估计均值漂移半二次优化随机梯度学习特征提取 Renyi熵鲁棒统计学 Cluster Analysis Non-parametric Kernel Density Estimation Mean-shift Half-quadratic Optimization Stochastic Gradient Learning Feature Extraction Renyi's Entropy Robust Statistics
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6143
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	袁晓彤. 非参数核密度聚类与特征提取算法研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2009.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20051801462805（10179KB）			暂不开放	CC BY-NC-SA