基于高斯混合模型的高维数据概率密度估计

CASIA OpenIR > 毕业生 > 博士学位论文

	基于高斯混合模型的高维数据概率密度估计
其他题名	Probability Density Estimation of High-Dimensional Data Based on Gaussian Mixture Model
	刘晓华
	2011-05-31
学位类型	工学博士
中文摘要	概率密度估计是模式识别和机器学习领域的一个基本问题。它对于使用贝叶斯分类决策非常重要。高斯混合模型由于其强大的逼近数据分布的能力，是概率密度估计非常合适的建模工具。基于最大似然的EM算法是求解高斯混合模型的基本方法。然而，高维数据的概率密度估计由于样本稀疏等问题成为一个难题，即所谓“维数灾难”问题。特征降维可以有效地克服维数问题，然而如何将降维与高斯混合模型有机结合是一个问题。另一方面，高斯混合模型是一个生成式模型，每类参数是独立估计的，在训练中没有考虑分类边界，这样获得的模型不一定能够取得很好的分类效果，而鉴别学习能够提高模型的分类性能。针对这些问题，本文对高斯混合模型的结构及模型参数的鉴别学习方法进行了深入的研究，主要贡献如下：（1）提出了一种共享子空间混合密度模型，它能够表示全空间的概率密度，并且子空间的计算和密度估计在EM框架下同时进行。每个高斯成分都表示成一个主子空间的椭圆高斯和一个补子空间的球面高斯的乘积。首先通过EM计算出全空间的模型参数，包括权重，均值和协方差矩阵，然后计算出共享协方差矩阵和共享子空间，将每个高斯成分降维到共享子空间。在共享子空间里，每个成分是一个椭圆高斯，而在补子空间，则通过一个共享的特征值来表示。为了提高分类性能，我们通过交叉验证的方式来确定补子空间的大小。在UCI数据集上的实验表明，我们提出的模型要优于以前的模型。（2）针对子空间高斯混合模型，提出了一种鉴别学习方法。选择分类错误（MCE）作为鉴别准则，通过梯度下降对参数进行鉴别训练。首先通过PCA降维和EM算法得到模型的降维矩阵和高斯混合模型参数，把它们当做鉴别学习的初始值，通过梯度下降对所有参数进行更新，包括子空间参数和高斯混合模型参数。为了取得更好的泛化性能，我们引入了正则化学习，通过给目标函数加上似然函数的百分比来防止过学习。在MNIST数据集和UCI数据集上的对比实验表明，我们提出的鉴别学习方法分类性能不仅优于生成式学习方法，而且优于其他一些鉴别学习方法。（3）EM算法能够估计高斯混合模型的参数，却不能确定混合成分的个数。我们提出了一种基于启发式交叉验证的快速鉴别模型选择方法来决定每类的个数。首先通过对手惩罚竞争学习方法来给出模型个数的初值，然后通过启发式交叉验证进行分裂或合并操作，最后通过验证集的分类错误率来确定最终的成分个数。由于在模型选择时考虑了鉴别信息，所选择的模型能给出更好的分类性能。将这个方法应用于USPS数据集和UCI的一些数据集，包括低维数据和高维数据。实验结果表明，在大多数情况下，我们提出的方法都能够给出更好的分类结果。
英文摘要	Density estimation is a fundamental problem in pattern recognition and machine learning. It is particularly important for classification using the Bayes decision rule. The Gaussian Mixture Model (GMM) is a popular model for density estimation because of its great capability of approximating arbitrary distributions. The Expectation Maximization (EM) algorithm, based on maximum likelihood, is a basic approach for GMM parameter estimation. However, density estimation in high-dimensional data spaces is a challenge due to the sparseness of data which is well-known as "the curse of dimensionality". Reducing the dimensionality of features can overcome the curse of dimensionality, but how to combine dimensionality reduction with GMM is a concern. On the other hand, the GMM is a generative model, with parameters estimated for each class independently; without considering decision boundaries in training, the obtained models do not necessarily give high classification accuracy. While discriminative learning can improve the classification accuracy of the model. Aiming at these problems, this thesis studies model structure selection in high-dimensional space and discriminative learning for GMMs. The main contributions of this thesis are as follows. (1) We propose a Pooled Subspace Mixture Density (PSMD) model for classification, which represents the density in full space and estimate the common subspace and Gaussian mixture simultaneously under the EM framework. Each Gaussian component is represented as the product of an elliptical Gaussian in subspace and a spherical Gaussian in the complementary subspace. Firstly, the EM algorithm estimates the model parameters of full space, including the weighting coefficients, the means and the covariance matrix; Then we compute the pooled covariance and the pooled subspace, and project each Gaussian component into the pooled subspace. In the pooled subspace, each component is a Gaussian model, and the density in the complementary subspace is characterized by pooled eigenvalue. In order to improve the classification accuracy, the pooled eigenvalue is decided by cross validation. The experimental results on UCI datasets demonstrate that in most cases, the proposed method yields higher classification accuracies than the previous ones. (2) For the subspace GMM density model, we propose a discriminative learning method. The minimum classification error (MCE) criterion is chosen to optimize all the parameters by stochastic gradient...
关键词	概率密度估计高斯混合模型 Em算法共享子空间混合密度模型鉴别学习最小分类错误准则梯度下降模型选择启发式交叉验证 Probability Density Estimation Gaussian Mixture Model Em Algoithm Pooled Subspace Mixture Density Model Discriminative Learning Minimum Classification Error Criterion Gradient Descent Model Selection Heuristic Cross-validation
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6381
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	刘晓华. 基于高斯混合模型的高维数据概率密度估计[D]. 中国科学院自动化研究所. 中国科学院研究生院,2011.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20051801462806（1732KB）			暂不开放	CC BY-NC-SA