Cluster analysis is one of the fundamental problems associated with data analysis. It aims to separate a finite data set into a discrete set of natural hidden data structures with little or no prior knowledge. With the rapid development of computational techniques, cluster analysis has become an attractive and challenging research topic in a wide variety of fields, ranging from machine learning, artificial intelligence, pattern recognition and data mining. Feature extraction utilizes some transformations to generate useful and novel features from the original ones, so that clustering or classification tasks can be eased. This thesis focuses on the non-parametric kernel density estimation based clustering and feature extraction methods. We begin by a problem description and background review of the cluster analysis and feature extraction. We then systemically introduce the Mean-Shift (MS) clustering method and some related theories that form the base of this work. The main contributions of this thesis include two: 1. Novel theoretical interpretation and algorithmic extensions of the MS clustering method} We give a deep analysis and study of the mathematical essence behind MS and its algorithmic extensions. The key contributions of this part include: (1) When kernel function is convex, we prove that MS is equivalent to a half-quadratic optimization procedure for the kernel density estimator, and thus interprets the elegant numerical performance of MS. (2) Inside the proposed HQ analysis framework, we revisit the quadratic bounding optimization (QBO) nature of MS. Motivated by the QBO perspective of MS, we develop a simple yet efficient kernel density set covering algorithm which can cover a given data set with a sparse group of hyperellipsoids. By iteratively running the proposed set covering algorithm on a query set, we obtain an agglomerative non-parametric kernel density clustering algorithm called as Agglo-MS, which may significantly accelerate MS clustering procedure. As the extension of Agglo-MS, we develop an incremental non-parametric kernel density clustering algorithm called as IAgglo-MS, and a constrained non-parametric kernel density clustering algorithm called as CAgglo-MS. (3) Furthermore, we extend the classical batch MS into its stochastic gradient version along with asymptotic property analysis. Our stochastic gradient Mean-Shift can be applied to efficient kernel density mode-seeking on large scale data sets or data streams. 2. A...
修改评论