英文摘要 | With the rapid growth of information technology and World Wide Web, Web has became a very huge, complex and rapid growth information database, the users often feel “Information Overload”.Facing such question, the data mining techniques have been proposed, help user to combine the information retrieval and find useful knowledge. Clustering is one of the data mining tools which is unsupervised and play an important role. The traditional cluster methods focus on the homogeneous data, however though the higher information processing standards of users and the interaction between the users and the Web systems,the new data emerged.For example, the texts and words, the user browsing data, user query log data and opinion data.In these data, there contains many heterogeneous data objects,including user, query ,Web pages, texts and words. These objects not only have their own contents,but also their relationships with other different types.The traditional cluster methods can't meet the need of simultaneous clustering of heterogeneous data, the precious is low, and the readability of the labels is poor. So through the study of the features of the heterogeneous data, we build a model with mathematical resolution, and propose a new heterogeneous data mining method to resolve the problem of simultaneous clustering heterogeneous data. We first introduce the basic theories of clustering algorithm and the classic homogeneous data clustering algorithms, and then discuss the differences, relations and merit and demerit between the algorithms. The next, we analyze the features of the heterogeneous data and discuss the application. As the traditional cluster methods can't meet the need of simultaneous clustering of heterogeneous data. We present a co-clustering algorithm for heterogeneous data based on resistive network. In the algorithm, the heterogeneous related data is transformed into a resistive network with multi-part graph structure, simulation current/voltage characteristic of resistive network to continue the following computing of eigenvalue and clustering. After clustering, a clustering result structure can be obtained, that in the structure one class includes multiple heterogeneous data which can be each other's label, and the readability of the labels is high. Experiments prove that the data clustering algorithm is achievable and effective. |
修改评论