With the rapid development of internet technology, the diverse, heterogeneous, sparse and big data is growing exponentially. It is important to be aware of the ways to represent and deep understand the big data. It has even become the important research subject. Graphical model is a kind of knowledge representation, learning and inference method about structure and relation of data based on probabilistic framework. This kind of method describes the uncertainty of the data well. Therefore, graphical model provides an effective solution for the big data problem. According to the similarities and differences between data object types and connection modes, the researches can be divided into three classes: the multi-layer topic learning on the homogeneous data, the data object differentiation learning on multi-source heterogeneous data, and the connection differentiation learning on multi-source heterogeneous data. This paper analyzes the three problems from multi-level subject extraction, bidirectional information interaction of multi-source subdomains, and the representation of connection separately based on the graphical model theory, and proposes three methods. The mainly work and contribution of this thesis are as follows: Firstly, in response to the issues that single hidden variable model on homogeneous data can only extract single level feature representations, a graph model (SOM-CSM) for extracting multi-level data topic states is proposed. Starting from self-organizing map network, the model extracts the primary topic state representation nodes. Next, the model input the primary topic state representation nodes into the improved content structure model to extract the advantage topic state representation nodes. At last, the model use EM algorithm to get the label. We prove the SOM-CSM method is polynomial time complexity. The algorithm was tested on two international general sentiment analysis dataset: Amazon dataset and tripadvisor dataset. The experimental results show that SOM-CSM algorithm improves the average accuracy by at least 4.6% on Amazon dataset. And four kinds of rating predicteing indexs show that SOM-CSM algorithm outperforms other four methods on the tripadvisor dataset. Four methods are LDA, HUCRF and JointCM. Secondly, in response to the issues that data objects on the multi-source heterogeneous data are different, a bidirectional information interaction fusion multi-source feature space graphical model FHMM-LDA algorithm which ca...
修改评论