不同结构数据的图模型机器学习研究

CASIA OpenIR > 毕业生 > 博士学位论文

	不同结构数据的图模型机器学习研究
其他题名	Study of Graphical Model Machine Learning on the Different Structure Data
	吴蕾
	2014-05-28
学位类型	工学博士
中文摘要	随着互联网技术的快速发展，多样、异构、稀疏、海量的数据呈指数级快速增长。如何有效地表示和深入理解这些大数据已经越来越受到人们的重视，甚至已经成为当前的重要研究课题。图模型是一种基于概率框架对数据之间结构和关系进行知识表示、学习以及推理的方法。这种方法能够很好地描述数据的不确定性。因此，图模型方法为解决大数据难题提供了一种有效的解决方案。针对数据对象种类以及连接方式的异同，可以将研究问题分为三种：同构数据上的多层次主题学习、多源异构数据的数据对象差异化学习以及多源异构数据的连接关系差异化学习。本文借助图模型理论分别从多层次主题提取、多源子领域信息双向交互、连接关系表示三个方面分别对这三种研究问题进行了分析，并提出了三种算法。主要的工作和贡献如下：第一，针对同构数据下的单层隐藏变量模型只能提取单层次特征表示的缺陷，提出了一种提取多层次数据主题状态的图模型SOM-CSM算法。首先，使用自组织映射网络从输入的单词层提取初级主题状态表示节点；然后，将初级主题状态节点输入到改进的内容结构模型中，从而进一步提取高级主题状态表示节点，其中改进的内容结构模型使用了由一阶概率逻辑子句提取的特征函数；最后，利用EM算法得到标签。理论上证明了SOM-CSM算法具有多项式时间复杂度。在国际通用情感分析数据集亚马逊数据和Tripadvisor数据上的实验结果表明，SOM-CSM算法在亚马逊数据集上平均至少能够将精度提高4.6%。并且在Tripadvisor数据下，4种评级预测指标表明SOM-CSM算法的结果均优于使用LDA, HUCRF以及原始JointCM算法的结果。第二，针对多源异构数据的数据对象差异化学习问题，提出了一种兼顾领域独特性和各异性的双向信息交互融合多源特征空间图模型FHMM-LDA算法。首先，将对各异构领域的学习问题转化为求解多领域各自HMM-LDA模型参数的问题，并在各个领域的子特征空间上提取语义主题特征；然后，通过全局HMM-LDA模型将各个子特征空间的语义主题特征映射到全领域特征空间中，并给出了映射过程的几何解释；最后，通过EM算法和Gibbs采样求解全局模型的参数并进行推理，实现各个子领域信息的双向交互。在用户行为数据集MovieLens数据和Book-Crossing数据上的实验结果表明，与I-GP、CMF以及M-GP算法相比较，FHMM-LDA算法能够减小用户行为数据的预测误差。并且FHMM-LDA算法相比于典型的I-GP算法，将相对平均绝对值误差减小了44%。第三，针对多源异构数据的连接关系差异化学习问题，提出了一种使用一阶概率逻辑子句表示连接关系的融合多源异构子领域图模型ATLDA-MLN算法。首先，ATLDA-MLN算法根据数据的网络模式对多源异构数据进行划分，在各个子领域上建立各自的ATLDA模型，并提取作者或者会议分布主题；然后，算法用一阶概率逻辑子句描述各个子领域连接关系，并使用马尔科夫逻辑网将多个具有不同参数的ATLDA模型融合；最后，通过使用Gibbs采样，ATLDA-MLN算法可以对模型进行参数学习以及推理。在异构信息网络DBLP数据集上的实验结果表明，与nLB、wvRN、GNetMine、以及RankClass算法相比较，ATLDA-MLN算法可以提高分类效果...
英文摘要	With the rapid development of internet technology, the diverse, heterogeneous, sparse and big data is growing exponentially. It is important to be aware of the ways to represent and deep understand the big data. It has even become the important research subject. Graphical model is a kind of knowledge representation, learning and inference method about structure and relation of data based on probabilistic framework. This kind of method describes the uncertainty of the data well. Therefore, graphical model provides an effective solution for the big data problem. According to the similarities and differences between data object types and connection modes, the researches can be divided into three classes: the multi-layer topic learning on the homogeneous data, the data object differentiation learning on multi-source heterogeneous data, and the connection differentiation learning on multi-source heterogeneous data. This paper analyzes the three problems from multi-level subject extraction, bidirectional information interaction of multi-source subdomains, and the representation of connection separately based on the graphical model theory, and proposes three methods. The mainly work and contribution of this thesis are as follows: Firstly, in response to the issues that single hidden variable model on homogeneous data can only extract single level feature representations, a graph model (SOM-CSM) for extracting multi-level data topic states is proposed. Starting from self-organizing map network, the model extracts the primary topic state representation nodes. Next, the model input the primary topic state representation nodes into the improved content structure model to extract the advantage topic state representation nodes. At last, the model use EM algorithm to get the label. We prove the SOM-CSM method is polynomial time complexity. The algorithm was tested on two international general sentiment analysis dataset: Amazon dataset and tripadvisor dataset. The experimental results show that SOM-CSM algorithm improves the average accuracy by at least 4.6% on Amazon dataset. And four kinds of rating predicteing indexs show that SOM-CSM algorithm outperforms other four methods on the tripadvisor dataset. Four methods are LDA, HUCRF and JointCM. Secondly, in response to the issues that data objects on the multi-source heterogeneous data are different, a bidirectional information interaction fusion multi-source feature space graphical model FHMM-LDA algorithm which ca...
关键词	不同结构数据图模型一阶逻辑主题模型马尔科夫逻辑网 Data With Different Structure Graphical Model First-order Logic Topic Model Markov Logic Network
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6629
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	吴蕾. 不同结构数据的图模型机器学习研究[D]. 中国科学院自动化研究所. 中国科学院大学,2014.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20101801462806（1609KB）			暂不开放	CC BY-NC-SA