面向异质关系数据的协同因子化模型与算法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	面向异质关系数据的协同因子化模型与算法研究
	赵洋洋
	2016-05-28
学位类型	工学博士
中文摘要	异质关系数据（Heterogeneous Relational Data）（如社交网络，知识图谱，基因-蛋白质作用网络中的数据）正成为信息产业的主流数据形式以及大数据的重要构成部分，具有丰富的语义价值。异质关系数据一般指的是存在多于一种实体类型或一种以上关系类型的数据，其中的实体与关系往往构成不同类型的链接，进而形成复杂的依存模式。由于传统的机器学习方法大多假设数据独立同分布，因此以传统机器学习方法应对异质关系数据往往会造成严重的结构信息损失。本文以协同因子化为主要研究方法，从信息表示，融合，分类，聚类，关系预测这几方面对异质关系数据进行了深入的研究。通过模型与算法研究以及实验验证，本文取得了一系列的研究成果。这些研究成果不仅为今后的异质关系数据模型与算法研究提供一定的借鉴意义，也将有助于不同领域中异质关系数据的具体分析和运用。论文的主要内容及创新点如下：首先，借鉴现有方法以“异质信息网络（Heterogeneous Information Network）”来描述异质关系数据，以“元路径（Meta Path）”来刻画异质关系数据节点间链接的模式，本文提出一种在元路径特征层次上的结构信息（与属性信息联合）量化模块，以实现结构-属性信息在中层语义上的融合。具体地来说，我们提出一种通用的元路径权重计算方法，以及一种新的链接相似度度量方法，通过元路径加权、筛选，链接相似度计算、加权组合，最终生成加权结构-属性语义矩阵。该量化模块能够在特征层次上有效地融合异质关系数据的结构、属性以及（部分）标签信息，是后续章节进行异质关系数据分类、聚类的分析框架中不可或缺的组成部分。实验表明，与当下的方法相比，我们提出的元路径权重计算方法与链接相似度计算方法均具有优势。（第二章：异质关系数据的信息表示）其次，由于异质关系数据普遍存在高度稀疏与维度灾难等问题，其数据模式统计显著性不明显。在这种情况下充分利用结构信息与属性信息，使得二者互补增强就显得非常必须。本文明确了“统计+结构”的理念，提出了一种有效解决异质关系数据分类和聚类的算法框架。该框架将结构信息与属性信息进行相互渗透式融合：从算法上说来，首先相关信息被融合生成加权结构-属性语义矩阵，紧接着从统一、融合的视角通过协同因子化对这些信息加以利用，最终得到符合分类和聚类任务的分布式表示。该框架按照以下方案实施： (1)针对异质关系数据的节点分类任务，提出了一种能够无缝融合结构信息与属性信息的协同矩阵分解模型，通过在加权语义矩阵中融合结构、属性和标签信息，并在流型约束下将加权语义矩阵与属性信息矩阵同步分解，得到具有高度表达能力的低维隐藏因子表示。（第三章：异质关系数据分类） (2)针对异质关系数据的节点聚类任务，提出了一种基于元路径的协同非负矩阵分解模型，在加权语义矩阵中融合结构、属性信息，并在双图拉普拉斯正则化与聚类指示矩阵的约束下，使得在优化的过程中，聚类目标的隐藏因子能够逐渐表达出聚类特性。（第四章：异质关系数据聚类）最后，针对异质关系数据的链接预测任务，本文从以下方面进行了研究： (1)由于现实中许多异质关系数据实体的概念标签分层排列在树（Tree）或有向无环图（DAG）中，对这些实体与其概念标签之间进行“is-a”类型的链接预测，实质上等价于对这些实体进行层次化的多标签分类。本文采用偏最小二乘（PLS）技术来估算高维度标签向量，该方法可同时在特征和标签空间进行投影并构造它们之间有效的预测模型。然后我们证明了层次约束下的最优标签预测问题可以合理地转化为结构性稀疏惩罚下的最优路径预测问题。路径选择模型的引入能够让我们进一步利用多项式时间复杂度的高效网络流求解器。实验结果证明，无论用于标签为树还是DAG结构的数据集，该算法都比现有算法有更好的表现。（第五章：基于最优路径预测的层次化多标签分类） (2)针对以三元组（triplet）形式表达的知识图数据中的链接预测（即知识补全）问题，提出了一种集成利用显式特征模型和隐式特征模型（基于因子化的链接预测模型）的学习框架。具体来说，我们提出一种基于L1-正则化的有偏Logistic Regression的三元组链接预测模型。我们首先依据三元组(h,r,t)三要素（头结点（主语）h，尾结点（宾语）t，关系（谓语）r）在训练集上的相关性提取有效的显式特征，再利用已知的三元组正样本进行正样本和无标记样本学习（PU-learning）。其学习得到的预测模型一方面用来直接进行三元组预测，另一方面用来协助为基于因子化的关系预测模型生成可靠负样本。与此同时，我们提出一种基于相似度语义加权的因子化链接预测模型。利用显式特征计算三元组正负样本对之间的语义相似度，在模型中增加对预测影响较大的数据点（即三元组正负样本对）的权重，得到了比现有的隐式特征模型预测性能更好的隐藏因子表示；最后，综合考虑隐式特征模型与显式特征模型的预测结果进行预测，当这两种模型结果相近时能得到比它们更有优势的表现。（第六章：异质关系数据链接预测）
英文摘要	Heterogeneous Relational Data(HRD), such as Social Network, Knowledge Graph, Gene Protein interaction network, has rich semantic value, becoming a mainstream data form in information industry and an important part of big data. HRD generally refers to the data with more than one entity type or more than one relation type. The entities and relations of HRD tend to form different types of links, which further implies complex dependency patterns. As most of the traditional machine learning methods assume that the data points are independent and identically distributed (IID), they always suffer from severe structural information loss when used to deal with HRD. Based on collective factorization, we have conducted research on Heterogeneous Relational Data in the following aspects: representation, integration, classification, clustering, and link prediction. Through theoretical study and experimental verification, we have made a series of achievements. These research results will not only offer insight into the future model and algorithm research of HRD, but also play an important role in various application domains. The main contents and innovations of this paper are as follows: Firstly, inspired by the existing analysis models which employ Heterogeneous Information Network to describe HRD and take advantage of Meta Path to describe the link patterns in HRD, we put forward a structural (and attribute information combined) information quantization component, achieving semantic fusion of structure(-attribute) information in terms of meta path features. Specifically, we propose a general meta path weighting method and a novel link similarity calculation method. Through meta path weighting, filtering, link similarity calculation, and weighted combination, we ultimately generate a weighted structure-attribute semantic matrix. The quantization component effectively integrates the structure, attribute, and (partial) label information of HRD, being an indispensable component of the analysis framework of HRD classification and clustering in the following chapters. Experimental results show that compared with the present methods, the proposed meta path weighting method and link similarity calculation method both seize competitive advantage.(The 2nd chapter: Information Representation of Heterogeneous Relational Data) Secondly, due to the high degree of sparsity and the curse of dimensionality in the heterogeneous relation data, the statistics of the data patterns is not significant. In this case, it is necessary to make full use of the structure and attribute information of HRD, and make the two complement and enhance each other. With the "statistics + structure" concept, we put forward a framework to solve the HRD classification and clustering. Algorithmically, we integrate the structure and attribute information in a mutual penetration way: to begin with, the information is fused together as the weighted structure-attribute semantic matrix, and then the information is leveraged in collective factorization with the perspective of unity and integration, finally we obtain the distributed representation beneficial for classification or clustering task. The framework has been implemented as follows: (1)For the HRD classification task, we put forward a collaborative matrix factorization model with seamless integration of structure and attribute information. The model integrates the information of structure, attributes, and (partial) label into the weighted structure-attribute semantic matrix, and simultaneously factorizes it and the attribute information matrix under the mainfold constraints. The model results in the low dimensional latent factors with highly expression ability. (The 3rd chapter: Heterogeneous Relational Data Classification) (2)For the HRD clustering task, we put forward a meta path based collective non-negative matrix factorization model, which integrates the structure and attribute information into the weighted structure-attribute semantic matrix, and simultaneously factorizes it and the attribute information matrix under the cluster indicator matrix constraint with the dual graph Laplacian regularization. In the process of optimization, the latent factor of the data gradually reveals the clustering characteristics. (The 4th chapter: Heterogeneous Relational Data Clustering) Finally, for the HRD link prediction task, we have conducted research in the following two aspects: (1)As the concept label of many real world HRD are arranged hierarchically in a tree or directed acyclic graph (DAG), the "is-a" typed link prediction problem of these data in fact is equivalent to the hierarchical multilabel classification problem. In this paper, we adopt the partial least squares (PLS) techniques to estimate the high-dimensional label vector, which perform simultaneous projections of the feature and label space, constructing sound predictive models between them. We then prove that the optimal label prediction problem with hierarchy constraints can be reasonably transformed into the optimal path prediction problem with the structured sparsity penalties. The introduction of path selection models further allows us to leverage the efficient network flow solvers with polynomial time complexity. The experimental results validate the promising performance of the proposed algorithm in comparison to the state-of-the-art algorithms on data sets with both tree- and DAG-structured labels.(The 5th chapter: Hierarchical Multilabel Classification with Optimal Path Prediction) (2)Regarding the link prediction task for triplets in knowledge graph (i.e. Knowledge Base/Garaph Completion), we put forward an ensemble learning framework to combine observed feature model and latent feature model(factorization based link prediction model). Specifically, we first propose a novel observed feature model based on L1-regularized biased Logistic Regression. We extract effective observed features from relevance between the three elements (i.e., the head(subject) node, the relation(predicate) node and the tail(object) node) of the triplets (h,r,t) in the training set, and then utilize the known triplets as positive samples to do PU-learning (Learning from Positive and Unlabeled Examples). The resulting model is used to carry out the link prediction on the test set, at the same time, to assist in the generation of reliable negative examples for the factorization based link prediction model. Furthermore, we put forward a semantic similarity weighted factorization based link prediction model. In this model, we calculate semantic similarity of the positive triplet and the negative triplet pairs with observed features, and then increase the weights of the data points (i.e., the positive triplet and the negative triplet pairs) that have greater impact on the prediction in the model. The proposed latent feature model achieves better prediction performance than other factorization based link prediction models. Finally, combining the results of the above two models achieves better performance than that of the two single models, probably in the condition that the results of the two models are close to each other. (The 6th chapter: Heterogeneous Relational Data Link Prediction)
关键词	异质关系数据协同因子化结构(-属性)信息量化信息融合分类聚类链接预测
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/11953
专题	毕业生_博士学位论文
作者单位	中国科学院自动化研究所
第一作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	赵洋洋. 面向异质关系数据的协同因子化模型与算法研究[D]. 北京. 中国科学院大学,2016.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
面向异质关系数据的协同因子化模型与算法研（3497KB）	学位论文		限制开放	CC BY-NC-SA