CASIA OpenIR  > 毕业生  > 博士学位论文
面向知识图谱的关系发现关键技术研究
纪国良
2017-05-24
学位类型工学博士
中文摘要

知识图谱以结构化的形式描述现实世界中的实体以及实体之间的关系,是知识表示的一种重要形式。自2012年5月由Google正式提出以来,已经在查询理解、智能问答、个性化推荐等领域得到了广泛的应用。虽然当前的知识图谱包含了大量的结构化信息,但是现实世界中实体和关系的数量十分庞大,并且大多数知识图谱都以人工协作或者(半)自动化的方式构建,这使得知识图谱还远不完备,其中一个重要的问题是存在大量关系缺失的现象,这在很大程度上限制了它们的可用性。面向知识图谱的关系发现研究正是以解决这个问题为目标。

面向知识图谱的关系发现旨在预测知识图谱中缺失的关系,对关系进行补全,其方法主要包含两个方面:一、基于知识图谱中已有的结构化事实进行推理,获得其中缺失的关系;二、从非结构化文本中抽取实体之间的关系,补充到知识图谱中。目前关系发现的研究已经取得了一定的进展。在第一个方面,有传统的基于逻辑规则的推理方法和基于表示学习的推理方法。基于逻辑规则的推理方法容易受到数据稀疏性的影响,在生成规则和运用规则进行推理时计算复杂度高,难以适应当前大规模知识图谱的应用需求;基于表示学习的推理方法在低维向量空间中学习实体和关系的稠密向量,能够有效缓解数据稀疏问题,而且计算效率高,更加适合大规模知识图谱的关系补全任务。因此,对于第一个方面,本文重点研究基于表示学习的推理方法。在第二个方面,面对文本数据,主要有无监督、有监督和弱监督关系抽取方法。无监督方法以实体间的单词字符串表示关系,难以映射到特定的知识图谱中;有监督方法需要人工标注数据,其应用领域和规模受到限制;弱监督方法使用知识图谱作为监督信息,容易自动获得大规模训练数据,而且其关系类别以知识图谱中的关系为基准,没有关系映射困难的问题。因此,对于第二个方面,本文重点研究弱监督关系抽取方法。

本文针对面向知识图谱的关系发现关键技术展开研究,研究内容及成果主要包括:

1、传统的表示学习方法没有考虑到知识图谱中实体和关系广泛存在多类别的特点,从而导致不同类型的实体和关系共享映射矩阵,影响关系预测的准确性。针对这个问题,本文提出了基于动态映射矩阵的表示学习推理方法TransD。该方法为每个实体和关系赋予两个向量,一个表示实体或者关系的一般含义;另一个用于动态地构造映射矩阵,以灵活的方式将实体向量中与当前关系有关的含义投影到关系向量空间中,然后完成从头实体到尾实体的映射过程。因此,每个(关系,实体)对的映射矩阵都由二者共同确定,这不但充分考虑了实体和关系的多类别性,而且用向量之间的运算代替了以往方法中矩阵乘以向量的运算,提高了计算效率。在WordNet和Freebase上的实验结果表明,该方法在三元组分类和链接预测任务上的表现显著优于基线系统。

2、针对知识图谱中数据存在异构性(不同关系连接的实体对数量不同)和不平衡性(同种关系连接的头、尾实体数量不同)的问题,提出了基于自适应稀疏映射矩阵的表示学习推理方法TranSparse。该方法采用稀疏矩阵作为映射矩阵,包含share和separate两个模型。share模型主要解决异构性问题,其头、尾实体共享一个映射矩阵,映射矩阵的稀疏度由关系连接的实体对数量确定,连接的实体对越多,稀疏度越小,反之越大;separate模型在share模型的基础上进一步解决不平衡性问题,其头、尾实体分别拥有各自的映射矩阵,稀疏度由关系在具体位置(头、尾位置)连接的实体数确定。稀疏矩阵使模型对数据具有很好的适应能力,且零元素不参与运算,能够减少计算量,易于应用在大规模知识图谱上。实验结果表明,TranSparse能够显著提升关系的预测效果。

3、对于弱监督关系抽取,针对训练数据中存在回标噪声和实体背景知识不足的问题,提出了基于句子级关注机制和实体描述的分段卷积神经网络模型。在句子级关注机制模块中,首先使用两个给定实体的向量之差作为它们之间关系的特征向量,然后使用分段卷积神经网络抽取多示例包中每个句子的特征向量,再通过一个隐藏层计算关系特征向量和句子特征向量的相似度(即关注权重),通过权重的大小选取有效的句子,剔除噪声。除此以外,从Freebase和Wikipedia中抽取实体的描述,为实体提供更加丰富的背景知识,也为句子级关注机制提供更好的实体表示。实验结果表明,关注机制能够使用更高的权重选择有效句子,实体描述也能够提供更多有用的背景知识。在自动评价和人工评价上,该方法取得了优于所有基线系统的效果。

上述工作的第1、2部分是基于表示学习的关系预测,第3部分是基于非结构化文本的关系抽取,二者相互补充,共同构成了本文中关系发现的内容。

英文摘要

The Knowledge Graphs (KGs) describe entities and their relations in the real world with a structured form. It is an important form to represent knowledge. Since Google formally proposed the concept in May 2012, it has been widely applied in the understanding of query, intelligent Q\&A, personalized recommendations and other fields. Although the current KGs contain large amounts of structured information, they are far from completeness due to the number of entities and relations in the real world is very huge and most KGs have been built either collaboratively or (partly) automatically. Therefore, there are many missing relations in the KGs and this defect limits their availabilities. The research on the method of relation discovery for KGs aims to solve the problem.


Relation discovery for KGs is to predict and complete the missing relations. The task involves two aspects: (1) Reasoning out the missing relations based on the structured facts in KGs; (2) Extracting relations between entity pairs from unstructured texts and adding them into KGs. At present, there are much work has made certain progress for the task. The first aspect has two main methods:the traditional logic rules method and representation learning method of KGs. The methods based on logic rules are vulnerable to data sparsity. The computation complexity of logic rules generation and reasoning is very high, which can't satisfy the need of large-scare KGs. The representation learning methods can remit the data sparsity problem by learning  dense vectors for entities and relations in low dimensional vector space. They have higher computational efficiency than logic rules, so they are more suitable for relation completion of large-scale KGs. Therefore, for the first aspect, we mainly research the representation learning methods. The second aspect contains unsupervised, supervised and distantly supervised relation extraction. The unsupervised methods extract the words between the given entity pairs as relations and it is difficult to map the words into a certain KG. The supervised methods need manual annotation data, which limits their applying fields and scale. The distant supervision strategy is an effective method of automatically labelling dataset by KGs, and its relations (labels) are also in the KGs (does not need the mapping operations). So it is useful to extract relations for KGs. Therefore, for the second aspect, we focus on the distant supervision method.

In this dissertation, we research the key methods of relation discovery for KGs. The main contents and achievements are as follows.

1. Previous representation learning methods do not consider the multiple types of entities and relations, and let the entities and relations with different types share the same mapping matrices. This strategy would reduce the accuracy of relation prediction. To overcome the problem,  we proposed a representation learning method TransD for reasoning based on dynamic mapping matrices. It defines two vectors for each entity and relation. The first vector represents the general meaning of an entity or a relation, the other one will be used to construct mapping matrices dynamically which provide a flexible way to project relation-specific ingredient of entity vector into relation vector space where the translation will be finished. Therefore, every entity-relation pair has an unique mapping matrix determined by themselves, which can adapt to the multiple types of entities and relations. In addition, TransD is more efficient than previous work because it has no matrix-by-vector operations which can be replaced by vectors operations.  We evaluate TransD with the task of triplets classification and link prediction on WordNet and Freebase. The experimental results show that our method has significant improvements compared with previous models.

2. To alleviate the negative effects of heterogeneity (the number of entity pairs linked by different relations are different) and imbalance (the number of head entities and that of tail entities are different in a relation) in KGs,  we propose a representation learning method TranSparse based on adaptive sparse mapping matrices. In TranSparse, we define all mapping matrices as sparse matrices and it contains two models ``share" and ``separate". The ``share" model is proposed for the heterogeneity, its head and tail entities share the same mapping matrices whose sparse degrees are determined by the number of entity pairs linked by relations. The more entity pairs linked by a relation, the sparse degrees of the relation's mapping matrix is more smaller, and
 vice versa. The ``separate" model is based on ``share" and it is proposed for the imbalance. It has two mapping matrices for each relation, one for head and the other for tail. The sparse degrees are determined by the number of entities linked by relations on head or tail. The sparse matrix not only can adapt different KGs, but also can reduce the computation which makes the model easy to extend on large-scale KGs. In experiments, TranSparse improves the prediction results significantly.
 
3. To weaken the influence of noise data and supplement enough background knowledge of entities, we propose a distant supervision relation extraction method with sentence-level attention and entity descriptions. For a multiple instance bag,  it uses the difference of the two given entities' vectors to represent relation features, then extracts each sentence's feature vector by a Piecewise Convolutional Neural Networks (PCNNs) module and computes the similarities (attention weights) between relation features and each sentence's features through a hidden layer.  The sentences with higher weights are valid instances, the others are noise sentences. At last, the weighted sum of all sentence feature vectors is regarded as the bag's features and the model feeds them into a softmax classifier. In addition, we extract entity descriptions from Freebase and Wikipedia. The descriptions not only provide more information for predicting relations, but also bring better entity representations for the attention module. The experimental results show that the attention mechanism is able to selectively focus on the relevant sentences through assigning higher weights for valid sentences, and that the model can obtain more useful background knowledge from the entity descriptions. Our model outperforms all the baseline systems on hold-out and manual evaluations.

In the above work, part 1 and 2 are relation prediction based on representation learning methods, and the part 3 is relation extraction from unstructured texts. They complement each other and constitute the relation discovery content together.


 

关键词知识图谱 关系发现 表示学习 弱监督关系抽取 卷积神经网络
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/14782
专题毕业生_博士学位论文
作者单位中国科学院自动化研究所
推荐引用方式
GB/T 7714
纪国良. 面向知识图谱的关系发现关键技术研究[D]. 北京. 中国科学院大学,2017.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
面向知识图谱的关系发现关键技术研究.pd(5538KB)学位论文 限制开放CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[纪国良]的文章
百度学术
百度学术中相似的文章
[纪国良]的文章
必应学术
必应学术中相似的文章
[纪国良]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。