面向预定义关系类型的知识抽取关键技术研究

CASIA OpenIR > 毕业生 > 博士学位论文

	面向预定义关系类型的知识抽取关键技术研究
	郑孙聪
	2017-05-27
学位类型	工学博士
中文摘要	随着互联网技术的快速发展，网络中的文本数据急剧增加。海量的文本一方面蕴含了丰富的知识，另一方面也夹杂了庞大的冗余信息，这使得用户面临着信息过载的困绕。因此，如何从这些非结构化的文本数据中高效、精准地抽取出实体及他们之间的关系以形成结构化的知识，帮助人们快速获取关键性的信息，丰富各种智能化应用所依存的知识资源，是知识抽取领域的研究热点。本文以非结构化文本为研究对象，以人们预定义的关系类型为中心，以获取结构化知识单元为目的，在文本主题抽取的前提工作下，从三种不同的知识抽取角度展开了相关研究。本文中，结构化的知识单元是指一种特殊的三元组，三元组中的“主语(subject)”和“宾语(object)”都是被抽取文本中的实体词，而“谓词(predicate)”是预定义的关系类型。三种不同的知识抽取方式分别是先识别实体再抽取关系的串联抽取方式、实体和关系的关联抽取方式和三元组的端对端抽取方式。具体研究内容及工作成果主要包括以下四个方面。 1. 提出了一种可用于抽取文本主题的无监督文本主题向量化方法。文本主题概括了文本的主要语义信息。抽取文本语料的主题信息，有助于定义关系类型体系和抽取特定领域知识，是知识抽取的基础工作。针对主题抽取问题，本文借鉴word2vec的思想，提出了一种无监督的文本（句子，文档）主题向量化方法，该方法能有效地将文本自动地映射到一个低维的语义空间中，使相似主题文本在该空间的距离尽可能地接近，解决了传统主题抽取方法的高维、稀疏、语义鸿沟等问题。在相关任务的数据集上进行了丰富实验，结果表明所提方法在主题抽取方面具有明显优势。 2. 提出了一种同时挖掘关系模式特征和实体语义特征的关系抽取框架。关系抽取任务是在已知句子中实体的情况下判别实体之间的关系，是基于串联知识抽取方式的关键步骤。针对这一任务，本文分别设计了基于卷积神经网络与长短期记忆网络的关系模式特征挖掘模块和带扩散窗口的卷积网络的实体语义特征挖掘模块。通过融合上述两个模块，本文提出了一种抽取实体关系的关系抽取框架，并在关系抽取的公开数据集SemEval-2010中对此抽取框架进行实验验证。实验结果表明，所提方法取得了当时最佳的结果。与此同时，实验结果也充分验证了本文所提关系模式抽取模块和实体语义表示模块的有效性。 3. 提出了一种对实体与关系进行关联抽取的方法。传统串联知识抽取方式侧重于分别优化“实体识别”和“关系抽取”两个子任务，往往忽视二者之间的联系，而已有的联合抽取方法又多是基于人工特征方式，耗时耗力且鲁棒性差。针对上述问题，本文提出了一种基于混合神经网络的实体与关系的关联抽取模型，该方法不仅避免了人工设计特征的过度参与，而且增强了实体识别和关系抽取的关联性。在信息抽取的公开评测数据集ACE05上的实验，验证了本文所提方法的有效性。 4. 提出了一种基于序列标注的端对端三元组抽取方法。已有的三元组抽取算法都是通过获取实体元组和关系元组从而得到知识单元（三元组）。这在一定程度上造成误差累积和信息冗余等问题，影响了抽取效果。为了解决上述问题，本文设计了一种新型的标记策略，它可以将三元组抽取任务转化为序列标记的任务，进而通过端对端的算法实现三元组抽取。此外，本文在上述标记策略的基础上，结合三元组的标记特点，对基于双向长短时记忆网络的端对端模型进行了改进。通过在公开的大规模的相关数据集中进行相关实验，实验结果表明基于本文标记策略的端对端抽取方法的性能优于当前其他算法。
英文摘要	With the rapid development of Internet, the text data in the network increases dramatically. On the one hand, the massive text data contains a wealth of knowledge, which can support the development of a variety of intelligent applications. On the other hand, the massive text data can bring a huge redundant information, which makes it difficult for people to find the information they want. Extracting the entities and their relationships from the unstructured text data to form structured knowledge can promote the development of intelligent industry and help people to find information quickly. In this thesis, we focus on the problem of extracting knowledge from unstructured texts on the premise of predefined relation set. In order to define the relation set and preprocess the extracted text, we firstly study the technology of topic extraction. Then three different kinds of knowledge extraction methods were proposed, they are: the pipelined knowledge extraction method, the joint knowledge extraction method and the end-to-end knowledge extraction method. The main achievements of this thesis are shown as follows. First, an unsupervised text topic embedding method is pro- posed for topic extraction. The topic within text contains the main semantic information of the text. Extracting the topic information from text is helpful to define the relation set and extract the specific domain knowledge. Topic extraction is the basic work of knowledge extraction. We propose a text topic embedding method, based on the idea of word2vec, to extract text topic. Text topic embedding method can automatically embed the text into a low dimensional semantic space, in which the texts with similar topic will be close to each other. When compared with conventional topic models, text topic embedding method alleviates the data sparsity problem and can captures the semantic relevance between different texts, The method we proposed is a high-efficiency and unsupervised method, thus it can be suitable for the topic extracting problems with large scale textual data. The experimental results show that the proposed method has an obvious advantage in terms of the topic representation. Second, a neural network based framework is proposed for rela- tion extraction, which can simultaneously learn two kinds of im- port features, the semantic properties of entities and sentence's relation pattern. Relation extraction is to identify the relationship of two given entities in the text. It is an important step of pipelined knowledge extraction method. To extract relation, we propose a neural network based framework which contains a relation pattern extraction module (RPE) and a entity semantic extraction module (ESE). RPE focuses on extracting the features of relational patterns, and ESE focuses on learning the features of entities’ semantic properties . We conduct experiments on the public dataset: the SemEval-2010 Task8 dataset. This relation extraction method achieves the state-of-the-art result without using any external information. Additionally, the experimental results also show that the ESE can represent the semantic relationship of the given entities effectively. Third, a hybrid neural network model is proposed to jointly extract entities and their relationships. Traditional pipelined knowledge extraction methods treat the task as two separated tasks, i.e., named entity recognition and relation extraction. They neglect the relevance of these two subtasks. Besides, most existing joint methods are feature-based structured systems. They need complicated feature engineering and heavily rely on the supervised NLP toolkits. Based on the above analysis, we propose a hybrid neural network model to jointly extract entities and their relationships without any handcrafted features. We conduct experiments on the public dataset ACE05 (Automatic Content Extraction program) to verify the effectiveness of our method. The method we proposed achieves the state-of-the-art results on entity and relation extraction task. Finally, An end-to-end knowledge extraction method is proposed, which is based on a novel tagging scheme. Most of existing methods for knowledge extraction are based on identifying entities and their relationships. These methods lead to erroneous delivery or produce redundant information, which can affect the extraction results. For solving these problem, we propose a novel tagging scheme that can convert knowledge extraction tasks to tagged tasks without extracting entities and relations separately. Since the tags contains the information of extracted triples, the triples can be extracted directly by using the end-to-end models. We conduct experiments on a public dataset produced by distant supervision method and the experimental results show that the tagging based methods are better than most of the pipelined methods and joint learning methods. What’s more, the end-to-end model proposed in this paper, achieves the best results on the public dataset.
关键词	知识抽取文本向量化卷积神经网络长短时记忆网络端对端模型
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/14672
专题	毕业生_博士学位论文
作者单位	中国科学院自动化研究所数字内容研究中心
第一作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	郑孙聪. 面向预定义关系类型的知识抽取关键技术研究[D]. 北京. 中国科学院研究生院,2017.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
面向预定义关系类型的知识抽取关键技术研究（6564KB）	学位论文		限制开放	CC BY-NC-SA