面向非结构化文本的知识抽取关键技术研究

CASIA OpenIR > 毕业生 > 博士学位论文

面向非结构化文本的知识抽取关键技术研究

周鹏

2018-05

学位类型

工学博士

英文摘要

随着互联网技术的快速发展，网络中的文本数据急剧增加。这些海量的文本蕴含了丰富的知识，但也夹杂了庞大的冗余信息。如何从这些非结构化的文本数据中发现知识，并将这些文本知识表示成计算机能够“理解”的形式，从而减轻人类学习成本、丰富各种智能化应用所依存的知识资源，是知识抽取领域的研究热点。
本文以非结构化文本为研究对象，以人们预定义的关系类型为中心，以获取结构化知识单元为目的，在文本主题抽取的前提工作下，从不同的知识抽取角度展开了相关研究。本文中，结构化的知识单元是指一种特殊的三元组，三元组中的“主语”(subject)和“宾语”(object)都是待抽取文本中的实体词，而“谓语”(predicate)是预定义的关系类型，其结构形式为：{实体1，关系类型，实体2}。具体研究内容及工作成果主要包括以下四个方面：
（1）本文提出了一种基于循环神经网路和二维卷积神经网络的主题抽取模型。文本主题概括了文本的主要语义信息，而抽取文本语义的主题信息，将有助于抽取特定领域的知识和定义关系类型体系，也是知识抽取工作的基础。针对这一任务，本文设计了一种基于长短期记忆网络和二维卷积神经网络的主题挖掘模型，该模型能有效地挖掘句子的语义信息。在多个相关任务上进行了实验，结果表明所提模型在主题抽取方面具有明显的优势。
（2）本文提出了一种基于注意力机制的有监督关系抽取方法。关系抽取任务是在已知句子中实体的情况下判别实体之间的关系，是基于串联知识抽取方式的关键步骤。目前多数抽取方法依赖于自然语言处理工具包的输出来构造特征，这样就会导致特征抽取过程的误差累积和传播，并且构造的特征扩展性差，很难直接用到另一个领域。另外，这些方法都忽视了一个事实：句子中不同的词对关系类型的重要性不同，有些关键词更能表达实体之间的语义关系。针对上述问题，本文提出了基于长短期记忆网络和注意力机制的抽取框架，该框架仅使用词向量作为输入，不依赖于任何外部其他的输入和自然语言处理工具。在关系抽取的公开数据集SemEval-2010上对此抽取框架进行验证，实验结果表明，所提方法的性能优于当前大多数方法。
（3）本文提出了一种基于层级注意力机制的弱监督关系抽取方法。有监督的关系抽取方法依赖于大量标注的训练语料，人工标注数据不仅费时费力，而且也不可能标注所有领域所需的海量数据。此外，弱监督关系抽取任务中包含两个实体的句子并不一定会表达实体之间的关系类型，如果将包含实体对的所有句子用于训练，则会出现回标噪声问题。为了解决上述问题，本文设计了一种基于两层注意力机制的神经网络模型，该模型不依赖于任何标注的语料。在弱监督关系抽取常用数据集上进行实验，实验结果表明层级注意力机制可以显著提升模型的关系抽取效果，并显著地降低了计算复杂度。
（4）本文提出了一种基于混合神经网络的实体和关系的关联抽取方法。传统串联知识抽取方法往往将实体识别和关系抽取当作两个独立的任务对待，忽视了二者之间的联系，而已有的联合抽取方法又多是基于人工特征工程，耗时耗力且鲁棒性差。针对上述问题，本文提出了一种基于长短期记忆网络和卷积神经网络的实体与关系的关联抽取模型，该方法不仅避免了人工设计特征的过度参与，而且增强了实体抽取和关系抽取的关联性。在信息抽取的常用数据集CoNLL04上的实验结果表明，本章所提方法显著提高了实体和关系抽取的效果。

;

With the rapid development of Internet, the text data in the network increases dramatically. The massive text data contains a wealth of knowledge, which can support the development of a variety of intelligent applications, but it can bring a huge redundant information, which makes it difficult for people to find the information they want. We need to discover knowledge from large-scale text data and convert the knowledge to something that the computer can understand. Knowledge extraction aims to solve this problem.
In this thesis, we focus on the problem of extracting knowledge from unstructured texts on the premise of predefined relation set. In order to define the relation set and preprocess the extracted text, we firstly study the technology of topic extraction. The three different kinds of knowledge extraction methods were proposed, they are: the supervised relation extraction method, the distant supervised relation extraction method and the joint knowledge extraction method. The main achievements of this thesis are shown as follows:
Firstly, a hybrid neural network model is proposed for topic extraction. The topic within text contains the main semantic information of the text. Extracting the topic information from text is helpful to define the relation set and extract the specific domain knowledge. We propose a neural model based on two-dimensional convolution and two-dimensional pooling to extract topic information and this model is able to capture the semantic information in the text. We conduct experiments on six public dataset and the experimental results show that the proposed method outperforms most of the state-of-the-art methods.
Secondly, a neural network based on attention mechanism is proposed for supervised relation extraction. Relation extraction is to identify the relationship of two given entities in the text, which is an important step of pipelined knowledge extraction method. The main weakness of most existing methods is that most features are explicitly derived from Natural Language Processing (NLP) tools, the errors generated by NLP tools would propagate in these methods and these features constructed on one domain could not utilized by another domain. Another weakness is that most existing methods treat all words in the text as the same important and ignore the fact the keywords are more crucial for the relation than other words in the text. Based on the above analysis, we propose a novel neural network based on attention mechanism to extract relation without any handcrafted features. Experimental results on the SemEval-2010 relation classification task show that the proposed method only with word embeddings outperforms most of existing methods.
Thirdly, a hierarchical selective attention mechanism based neural network is proposed for distant supervised relation extraction. In relation extraction, one challenge that is faced when building a machine learning system is the generation of training examples, which is time-consuming when labelling text. The distant supervised relation extraction methods suffer from the wrong label problem. A sentence that mentions two entities does not necessary express their relation. To solve these problems, we propose a hierarchical attention mechanism based neural network, which does not rely on annotated text. We conduct experiments on as widely used dataset and the experimental results demonstrate that the proposed method performs significantly better than most of existing methods.
Finally, a hybrid neural network is proposed to jointly extract entities and their relations. Traditional pipelined knowledge extraction methods treat the task as two separated tasks, i.e., named entity recognition and relation extraction. They neglect the relevance of these two subtasks. Besides, most of existing joint methods are feature based structured systems, which need complicated feature engineering and heavily rely on the supervised NLP tools. We propose a hybrid neural network to jointly extract entities and their relations without using any handcrafted features. Experimental results on the CoNLL04 dataset demonstrate that the proposed model using only word embedding as input achieves state-of-the-art performance.

关键词

知识抽取长短期记忆网络卷积神经网络二维卷积神经网络注意力机制

文献类型

学位论文

条目标识符

http://ir.ia.ac.cn/handle/173211/20981

专题

毕业生_博士学位论文

作者单位

中国科学院自动化研究所

第一作者单位

中国科学院自动化研究所

推荐引用方式
GB/T 7714

周鹏. 面向非结构化文本的知识抽取关键技术研究[D]. 北京. 中国科学院研究生院,2018.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
template_签名页.pdf（5845KB）	学位论文		限制开放	CC BY-NC-SA