CASIA OpenIR  > 模式识别国家重点实验室  > 自然语言处理
面向非结构化文本的关系抽取关键技术研究
曾祥荣
Subtype博士
Thesis Advisor赵军
2019-05-28
Degree Grantor中科院自动化所
Place of Conferral中科院自动化所
Degree Discipline模式识别与智能系统
Keyword信息抽取 关系分类 关系抽取 多关系抽取 深度学习 强化学习
Abstract

计算机和智能手机的普及使得海量的用户在互联网上产生了海量的非结构化文本信息。这些非结构化文本中蕴含着丰富却难以直接使用的知识。信息抽取可以自动地从非结构化文本中抽取出知识,从而辅助诸如语义检索、问答等任务。
关系抽取是信息抽取技术的重要组成部分,同时也是信息抽取领域的难点问题。关系抽取旨在从给定的非结构化文本中找出实体及其之间的语义联系。本文关注的是抽取两个实体之间的关系,且这些关系是预先定义的,因此称为预定义二元关系抽取。关系抽取还可以根据句子中实体是否已知、句子中三元组的数量等不同维度划分成不同的子任务。由于自然语言表达的灵活性与多样性,同样的语义关系可以有多种不同的表述,而同样的表述在不同的语境下可能表示不同的语义关系。这对面向非结构化文本的关系抽取提出了很大的挑战,同时也吸引了学术界和工业界的广泛关注。关系抽取的相关成果已经应用在了部分互联网产品中,如语义搜索中的知识图谱等。
本文面向非结构化文本的关系抽取关键技术展开研究,研究成果主要包括:
1、针对远程监督数据中,句子没有直接的监督信息(关系标签),只有包有明确的监督信息,一般的有监督方法难以适用的问题,提出利用强化学习进行句子级的关系分类。本文将对包关系预测的过程看作强化学习的过程。给定一个包,基于卷积神经网络的关系分类器对其中的每个句子独立地进行关系分类,然后将每个句子所预测的关系进行整合用于预测包的关系。再将预测的包关系和标准的包关系进行对比,以便判断整个过程中关系分类器表现的优劣,从而决定奖赏值。最后利用奖赏值对关系分类器进行训练。实验结果表明,该方法在两个不同类型的实验上都超越了基线方法。
2、针对多关系抽取任务中三元组实体重叠的问题,提出融合了拷贝机制的序列到序列模型。该方法利用序列到序列模型直接生成各个三元组。在生成一个三元组时,首先生成其关系,然后利用拷贝机制从源句子中拷贝第一个实体,最后从源句子中拷贝第二个实体。由于该方法通过拷贝的方式生成一个实体,该实体在需要参与其它三元组时还可以再次被拷贝,因而该方法可以解决三元组实体重叠的问题。该方法还采用了两种不同的策略来生成各个三元组。实验结果表明,该方法在句子中只有一个三元组时可以取得与基线方法相近的性能,在句子中有多个三元组时性能显著超过了基线方法。
3、针对多关系抽取任务中三元组的抽取顺序问题,提出利用强化学习来指导三元组的抽取。一个句子中存在多个三元组时,三元组的抽取顺序会影响最终的抽取结果。因为有些三元组更容易抽取,它们抽取出来后还可以辅助其它三元组的抽取。一般的有监督模型需要事先为每个句子指定三元组的抽取顺序,但是难以为每个句子指定其最佳的抽取顺序。因此利用强化学习进行训练,将奖赏值与最终正确抽取的三元组个数相关联。为了取得最高的奖赏值,模型将自动地以最佳的顺序抽取三元组。在两个公开的数据集上的实验验证了本方法的有效性。

Other Abstract

With the development of computer and smartphone, Internet users are generating large amounts of unstructured texts.A lot of knowledge is contained in those unstructured texts.However, it's not easy to use them directly.The information extraction task tries to extract the knowledge from unstructured texts so that other tasks like question answering could benefit from them.As an important and challenging subtask of information extraction, relation extraction aims to recognize a pair of entities and judge the semantic relation between them automatically.
In this dissertation, we focus on the predefined relation between two entities, which is called predefined binary relation extraction.Relation extraction can also be divided by other dimensions like if the entities are given or the number of relational facts in a sentence.Due to the flexibility of natural languages, a semantic relation could be expressed in various ways.This feature makes the extraction difficult.Relation extraction task has attracted the attention of not only researchers but also industry.Various related research papers are published every year and some commercial corporations have already applied relation extraction in their products.
This dissertation focus on the relation extraction of unstructured texts and the main achievements are as follows:
1. To address the problem that supervised methods are not suitable for distantly supervised dataset because sentences in such dataset are not labeled,we apply reinforcement learning to sentence level relation classification with the distantly supervised dataset.The bag relation prediction process is converted into a reinforcement learning process.Given a bag, the convolutional neural networks based classifier predict the relation of each sentence separately.Then we combine the predicted relation of each sentence to predict the bag relation.The reward is based on the comparison of the predicted bag relation and the gold relation, which is used to judge the performance of the relation classifier.We conduct two different types of experiments and our method achieves better performance than the baseline methods.
2. To address the overlapping problem in multiple relational facts extraction, we propose a sequence-to-sequence model with copy mechanism.This model generates relational facts directly.When generating a triplet, the model first generates the relation, then the model copies the first entity from the source sentence, lastly, the model copies the last entity from the source sentence.Since we generate entity with the copy mechanism, this entity could participate in other relational facts when necessary.Therefore, the overlapping problem is resolved.Two different strategies are utilized when generating the relational facts.Experiments show that our method achieves comparable performance with the baseline methods when the sentence only contains one relational fact.While our model significantly outperforms the baseline methods when the sentence contains multiple relational facts.
3. To address the extraction order problem in multiple relational facts extraction, we use reinforcement learning to guide the extraction of relational facts.When a sentence contains multiple relational facts, the extraction order could make a difference to the extraction performance.Some relational facts are easier to extract and they can help the extraction of other relational facts.A supervised model requires a predefined extraction order for each sentence.However, it's difficult to assign the best extraction order for each sentence manually.We use reinforcement learning to train a sequence-to-sequence model and associate the reward with the number of correctly extracted relational facts.To achieve the highest reward, the model would extract relational facts in their best order automatically.The widely conducted experiments verify the effectiveness of this method.

Pages98
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/23770
Collection模式识别国家重点实验室_自然语言处理
Recommended Citation
GB/T 7714
曾祥荣. 面向非结构化文本的关系抽取关键技术研究[D]. 中科院自动化所. 中科院自动化所,2019.
Files in This Item:
File Name/Size DocType Version Access License
面向非结构化文本的关系抽取关键技术研究-(8436KB)学位论文 开放获取CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[曾祥荣]'s Articles
Baidu academic
Similar articles in Baidu academic
[曾祥荣]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[曾祥荣]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.