CASIA OpenIR  > 毕业生  > 博士学位论文
面向非结构化文本的关系抽取关键技术研究
Alternative TitleResearch on the Key Methods for Relation Extraction from Unstructured Texts
曾道建
Subtype工学博士
Thesis Advisor赵军
2015-05-27
Degree Grantor中国科学院大学
Place of Conferral中国科学院自动化研究所
Degree Discipline模式识别与智能系统
Keyword信息抽取 关系抽取 非结构化文本 卷积神经网络 Information Extraction Relation Extraction Unstructured Texts Convolutional Neural Networks
Abstract随着互联网技术的发展和普及,网络已经成为大多数人日常生活中必不可少的一部分。互联网上存在大量的非结构化电子文本,面对日益增长的网页数据,如何帮助人们理解这些数据,快速地从海量的非结构化文本中发现知识,以及如何将这些文本知识表示成计算机能够``理解''的形式,从而减轻人类的学习成本,显得越来越重要。信息抽取(Information Extraction)技术的研究正是为了解决这个问题。 关系抽取(Relation Extraction)是信息抽取技术的重要环节,是信息抽取领域重要的基础任务和难点问题之一,其任务是从结构化和非结构化文本中自动识别出一对概念和联系这对概念的语义关系,并构成关系三元组。关系抽取不仅有助于互联网信息的管理与服务,而且对于文本内容理解具有重要支撑作用,能够将文本分析从语言层面提升到内容层面,对大规模知识库构建、问答系统、语义搜索等具有潜在的应用前景。因此,关系抽取技术得到了学术界和工业界的广泛关注,正成为越来越热门的研究课题。近年来,面向结构化文本的关系抽取研究已经取得了一定进展。然而,由于自然语言灵活多变,同一种语义关系可以使用不同的语言表达,而同一种语言表达在不同上下文中也经常描述不同的语义关系,自然语言的歧义性对面向非结构化文本的关系抽取提出了很大的挑战,其性能一直较低,其中有许多值得研究的问题。 本论文对面向非结构化文本的关系抽取关键技术展开研究,研究内容主要包括: 1、对于有监督关系抽取方法,针对特征提取过程中存在的误差累积问题,提出基于卷积神经网络(CNNs,Convolutional Neural Networks)的关系抽取方法,该方法不依赖于现有的自然语言处理工具,利用卷积网络从原始文本中自动学习表示语义关系的特征,特别地使用位置特征对待给定语义关系的词建模。具体地,首先通过查询词向量,将输入文本表示为向量形式,然后抽取待给定语义关系的词语对应的向量作为词汇级别特征,同时利用卷积网络进行语义组合得到句子级别特征,最后将这两种特征拼接起来构成最终的特征向量。实验结果表明,与基线系统相比,该方法在关系抽取任务上性能有显著提升,显著地改善了传统特征抽取存在的误差累积问题,同时使用位置特征后系统性能进一步提升。 2、对于弱监督(DS, Distant Supervision)关系抽取方法,针对训练数据中存在回标噪声以及使用卷积网络时词序信息丢失的问题,提出基于分段卷积神经网络(PCNNs,Piece-wise Convolutional Neural Networks)的弱监督关系抽取方法,该方法使用分段最大池化代替传统卷积网络的池化操作,以捕获词序信息,得到结构化特征。另外,该方法将弱监督关系抽取看作多示例问题,每个样本作为多示例包输入,利用分段卷积网络对包中的每个示例自动学习特征,将目标函数定义在包上,使用多示例学习训练网络参数,从而减少数据回标噪声对实验结果的影响。实验结果表明,在held-out评价和人工评价两种指标上,该方法取得的结果均好于基线系统,有效克服了回标噪声以及使用卷积网络时词序信息丢失的问题。 3、对于弱监督关系抽取方法,针对冗余信息利用的问题,提...
Other AbstractWith the development and popularization of Internet,the network has become the most essential part of everyday life. There are large amounts of unstructured texts on the Internet. Faced with the ever-growing Web data, we need to quickly discover knowledge from large-scale unstructured texts and convert the knowledge to something that the computer can understand. Information extraction aims to solve this problem. Relation extraction was formulated as a critical part of information extraction. It is a fundamental and one of the most difficult tasks in the field of information extraction. Relation extraction aims to automatically recognize a pair of concepts and the semantic relation between them from structured or unstructured texts. Relation extraction not only helps to manage the information and services on the Internet, but also supports for text comprehension. It can enhance text analysis from language to content level and has the potential to help large-scale knowledge base construction, question answering system and semantic search. Thus, relation extraction has being received widespread attention in academia and industry, and is becoming increasingly popular research topic. Recently, relation extraction from structured texts has made good progress. However, due to the flexibility of natural languages, relation extraction from unstructured texts is much more difficult. On the one hand, a semantic relation can be represented as different linguistic expressions. On the other hand, a linguistic expression often has entirely different meanings in different contexts. Therefore, relation extraction from unstructured texts faces a great challenge and the performance is very low. There are many problems worthy of study. In this dissertation, we focus on extracting relations from unstructured texts. The main achievements are as follows: 1. To address the error propagation in feature extraction procedure of supervised relation extraction approaches, we propose a relation extraction approach based on Convolutional Neural Networks (CNNs). In the approach, CNNs are used to automatically learn features from raw texts, which is independent on the existing Natural Language Processing (NLP) tools. In addition, position features (PF) are exploited to specify the pairs of words to which we expect to assign relation labels. The input of the system is raw texts. First, word tokens are transformed into vectors by looking up word embeddings. Then, we directly rega...
Other Identifier201118014628068
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/6717
Collection毕业生_博士学位论文
Recommended Citation
GB/T 7714
曾道建. 面向非结构化文本的关系抽取关键技术研究[D]. 中国科学院自动化研究所. 中国科学院大学,2015.
Files in This Item:
File Name/Size DocType Version Access License
CASIA_20111801462806(5088KB) 暂不开放CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[曾道建]'s Articles
Baidu academic
Similar articles in Baidu academic
[曾道建]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[曾道建]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.