面向半结构和无结构文本的实体关系抽取关键技术研究

CASIA OpenIR > 毕业生 > 博士学位论文

	面向半结构和无结构文本的实体关系抽取关键技术研究
其他题名	Research on the Key Methods for Relation Extraction from Semi-structured and Unstructured Texts
	刘洋
	2014-05-29
学位类型	工学博士
中文摘要	实体关系抽取是从网络半结构和无结构文本中抽取实体属性关系和实体之间关系的技术，是信息抽取领域重要的基础任务和难点问题之一，对大规模知识库构建、问答系统、语义搜索等应用具有潜在的应用前景。因此，实体关系抽取技术得到了工业界和学术界的广泛关注。然而，海量异构的网络数据以及表达多样性的网络语言，也对实体关系抽取技术提出了很大的挑战。论文研究面向大规模网络半结构和无结构文本的实体关系抽取技术，主要研究成果如下： 1、为了应对弱半结构文本模板不一致的问题，本文提出了一种利用网站级模板和属性知识的弱半结构文本实体属性关系抽取方法。首先，利用基于图的随机游走算法来获得高置信度的模板和属性，构成网站级知识；然后，利用获得的网站级知识来识别每篇文章中的弱半结构文本区域，进而从中抽取属性-属性值对，并与相应实体组成实体属性关系。实验结果表明，在面对弱半结构文本时，相比于没有利用网站级知识的抽取方法，本方法显著地提升了抽取性能，增强了其鲁棒性。 2、无结构文本弱监督关系抽取存在训练数据回标噪声问题，当前许多方法利用多实例模型对训练数据回标噪声进行建模，但多实例模型的假设在很多时候得不到满足，造成抽取性能欠佳。为了应对以上问题，本文提出了一种基于显著特征发现的弱监督实体关系抽取方法。首先，利用回标产生的训练数据的特征将满足回标条件的样本同不满足回标条件的样本联合起来；然后，基于主题模型对联合样本中特征和潜在关系进行建模，以发现特征的明确度，再将明确度和表征特征所含信息量的信息度进行融合，获得特征的显著性；最后，将显著性当做特征值学习实体关系抽取模型，以使显著性高的特征比噪声特征获得更大的权重。实验结果表明，本方法在Held-out评价和人工评价的实验中，其结果均好于以前的方法。 3、传统弱监督实体关系抽取方法使用的特征由四种实体类型构成，这样的粗粒度实体类型常常不能够区分两个实体间的实体关系。为了应对以上问题，本文提出了一种基于细粒度实体类型特征发现的无结构文本弱监督实体关系抽取方法，通过发现细粒度实体类型来增强特征的区分性。首先，利用细粒度实体类型同维基百科文章的对应关系得到训练数据，自动训练细粒度实体类型分类器；然后，利用搜索引擎返回结果扩展实体指称，并通过学习得到的分类器预测扩展后实体指称的细粒度实体类型；最后，将细粒度实体类型融入到实体关系抽取模型，并研究比较了三种融入方法――替换方法、扩展方法和选择方法。实验结果表明，相较于原抽取模型，本文扩展方法在聚合抽取实验中获得了更加平滑的正确率/召回率曲线――它代表更加稳定的系统，并在句子级抽取实验中大幅度提升了准确率。
英文摘要	Relation extraction aims to extract attribute relations and relations between entities from semi-structured and unstructured texts. It is one of the core tasks in information extraction. Due to its potential effects on the construction of the large-scale knowledge base, question answering, semantic search etc., it has gained much attention in both the academic community and the industrial community. And the massive heterogeneous Web data and the diversified Web language also pose great challenges to the relation extraction techniques. In the paper, we focus on extracting relations from semi-structured and unstructured texts. The main content is as follows: 1, To handle the problem of template inconsistency in weakly semi-structured texts, we propose a method that leverages the site-level knowledge with templates and attributes to extract attribute relations from weakly semi-structured texts. First, we use a graph-based random walk model to acquire templates and attributes with high confidence, which constitute the site-level knowledge. Then we utilize such knowledge to identify weakly semi-structured texts in each page, and extract attribute-value pairs to get attribute relations with corresponding entities. The experiments show that, comparing with the baseline method which does not utilize site-level knowledge, our method can improve the extraction performance significantly. 2, Distant supervision (DS) for relation extraction suffers from the problem of noisy labeling. Most solutions try to model the noisy instances in the form of multi-instance learning. However, the distant supervision assumption may fail, which causes a bad performance. In this paper, we employ a novel approach to address this problem by exploring distinctive features. First, We make use of all the training data (both the labeled part that satisfies the DS assumption and the part that does not). Then, we employ an unsupervised method based on a topic model to discover the feature-relation distribution. We use the distribution to compute the clarity of a feature, and we compute the distinctiveness of a feature by combining its clarity and its informativeness which is measured by the length and frequency of the feature. At last, we train the extractor by using the distinctiveness as the value of the feature, where the distinct features will get greater weight than the noisy ones. Experimental results show that the approach significantly outperforms the baseline methods in both th...
关键词	信息抽取关系抽取半结构文本无结构文本弱监督 Information Extraction Relation Extraction Semi-structured Texts Unstructured Texts Distant Supervision
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6641
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	刘洋. 面向半结构和无结构文本的实体关系抽取关键技术研究[D]. 中国科学院自动化研究所. 中国科学院大学,2014.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20111801462908（3732KB）			暂不开放	CC BY-NC-SA