CASIA OpenIR  > 毕业生  > 硕士学位论文
面向中文知识抽取的语料库构建技术研究
郝悦星
学位类型工程硕士
导师徐波
2017-05-27
学位授予单位中国科学院研究生院
学位授予地点北京
关键词知识抽取 远监督 端对端记忆网络 Ssm框架
摘要
近年来,知识抽取任务逐渐成为了自然语言处理领域的一个热点问题。现
有的知识抽取算法主要是基于统计机器学习方法,其特点是严重依赖于训练语
料。由于手动标注知识抽取数据的成本太高,目前暂时没有公开的大数量级的
中文知识抽取数据集,这导致面向中文的知识抽取算法研究缺乏合理的验证数
据集,阻碍相关领域的发展。
针对上述问题,本文面向关系抽取、三元组抽取等中文知识抽取任务,通
过人-机结合的方式,围绕语料获取、过滤和人工筛选等步骤展开研究,实现
构建高质量、大规模的中文知识抽取语料库的目标。主要内容包括以下三点:
(1) 提出了一种基于远监督的语料库构建方法。本文基于回标的思想利用三元
组匹配句子,如果句子中包含三元组中的实体对,则认为该句子能抽取出
对应三元组,并标注为句子-三元组对存储在语料库中。为了自动获取大量
的句子-三元组对,本文首先设计爬虫爬取在线百科知识库和部分娱乐性
网站的页面信息,解析页面中的半结构化信息和非结构化文本后分别构成
了三元组库和生语料库。然后,本文设计了基于SVM的不完整句子过滤模
型,提取了信息增益、TF-IDF值、词性和句法规则等特征,筛选出句法结
构完整的句子。最后,本文基于远监督回标的思想利用三元组库回标生语
料库,初步构建了句子-三元组语料库。
(2) 提出了一种基于端对端神经网络的语料库过滤方法。基于远监督的回标方
法虽然克服了有监督方法需要人工标注数据的不足,但其假设并不完善,
实体对在句子中出现的语义关系未必与对应三元组中的关系匹配。仅通过
上述回标方式构建的句子-三元组对语料库引入了大量噪声样本,这会严重
影响抽取算法的性能。针对这个问题,本文将该匹配判定问题转化为一个
二分类问题,提出了一种基于端对端记忆网络的分类模型,将句子的有效
信息存储于记忆组件中,三元组从记忆组件中挑选出与匹配判定相关的记
忆后进行分类,从而筛选出正确的句子-三元组对。实验结果表明,与传统
机器学习方法相比,该模型的分类效果更好,同时实验中针对不同关系的
句子- 三元组对特点进行了分析。
(3) 搭建了一个面向中文知识抽取语料库的人工筛选平台。语料库的质量直接
影响后续抽取模型的性能,而通过算法自动生成的句子-三元组对语料库无
法达到高准确率的要求。因此本文基于B/S和SSM框架搭建了一个面向中
文知识抽取语料库的人工筛选平台,方便对回标、过滤后的句子-三元组对
进行最后的人工确认,以保证句子-三元组对语料库的高质量。该平台主要
包括了确认、修改和删除三种功能,用户可利用这三种功能对句子-三元组
对进行筛选和修改。此外,该平台支持多人同时在线进行句子-三元组对的
匹配确认工作,提高了人工筛选的工作效率。
其他摘要
Recent years, knowledge extraction task has gradually become a hot topic
in the field of natural language processing. Due to the high cost of manually
labeled data, there is no large-scale Chinese knowledge extraction data set at
present. This leads to the lack of reasonable verification data set for the Chinese
knowledge extraction algorithm, which hinders the development of related fields.
In this paper, we construct the Chinese knowledge extraction corpus for
knowledge extraction tasks like relation extraction or entity and relation extraction.
Through the way of human-computer interaction, we use the following
aspects: corpus acquisition, filtering and manual label. The main contents include
the following three points:
(1) Firstly, a distant supervision method is proposed for Chinese knowledge extraction
corpus. If the sentence contains the entity pair in the triple, the
sentence can extract the corresponding triples and marked as a sentencetriple
pair stored in the corpus. First, the reptile is designed to crawl the
page information of the online encyclopedia and some entertaining websites.
After analysing page information, we build tuple database and sentence corpus
using the semi-structured information and the unstructured text. Then,
we design a incomplete sentence filtering model based on SVM with the characteristics
of information gain, TF-IDF value, part of speech and syntactic
rules. Finally, we construct the sentence-triple pair corpus based on the idea
of distant supervision.
(2) Secondly, a corpus filtering method based on end-to-end memory network
is proposed to filter wrong labeled sentence-triple pairs. Although the construction
method based on distant supervisor overcomes the shortcomings of
the manual labeling, the hypothesis is not perfect, and the semantic relation
of the entities in the sentence does not necessarily match the relationship in the corresponding triple. This will lead to a large number of noise instances,
which will seriously affect the performance of the extraction algorithm.
Aiming at this problem, we transform the matching problem into a
two-class problem, and proposes a classification model based on the end-toend
memory network. The effective information of the sentence is stored in
the memory, and the triple tuple is selected related knowledge from the memory
to screen out correct instance. The experimental results show that the
model is more effective than the traditional machine learning method, and
the characteristics of the sentence - triples are analyzed in the experiment.
(3) Thirdly, an artificial screening platform for Chinese knowledge extraction corpus
is built for manually filtering. The quality of the corpus directly affects
the performance of the subsequent extraction model, but the sentences-triple
pairs generated by the algorithm automatically can not achieve the requirement
of high accuracy. Therefore, we builds a manual screening platform
based on B/S and SSM framework for Chinese knowledge extraction High
quality corpus construction. The platform includes three functions: confirmation,
modification and deletion. Users can apply three functions to filter
and modify sentences - triples. In addition, the platform supports multiple
people to confirm the work online, which can improve the efficiency of manual
labeling.
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/14847
专题毕业生_硕士学位论文
作者单位中国科学院自动化研究所
推荐引用方式
GB/T 7714
郝悦星. 面向中文知识抽取的语料库构建技术研究[D]. 北京. 中国科学院研究生院,2017.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
面向中文知识抽取的语料库构建技术研究.p(2266KB)学位论文 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[郝悦星]的文章
百度学术
百度学术中相似的文章
[郝悦星]的文章
必应学术
必应学术中相似的文章
[郝悦星]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。