面向网络的中文实体关系抽取的研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	面向网络的中文实体关系抽取的研究
其他题名	Research on Web-based Chinese Entity Relation Extraction
	王昊
	2015-05-25
学位类型	工程硕士
中文摘要	实体关系抽取是信息抽取的重要任务，该任务的的输入是多结构化的文本数据，包括：结构化的infobox信息框，半结构化的表格，以及非结构化的自由文本。该任务的输出是实体关系，可以表示为三元组（实体1，关系，实体2）的形式。对于结构化和半结构化数据，可以直接解析得到关系三元组，目前实体关系抽取的研究主要是从非结构化文本中抽取出实体关系。比如给定一个句子“姚明出生于上海”，实体关系抽取算法需要从中抽取出实体关系“<姚明, 出生地, 上海>”。这些抽取出来的三元组集合，可以构建知识库，对问答系统，语义网，机器翻译等都有非常重大的意义。现在互联网上有海量的中文数据，而且中文互联网用户数量巨大，对中文实体关系抽取的研究有着很好的应用前景。但是当前大部分实体关系的抽取研究都是处理英文数据，基于中文语料的工作很少。和英文相比，中文句子需要分词，中文语言也没有时态，字母大小写等特征，所以基于中文的实体关系抽取的研究更难，更有挑战性。本文针对中文实体关系抽取的方法进行了探索和研究，主要的创新和研究成果有： 1.构建了一个中文语义知识库。爬取百度百科和互动百科的网页数据，抽取其中结构化部分，转化为关系三元组<实体1,关系词,实体2>的形式存储起来，构建中文语义知识库。当给定的待抽取关系词在知识库中的频数大于某阈值，则认为该关系词为高频关系词，否则认为该关系词为低频关系词。 2.对于高频关系词的抽取，转换为序列标注问题。高频关系词在知识库中对应丰富的关系三元组集合，这些三元组数据可以采用打分策略在文本中回标候选句子，自动构建训练语料。采用关键词匹配策略在待抽取词条页面中定位到需要抽取的句子，训练条件随机场模型标注待抽取部分，然后根据标注的结果提取关系三元组。实验对比选择候选句子的不同策略，然后从准确率和召回率的侧重点给出不同的建议。 3.应用领域知识和规则进行低频关系词的实体关系抽取，该方法有效避免了低频关系词无法自动标注训练语料的问题。确定待抽取关系词前后实体的类别，扩充表达该关系的关键词库，借助实体类别词库数据，根据实体对和关键词在文本中共现的策略，抽取相应的关系三元组。另外，采用关联分析的方法学习规则，可以挖掘出非常丰富的关系词模板。 4.利用word2vec训练词向量进行中文实体关系的判断和抽取。利用google开源工具包word2vec，结合百度百科的文本数据，学习得到词向量，通过实验评估词向量的效果。根据词向量，学习得到待抽取关系词对应的关系矩阵，利用关系矩阵训练分类器，将实体关系抽取转换为二分类问题，通过分类结果判断实体对中是否存在特定的关系，来获取关系三元组。
英文摘要	Entity relation extraction is one of the main tasks of information extraction. The input of this problem is multi-structured data, including structured data (infobox form), semi-structured data (tables and lists) and non-structured data (free text). And the output is a set of fact triples extracted from input data. Entity relation triples can be easily extracted from the structured and semi-structured data, and current research mainly refers to extracting relation triples from unstructured text. For example, given the sentence ”Yao Ming was born in Shanghai” as input, the relation extraction algorithm should extract ”” from it. These fact triples can be used to build a large, high-quality knowledge base, which can benefit to a wide range of NLP tasks, such as question-answering system, semantic web and machine translation. Now massive Chinese information exists on the internet and the research of Chinese entity relation extraction will have important significance. But current research mainly focuses on the processing of English resource and the study conducted on Chinese corpus is less. Compared to English language, Chinese language need word segmentation, and the proper nouns don’t have the first letter capitalized, so the Chinese entity relation extraction is more difficult and more challenging. This thesis thoroughly explores and investigates the Chinese entity relation extraction approaches, and the main contributions are as follows: 1. A Chinese knowledge base is built. The structured part of the Chinese text from Baidu Encyclopedia and Hudong Encyclopedia is extracted and stored after being transformed to a triple which is organized in the form . A relation word is identified as a high-frequent relation word when its frequency in the knowledge base is larger than a threshold and other wise a low-frequency relation word. 2. The extraction of high-frequency relation word is converted into a sequence labeling issue. We can get enough triples containing the high-frequency relation word by traversing the knowledge base, and these triples can be used to automatically tag training data. And the sentences used for extraction are located in the specified phrase page according to key words matching strategy. We train a conditional random field model to label the extracted part and then generate corresponding relation triples. 3. The method based on some simple rules and knowledge base is...
关键词	信息抽取实体关系抽取条件随机场模型知识库词向量 Information Extraction Entity Relation Extraction Conditional Random Field Model Knowledge Base Word Embedding
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/7768
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	王昊. 面向网络的中文实体关系抽取的研究[D]. 中国科学院自动化研究所. 中国科学院大学,2015.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_2012E801466109（1783KB）			暂不开放	CC BY-NC-SA