实体链接关键技术研究

CASIA OpenIR > 毕业生 > 博士学位论文

	实体链接关键技术研究
其他题名	Research on Key Problems of Entity Linking
	张涛
	2013-05-27
学位类型	工学博士
中文摘要	命名实体歧义是指同一个实体指称项在不同上下文环境中对应不同真实世界实体的语言现象。实体歧义问题给信息处理领域的很多任务带来了严重问题，信息检索和抽取、知识工程等任务都需要功能强大的实体消歧系统做支撑。研究高性能的实体消歧技术具有重要的学术和应用价值。实体链接是解决命名实体歧义问题的一种重要方法，该方法通过将具有歧义的实体指称项链接到给定的知识库中从而实现实体歧义的消除。本文针对实体链接任务中的核心问题：实体指称项与候选实体之间语义相似度的计算展开深入研究。论文的主要工作和创新点归纳如下： 1、提出了基于维基概念语言模型的实体链接方法，有效提升了指称项文本与候选实体之间语义相似度计算的准确性实体链接的关键问题是实体指称项文本与候选实体之间语义相似度的计算。传统的基于词袋子模型的相似度计算方法仅仅考虑实体指称项与候选实体的文本表层特征，不能捕捉到存在于文本内部的语义信息。为了使得相似度的计算更加准确，本文提出了一种基于维基概念语言模型的方法来计算实体指称项与候选实体之间的语义相似度。通过将实体指称项文本与候选实体分别映射到维基概念语义空间中来获得它们在概念空间上的语义表达。进一步，本文给出了利用维基百科的结构化信息估计概念语言模型的方法，设计并实现了基于维基概念语言模型方法的实体链接系统。在KBP数据集上的实验结果表明，相比于基于词的语言模型方法，本文提出的方法取得了6.1%的性能提升；相比于State-of-Art系统，该方法也取得了1.8%的性能提升。 2、提出了基于排序学习框架的实体链接方法，并给出了一种可以融合类别关系与链接关系的维基概念相似度计算方法，有效提升了实体链接系统的性能为了充分利用存在于维基百科中的各种结构化信息进行语义相似度计算，本文提出了一种可以融合类别关系与链接关系的维基概念相似度计算方法。首先根据维基概念之间的结构化信息定义维基概念图。然后根据定义好的维基概念图，利用在维基图上的随机游走算法确定维基概念之间的相似度。在此基础上，本文设计并实现了一个基于排序学习算法框架的实体链接系统，将该相似度特征融入到本文设计的系统中，取得了较好的效果。在KBP数据集上的实验结果表明，相比于传统的维基概念相似度计算方法，该系统取得了4.3%的性能提升；相比于State-of-Art系统，该系统也取得了有竞争力的结果。 3、提出了基于双语隐含主题模型的跨语言实体链接方法，这种方法可以避免跨语言实体链接对机器翻译系统的依赖传统的跨语言实体链接方法往往依赖于统计机器翻译系统，通过将实体指称项文本翻译成与知识库相同的语言进而将该问题转化为传统的单语言实体链接问题。该方法的缺点是对训练数据的要求较高，往往需要在句子级别对齐的大量双语平行语料。本文提出一种基于双语隐含主题模型的跨语言实体链接方法，该方法从语义上相关的大规模双语平行语料库中挖掘隐含主题信息，训练隐含主题模型，然后利用双语隐含主题模型将实体指称项文本与候选实体文本映射到同一个隐含主题空间中去，从而进行主题语义上的相似度计算。在KBP评测数...
英文摘要	Named entity ambiguity means that the same entity mention can refer to different entities in different context. It has brought very serious problems in information processing community, including machine translation, information extraction. Entity linking is an approach to resolve the named entity ambiguity problem. The task of entity linking sys-tem is to link an entity mention in a background souce document with the corresponding real world entity in an existing knowledge base. The research of entity linking system has a great academic and applied value in the field of knowledge engineering, information retrieval and natural language processing. This thesis focuses on the key problem of entity linking: the semantic similarity between the context of entity mention and the candidate entity. The main work and the contribu-tions of this thesis are summarized as follows: 1. A concept-based language model is proposed for entity linking task In order to overcome the problem of traditional BOW method and get a better semanitic relatedness measure between the context of entity mention and candidate entity, this the-sis propose a concept-based language model for entity linking. This language model represents both query and entity using Wikipedia concept instead of single word. The concepts used are taken from a very comprehensive, human-defined ontology, Wikipedia. We believe that by mapping the query and entity using high-level concepts will result in a model that is less dependent on the specific terms used in the query text and the docu-ment of entity. It could yield matches even when the same concept is described by dif-ferent terms in query and entity. To better capture the semantic knowledge from the structural information in Wikipedia, we develop two methods to estimate the concept language model for the entity. One is based on the link structure between the entity and the Wikipedia concept. The other is based on the category information of the entity. To evaluate the effectiveness of our proposed method, we conduct experiments on the stan-dard KBP datasets. Experimental results show that the proposed method can obtain a 6.1% improvement compared with the traditional word-based language model. Compared with the state-of-art approach, the propose method also get a 1.8% improvement. 2. A learning-to-rank framework is proposed for entity linking task In order to capture the structure information from Wikipedia to better estimate the seman-tic relate...
关键词	实体消歧实体链接语言模型信息检索主题模型 Named Entity Disambiguation Entity Linking Language Model Information Retrieval Topic Model
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6522
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	张涛. 实体链接关键技术研究[D]. 中国科学院自动化研究所. 中国科学院大学,2013.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20091801462807（1596KB）			暂不开放	CC BY-NC-SA