基于语义的文本关联性分析

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于语义的文本关联性分析
其他题名	Correlation between Texts Based on Semantic Analysis
	刘贤达
	2011-05-27
学位类型	工学硕士
中文摘要	随着网络信息的迅速增长，如何提高信息检索系统对自然语言的处理能力，成为了研究热点。文本关联性计算作为信息检索处理中一项基础性技术，直接影响着检索结果的好坏。而传统的基于词语字符串匹配的方法已经不适用于解决今天复杂的语言关联问题。因此，本文提出一种基于语义的文本关联性分析方法，以语义为核心，构建文本间的关键词网络，分析文本间的语义关联性。论文的主要内容包括： 1、建立关键词网络分析论文要素及结构，介绍关键词特征，详细说明了首位置特征、首次出现位置特征POS、词频、TF×IDF、词性、文档长度等特征的基本思想和计算方法。讨论了四种常用的关键词抽取方法，并结合已有资源，决定采用基于统计的关键词抽取方法。最后定义关键词网络，并定义关键词网络中的“核心词汇”节点、“枝叶词汇”节点及“潜在词汇”节点。 2、研究并阐释两种知识表示体系：知网和概念知识树知识表达体系。在知网中，义原是基本表达单元，而义项是由义原所组成的。知网通过一种知识描述语言来对每个概念进行描述；在概念知识树中，概念是基本表达单元，而我们用属性、关系和行为三方面对概念进行描述。我们结合两种知识表达体系，对自动化学科词汇进行语义分析。 3、分析文本关联性首先提出基于知网的词汇间相似度改进算法。在义原间相似度计算的改进算法中，我们考虑了概念层次树的深度和概念层次树的区域密度对义原间相似度计算的影响。在义项间相似度计算的改进算法中，我们采用分类讨论的方式解决义原加权的问题。然后分析自动化学科词汇的结构，提出自动化学科词汇的语义确定的算法以及计算自动化学科词汇间相似度的算法。最后，结合关键词网络，提出文本关联性的语义分析算法。
英文摘要	With the rapidly increasing information on the Internet, a research has been a focus on improving the performance of an information retrieval (IR) system by Natural Language Processing (NLP). As a fundamental technique in IR system, correlation computation between texts has affected the retrieval results directly. However, the traditional method to compute correlation is to use keyword string match, which is helpless when it comes to solve complex problems about text correlation. Therefore, this paper will solve the problem about correlation between automation discipline papers based on semantic analysis. The main content of this paper is: 1.Build Keyword Networks First, I analyze the elements and structures of papers. Then I introduce the characteristics of the keywords and explicitly explained 5 characteristics, such as the first position, the term sequence, the value of TFIDF, the part of speech and length of documents. Besides, I also discusse 4 common ways of extracting keywords and decided to use the method based on statistics. At last, I define the keyword networks and put forward 3 kinds of nodes in keyword networks, such as “core-word” node, “leaf-word” node and “potential-word” node. 2.Explain the Structure of Knowledge Representation: HowNet and Conceptual Knowledge Tree In HowNet, the sememe is the unit of semantic meaning and the concept is made up of sememes. Each concept is expressed as Knowledge Representive Language; in Conceptual Knowledge Tree, we use the attributes, relations and behaviors to describe a concept. I use these two knowledge representation system mention above to analyze automation discipline words on the basis of semantics. 3.Analyze correlation between texts First, I use HowNet as semantic representation fundament to compute similarity between common words. I improve the algorithm of computing similarity between sememes and between concepts. When computing similarity between sememes, I take the height and density of sememe tree into consideration. When computing similarity between concepts, I solve the problem about the weight of sememes by classifying each condition. Second, I analyze the structure of automation discipline words and put forward an algorithm to determine automation discipline words’ semantic meaning with the help of Conceptual Knowledge Tree and then computed the similarity between automation discipline words. Finally, I presente the algorithm of computing correlation between papers: map papers...
关键词	关联性知网概念知识树知识表示体系关键词网络相似度 Correlation Hownet Conceptual Knowledge Tree Keyword Network Similarity
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/7560
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	刘贤达. 基于语义的文本关联性分析[D]. 中国科学院自动化研究所. 中国科学院研究生院,2011.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20082800902901（1713KB）			限制开放	CC BY-NC-SA