CASIA OpenIR  > 毕业生  > 硕士学位论文
基于语义的文本关联性分析
其他题名Correlation between Texts Based on Semantic Analysis
刘贤达
学位类型工学硕士
导师杨一平
2011-05-27
学位授予单位中国科学院研究生院
学位授予地点中国科学院自动化研究所
学位专业计算机应用技术
关键词关联性 知网 概念知识树知识表示体系 关键词网络 相似度 Correlation Hownet Conceptual Knowledge Tree Keyword Network Similarity
摘要随着网络信息的迅速增长,如何提高信息检索系统对自然语言的处理能力,成为了研究热点。文本关联性计算作为信息检索处理中一项基础性技术,直接影响着检索结果的好坏。而传统的基于词语字符串匹配的方法已经不适用于解决今天复杂的语言关联问题。因此,本文提出一种基于语义的文本关联性分析方法,以语义为核心,构建文本间的关键词网络,分析文本间的语义关联性。论文的主要内容包括: 1、建立关键词网络 分析论文要素及结构,介绍关键词特征,详细说明了首位置特征、首次出现位置特征POS、词频、TF×IDF、词性、文档长度等特征的基本思想和计算方法。讨论了四种常用的关键词抽取方法,并结合已有资源,决定采用基于统计的关键词抽取方法。最后定义关键词网络,并定义关键词网络中的“核心词汇”节点、“枝叶词汇”节点及“潜在词汇”节点。 2、研究并阐释两种知识表示体系:知网和概念知识树知识表达体系。 在知网中,义原是基本表达单元,而义项是由义原所组成的。知网通过一种知识描述语言来对每个概念进行描述;在概念知识树中,概念是基本表达单元,而我们用属性、关系和行为三方面对概念进行描述。我们结合两种知识表达体系,对自动化学科词汇进行语义分析。 3、分析文本关联性 首先提出基于知网的词汇间相似度改进算法。在义原间相似度计算的改进算法中,我们考虑了概念层次树的深度和概念层次树的区域密度对义原间相似度计算的影响。在义项间相似度计算的改进算法中,我们采用分类讨论的方式解决义原加权的问题。然后分析自动化学科词汇的结构,提出自动化学科词汇的语义确定的算法以及计算自动化学科词汇间相似度的算法。最后,结合关键词网络,提出文本关联性的语义分析算法。
其他摘要With the rapidly increasing information on the Internet, a research has been a focus on improving the performance of an information retrieval (IR) system by Natural Language Processing (NLP). As a fundamental technique in IR system, correlation computation between texts has affected the retrieval results directly. However, the traditional method to compute correlation is to use keyword string match, which is helpless when it comes to solve complex problems about text correlation. Therefore, this paper will solve the problem about correlation between automation discipline papers based on semantic analysis. The main content of this paper is: 1.Build Keyword Networks First, I analyze the elements and structures of papers. Then I introduce the characteristics of the keywords and explicitly explained 5 characteristics, such as the first position, the term sequence, the value of TFIDF, the part of speech and length of documents. Besides, I also discusse 4 common ways of extracting keywords and decided to use the method based on statistics. At last, I define the keyword networks and put forward 3 kinds of nodes in keyword networks, such as “core-word” node, “leaf-word” node and “potential-word” node. 2.Explain the Structure of Knowledge Representation: HowNet and Conceptual Knowledge Tree In HowNet, the sememe is the unit of semantic meaning and the concept is made up of sememes. Each concept is expressed as Knowledge Representive Language; in Conceptual Knowledge Tree, we use the attributes, relations and behaviors to describe a concept. I use these two knowledge representation system mention above to analyze automation discipline words on the basis of semantics. 3.Analyze correlation between texts First, I use HowNet as semantic representation fundament to compute similarity between common words. I improve the algorithm of computing similarity between sememes and between concepts. When computing similarity between sememes, I take the height and density of sememe tree into consideration. When computing similarity between concepts, I solve the problem about the weight of sememes by classifying each condition. Second, I analyze the structure of automation discipline words and put forward an algorithm to determine automation discipline words’ semantic meaning with the help of Conceptual Knowledge Tree and then computed the similarity between automation discipline words. Finally, I presente the algorithm of computing correlation between papers: map papers...
馆藏号XWLW1615
其他标识符200828009029013
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/7560
专题毕业生_硕士学位论文
推荐引用方式
GB/T 7714
刘贤达. 基于语义的文本关联性分析[D]. 中国科学院自动化研究所. 中国科学院研究生院,2011.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
CASIA_20082800902901(1713KB) 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[刘贤达]的文章
百度学术
百度学术中相似的文章
[刘贤达]的文章
必应学术
必应学术中相似的文章
[刘贤达]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。