CASIA OpenIR  > 毕业生  > 硕士学位论文
面向信息检索的语义计算技术
其他题名Semantic Computing in Information Retrieval
金千里
2004-05-01
学位类型工学硕士
中文摘要信息检索,包括信息的组织、呈现、查询、存取等各个方面,为人们提供快速、 精确地获取所需信息的方式。信息检索通常是指文本检索,其核心是根据用户查 询找到相关文本,包括“标引”和“相似度计算”两个关键技术。随着信息社会 尤其是互联网的发展,人们对检索的要求越来越高。传统的基于关键词匹配的检 索技术,往往存在查不全、查不准、检索质量不高的问题。因此,智能检索研究 已经成为热点,并将是支撑下一代互联网的核心技术之一。 由于文本大多数是用非形式化的自然语言表述,因此实现智能检索的关键就是要 在一定程度上理解自然语言,挖掘出隐藏在文本背后的“语义”。从研究现状来 看,基于词汇的语义模型是一类比较理想的浅层语义表述方式,已经有了很多成 功的实践。因此,在信息检索中引入智能技术的一种方案,就是在“标引”和‘相 似度计算”两个关键技术中引入词汇语义模型,月浅层语义来指导检索过程。提 高检索的准确率。这正是文本的选题思路和工作重点。 本文首先简要介绍信息检索和语义模型的研究现状,说明两者结合的必要性和合 理性。然后,论述三类语义模型(隐含语义标引、语义树、语义张量)在信息检 索中的应用。最后,介绍模式识别国家重点实验室(NLPR)的信息检索系统框架、 模块和实现:并利用TREC评测来测试系统的功能和性能。概括地说,本文主要 有如下一些工作。 (1)论述了语义模型与信息检索中两个关键技术(“标引”和“相似度计算”) 的结合问题; (2)改进了稳含语义标引模型,提出弱指导的统计隐含语义标引模型,使语义 空间分布更合理,效率也更高。这个模型可以小规模地应用于“查询主题 词构造”技术; (3) 提出了基于语义树的语义空间模型。语义空间不再是静态的,而是实时构 建的,其灵活性和可操作性优于各种隐含语义标引模型。尤其在查询主题 词扩展技术方面,性能超过了常见的扩展算法; (4)提出了语义张量的概念,并明确了其物理意义,归纳为两个核心思想。进 一步,用窗口系列模型来表述这两个思想,并应用于查询和文本间的相似 度计算。实验证明,这类模型比传统的矢量模型更有效; (5)构建了NLPR检索系统框架,并完成了模块设计和编程的工作。除了标弓一 和相似度计算等与检索技术相关的模块外,还包含了汉语分词、英文词形 还原等语言处理技术; (6)通过参加2003年的TREC评测(Robtust Track和N0v
英文摘要Information Retrieval, including information orgnizing, representation, inquiry, access, etc, supplies a series of technologies for obtaining information rapidly and accurately. Information retrieval systems, usually focusing on text retrieval, find relevant documents based on users'queries. There are two key technologies involved, "Indexing" and "Similarity Computing". The traditional retrieval methods based on the keyword matching often result in low precisions. With the development of information society and World Wide Web, the traditional retrieval methods can no longer satisfy users' requirement. Nowadays, inrelegent retrieval has already become a hot-spot of research and will be a key technology in the next generation of World Wide Web. In most cases, the contents of the text are represented by nature language. The key challenge of intelligent retrieval is the natural language understanding, which means to find out the meaning behide the text. We believe that semantic models based on words are suitable for representing the shallow meaning of the text. Therefore, we use word-based semantic models to supervise the process of information retrieval, in order to get improved retrieval performance. Firstly the thesis gives a brief introduction of the background of information retrieval and semantic models. Then, three kinds of semantic models (Latent Semantic Indexing series, Semantic Tree and Semantic Tensor) are proposed and evaluated in the field of information retrieval. After that, we present the NLPR IR System, including the architecture, module definitions and implementation. TREC evalutions are used to evaluate the system. In summary, the contributions of the thesis are as follows. (1) It explains how to use word-based semantic models to supervise the process of information retrieval; (2) Based on the former LSI and PLSI, weakly-supervised probabilistic latent semantic indexing (SPLSI) is presented and evaluated, which can get more reasonable semantic space and can be used in the process of indexing; (3) Semantic Tree Model (STM) is developed to, create a dynamic, flexible, controllaoe and real-time semantic space. As a new technology of indexing, STM outperforms most of the existing methods; (4) Semantic Tensor is put forward as a new theory, which is expressed by two key notions. Three Window-based Models of this theory are developed to compute the similarities between documents and queries. The experiments show that They outperform the traditional word-based vector space models; (5) We build NLPR IR System, including architecture designing, module definitions and implementation; (5) We participate in 2003 TREC Evaluation (Robust Track and Novelty Track) in order to test NLPR IR system and get excellent results in Novelty Track.
关键词信息检索 文本检索 语义模型 Trec评测 Information Retrieval Text Retrieval Semantic Model Semantic Computing Trec Evaluation
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/6762
专题毕业生_硕士学位论文
推荐引用方式
GB/T 7714
金千里. 面向信息检索的语义计算技术[D]. 中国科学院自动化研究所. 中国科学院研究生院,2004.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[金千里]的文章
百度学术
百度学术中相似的文章
[金千里]的文章
必应学术
必应学术中相似的文章
[金千里]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。