面向信息检索的语义计算技术

CASIA OpenIR > 毕业生 > 硕士学位论文

	面向信息检索的语义计算技术
其他题名	Semantic Computing in Information Retrieval
	金千里
	2004-05-01
学位类型	工学硕士
中文摘要	信息检索，包括信息的组织、呈现、查询、存取等各个方面，为人们提供快速、精确地获取所需信息的方式。信息检索通常是指文本检索，其核心是根据用户查询找到相关文本，包括“标引”和“相似度计算”两个关键技术。随着信息社会尤其是互联网的发展，人们对检索的要求越来越高。传统的基于关键词匹配的检索技术，往往存在查不全、查不准、检索质量不高的问题。因此，智能检索研究已经成为热点，并将是支撑下一代互联网的核心技术之一。由于文本大多数是用非形式化的自然语言表述，因此实现智能检索的关键就是要在一定程度上理解自然语言，挖掘出隐藏在文本背后的“语义”。从研究现状来看，基于词汇的语义模型是一类比较理想的浅层语义表述方式，已经有了很多成功的实践。因此，在信息检索中引入智能技术的一种方案，就是在“标引”和‘相似度计算”两个关键技术中引入词汇语义模型，月浅层语义来指导检索过程。提高检索的准确率。这正是文本的选题思路和工作重点。本文首先简要介绍信息检索和语义模型的研究现状，说明两者结合的必要性和合理性。然后，论述三类语义模型(隐含语义标引、语义树、语义张量)在信息检索中的应用。最后，介绍模式识别国家重点实验室(NLPR)的信息检索系统框架、模块和实现：并利用TREC评测来测试系统的功能和性能。概括地说，本文主要有如下一些工作。 (1)论述了语义模型与信息检索中两个关键技术(“标引”和“相似度计算”) 的结合问题； (2)改进了稳含语义标引模型，提出弱指导的统计隐含语义标引模型，使语义空间分布更合理，效率也更高。这个模型可以小规模地应用于“查询主题词构造”技术； (3) 提出了基于语义树的语义空间模型。语义空间不再是静态的，而是实时构建的，其灵活性和可操作性优于各种隐含语义标引模型。尤其在查询主题词扩展技术方面，性能超过了常见的扩展算法； (4)提出了语义张量的概念，并明确了其物理意义，归纳为两个核心思想。进一步，用窗口系列模型来表述这两个思想，并应用于查询和文本间的相似度计算。实验证明，这类模型比传统的矢量模型更有效； (5)构建了NLPR检索系统框架，并完成了模块设计和编程的工作。除了标弓一和相似度计算等与检索技术相关的模块外，还包含了汉语分词、英文词形还原等语言处理技术； (6)通过参加2003年的TREC评测(Robtust Track和N0v
英文摘要	Information Retrieval, including information orgnizing, representation, inquiry, access, etc, supplies a series of technologies for obtaining information rapidly and accurately. Information retrieval systems, usually focusing on text retrieval, find relevant documents based on users'queries. There are two key technologies involved, "Indexing" and "Similarity Computing". The traditional retrieval methods based on the keyword matching often result in low precisions. With the development of information society and World Wide Web, the traditional retrieval methods can no longer satisfy users' requirement. Nowadays, inrelegent retrieval has already become a hot-spot of research and will be a key technology in the next generation of World Wide Web. In most cases, the contents of the text are represented by nature language. The key challenge of intelligent retrieval is the natural language understanding, which means to find out the meaning behide the text. We believe that semantic models based on words are suitable for representing the shallow meaning of the text. Therefore, we use word-based semantic models to supervise the process of information retrieval, in order to get improved retrieval performance. Firstly the thesis gives a brief introduction of the background of information retrieval and semantic models. Then, three kinds of semantic models (Latent Semantic Indexing series, Semantic Tree and Semantic Tensor) are proposed and evaluated in the field of information retrieval. After that, we present the NLPR IR System, including the architecture, module definitions and implementation. TREC evalutions are used to evaluate the system. In summary, the contributions of the thesis are as follows. (1) It explains how to use word-based semantic models to supervise the process of information retrieval; (2) Based on the former LSI and PLSI, weakly-supervised probabilistic latent semantic indexing (SPLSI) is presented and evaluated, which can get more reasonable semantic space and can be used in the process of indexing; (3) Semantic Tree Model (STM) is developed to, create a dynamic, flexible, controllaoe and real-time semantic space. As a new technology of indexing, STM outperforms most of the existing methods; (4) Semantic Tensor is put forward as a new theory, which is expressed by two key notions. Three Window-based Models of this theory are developed to compute the similarities between documents and queries. The experiments show that They outperform the traditional word-based vector space models; (5) We build NLPR IR System, including architecture designing, module definitions and implementation; (5) We participate in 2003 TREC Evaluation (Robust Track and Novelty Track) in order to test NLPR IR system and get excellent results in Novelty Track.
关键词	信息检索文本检索语义模型 Trec评测 Information Retrieval Text Retrieval Semantic Model Semantic Computing Trec Evaluation
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6762
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	金千里. 面向信息检索的语义计算技术[D]. 中国科学院自动化研究所. 中国科学院研究生院,2004.