CASIA OpenIR  > 毕业生  > 硕士学位论文
文本分类和检索研究
Alternative TitleResearch on Text Retrieval and Text Categorization
李小兵
Subtype工学硕士
Thesis Advisor台宪青
2005-04-30
Degree Grantor中国科学院研究生院
Place of Conferral中国科学院自动化研究所
Degree Discipline计算机应用技术
Keyword文本检索 文本分类 概念网络 知识树 Text Retrieval Text Categorization Conceptual Network Knowledge Tree
Abstract随着 Internet 的迅猛发展,大量的电子信息成几何级数增长。面对浩瀚的信息海洋,如何从中间获取自己所需的信息,成为越来越多的人的迫切要求。文本信息在电子信息中占有很重要的地位,文本信息检索和分类技术的研究,有着重大的理论价值和现实需求。 本文面向大规模的中文文本,在文本检索和文本分类方面展开了研究与探索: 首先,针对文本检索,作了相关的研究。中文文本检索有多种模型:布尔逻辑模型、基于统计的 VSM 模型、基于概率的模型、基于语义网络的模型等。在分析这些模型的基础上,本文利用概念网络作为工具,对文本检索进行了探讨。文中阐述了如何利用概念网络组织领域知识,以及如何把领域知识应用到文本检索中的方法。 其次,对文本自动分类技术作了研究。目前大多数文本分类系统都是基于 VSM 模型的,即将文本表示成向量,然后通过计算向量间的距离决定向量类别的归属。本文针对 VSM 模型一般不考虑特征间的关系和文本结构方面的关系而导致分类不准的问题,对基于知识树的文本分类方法进行了研究。 该方法模仿人类在进行分类时的行为,以知识树所组织的知识作为分类的依据。在计算文本与类别的关联度的过程中,考虑了文本的结构方面的信息,对关键词进行动态加权。实验结果表明,相较于基于向量空间模型的 KNN分类方法,这种分类方法能明显地提高分类的召回率。同时,实验结果也指出,该方法的分类效果可以在知识树进一步完善的情况下得到进一步的改良。
Other AbstractWith the rapid development of the Internet, a tremendous amount of information is increasing everyday. How to gain useful information from huge of e-information is an urgent task to handle. Text information holds a very important station in all e-information. The research on text retrieval and text categorization has great value both in theory and reality. In this article, we focus on the research and exploration of the text retrieval and automatic text categorization: Firstly, we concentrate on the text retrieval. There are many models for Chinese text retrieval: Boolean indexing, vector space model (VSM) based on statistics, probabilistic retrieval, retrieval based on semantic network and so on. After analysis these models, this paper explores text retrieval with the conceptual network as a tool. How to organize domain knowledge with conceptual network and how to uses the domain knowledge in text retrieval are explored in this part. Secondly, we do research on automatic text categorization. Now, most of the text categorization systems are based on the VSM, that means the text is expressed in a vector, then which class the text belongs to is determined by the distance between the vectors. As the VSM does not take the relationship between the features into account, the result is not so precisely as some times. Aimed at this instance, the text categorization algorithm based on knowledge tree is proposed in this article. It simulates the human behavior in the text classification and uses the knowledge tree as the basis to categorize the text. During the process of computing the association degree between the text and the class, it considers the structure of the text and makes dynamic weighting to the key words. The experiments show that this algorithm has better recall than KNN algorithm that based on VSM. At the same time, the experiments show that we can get better results if the knowledge tree is more consummated.
shelfnum872
Other Identifier14605T302
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/6882
Collection毕业生_硕士学位论文
Recommended Citation
GB/T 7714
李小兵. 文本分类和检索研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2005.
Files in This Item:
There are no files associated with this item.
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[李小兵]'s Articles
Baidu academic
Similar articles in Baidu academic
[李小兵]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[李小兵]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.