CASIA OpenIR  > 毕业生  > 硕士学位论文
基于词间关系分析的文本分类算法研究
Alternative TitleResearch of Text Classification Algorithm based on the Relationship Analysis between Words
吴双
Subtype工程硕士
Thesis Advisor张文生
2011-05-25
Degree Grantor中国科学院研究生院
Place of Conferral中国科学院自动化研究所
Degree Discipline软件工程
Keyword文本分类 词间关系 关联规则 相关分析 投票函数 Text Classification Relationship Between Words Association Rule Correlation Analysis Voting Method
Abstract随着 Internet 网络资源快速发展,人们不仅重视信息的有效性,而且更加关注信息获取的经济性。因此,准确地获取有效信息,高效地应用有价值信息显得十分重要。文本分类是信息检索、知识挖掘的关键技术,如何使分类效果更加精确成为信息检索领域的研究热点。 文本分类的一大难题是提高特征项对于文本内容的表示能力。传统的向量空间模型以独立的词作为语义单元,没有很好地揭示词语之间的关系,难以突出对文本内容起到关键性作用的特征。目前,词间关系分析主要集中在基于词频的共现率计算和构造独立的关联分类器,没有形成较为全面的研究体系。 本文从分析文本类别特征入手,提出了文本分类中词语间关系研究的课题,从关联和相关两个方面进行了词间关系分析。 本文的主要工作和创新点在于: 1、针对传统特征选择方法的不足,提出一种新的基于词间关系的特征选择算法。该方法考虑关键词的出现位置,利用关联规则挖掘算法发现词之间的关联关系,并通过非线性相关分析对强关联规则进行筛选,最终生成与类别属性密切相关的特征空间。实验结果表明:该方法在分类精度上优于传统特征选择方法。 2、针对词间相关性的线性分析,将线性最小二乘拟合分类作为 K近邻分类器的补充,形成 LLSF-KNN组合分类算法,并对 K最近邻算法中的投票函数进行改进。实验结果表明:新的组合分类算法具有良好的分类效率和结果。
Other AbstractWith the rapid development of the Internet resources, people pay more attention to the information. Therefor, it is very important that how to access and use information effectively. Text classification is the key technology of information retrieval and knowledge mining field. How to improve the classification efficiency has become the research focus in information retrieval. For text classification, how to improve the representation capability of features is the key of research. It takes the separate word as unit to establish vector space model. The words that are key to the documents content and the associational relations between words have not been realized. There are not so many studies on the relationship analysis between words recently. By proceeding with analysis on documents type features, we brought forward the subject of research on the relations between words, and analyzed it in two aspects: association and correlation. The work done in this paper includes: 1. Aiming at the shortage of the traditional feature selection, a new feature selection algorithm based on association rules and keywords is presented. This algorithm checked association rules by nonlinear correlation analysis to produce feature space which closely related to the category attribute. The experiment indicated that this method has a better categorization result than the traditional one. 2. Term correlation in the method of linear analysis and text classifier combining LLSF and KNN classifiers are proposed. Furthermore, a new voting method in KNN is designed. The experimental results showed that the new classifier achieved higher classification accuracy and efficiency.
shelfnumXWLW1624
Other Identifier200828009029075
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/7567
Collection毕业生_硕士学位论文
Recommended Citation
GB/T 7714
吴双. 基于词间关系分析的文本分类算法研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2011.
Files in This Item:
File Name/Size DocType Version Access License
CASIA_20082800902907(1109KB) 暂不开放CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[吴双]'s Articles
Baidu academic
Similar articles in Baidu academic
[吴双]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[吴双]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.