基于词间关系分析的文本分类算法研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于词间关系分析的文本分类算法研究
其他题名	Research of Text Classification Algorithm based on the Relationship Analysis between Words
	吴双
	2011-05-25
学位类型	工程硕士
中文摘要	随着 Internet 网络资源快速发展，人们不仅重视信息的有效性，而且更加关注信息获取的经济性。因此，准确地获取有效信息，高效地应用有价值信息显得十分重要。文本分类是信息检索、知识挖掘的关键技术，如何使分类效果更加精确成为信息检索领域的研究热点。文本分类的一大难题是提高特征项对于文本内容的表示能力。传统的向量空间模型以独立的词作为语义单元，没有很好地揭示词语之间的关系，难以突出对文本内容起到关键性作用的特征。目前，词间关系分析主要集中在基于词频的共现率计算和构造独立的关联分类器，没有形成较为全面的研究体系。本文从分析文本类别特征入手，提出了文本分类中词语间关系研究的课题，从关联和相关两个方面进行了词间关系分析。本文的主要工作和创新点在于： 1、针对传统特征选择方法的不足，提出一种新的基于词间关系的特征选择算法。该方法考虑关键词的出现位置，利用关联规则挖掘算法发现词之间的关联关系，并通过非线性相关分析对强关联规则进行筛选，最终生成与类别属性密切相关的特征空间。实验结果表明：该方法在分类精度上优于传统特征选择方法。 2、针对词间相关性的线性分析，将线性最小二乘拟合分类作为 K近邻分类器的补充，形成 LLSF-KNN组合分类算法，并对 K最近邻算法中的投票函数进行改进。实验结果表明：新的组合分类算法具有良好的分类效率和结果。
英文摘要	With the rapid development of the Internet resources, people pay more attention to the information. Therefor, it is very important that how to access and use information effectively. Text classification is the key technology of information retrieval and knowledge mining field. How to improve the classification efficiency has become the research focus in information retrieval. For text classification, how to improve the representation capability of features is the key of research. It takes the separate word as unit to establish vector space model. The words that are key to the documents content and the associational relations between words have not been realized. There are not so many studies on the relationship analysis between words recently. By proceeding with analysis on documents type features, we brought forward the subject of research on the relations between words, and analyzed it in two aspects: association and correlation. The work done in this paper includes: 1. Aiming at the shortage of the traditional feature selection, a new feature selection algorithm based on association rules and keywords is presented. This algorithm checked association rules by nonlinear correlation analysis to produce feature space which closely related to the category attribute. The experiment indicated that this method has a better categorization result than the traditional one. 2. Term correlation in the method of linear analysis and text classifier combining LLSF and KNN classifiers are proposed. Furthermore, a new voting method in KNN is designed. The experimental results showed that the new classifier achieved higher classification accuracy and efficiency.
关键词	文本分类词间关系关联规则相关分析投票函数 Text Classification Relationship Between Words Association Rule Correlation Analysis Voting Method
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/7567
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	吴双. 基于词间关系分析的文本分类算法研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2011.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20082800902907（1109KB）			暂不开放	CC BY-NC-SA