情感文本分类方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	情感文本分类方法研究
其他题名	Approaches to Sentiment Text Classification
	李寿山
	2008-06-01
学位类型	工学博士
中文摘要	情感文本分类涉及文本内容理解、模式分类方法等若干自然语言理解和模式识别的问题。开展该技术的研究，不仅可以推动自然语言理解相关技术的研究，而且可以丰富模式识别和人工智能理论研究的内容，具有重要的学术价值和理论意义。目前，人们越来越习惯于在网络上表达自己的观点和情感。于是，在网上出现了大量的带有情感信息的文本，这些情感文本以商品评论、论坛评论和博客等多种形式存在。面对网上这些越来越多表达情感信息的文本，传统的基于主题的文本分类系统已经不能满足人们的需求，迫切需要对这些情感文本进行研究和分析。因此，开展情感文本分类方法研究同样具有重要的应用价值。本论文的主要贡献归纳如下：（1）在对文本分类中常用的6种特征提取方法进行理论分析的基础上，提出了两个特征评价基准，分别是文档频率基准和类文档比率基准，在此基础上提出了一种叫做带权重的文档频率和比率方法的新特征提取方法，并对这些特征提取方法在情感文本分类任务中的应用方法进行了深入研究。通过大量的实验证明，本文提出的新方法能够在不同领域内都能取得很好的分类效果，从而克服了已有方法在领域方面的依赖性问题。（2）实现了多分类器组合方法中两种基本融合规则（乘法规则和加法规则）的理论推导。这一推导将它们融合到Bayes理论框架下，并分别给出两种规则成立所需要的独立性条件。在此基础上，本文实现了面向情感文本分类的组合分类器系统，用以融合不同的特征子集。实验结果表明，这两种融合规则有效地提高了情感文本的分类效果。（3）提出了多领域的情感文本分类问题，并针对这一问题给出两种不同的求解方法。情感分类是一个领域相关的问题。在设计实用的情感文本分类系统时，一般需要搜集多个领域的训练语料以保证系统能够在多个领域都能提供良好的分类效果。针对这一问题，我们提出了在特征层和分类器层两个层面分别进行特征项集合和分类结果的融合，以达到同时利用来自多领域的训练语料构建分类器的目的。实验结果表明，相对于利用单领域语料分别训练，这两种融合方法都能充分利用所有领域的语料，大大地提高了整体分类的效果。（4）针对情感文本分类方法的领域适应问题，提出了解决多领域适应问题的组合分类器方法。论文重点分析了多个源领域参与适应学习的方法，提出了一种在半监督情况下的多领域适应方法，叫做驱动集成的Self-training方法。实验结果表明，该方法对于多领域适应的分类效果优于单领域适应的分类效果。
英文摘要	Sentiment text classification involves the theory of both text content understanding and pattern recognition. Studying this subject is academically valuable to not only assist the development of natural language understanding but also enrich the content of pattern recognition. Currently, people become more and more conditioned to express their opinions or sentiment information on the web. As a result, there exists a huge amount of documents that expressed as product reviews, forum reviews or personal BLOG articles. To deal with the text with sentiment, research work on text classification has transferred from traditional topic-based classification to sentiment-based classification. Therefore, Study-ing sentiment text classification is also valuable for real applications. The main contributions are summarized as follows: (1) We theoretical analyze six popular feature selection methods for text classifica-tion and propose two basic measurements, document frequency and category ratio meas-urements. Based on the theoretical analysis, we propos a new feature selection called weighted log likelihood ratio (WLLR) method. The experimental results show that this new method performs very well in sentiment classification of different domains. (2) We give the theoretical explanation to the two important fusion rules (the prod-uct and sum rule) for combining multiple classifiers. The explanation puts them in the framework of Bayes theory and gives the dependence conditions they need. Moreover, we implement a multiple classifier system for sentiment text classification to fusing dif-ferent feature sets. Experimental results show that the two fusion rules both improve the classification performance. (3) We address the problem of multi-domain sentiment classification and present two methods to the problem. Sentiment classification is a domain-specific problem. When designing a real application system on sentiment text classification, we need to collect annotated data from multiple domains to guarantee a good performance. Given the training data from multiple domains, we propose two methods, feature-level and classi-fier-level fusion, to train classifiers using all the data simultaneously. Experimental re-sults show that multi-domain sentiment classification using these two methods performs much better than single domain classification (using the training data individually). (4) We apply classifier combination methods to multiple domain adaptation for sen-timent text classification. Domain adaptation for sentiment classification is a very practi-cal problem. We focus on the problem of multi-domain adaptation where there exists more than one source domain. We propose a method called ensemble driven self-training method to deal with this problem. Experimental results show that our proposed method makes the multi-domain adaptation performs better than single domain adaptation for sentiment text classification.
关键词	文本分类主题文本分类情感文本分类特征提取方法组合分类器方法领域适应 Text Classification Topic-based Text Classification Sentiment Text Classification Feature Selection Methods Classifier Combination Methods Domain Adaptation
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6110
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	李寿山. 情感文本分类方法研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2008.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20051801462809（1107KB）			暂不开放	CC BY-NC-SA