基于语言知识和集成学习的情感文本分类方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于语言知识和集成学习的情感文本分类方法研究
其他题名	Approaches to Sentiment Classification Based on Linguistic Knowledge and Ensemble Learning
	夏睿
	2011-05-26
学位类型	工学博士
中文摘要	随着互联网技术不断深入发展，用户越来越多地从被动接受网站发布的信息转变为主动获取、发布、共享和传播信息。因此，如今的互联网上存在着大量带有主观的观点和情感的文本，我们称之为情感文本。对这些情感文本的分析、挖掘和管理，具有非常重要的意义。同时，情感文本分析的研究内容非常广泛，它涉及到自然语言处理、模式识别、机器学习、信息检索、数据挖掘等多项基础研究。因此，开展这项研究具有重要的学术意义和应用价值。情感文本分类是情感文本分析的一项重要研究内容。它是对文本中的主观信息（如观点、情感等）进行分类的一项研究课题。主流的情感文本分类方法继承了传统的主题文本分类方法：利用向量空间模型进行文本表示，再使用统计机器学习算法进行分类。但是这种传统的方法存在诸多缺陷。本文针对这些缺陷，围绕如何将语言知识与集成学习相结合，寻找对情感文本分类更加有效的特征，并充分利用这些特征建立鲁棒的高性能情感分类系统等问题，进行了深入研究和探索。论文的主要贡献和创新归纳如下：（1）提出了一种基于词性信息集成的情感文本分类方法。根据调研分析我们发现，不同的词性对于情感分析具有不同的作用，因此，在基于词性信息集成的情感文本分类方法中，我们首先按照词性信息将一元语法特征分成几个特征子集，接着使用不同的分类算法构建基分类器，然后利用集成学习方法去组合这些分类器，以达到取长补短的目的，从而提高分类性能。论文在五个语料上对三类集成算法和三种集成策略进行了大量实验，结果表明，基于词性信息集成的情感文本分类方法能够显著提高分类的效果。（2）提出了一种基于词对关系集成的情感文本分类方法。在前一项研究工作的基础上，本文进一步引入了二元语法特征和依存词对特征分别用于捕捉文本的词序信息和依存关系，并建立了基于词对关系集成的情感文本分类方法。大量的对比实验表明，基于词对关系集成的情感文本分类方法能够进一步提高系统分类的性能。在此基础上，论文对集成算法在情感文本分类中的有效性、各种集成算法性能的优劣以及集成算法的效率进行了深入分析和讨论。（3）针对传统词对关系特征存在的特征空间维数高、数据稀疏、单独使用性能较低这三个问题，论文分别提出了泛化词对特征的抽取方法、快速特征选择方法和相应的集成方法。其中，泛化词对特征抽取方法与传统的词对特征相比，在缩减了原始特征空间的基础上显著提高了特征分类的性能；快速特征选择方法在极大降低特征空间维数的同时，有效地保持甚至提高了分类性能，而且还大大提高了传统的信息增益法的计算效率。实验表明，这些方法进一步提高了情感文本分类的性能。（4）将集成学习方法延伸到跨领域情感分类任务中，提出了基于集成学习的跨领域情感文本分类方法。其基本思路是：首先依据词性信息划分特征子集，不同类型的特征子集具备不同的跨领域性能，然后利用集成学习实现特征权重的二次分配，从而达到领域迁移学习的目的。实验结果表明，集成方法能够合理地分配各部分特征的权重，显著提高跨领域情感文本分类系统的性能。论文进一步总结，基于线性加权...
英文摘要	With the development of Internet technology, there are a lot of subjective texts containing opinion or sentiment on the Internet. The task to analysis, mine and manage these texts, has become an important research topic. The content of sentiment analysis is very broad. It relates to many fundamental research directions, such as natural language processing, pattern recognition, machine learning, information retrieval, data mining, etc. Therefore, it also has important research value. Sentiment classification is an important content of sentiment analysis. Its task is to recognize the opinion or sentiment involved in the subjective text. Current research in sentiment classification generally follows the methodology in topical text classification, where the vector space model is employed for text representation, and then some statisti-cal machine learning methods are used for classification. To address the drawbacks of traditional methods, in this thesis, we focus on the integration of linguistic knowledge and ensemble learning technique, and try to solve the following two problems: how to find significant features to sentiment classification, and how to effectively integrate these features with classification models. The main contribution of this thesis can be summa-rized as follows: (1) We propose a part-of-speech information based ensemble model for sentiment classification. According to different parts of speech, the unigram features are divided into several subsets. Different classification algorithms are then employed to construct several base classifiers. Finally we use the ensemble learning methods to integrate these base classifiers efficiently. Three types of ensemble methods, namely the fixed combina-tion, weighted combination and meta-learning classifier are evaluated on five wide-ly-used datasets with three ensemble strategies. The experimental results show the pro-posed method can significantly improve the classification performance. (2) We extend the resource of features, and propose a word relation based ensemble model for sentiment classification. Particularly, we explore the use of bigrams and word dependency relations, which can to some extent, capture the word order and syntactic in-formation respectively. Similarly, we experiment with three types of ensemble methods and three ensemble strategies. The results show the word relation based ensemble model can gain an extra improvement in classification accuracy. Furthermore, we made in-dep...
关键词	文本情感分析情感文本分类语言知识集成学习迁移学习 Sentiment Analyisis Sentiment Classification Language Knowledge Ensemble Learning Transfer Learning
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6347
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	夏睿. 基于语言知识和集成学习的情感文本分类方法研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2011.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20071801462909（839KB）			暂不开放	CC BY-NC-SA