文本倾向性分析技术研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	文本倾向性分析技术研究
其他题名	Research on Orientation Analysis of Texts
	王根
	2007-06-19
学位类型	工学硕士
中文摘要	互联网作为一种交互式媒体，被越来越多的人用来表达自己的观点和态度。这些带有倾向性的主观性语言集中在Blog、论坛和留言之中，包含有大量信息，自动挖掘其中的观点和态度，具有十分重要的意义。本文围绕文本倾向性分析，从以下几个不同的应用背景出发，研究了文本倾向性的褒贬分类和主客观分类两个任务中的相关技术。（1）短评论的褒贬分类短评论是在商业产品评论中常见的评论形式。传统的文本褒贬分类采用基于褒贬义词典的方法。然而，在短评论的褒贬义分类任务中，因为评论的对象属于特定的领域，而通用的褒贬义词典不能够覆盖特定领域内的倾向性表达词语，造成分类性能很差。此外，在短评论分类任务中传统方法还有另外一个缺陷：因为褒贬词语表达什么样的态度要依赖于其描述对象，所以，独立于描述对象，很多词语的倾向性难以判断。针对以上问题，本文在有监督学习的框架下采用了基于组合特征的分类方法：首先构建一个有褒贬标记的短评论语料库，然后从中自动挖掘褒贬描述的对象，并绑定该对象和其褒贬描述词，最后基于这种绑定后的组合特征，在语料库的监督下进行短评论褒贬分类的学习。实验表明，在电子产品评论分类任务上，该方法较传统褒贬分类方法能够获得更好的分类效果。（2）长评论的褒贬分类长评论是文学影视作品评论中常见的文体。与产品评价等短评论文体相比，长评论通常包含更多的评价侧面，篇幅更长，使用的语言更加丰富，情感更加强烈。更重要的是，长评论在篇章构成上包含某些结构，其中一种结构就是篇章中情感表达的结构。结合这种情感结构，本文针对长评论的褒贬分类，提出了一种称为Roof-CRF模型的分类方法。该模型将篇章中情感的各种关系统一建模，能够刻画出句子情感同句中词语的关系、句子间情感的关系、篇章情感和句子情感之间的关系以及篇章情感同篇章中词语的关系，从而一体化地对长文全局篇章情感和局部句子情感进行分类。实验证明，与传统方法相比，该方法对篇章和句子的褒贬分类性能都有一定的提高。（3）评论的倾向性分级在评论的倾向性分类中，褒贬的类别是具有强弱顺序的，这种有序类别的分类问题属于序回归问题，但是当前的方法都是从多分类的角度来进行评论的分级，所学习出来的模型不能完全符合有序类别的分类任务。本文提出一种基于多重冗余标记的方法使CRF能够从序回归问题的角度解决情感分级任务。此外，利用该方法，本文将主客观分类、情感极性分类和情感强弱分类三个任务集成到统一的模型之中，避免了分步方法误差积累和蔓延问题。在英文电影评论语料上的实验表明，跟标准的CRF方法相比，本文提出的方法能更好地解决评论的倾向性分级任务。（4）主客观分类本文面向TREC-BLOG07的观点搜索任务，介绍了一个主客观分类的方案。首先，为了解决主客观分类的训练语料难以收集的问题，采用了一种基于单类样例的文本分类方法。其次，基于主动学习思想，采用了动态选择训练样本的方法。最后，在主观性和相关性融合阶段，使用了支持向量回归的方法。在TREC BLOG06观点搜索数据上的实验结果验证了这套方案的有效性。
英文摘要	Nowadays, more and more attentions and intentions concentrate on expressing personal opinions via the public blogs, forums, wiki, etc. Numerous available reviews and comments which cannot be collected manually contain a magnitude of valuable information. Automatical Analysis of these texts will be beneficial for both groups and individuals. Under this background, the paper will study intensively on two most essential issues in text orientation analysis, namely polarity classification and subjective classification. Our main contribution and focus are summarized as follows. (1) Polarity classification of short reviews Because they require feature-independency assumption which is not a fact on polarity classification of short reviews, traditional classification menthods perform poor on these kinds of tasks. Aiming at this problem, the paper proposes a combined-feature-based classification method, in which the object of the opinion is extracted at first, then the object is combined with the opinioned word, and finally the polarity of the short review is classified based on the combined features. The experiment on the polarity classification task for electronic product reviews shows that, the method can get better performance in comparison with the traditional method. (2) Polarity classification of long review The long reviews contain deeper structures, one of which is the sentiment/opinion structure. In order to model this structure, we present a new model named Roof-CRF, which can accommodate various kinds of features related to sentiment/opinion states and their transition. The model classifies sentiments of documents and sentences spontaneously, and the accuracies of the both are improved. (3) Sentiment grading of reviews Viewing the review grading tasks as multi-class classification will neglect the ordinal relationship between the labels. Therefore, we propose a novel Redundant-labeled-CRFs, which can deal with ranking or ordinal regression problems in more proper way. Besides, the model can integrate subjective/objective classification task and opinion grading task and then depress error by making safe decision over all of these subtasks. Experiment shows that the presented method outperforms basic CRF model in review grading. (4) Subjective/Objective classification Facing with the opinion search tasks in TREsC-Blog-07, we present a solution for subjective/objective classification for blog texts. A partial supervised learning method using only positive examples and unlabeled examples is adopted for subjective/objective classification, after selecting better examples for training with an approach of active learning. After that, we fuse relevance and subjectiveness with a Support Vector Regression method. Detailed experiments prove the effectiveness of this solution.
关键词	倾向性分析褒贬分类评论分级主客观分类 Orientation Analysis Polarity Classification Subjective/objective Classification Review Grading
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/11961
专题	毕业生_硕士学位论文
作者单位	中国科学院自动化研究所
第一作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	王根. 文本倾向性分析技术研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2007.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20042801462806（1362KB）	学位论文		限制开放	CC BY-NC-SA