文本倾向性分析技术研究

CASIA OpenIR > 毕业生 > 博士学位论文

	文本倾向性分析技术研究
其他题名	Research on Sentiment Analysis
	刘康
	2010-06-03
学位类型	工学博士
中文摘要	文本倾向性分析是自然语言处理领域中的新兴任务之一，其主要任务是识别、分析和理解文本中包含的主观信息及其倾向性。这一任务涉及模式分类、机器学习、信息检索、信息抽取等多个领域的关键技术，对其进行研究可以推动相关领域的发展，因此具有十分重要的学术意义。另外，随着Web2.0的发展以及Web3.0的到来，文本倾向性分析对于网络舆情分析、产品市场调研和社会信息安全等都有十分重要的意义，因此文本倾向性分析具有重要的应用价值。本文主要针对文本倾向性分析中的观点倾向性分类问题展开研究，主要工作及创新点如下：（1）句子级文本倾向性分类：为了捕捉上下文对于句子倾向性的影响，不同于传统倾向性分类方法，本文把篇章中的每一个句子不再看作是一个孤立点，而是组成篇章的一个情感链，于是句子倾向性分类从基于点的分类任务转化成序列标注任务，这样可以在识别句子倾向性时考虑上下文的影响。同时为了解决倾向性标记间的冗余相关性影响分类性能的问题，本文在条件随机场框架下，提出一种加入冗余标记特征的句子倾向性判别方法。该方法根据倾向性标记特点设计了冗余标记特征来捕捉倾向性标记间的冗余相关性，从而改善了句子倾向性识别的性能。在主客观分类、褒贬分类以及褒贬强度分类等多个任务中的实验证明了该方法的有效性。（2）篇章级文本倾向性分类：判别篇章整体观点倾向性存在两方面的困难问题：第一，篇章中包含多个不同倾向性的观点，如何从多个局部观点中识别出整体观点的倾向性？第二，在利用篇章级文本训练分类器时，只能在训练语料中观察到篇章总体的倾向性标记，而那些与篇章整体观点不一致的局部观点的倾向性并不能从训练语料中观察到，这样会造成那些表达局部观点的特征不能与其真正的倾向性标记相联系，以至于这些特征的权重在训练过程中产生偏差，从而影响分类性能。针对第一个问题，本文认为篇章整体观点是篇章中所有主题观点（局部观点）的集成，因此提出一种基于主题信息的篇章观点倾向性判别方法，根据篇章中主题的重要性，通过融合每个主题的观点倾向性来判别整个篇章的倾向性。针对第二个问题，本文提出一种统计分类器与倾向性词典相融合方法，在利用语料训练分类器时融入倾向性词典信息，来纠正那些训练产生偏差的特征权重，从而提高分类精确度。实验表明，相对于传统倾向性判别方法，这两种方法均能有效地提高篇章级倾向性分类的性能。（3）文本倾向性分类中的领域自适应：造成分类器跨领域性能差异的原因在于两方面：不同领域样本分布不一致和不同领域特征空间不一致。针对这两个方面，本文从实例权重和特征映射两个角度提出两种方法。第一，提出一种基于“产生/判别”混合模型的领域自适应方法，在混合模型框架下，根据不同领域的样本分布信息调节训练语料中样本的权重（增加与目标领域分布相近的样本权重，减少与目标领域分布不一致的样本权重），从而减少那些与目标领域分布不一致信息对于分类器的影响。第二，提出一种基于特征映射的领域自适应方法，通过不同领域间的共有主题建立不同领域特征之间的对应关系，使得统计分类器训练和测试在...
英文摘要	Sentiment Analysis is a new task of natural language processing. The aim of this task is to recognize, analyze and understand the opinions in the texts, which involoves the key technologies of pattern recognition, machine learning, information retrieval, information extraction, etc. Thus, the research on sentiment analysis has significant academic value. Furthermore, with the development of Web 2.0 and the rise of Web 3.0, automatic sentiment analysis of texts will be beneficial for public opinion analysis, market research, society information security and so on. Therefore, it is also very useful for real applications. Under this background, this dissertation focuses on the sentiment classification task in sentiment analysis. The main contributions are summarized as follows: (1). Sentence-Level Sentiment Classification: To capture the contextual constraints on the sentence sentiment, this dissertation regards the the sentences in a passage as a sentiment flow instead of isolated points. Therefore, the orginal sentence-level sentiment classification task is converted into a sequential labeling task and the contextual information can be considered. At the same time, to capture label redundancy, this dissertation introduces redundant labels into the original sentimental label set and adds redundant label features into statistic model, so that the performance of sentiment classification can be improved. Experiments on sereval sentiment analysis tasks (including subjective classification, polarity classification and sentiment strength rating) prove the effectiveness of our approach. (2). Document-Level Sentiment Classification: There are two difficulties for identifying the overall sentiment in a document: 1). A document may contain multiple sentiments for different object’s facets (loacal sentiments). How to summarize the overall sentiment of the document from these diversified local sentiments? 2). When a document-level sentiment classifier is trained, only the overall sentiment labels can be observed in the training dataset, and the local sentiments (may be inconsistent with the overall sentiment) are unoberved, so that some feature weights may be biased after training because these features are not connected with their real sentiment labels. How to revise these biased feature weights? For the first problem, we regard the overall sentiment is an intergration of all the local sentiments in a document with different weights. Therefore, we present a novel ...
关键词	文本倾向性分析主客观分类褒贬分类褒贬强度分类 Sentiment Analysis Subjective Classification Polarity Classification Sentiment Strength Rating
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6285
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	刘康. 文本倾向性分析技术研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2010.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20061801462804（2035KB）	学位论文		限制开放	CC BY-NC-SA