面向大规模互联网数据的细粒度观点挖掘方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	面向大规模互联网数据的细粒度观点挖掘方法研究
其他题名	Fine-Grained Opinion Mining Methods for Large-Scale Online Reviews
	徐立恒
	2014-05-29
学位类型	工学博士
中文摘要	随着移动互联网的迅速扩张，网络购物大大地提升了人们的生活质量。在此背景下，许多电子商务网站提供了产品评价平台，以方便用户分享产品使用经验以及对产品的满意程度作出评价。这些评价语无论对于消费者还是企业都有重要参考价值。然而，由于评价语规模庞大，使得人工阅读方式面临许多困难。因此，自动观点挖掘系统应运而生。观点挖掘，主要研究自动分析产品评价文本的方法，总结用户对产品各个功能的观点倾向。本文需要挖掘的观点信息，主要包括评价词（表达用户观点倾向的词）和评价对象（通常是产品的功能或属性）两部分。传统观点挖掘方法主要依靠依存句法分析，通过捕捉评价词和评价对象之间的修饰关系，抽取用户表达的观点信息。然而，基于句法的观点挖掘方法存在许多问题。本文主要针对现有基于句法分析的方法的缺点，研究面向大规模互联网评价文本的自动评价词和评价对象抽取方法，具体研究内容与成果如下： (1) 本文提出一个两步走的算法，改进传统基于句法分析的观点挖掘方法的部分缺点。传统观点挖掘方法常依赖许多句法模板，由于不同模板准确度不同，导致部分低质量模板容易引入许多噪声词。针对该问题，本文在算法的第一步，提出将句法模板融入到一个评价关系图，并为每一个模板估算一个置信度，使得低质量的模板得到低置信度。另一方面，传统方法倾向于使用词频对候选词排序，其缺点是无法过滤高频噪音词，且容易丢失低频词。针对该问题，本文在算法的第二步，使用一个半监督二元分类器对评价对象列表进行过滤，从而使算法不依赖于词频。实验证明，本文提出的第一步方法有效提升了准确率，第二步方法有效降低了词频的不良影响。 (2) 本文提出使用单语词对齐模型取代句法分析工具。现有句法分析工具在处理复杂的互联网评价语时，其准确度往往不能令人满意。针对该问题，本文提出使用单语词对齐模型，通过无监督词共现统计方式，模拟评价词与评价对象之间的评价修饰关系。相比于基于句法的方法，词对齐模型可有效减少分析口语语料时的错误修饰关系，同时有效提升系统的召回率。但是，无监督词对齐模型容易受到训练数据规模不足的影响。据此，本文进一步提出一个基于半监督词对齐模型的观点挖掘算法，将部分可靠依存句法关系与词对齐模型融合。实验证明该方法有效提升了模型在处理小规模语料时的效果。 (3) 本文提出利用词向量学习方法取代句法分析工具。现有基于句法的方法将词看作离散的变量，这样的方式易出现数据稀疏性问题。针对该问题，本文引入词向量学习方法取代句法分析捕捉上下文语义。由于语义相似的词拥有相似的词向量，因此可以有效地降低数据稀疏性问题带来的不良影响。同时，本文还引入词向量距离衡量词之间的语义相似度关系，取代传统基于图的方法中的模板-词共现关系。实验证明，在产品属性词抽取过程中，词向量距离显著优于模板-词共现关系。
英文摘要	With the rapid growth of mobile internet, online shopping has greatly improved life for consumers. Against this background, many e-commerce websites provide online review platforms for consumers to share their purchase experiences and opinions on products. These reviews are of great value to both consumers and business organizations. However, manually reading throughout large scales of review texts is a very arduous task. Therefore, automatic opinion mining system emerges. Generally, opinion mining systems make summarizations of consumers' opinions through automatic analysis on review texts. In this thesis, we mainly focus on mining opinion words (which refer to those terms indicating sentiment polarities) and opinion targets (which are often attributes or functions of products). Conventional opinion mining methods often rely on employing syntactic dependency parsing to capture modified relations between opinion words and opinion targets, which may have many limitations. This thesis aims to provide several opinion mining methods to overcome shortcomings of conventional syntax-based opinion mining systems. The main contents and contributions of this thesis include: (1) This thesis proposes a two-stage method to improve conventional syntax-based opinion mining methods. Previous works often use many syntactic patterns to mine opinion words and opinion targets. However, some patterns are of low quality, which may introduce many noise terms. To alleviate this issue, we incorporate syntactic patterns in a Sentiment Graph and apply random walking on the graph to estimate confidence of patterns. In this way, low-quality patterns will have low confidence, so as to improve accuracy. On another hand, previous works tend to rank candidates by term frequencies, this may introduce high-frequency noise terms and lose low-frequency opinion terms. To solve this problem, we employ a semi-supervised binary classier to refine opinion targets, which does not rely on term frequencies to rank candidates. Experimental results show that the first stage effectively improves precision and the second stage significantly reduces adverse effects of term frequencies. (2) This thesis introduces a monolingual word alignment model, which substitutes syntactic parser to capture opinion relations. Current syntactic parsers can easily suffer from informal expressions in online reviews. To tackle this problem, instead of using syntactic parsers, this thesis employs an unsupervised monol...
关键词	观点挖掘观点倾向性分析评价词抽取评价对象抽取产品属性词挖掘 Opinion Mining Sentiment Polarity Analysis Opinion Word Extraction Opinion Target Extraction Product Feature Mining
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6643
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	徐立恒. 面向大规模互联网数据的细粒度观点挖掘方法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2014.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20111801462909（2513KB）			暂不开放	CC BY-NC-SA