面向互联网观点的垃圾评论检测关键技术研究

CASIA OpenIR > 毕业生 > 博士学位论文

	面向互联网观点的垃圾评论检测关键技术研究
	王雪鹏
	2017-05
学位类型	工学博士
英文摘要	随着社交媒体的快速发展，来自互联网的评论观点，越来越多地影响着组织和个人的购买决策制定、选举投票和市场产品设计等事宜。对于商业和个人，正面的评论观点常常意味着更高的利润和更好的口碑。然而，不幸地是，基于对于利润和市场的追逐，在商业性评论网站上也出现了越来越多的虚假评论或虚假观点（统称垃圾评论）。如何有效地检测出垃圾评论，保护用户利益，维持评论网站信誉度，成为了工业界与学术界都亟待解决的问题。由此，垃圾评论检测任务应运而生。垃圾评论检测是观点挖掘领域中的一个重要任务。广义而言，垃圾检测技术在众多领域都有相关研究。相比于垃圾网页检测与垃圾邮件检测，垃圾评论检测难度更高。这是由于垃圾评论，尤其是那些旨在对目标产品或服务进行推销的评论，往往是高度隐性的。这些评论通常伪装成了来自于真实用户的诚实评论。人们很难单一地从评论文本入手来区别垃圾评论与正常评论。因此，现有大量前人工作从评论文本与用户行为两方面数据出发，寻找能够指示垃圾评论的可疑性线索，提取有效文本特征与用户行为特征，用此两类特征来表示目标评论，进而依靠统计模型来检测垃圾评论。现有工作通过实验证明，在检测垃圾评论的任务中，用户行为特征要比评论文本特征更为有效。然而，现有研究工作将主要精力放在了特征工程上。其中存在着特征提取依赖专家知识、特征提取依赖丰富信息无法解决冷启动问题、无法动态选择重要特征等问题。本论文针对现有研究工作中存在的缺点，从用户行为信息入手，面向互联网观点数据，展开研究垃圾评论检测关键技术，研究成果主要包括： 1、针对传统统计特征提取过程中，存在的过度依赖于专家知识、先验假设的问题，提出了一种基于张量分解的用户行为表示学习方法。该方法不依赖于专家知识，直接从数据层面出发，利用多关系全局信息自动联合学习用户的行为表示以及所评价的产品表示。具体地，本文在未作出任何垃圾嫌疑倾向性假设的前提下，定义了两类基础关系，在此基础上，本文分别从时间、空间、社交等维度记录了两个实体之间的比较信息，共衍生出11 种具体的关系。为进一步联合运用多关系信息，并用隐含的方式表示评论（评论者表示+ 产品表示），采用基于上面提到的11 种关系的张量分解方法，在多关系之间应用全局性的损失函数，来更充分地联合学习评论者与产品各自的信息表示。实验结果表明，该方法学习到的评论表示要比传统统计特征更为有效，使得该任务中垃圾评论的表示摆脱了对于专家知识的依赖，展现了较强的鲁棒性和领域适应性。 2、针对传统统计特征提取过程中，存在的依赖丰富行为信息从而导致无法解决冷启动的问题，提出了一种基于图结构与卷积神经网络的模型。该部分研究工作在该领域内，第一次尝试量化分析并处理传统垃圾评论检测滞后所带来的冷启动问题。具体地，该部分工作通过实验量化分析证明了，传统统计特征需要建立在用户丰富的行为记录的基础上；而对于那些只发布了一条评论的新用户而言，基于传统统计特征的系统无法对其及时做出检测判定。因此，传统特征方法无法处理冷启动状态下的垃圾评论检测任务。该工作所提模型将评论系统中的评论文本信息与用户行为信息联合编码，寻找与新用户评论文本信息相似的老用户，进而将其行为信息用来补充增强新用户不足的行为信息，以此检测垃圾评论。实验结果表明，相比于传统统计特征，该方法能够有效地及时地检测冷启动状态下的垃圾评论。 3、针对现有工作过度关注于特征工程，而在特征提取后无法动态地选择重要特征的问题，提出了一种基于双向关注机制的神经网络模型。垃圾评论中存在用户行为特征可疑的评论、文本特征可疑的评论、以及文本与用户行为特征皆可疑的评论。前人工作将重点放于特征工程上，提取特征后直接应用现有模型算法。这些模型算法训练后所得权重矩阵对于每一个检测数据都是静态不变的。但对于只有用户行为特征可疑的评论而言，所联合利用到的正常文本特征反而成为了噪声数据，反之亦然。此时该静态权重矩阵即为一种全局性妥协的训练结果。该部分工作通过在神经网络中加入关注模块，使得模型能够为每一个数据学习一个动态的特征关注权重，进而更细粒度地分析判别垃圾评论为用户行为特征可疑或文本特征可疑。实验表明，基于双向关注机制的神经网络模型能够因评论而异地动态选择对检测有利的重要特征，更为充分地联合利用文本特征与用户行为特征，从而更有效地检测垃圾评论。 ; With the rapid development of the social media, the opinions from the Internethave a great influence on individuals and organizations for making purchase decisions,making choices at elections, and for marketing and product design. The positiveopinions can usually raise the reputation and result in more profits for the businessesand individuals. But, unfortunately, driven by the profits and markets, there emergemore and more deceptive opinion or opinion spam. How to effectively detect suchopinion spam, protect the benefits of the consumers, and maintain the credit of the reviewhosting websites, has become an urgent task to be investigated for both academyand industry. Therefore, there posed the task of opinion spam detection. Opinion spam detection is an important task in the domain of opinion mining.In general, spam detection has been widely studied in many fields. Compared withthe web spam and e-mail spam, it is very difficult to detect the opinion spam. This isbecause, especially the opinion spam which aimed at promoting target products or services,is highly implicit. These opinions always pretend to be honest opinions from realcustoms or users. It is hardly to distinguish them simply relying on the opinion texts.So, previous work turned to analyze the linguistic informations of the opinions andthe behavioral informations of the users. They tried to explore some suspicious cluesthat can indicate the opinion spam, and take the clues as effective linguistic and behavioralfeatures. Then, they represented the opinions with these two types of features,and trained models to detect the opinion spam. Previous work has proved in experimentsthat the behavioral features are more effective than linguistic features. However,the existing work devoted most efforts to the feature engineering, and directly utilizedthe off-the-shelf algorithms to the feature representations of reviews. There are manyproblems to be investigated, for example, the extraction of the features heavily relieson the experts’ knowledge, the statistical features need to collect abundant behavioralinformations of users and fail to solve the cold-start problem, the previous work failsto dynamically select the linguistic and behavioral features for different reviews. In this dissertation, we focus on the problems referred above, start with the behavioral informations of users, and exploit the key methods for opinion spam detectionfrom online reviews. The main achievements are as follows: 1.To handle the problem that the extraction of the features heavily relies on theexperts’ knowledge or prior assumption, we propose a representation learning methodbased on tensor decomposition. This method can collectively learn the embeddings ofthe reviewers and products from the global relations in a data-driven manner, instead ofheavily relying on experts’ knowledge. More specifically, we define two basic patternswithout any experts’ knowledge, developers’ ingenuity, or spammer-like assumptions.Based on the two basic patterns, we extended 11 interactive relations between entities(reviewers and products) in terms of time, locations, social contact, etc. Then, we utilizetensor factorization to perform tensor decomposition, and the representations ofreviewers and products are embedded in a latent vector space by collective learning.Finally, such representations are fed into a classifier to detect the review spam. Theexperiments results show that, the representations of reviews learned by our methodsare more effective than the traditional statistic features. Our method represents thereviews without relying on the experts’ knowledge and is more robust and possessespreferable domain-adaptability. This work mainly focuses on the users who ownabundant behavioral informations. 2.To handle the problem that the extraction of the feature need to collect abundantbehavioral informations of users which results in the cold-start problem, we propose amodel base on graph structure and convolutional neural networks. To the best of ourknown, we made the first attempt to quantitatively analyze and handle the cold-startproblem in opinion spam detection. More specifically, we carried out experiments andproved that the traditional features can only be extracted from abundant behavioral features.But for the new user who only post one review, the models based on traditionalfeatures fail to distinguish the opinion spam in time. So, the traditional features cannot help to detect opinion spam in cold-start task. The proposed model can augmentthe behavioral informations of the new users with the behavioral informations of theold users who have the similar linguistic informations, by jointly embeddings texts andbehaviors. The experiments showed that our model can effectively detect opinion spamfor the cold-start task, compared with the traditional statistical features. This workmainly focuses on the users who only have few behavioral informations. 3.The existing work paid most attention on the feature engineering, and fails toselectively utilize the linguistic and behavioral features. So we propose a bi-directionalattention-based neural networks. In the review system, some opinion spam is linguisticallysuspicious, some opinion spam is behaviorally suspicious, and some is bothlinguistically and behaviorally suspicious. Previous work focused on extracting effectivefeatures and directly adopted the off-the-shelf algorithm. But the weight matricestrained in these model are static to every data sample. However, for the behaviorallysuspicious opinion spam, the normal linguistic features are actually noises for themodel, vice versa. To some extent, the models are compromised results trained onthe whole datasets. In this work, we adopt the attention mechanism module to learndynamic attention weight for each review, and further distinguish whether the reviewspam is linguistically suspicious or behaviorally suspicious or both. The experimentalresults showed that our bi-directional attention-based model can dynamically select thelinguistic and behavioral features for different reviews, fully utilize the linguistic andbehavioral features collectively, and detect opinion spam more effectively.
关键词	网络观点垃圾评论检测文本特征行为特征神经网络
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/14663
专题	毕业生_博士学位论文
作者单位	中国科学院自动化研究所模式识别国家重点实验室
第一作者单位	模式识别国家重点实验室
推荐引用方式 GB/T 7714	王雪鹏. 面向互联网观点的垃圾评论检测关键技术研究[D]. 北京. 中国科学院研究生院,2017.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
wxp终版打印.pdf（2998KB）	学位论文		限制开放	CC BY-NC-SA