基于多源数据的媒体转载效果分析方法研究

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 互联网大数据与信息安全

	基于多源数据的媒体转载效果分析方法研究
	刘贺静
	2021-05-22
页数	91
学位类型	硕士
中文摘要	移动网络技术的快速革新和智能手机的高度普及重塑了在线媒体形态，为自动分析媒体转载模式、新闻影响力及内容公信力等关键要素带来了诸多挑战。深入分析多通道的媒体转载效果有助于预判新闻传播路径、了解新闻热点态势、把握受众舆论走向，辅助管理部门做出及时调整并制定决策。本文旨在应用自然语言处理技术与深度学习算法，借鉴异质网络表示学习、短文本主题发现及讽刺性反馈原因子句识别等问题的最新研究成果，从媒体转载行为预测、转载热点挖掘及转载反馈原因子句识别等三个方面开展基于多源数据的转载效果分析方法研究，主要工作及创新点总结如下： 1. 实体信息增强的媒体转载行为预测方法。新闻内容中例如人物、地点及组织等实体信息体现了媒体不同关注侧重，随媒体的转载行为建立动态关联；同时媒体间存在地域等属性特征。为充分利用新闻内容中隐含的实体关联和媒体间的关联信息，本文提出实体信息增强的媒体转载行为预测方法。该方法将媒体-实体间的多种关联关系映射至异质关联网络中，基于HIN2Vec算法建模包含关联关系的媒体及实体的嵌入表示，作为注意力机制来源引导模型学习具有关联属性和深层语义的新闻内容表示，实现对媒体转载行为的预测工作。 2. 基于语义关联及句法依存的转载热点挖掘方法。为解决现有基于关键词的主题分析方法在挖掘转载热点时对转载模式适应性不够、转载热点区分度及可解释性有待增强的技术挑战，本文提出基于语义关联及句法依存的转载热点挖掘方法。该方法针对新闻转载时产生的主题多样性和描述差异性，考虑转载标题的单词对语义关联及句法成分共现关系，构建语义-句法成分共现矩阵，将其作为约束条件融合至转载热点的隐向量表示学习中，获得相似新闻簇，实现对转载热点的代表性短语挖掘。 3. 基于多视角的讽刺性转载反馈原因子句识别方法。转载反馈及原因包含了受众的真实情感及立场，为解决讽刺性反馈在分析转载效果时带来的复杂性挑战，本文提出基于多视角的讽刺性转载反馈原因子句识别方法。该方法设计了混合语义度量机制，以不同方式显式建模讽刺性反馈与原因句间的语义关联，使用注意力机制自动捕获不同语境下各语义度量的重要程度。在句对级别的表示学习阶段，模型将上下文反馈与讽刺性反馈组成句对，以增强对语句间因果逻辑的捕获，实现基于多视角的讽刺性转载反馈原因子句识别。构建讽刺性转载反馈原因子句识别数据集对所提方法进行验证，与基准方法相比，所提方法的F1值为71.14%，AUC值为62.92%。
英文摘要	The rapid innovation of mobile network technology and the high popularity of smart phones have reshaped the form of online media, and have also brought many challenges to the automatic analysis of key elements such as media reprint patterns, news influence and credibility. The deep analysis of the effects of multi-channel media reprinting is helpful to predict the path of news dissemination, understand the trend of news hotspots, grasp the trend of audience opinion, and assist management departments in making timely adjustments and decisions. By applying natural language processing technology and deep learning algorithms, meanwhile learning from the latest research results of heterogeneous network representation learning, short text topic modeling, and sarcastic feedback cause detection, this thesis aims to research the analyzing methods of media reprint effect. Based on multi-source data, the research is carried out from three perceptions in the following: media reprint prediction, reprint hotspot mining, and sarcastic reprint feedback cause detection. The major work and contributions of this thesis are summarized as follows: 1. Entity Association Network Enhanced Reprint Prediction Model. Entity information such as people, locations, organizations, etc. in news content reflects the different focuses of the media, and establishes dynamic associations along with the media's reprinting behavior. In the meanwhile, there are geographic and other attributes between the media. In order to make full use of the implicit entity association and the association information between media in news content, this thesis proposes a method for predicting media reprint behavior enhanced by entity association network. This method first maps multiple association relationships between media and entities to the same heterogeneous information network, then applied the HIN2Vec algorithm to learning the embeddings of the media and entities that include the association relationships. Next the model uses them as the source of the attention mechanism to guide the model to learn news content representations with deep semantics. 2. Semantics-POS-assisted Reprint Hotspot Model. The existing keyword-based topic modeling methods are not are not sufficiently adaptable when mining reprint hotspots due to the various reprint patterns. On the other hand, the distinction and interpretability of reprint hotspots need to be enhanced. To tackle this challenge, this thesis proposes a reprint hotspot mining method based on biterms’ semantic associations and syntactic dependency. By considering the topic diversity and description differences generated during news reprinting progress, the proposed method combines the semantic associations of the biterms among the reprint news title with the co-occurrence relationships of syntactic components, then constructs a semantic-syntactic component co-occurrence matrix, and finally integrates it as a constraint into the latent representation learning of reprint hotspots. The proposed method could obtain similar news clusters, and generate the representative key phrases of each hotspot. 3. Attention-based Multi-view Sarcasm Cause Detection Model. Reprinting feedback and its causes reflect the audience’s actual emotions and opinions. To tackle the complexity the sarcastic feedback brings to the reprint effect analysis, this thesis proposes a multi-view method for detect the cause for sarcasm. The proposed model leverages the advantages of four different measurements to learn the semantic relevance in both word-wise and sentence-wise. The importance of each measure is automatically identified through attention mechanism. In addition, the long-term semantics are captured in pair-level to discover the deep causal relationships from the full views. Compared with the baseline methods, the F1 score of the proposed method is 71.14%, and the AUC value is 62.92%.
关键词	多源数据媒体转载效果转载行为预测热点挖掘反馈分析
语种	中文
七大方向——子方向分类	社会计算
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/44936
专题	多模态人工智能系统全国重点实验室_互联网大数据与信息安全
推荐引用方式 GB/T 7714	刘贺静. 基于多源数据的媒体转载效果分析方法研究[D]. 北京. 中国科学院自动化研究所,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
刘贺静毕业论文0606定稿.pdf（2597KB）	学位论文		开放获取	CC BY-NC-SA