基于深度神经网络的媒体转载行为分析研究

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 互联网大数据与信息安全

	基于深度神经网络的媒体转载行为分析研究
	姚日恒
	2020-05-30
页数	100
学位类型	硕士
中文摘要	随着新一代信息技术的快速发展，新闻传播领域发生了革命性变化。在新媒体环境下，传播内容丰富且开放，传播过程迅速且交互性强。媒体间的转载行为是信息传播的主要方式之一，分析转载模式有助于管理部门及时跟踪舆论发展态势从而提供决策支持。本文旨在借鉴深度神经网络在网络表示学习、文本表示学习等领域的研究成果，从媒体转载预测、新闻转载识别、转载群体分析三个角度展开对媒体转载行为的分析研究，主要工作如下： 1. 基于注意力机制的媒体转载预测方法。不同媒体对相同新闻文本关注的内容存在差异，针对如何为不同媒体生成特定新闻语义表征的挑战性问题，本文提出基于注意力机制融合媒体转载关联以及媒体发布内容的转载预测方法。该方法首先基于网络表示学习方法分析转载网络拓扑结构生成媒体向量表示；在新闻语义表示学习过程中，基于注意力机制将媒体向量表示作为注意力来源以定位媒体关注内容，实现对相同的内容特征为不同媒体赋予差异化的语义权重，从而生成媒体特定的内容表示；最后联合媒体向量表示以及新闻内容表示预测媒体转载关系。实验表明，本文提出的媒体转载预测模型实现了媒体转载关联和媒体发布内容的有效融合，能够准确捕获媒体对新闻内容的关注差异，便于更好地理解媒体转载行为。 2. 基于多层次语义建模的新闻转载识别方法。新闻转载过程中存在句子、段落、篇章等不同层面的摘编，可能存在词语变体、句式转换、文章结构重排等多种转述表达形式。针对如何捕获新闻间不同层次的深度语义相似性信息的挑战性问题，本文提出的转载识别方法分别从词语-句子层次、句子-段落-篇章层次全面度量新闻标题、正文之间的语义相似性。对于标题信息，模型通过词语移动距离衡量词语层次的相似性，句子层次的相似性特征采用双向长短期记忆网络获得语义表示之后映射到匹配空间中进行学习。对于正文信息，模型采用层次化双向长短期记忆网络学习句子、段落、篇章三个层次的语义表达然后通过交互匹配提取相似性特征。最终，模型联合标题以及正文的所有层次相似性特征识别新闻转载关系。实验表明，相比传统方法，本文提出的新闻转载识别模型能够全面建模新闻间多层次的语义相似性，有助于识别更丰富的新闻转载模式。 3. 基于BERT和变分图自编码模型的转载群体分析方法。转载过程中关联紧密且发布内容相似的媒体形成不同的群体。针对如何构建有效的媒体语义特征从而更好地捕获其与媒体关联之间的潜在内部关联的挑战性问题，本文提出基于预训练语言模型BERT以及变分图自编码模型的转载群体分析方法。该方法首先基于BERT对媒体发布内容进行编码并进一步构建媒体语义特征；然后采用变分图自编码模型在迭代信息传递框架中显式利用媒体关联关系聚合媒体语义特征，从而学习集成结构以及语义信息的媒体向量表示；最后通过聚类媒体向量表示实现群体划分。实验表明，本文提出的转载群体分析方法能够有效表征媒体语义，挖掘其与媒体关联间的深层联系，从而学习出准确反映媒体特征的向量表示，提升群体划分的性能。
英文摘要	With the rapid development of new information technology, revolutionary changes have taken place in the field of news dissemination. In the new-media environment, the news content is rich and open, and the spread process is rapid and interactive. The reprint behavior among media is one of the main ways of information dissemination. Analyzing the reprint patterns is helpful for the management department to timely track the development of public opinions so as to support their decision-making. By learning from the achievements of deep neural network in the fields of network embedding learning and text representation learning, this thesis aims to carry out research on the media reprint behavior from three perspectives: media reprint prediction, news reprint identification, and reprint group analysis. The major works of this thesis are summarized as follows: 1. A method for media reprint prediction based on attention mechanism. Since different media concentrate on different content in the same news text, the challenge of media reprint prediction lies in how to generate media-specific news semantic representation. To tackle this problem, we propose a method for media reprint prediction based on attention mechanism, which integrates reprint relation among media and news content. Firstly, based on the network embedding learning, our method analyzes the topology of reprint network to generate media representation. During news representation learning, the media representation is introduced as attention information to locate the content that the media focus on, and thus the same content will be assigned different weights according to different media. By this way, media-specific news content representation is learned. In the end, our model combines media representation and news content representation to predict media reprint relation. Experiments show that the proposed reprint prediction model can effectively integrate reprint relation among media and news content, and accurately capture the differences of media's focus to the news content, thus is able to better understand the reprint behavior of media. 2. A method for news reprint identification based on multi-level semantic modeling. When reprinting news, except from copying the whole original content, media may use part of it such as some sentences, some paragraphs, etc., and there may be various forms of rewriting, such as words changing, sentence conversion, article structure rearrangement, etc. To solve the challenging problem of how to capture the deep semantic similarity at different levels between news articles, our proposed news reprint identification model comprehensively measures the semantic similarity between news headlines and bodies at word-sentence levels and sentence-paragraph-article levels, correspondingly. For news headlines, the word-level similarity features are measured by the word mover’s distance model, and the sentence-level features are acquired by firstly learning representation for headline pair by Bidirectional long short-term memory network (BiLSTM) and then matching through a new space. For news bodies, our model applies hierarchical BiLSTM to learn the representation of sentences, paragraphs and the whole body, and then extracts the similarity features through interactive matching. Finally, our model merges all the similarity features to identify news reprint relation. Experiments show that compared with traditional methods, the proposed news reprint identification approach can comprehensively model the multi-level semantic similarity between news and help to identify more abundant news reprint patterns. 3. A method for reprint group analysis based on BERT and variational graph auto-encoder. During reprinting, media that are closely related and have similar publishing content form different groups. To address the challenging problem of how to construct effective semantic features of media so as to better capture the hidden internal relation between them and media reprint relation, we propose a reprint group analysis method based on pre-trained language model BERT and variational graph auto-encoder. It firstly encodes the media releases based on BERT and further constructs the semantic features of media. Then, it applies variational graph auto-encoder to explicitly explore media relation to aggregate media semantic features based on an iterative information spread framework, thus media representation that integrates structure and semantic information could be learned. Finally, grouping is achieved by clustering the media representation. Experiments show that the proposed reprint group analysis method can effectively represent the semantics of media and mine the deep relation between it and the media association, so as to learn the representation that accurately reflects the characteristics of the media and improve the performance of media group clustering.
关键词	媒体转载行为深度神经网络转载预测转载识别群体分析
语种	中文
七大方向——子方向分类	社会计算
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/39054
专题	多模态人工智能系统全国重点实验室_互联网大数据与信息安全
推荐引用方式 GB/T 7714	姚日恒. 基于深度神经网络的媒体转载行为分析研究[D]. 北京. 中国科学院大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
姚日恒-硕士论文-基于深度神经网络的媒体（2738KB）	学位论文		开放获取	CC BY-NC-SA