基于证据推理的虚假信息检测研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于证据推理的虚假信息检测研究
	许伟志
	2023-05-19
页数	84
学位类型	硕士
中文摘要	虚假信息指的是凭空捏造或刻意篡改真实内容的信息，它已然成为现代社会的一个严重问题。虚假信息的传播不仅会影响个人的决策，也会对公共秩序产生负面影响，例如金融市场、公共卫生领域等。由于社交媒体和互联网技术的普及，虚假信息的传播范围和速度都得到了增强，这也使得虚假信息检测变得更加具有挑战性。目前，已有一大批研究者从检测精确性和鲁棒性的角度提出了许多方法，促进了虚假信息检测任务的发展。然而，该任务下还存在若干个开放性问题还未在前人的方法中得到解决。本文将针对证据的长距离语义依赖关系学习、冗余信息的噪声影响、多结构化证据间的交互信息建模以及模型预测的鲁棒性这几个方面展开研究，具体研究内容和成果概述如下: (1)基于图结构学习的虚假信息检测模型在仅含文本证据的场景中，针对证据信息含有大量与验证言论真实性无关的冗余信息这一问题，我们以真实场景中的各类新闻及其相关证据为研究对象，提出了一种基于图结构学习的虚假信息检测模型。我们首先将非结构化的文本建模为结构化的图数据，通过邻居信息传播算法学习证据中的长距离语义依赖信息。同时，基于构建的文本图，我们还设计了一种图结构学习算法，将冗余信息的去除建模为精简图结构的过程，缓解了证据中的冗余信息对于模型的负面影响。基于此，我们提出的模型能够为任何下游语义交互模型提供细粒度的证据语义表达，进而提升模型准确率。我们在不同的下游模型上开展了大量的实验，结果表明提出的模型都取得超过基线方法的性能。 (2)基于异质图神经网络的多结构化虚假信息检测模型在现实世界中，证据往往不仅以文本的形式存在，还以表格、列表等半结构化的格式存在。由于这类半结构化数据的特性与文本不同，因此直接将仅含文本证据的场景中的模型迁移到多结构化场景中，难以取得很好的结果。这就要求研究者们对多结构化证据专门设计特定的模型。为此，我们提出了一种基于异质图神经网络的多结构化虚假信息检测模型。首先，我们创新性地设计了一张异质图，将文本和表格中的单词视为两类不同的节点，在这些节点之间设计了三种不同的边，即文本内部的边、表格内部的边以及文本和表格证据之间的边。然后采用异质图神经网络进行邻居信息传播来捕捉不同结构化证据之间的异质关联。在大规模的基准数据集 FEVEROUS 上开展的大量实验证明了提出的模型的有效性。 (3)基于反事实推理的虚假信息检测鲁棒性增强方法虚假信息检测模型的准确率提升并不完全源于模型的语义理解能力提升，部分提升是由于模型拟合了训练数据集中存在的偏置信息。拟合偏置信息会导致在不同分布的数据集上模型性能的显著下降，这就是模型的鲁棒性较差的表现。针对这一问题，本文提出了一种基于反事实推理的方法。具体而言，我们构建了一个事实性场景和一个反事实场景，学习得到相应的模型输出，最后基于潜在输出模型对两个场景的输出相减，获取抵消偏置信息后的预测结果。与基于数据增强和权重调整的方法相比，本文提出了一种新的增强鲁棒性的思路，并且在三个不同分布的数据集上进行了大量实验，结果表明提出的方法取得 14% 以上的平均准确率提升。
英文摘要	Fake information refers to information that is fabricated or intentionally altered from its original form, and has become a serious problem in modern society. The spread of fake information not only affects individual decision-making, but also has a negative impact on public order, such as in financial markets and public health fields. With the popularity of social media and internet technology, the scope and speed of fake infor- mation dissemination have been greatly enhanced, making fake information detection more challenging. Currently, many researchers have proposed methods to promote the development of fake information detection from the perspectives of detection accuracy and robustness. However, there are still several open problems that have not been solved by previous methods in this task. This paper will focus on research in several aspects, such as learning the long-distance semantic dependency of evidence, noise impact of re- dundant information, modeling the interaction between multi-structured evidence, and the robustness of model prediction. The specific research contents and results are sum- marized as follows: (1) Fake information detection model based on graph structure learning In the single-structured scenario containing only textual evidence, we propose a fake information detection model based on graph structure learning to address the prob- lem of a large amount of irrelevant redundant information in the evidence. We take various types of news and related evidence in the real world as the research objects and first model unstructured text as structured graph data, and then learn the long-distance semantic dependency information of evidence through the neighbor information prop- agation. Meanwhile, based on the constructed text graph, we design a graph structure learning algorithm to model the removal of redundant information as a process of simpli- fying the graph structure, which mitigates the negative impact of redundant information on the model. Based on this, our proposed model can provide fine-grained evidence semantic representations for any downstream semantic interaction model, thereby im- proving the model accuracy. We conduct a large number of experiments on different downstream models, and the results show that the proposed model outperforms the base- line methods. (2) Multi-structured fake information detection model based on heterogeneous graph neural network In the real world, evidence often exists not only in the form of text, but also in semi-structured formats such as tables and lists. Due to the differences between semi- structured data and text, it is difficult to directly transfer models from single-structured scenarios to multi-structured scenarios and obtain good results. Therefore, researchers need to design specific models for multi-structured evidence. To this end, we propose a multi-structured fake information detection model based on a heterogeneous graph neural network. Firstly, we creatively design a heterogeneous graph, where we regard words in text and tables as two different types of nodes, and three different types of edges are designed between these nodes, including intra-text edges, intra-table edges, and edges between text and table evidence. Then, we use a heterogeneous graph neural network for neighbor information propagation to capture the heterogeneous relations between different structured evidence. A large number of experiments conducted on the benchmark dataset FEVEROUS demonstrate the effectiveness of the proposed model. (3) Robustness enhancement method for fake information detection based on coun- terfactual reasoning The improvement of fake information detection model accuracy is not entirely due to the improvement of the model’s semantic understanding ability, but also partly due to the model fitting the biased information in the training dataset. Fitting biased infor- mation can lead to a significant decline in model performance on datasets with different distributions, which indicates poor robustness of the model. To address this problem, we propose a method based on counterfactual reasoning. Specifically, we construct a factual scenario and a counterfactual scenario, learn the corresponding model outputs, and then use a potential outcome model to subtract the outputs of the two scenarios to obtain the predicted results after biased information removal. Compared with meth- ods based on data augmentation and weight adjustment, we introduce a new pipeline to enhance the robustness of fake information detection models, and conducted extensive experiments on three datasets with different distributions. The results show that the pro- posed method achieved an average accuracy improvement of more than 14% compared to the baseline methods.
关键词	虚假信息检测图神经网络图结构学习反事实推理
收录类别	其他
语种	中文
七大方向——子方向分类	知识表示与推理
国重实验室规划方向分类	语音语言处理
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/52121
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	许伟志. 基于证据推理的虚假信息检测研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
许伟志-毕业论文终版.pdf（7056KB）	学位论文		限制开放	CC BY-NC-SA