基于因果关联的信息挖掘与传递及其在动力学和视觉问答场景中的应用

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于因果关联的信息挖掘与传递及其在动力学和视觉问答场景中的应用
	李宗钊
	2023-05-23
页数	65
学位类型	硕士
中文摘要	推理能力是支撑模型建模、理解并分析复杂环境的重要能力之一。在目前人工智能领域的相关研究工作中，模型对感知任务的处理能力已经达到较高水准，但在场景理解、视觉推理等认知任务上模型的表现仍然未能达到人类的期望。因此，需要研发切实可行的人工智能方法来提升模型的推理能力，使模型能够更好地处理认知任务。提升模型推理能力的一个重要手段就是利用不能被直接观测获得的、与推理相关的信息。具体而言，在动力学场景中，这些信息表现为物体的物理属性；在视觉问答场景中，这些信息表现为视觉常识特征。通过挖掘利用与推理相关的信息，模型不仅能够对动力学场景中的物体进行更精准的运动轨迹预测，还能在视觉问答任务中进行更合理的答案预测。近年来，对于挖掘利用此类信息来更好地完成因果推断、时序推理等认知任务尚未得到充分研究。因此，本文针对提高模型推理能力的问题开展研究，主要聚焦于挖掘利用数据中难以直接观测得到的却能够支撑推理的信息。具体而言，本文从两个典型场景出发，分别针对动力学场景和视觉问答场景展开研究，借助因果工具，进行基于因果关联的信息挖掘、传递和利用研究，并且在两个应用场景中进行充分的实验验证。本文的主要贡献如下： 1. 提出了一种基于全局因果关联注意力机制和物理信息时空传递结构的反事实预测模型。对于动力学场景中的反事实预测任务，现有研究存在着一些缺陷。比如，缺乏对潜在因果链的挖掘，导致模型难以准确预估场景中物体的物理属性；预测模块不能高效传递利用物理属性。针对这些问题，本文聚焦于动力学场景中的物理属性，包括质量、摩擦因子和重力等，提出了一种基于全局因果关联注意力机制和物理信息时空传递结构的反事实预测模型。全局因果关联注意力机制能够辅助模型建模长距离跨帧物体之间的因果关联，通过同时捕获空间和时序信息来挖掘物理属性。物理信息时空传递结构在空间和时间两个维度上传递利用挖掘获得的物理属性，辅助模型进行更为精准的反事实预测。在物理属性真实值未知的情况下，本文提出的模型能够充分利用物理属性形成的约束，在多个数据集上取得了目前最优的性能，并且能够较好地泛化到新的环境，保持良好的预测精度。 2. 提出了一种基于视觉常识信息的异质图对比学习框架。在视觉问答领域中，现有工作主要关注如何在多模态交互模块中对齐并融合跨模态信息，缺乏对视觉常识特征的重视；部分考虑了视觉常识的研究工作没有合理地传递利用视觉常识特征。针对于此，本文提出了基于视觉常识信息的异质图对比学习框架（VC-HGCL），该框架主要包括对比学习和异质图关联网络两个子模块。该框架通过引入对比学习，鼓励模型在回答与推理相关的问题时更关注视觉常识特征，并对场景中不同的对象进行权重的合理分配。异质图关联网络的设计目的在于高效结合利用视觉常识特征，物体视觉特征与文本特征，在同一模态和不同模态的对象之间建立合理的因果关联。此外，本文还将该框架设计成即插即用的形式，大大提升了框架的可延展性。本文将提出的框架与七个经典视觉问答模型进行了结合，并在四个不同的视觉问答任务中进行了实验。实验结果表明在 VC-HGCL 的帮助下，经典视觉问答模型的鲁棒性和预测准确度都得到了显著的提升，尤其是在与推理相关的认知任务上，比如因果推断和时序推理。
英文摘要	Reasoning is one of the crucial supporting capabilities for the model to effectively model, comprehend, and analyze complex environments. Although the model's capacity for processing perceptive tasks has greatly improved, when it comes to cognitive tasks related to scene understanding, visual reasoning, there is still typically a performance gap between the performance of existing models and human expectations. Consequently, it is essential to create useful artificial intelligence techniques to enhance the model's capacity for reasoning, enabling the model to handle cognitive tasks more effectively. Using information linked to reasoning that cannot be directly observed is a crucial method to enhance the model's capacity for reasoning. Specifically, in dynamic scenarios, these information are represented as physical properties of objects; in visual question answering scenarios, these information are represented as visual commonsense features. By mining and utilizing information related to reasoning, the model can not only predict the motion trajectory of objects in dynamic scenes more accurately, but also predict more reasonable answers in visual question answering tasks. In recent years, the mining and utilization of such information to better complete cognitive tasks such as causal inference and temporal reasoning has not been fully studied. Therefore, this paper conducts research on improving the reasoning ability of the model, focusing on mining and utilizing information that is difficult to explicitly observe in the data but can support reasoning. Specifically, starting from two typical scenarios, this paper conducts research on dynamic scenarios and visual question answering scenarios. With the help of causal tools, this paper conducts research on information mining, transferring and utilization based on causal relation, and performs adequate experimental verification in two application scenarios. The main contributions of this paper are as follows: 1.This paper propose a counterfactual prediction model combining global causal relation attention and spatiotemporal transfer structure of physical information. For counterfactual prediction tasks in dynamical scenarios, there are some shortcomings in the existing researches. For example, the lack of in-depth mining of probable causal chains makes the model unable to accurately forecast the physical properties of objects in the scene. Moreover, some models have prediction modules that are unable to effectively transfer and utilize the physical properties. To address these issues, this paper focuses on the physical properties in dynamical scenarios, such as mass, friction coefficient, and gravity. This paper propose a counterfactual prediction model combining global causal relation attention and spatiotemporal transfer structure of physical information. The global causal relation attention mechanism can help the model simulate the causal relationship between long-distance objects in different frames, and mine the physical properties by simultaneously capturing spatial and temporal information. The spatiotemporal transfer structure of physical information helps the model produce more precise counterfactual predictions by transferring the mined physical properties in spatial and temporal dimensions. Without any access to ground truth information about the physical properties, our model outperforms the state-of-the-art method on various benchmarks by fully utilizing the constraints of the physical properties. Extensive experiments demonstrate that our model can generalize to unseen environments and maintain good performance. 2. This paper propose a visual commonsense based heterogeneous graph contrastive learning framework. In the field of visual question answering, existing work mainly focuses on how to align and fuse cross-modal information in multimodal interaction modules, and lacks attention to the visual commonsense. Some of the research works that consider visual commonsense do not transfer and utilize visual commonsense information reasonably. To address the aforementioned problems, this paper proposes a visual commonsense based heterogeneous graph contrastive learning framework (VC-HGCL), which mainly includes two sub-modules of contrastive learning and heterogeneous graph relation network. By introducing contrastive learning, the framework encourages the model to pay more attention to visual commonsense features when answering questions related to reasoning, and reasonably assign weights to different objects in the scene. The heterogeneous graph relation network is created with the goal of effectively combining and transferring visual commonsense features, object visual features, and textual features, as well as establishing a reasonable causal relation between objects of the same modality and those of different modalities. This paper also designs the framework in a plug-and-play fashion, considerably increasing the framework's extensibility. In this paper, we combine the proposed framework with seven classical visual question answering models and conduct experiments on four different visual question answering tasks. The experimental results show that with the help of VC-HGCL, the robustness and prediction accuracy of the classic visual question answering model have been significantly improved, especially in cognitive tasks related to reasoning, such as causal inference and temporal reasoning.
关键词	信息挖掘信息传递因果学习视觉推理视觉问答
语种	中文
七大方向——子方向分类	图像视频处理与分析
国重实验室规划方向分类	视觉信息处理
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/52326
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	李宗钊. 基于因果关联的信息挖掘与传递及其在动力学和视觉问答场景中的应用[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
李宗钊硕士论文.pdf（27623KB）	学位论文		限制开放	CC BY-NC-SA