面向多模态语义理解与推理的视觉问答研究

CASIA OpenIR > 毕业生 > 博士学位论文

	面向多模态语义理解与推理的视觉问答研究
	张熙
	2024-05
页数	148
学位类型	博士
中文摘要	在当今社会进入移动互联网时代之际，图像、视频、音频、文本等海量的多模态数据随处可见。这些数据通过不同的模态传达着多样且各异的信息，极大地丰富了人们的工作与生活。然而，传统的数据处理方式已经无法应对规模庞大的多模态数据。因此，随着人工智能技术的发展，人们希望利用计算机视觉、自然语言处理和深度学习等领域的技术，对多模态数据进行深入的理解、准确的关联和可靠的认知推理，从而促进制造、医疗、教育等多模态应用领域的数字化与智能化。视觉问答作为多模态理解与推理的典型任务，旨在根据视觉输入(如图片和视频)和相关的自然语言问题，自动推理出正确的答案。由于其在多模态语义理解与推理方面具有重要的研究价值，并且在智能助理、电子商务等领域具有广泛的应用场景，视觉问答任务备受研究者的关注。目前基于深度学习的视觉问答研究虽然取得了突出的成果，但在实际应用中仍面临着来自三个方面的重要挑战:(1)多模态输入的高语义性与多样性。多模态输入通常包含着多样化的内容和丰富的语义信息。其中，视觉输入可能包含不同场景、风格和光照条件下的视觉信息，而文本输入可能包含高语义与复杂的词汇，难以被直接理解。同时，视觉和文本模态输入的表达方式存在差异，具有异构性。(2)数据集存在关联偏差。由于数据收集过程中的偏见、主观选择或人为标注等因素，视觉问答数据集的多模态输入间存在非随机的统计关联。这可能导致模型在训练和预测时过度依赖于这些关联，忽略对多模态输入的深入分析和推理，从而限制模型的通用性。(3)开放动态环境的复杂性。在现实世界中，视觉场景和自然语言表达等多模态数据时时刻刻处于变化和更新的状态，不断有新的信息或数据分布出现。这要求模型具备良好的连续学习和域外泛化能力。为了应对上述挑战，本文从语义挖掘、可靠关联和推理泛化三个维度展开对视觉问答的研究。首先，针对多模态输入的高语义性和多样性问题，研究了基于多层级反事实对比的多模态语义挖掘方法。其次，针对数据集存在关联偏差的问题，探索了基于神经模块网络的显式多模态关联方法和基于图匹配的多模态关联偏差去除方法。最后，针对开放动态场景的复杂性问题，研究了基于特征解耦的连续视觉问答方法和面向开放场景的低资源高效微调方法。论文的主要工作和创新点归纳如下: 1. 基于多层级反事实对比的多模态语义挖掘。现有方法在理解复杂异构的多模态输入时，通常构建基于整体注意力机制或大规模预训练的模型，存在复杂度高和计算量大的问题，且对视觉输入的理解不够全面。为此，本文提出了一个基于多层级反事实对比的多模态语义挖掘方法。该方法基于简单的模型结构，通过实例级、图像级和语义级的对比学习联合建模细粒度视觉内容、全局视觉场景和跨模态语义关联。同时，该方法进一步引入反事实思想以提高对比学习的质量，能够实现对多模态输入的全面理解。 2. 基于神经模块网络的显式多模态关联方法。现有方法在进行多模态推理时通常采用隐式推理的方式，导致在存在关联偏差的情况下，无法展现其真实的关联和推理能力。为此，本文提出了一个基于神经模块网络的显式多模态关联方法。该方法以文本的句法结构为推理线索，通过节点注意力模块、边注意力模块和转移模块进行序列化推理。通过展示这些神经模块的中间结果，该方法能够提供关联推理的细粒度可视化证据，从而提高视觉问答模型的可解释性和可靠性。 3. 基于图匹配的多模态关联偏差去除方法。视觉问答数据集存在大量未被探索的关联偏差，限制了鲁棒视觉问答模型的研究。本文深入探究了现有的多项选择视觉问答数据集，发现了两类多模态关联偏差，并构建了一个用于衡量模型克服关联偏差能力的评测数据集 NExT-OOD。同时，为减轻模型对偏差的依赖，提出了一个基于图匹配的跨样本关联去偏方法，从整个数据集的角度提供去偏指导，能够有效提高关联推理模型的泛化能力。 4. 基于特征解耦的连续视觉问答方法。现有的视觉问答方法大多使用离线训练的方式，无法处理现实中动态更新的多模态数据。为此，本文面向多模态连续学习构建了一个新颖的连续视觉问答设置 VQACL，以衡量视觉问答模型的连续学习能力和组合泛化性。同时，本文还提出了一种基于特征解耦的连续学习方法，通过解耦地学习两个模态的样本特定特征和样本不变特征，有效提高了模型处理动态多模态数据的能力。 5. 面向开放场景的低资源高效微调方法。现有方法在对预训练大模型进行领域微调时，通常需要较多的计算资源且容易过拟合于特定领域的数据，泛化性较差。为此，本文提出了一种面向开放场景的低资源高效微调方法。该方法基于两个冻结的单模态预训练模型，引入一组泛化提示和一组特有提示以同时实现模态对齐和下游任务适应。同时，该方法还设计了一种基于不变风险最小化的对比学习损失，能够有效增强模型在不同场景中的有效性和泛化性。
英文摘要	As today’s society enters the era of mobile Internet, massive multi-modal data such as images, videos, audio, and texts can be seen everywhere. These data convey diverse and heterogeneous information through different modalities, greatly enriching human work and life. However, traditional data processing approaches are no longer capable of handling the vast amount of multimodal data. Therefore, with the development of artificial intelligence technology, there is a desire to utilize technology in the fields of computer vision, natural language processing, and deep learning to achieve in-depth understanding, accurate association, and reliable cognitive reasoning of multimodal data. This, in turn, promotes the digitization and intelligence of various multimodal application domains such as manufacturing, healthcare, and education. As a typical task of multimodal understanding and reasoning, Visual question answering (VQA) aims to automatically predict answers based on the visual input and related natural language questions. Due to its significant research value in multimodal semantic understanding and reasoning, and its wide range of applications in intelligent assistants, e-commerce, and other fields, VQA has received considerable attention from researchers. Although current deep learning-based methods have achieved remarkable results on VQA, they still face three challenges in practical applications: (1) The high seman- tic and diversity of multi-modal inputs. Multimodal inputs typically contain diverse content and rich semantic information. The visual input may include visual infor- mation from different scenes, styles, and lighting conditions, while the textual input may contain high-level semantics and complex words that are difficult to be directly understood. Moreover, there are discrepancies between the representation of visual and textual modalities, which are heterogeneous. (2) The presence of correlation bias in datasets. Due to factors such as bias, subjective choices, or human annotations in the data collection process, there are non-random statistical correlations between multimodal inputs in VQA datasets. This may lead the model to over-rely on these correlations during training and prediction, neglecting comprehensive understanding and reasoning on the multimodal inputs, thus limiting the generalizability of the model. (3) The complexity of open dynamic environments. In the real world, multimodal data such as visual scenes and natural language expressions are constantly changing and up- dating, with new information or data distributions emerging continuously. This requires the model to possess good continuous learning and generalization capabilities. To deal with the above challenges, this dissertation investigates VQA from three dimensions: semantic mining, reliable correlation, and generalized reasoning. Firstly, for the high semantic and diversity of multimodal inputs, this dissertation proposes a multi-level counterfactual contrastive learning method to do multimodal semantic mining. Sec- ondly, for the correlation bias in datasets, this dissertation designs a neural module network to perform explicit multimodal correlating and reasoning, and proposes a con- trastive graph matching method to mitigate multimodal correlation biases. Finally, for the complexity of open and dynamic scenes, this dissertation investigates a decoupling representation learning method for continuous VQA, and proposes a parameter-efficient and data-efficient prompt tuning method for open scenarios. The main contributions of this dissertation are summarized as follows: 1. Multi-level counterfactual contrastive learning method for multimodal semantic mining. When understanding complex and heterogeneous multimodal inputs, existing methods often rely on holistic attention mechanisms or large-scale pre-training, resulting in high complexity and computational costs. Besides, they cannot perform comprehen- sive understanding of the visual input. To address these issues, this dissertation proposes a multi-level counterfactual contrastive learning method for multimodal semantic min- ing. Without introducing excessive computational parameters, this method designs an instance-level, an image-level, and a semantic-level contrastive learning layer to jointly perceive fine-grained visual content, global visual scenes, and cross-modal semantic correlations. At the same time, counterfactual thinking is introduced to further improve the quality of contrastive learning, which facilitates the comprehensive understanding of multimodal inputs. 2. Neural module based explicit multimodal correlation method. Existing methods usually perform implicit reasoning for VQA, making it difficult to determine whether the model performs the correct correlation and reasoning in the presence of correlation bias. To this end, this dissertation proposes a neural module based explicit multimodal correlation method. This method utilizes the syntactic structure of text as reasoning clues, and employs node attention module, edge attention module, and transfer module to perform sequential reasoning. It can provide fine-grained visual evidence for multimodal correlation, thereby enhancing the interpretability and reliability of the VQA model. 3. Contrastive graph matching method for multimodal correlation biases mitigat- ing. VQA datasets contain numerous under-explored correlation biases, which limit the robustness of VQA models. This dissertation thoroughly investigates existing multiple- choice VQA datasets and identifies two types of multimodal correlation biases. Besides, an evaluation dataset called NExT-OOD is constructed to measure the model’s ability to overcome these multimodal correlation biases. Moreover, to prevent the model from exploiting the biases, this dissertation proposes a contractive graph matching method. Specifically, it provides debiasing guidance from the perspective of the whole dataset, which can effectively improve the model’s generalizability for correlation reasoning. 4. Decoupling representation learning method for continuous visual question answering. Most of the existing VQA methods use offline training, which cannot handle dynamically updated multimodal data in real-world scenarios. To address this issue, this dissertation focuses on multimodal continuous learning, and introduces a novel continuous visual question answering setting called VQACL to evaluate the model’s continuous learning ability and compositional generalizability. At the same time, a decoupling representation learning method is proposed to do continuous VQA. The method extracts the sample-specific feature and sample-invariant feature for visual and textual input separately, which can successfully alleviate forgetting and enhance the models’ composition ability. 5. Parameter and data efficient prompt tuning method for open scenarios. When fine-tuning large pre-trained models on specific domains, existing methods often require significant computational resources and are prone to overfitting to domain-specific data, resulting in poor generalization. To this end, based on two frozen unimodal pre-trained models, this dissertation proposes a parameter and data efficient prompt tuning method for open scenes, which introduces a set of generalized prompts and a set of specialized prompts to perform accurate modality alignments and efficient task-specific adaptations. Besides, an invariant risk minimization based alignment objective is also incorporated to enhance the effectiveness and generalization of the model in different scenarios.
关键词	多模态视觉问答语义挖掘可靠关联推理泛化
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/58517
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	张熙. 面向多模态语义理解与推理的视觉问答研究[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
面向多模态语义理解与推理的视觉问答研究.（39126KB）	学位论文		限制开放	CC BY-NC-SA