CASIA OpenIR  > 毕业生  > 博士学位论文
面向多模态语义理解与推理的视觉问答研究
张熙
2024-05
页数148
学位类型博士
中文摘要

在当今社会进入移动互联网时代之际,图像、视频、音频、文本等海量的多 模态数据随处可见。这些数据通过不同的模态传达着多样且各异的信息,极大地 丰富了人们的工作与生活。然而,传统的数据处理方式已经无法应对规模庞大的 多模态数据。因此,随着人工智能技术的发展,人们希望利用计算机视觉、自然 语言处理和深度学习等领域的技术,对多模态数据进行深入的理解、准确的关联 和可靠的认知推理,从而促进制造、医疗、教育等多模态应用领域的数字化与智 能化。视觉问答作为多模态理解与推理的典型任务,旨在根据视觉输入(如图片 和视频)和相关的自然语言问题,自动推理出正确的答案。由于其在多模态语义 理解与推理方面具有重要的研究价值,并且在智能助理、电子商务等领域具有广 泛的应用场景,视觉问答任务备受研究者的关注。

目前基于深度学习的视觉问答研究虽然取得了突出的成果,但在实际应用 中仍面临着来自三个方面的重要挑战:(1)多模态输入的高语义性与多样性。多 模态输入通常包含着多样化的内容和丰富的语义信息。其中,视觉输入可能包含 不同场景、风格和光照条件下的视觉信息,而文本输入可能包含高语义与复杂的 词汇,难以被直接理解。同时,视觉和文本模态输入的表达方式存在差异,具有 异构性。(2)数据集存在关联偏差。由于数据收集过程中的偏见、主观选择或人 为标注等因素,视觉问答数据集的多模态输入间存在非随机的统计关联。这可能 导致模型在训练和预测时过度依赖于这些关联,忽略对多模态输入的深入分析 和推理,从而限制模型的通用性。(3)开放动态环境的复杂性。在现实世界中, 视觉场景和自然语言表达等多模态数据时时刻刻处于变化和更新的状态,不断 有新的信息或数据分布出现。这要求模型具备良好的连续学习和域外泛化能力。 为了应对上述挑战,本文从语义挖掘、可靠关联和推理泛化三个维度展开对视觉 问答的研究。首先,针对多模态输入的高语义性和多样性问题,研究了基于多层 级反事实对比的多模态语义挖掘方法。其次,针对数据集存在关联偏差的问题, 探索了基于神经模块网络的显式多模态关联方法和基于图匹配的多模态关联偏 差去除方法。最后,针对开放动态场景的复杂性问题,研究了基于特征解耦的连 续视觉问答方法和面向开放场景的低资源高效微调方法。

  论文的主要工作和创新点归纳如下:

1. 基于多层级反事实对比的多模态语义挖掘。现有方法在理解复杂异构的 多模态输入时,通常构建基于整体注意力机制或大规模预训练的模型,存在复 杂度高和计算量大的问题,且对视觉输入的理解不够全面。为此,本文提出了一 个基于多层级反事实对比的多模态语义挖掘方法。该方法基于简单的模型结构, 通过实例级、图像级和语义级的对比学习联合建模细粒度视觉内容、全局视觉场 景和跨模态语义关联。同时,该方法进一步引入反事实思想以提高对比学习的质 量,能够实现对多模态输入的全面理解。

2. 基于神经模块网络的显式多模态关联方法。现有方法在进行多模态推理 时通常采用隐式推理的方式,导致在存在关联偏差的情况下,无法展现其真实的 关联和推理能力。为此,本文提出了一个基于神经模块网络的显式多模态关联方 法。该方法以文本的句法结构为推理线索,通过节点注意力模块、边注意力模块 和转移模块进行序列化推理。通过展示这些神经模块的中间结果,该方法能够提 供关联推理的细粒度可视化证据,从而提高视觉问答模型的可解释性和可靠性。

3. 基于图匹配的多模态关联偏差去除方法。视觉问答数据集存在大量未被 探索的关联偏差,限制了鲁棒视觉问答模型的研究。本文深入探究了现有的多项 选择视觉问答数据集,发现了两类多模态关联偏差,并构建了一个用于衡量模型 克服关联偏差能力的评测数据集 NExT-OOD。同时,为减轻模型对偏差的依赖, 提出了一个基于图匹配的跨样本关联去偏方法,从整个数据集的角度提供去偏 指导,能够有效提高关联推理模型的泛化能力。

4. 基于特征解耦的连续视觉问答方法。现有的视觉问答方法大多使用离线 训练的方式,无法处理现实中动态更新的多模态数据。为此,本文面向多模态连 续学习构建了一个新颖的连续视觉问答设置 VQACL,以衡量视觉问答模型的连 续学习能力和组合泛化性。同时,本文还提出了一种基于特征解耦的连续学习方 法,通过解耦地学习两个模态的样本特定特征和样本不变特征,有效提高了模型 处理动态多模态数据的能力。

5. 面向开放场景的低资源高效微调方法。现有方法在对预训练大模型进行 领域微调时,通常需要较多的计算资源且容易过拟合于特定领域的数据,泛化性 较差。为此,本文提出了一种面向开放场景的低资源高效微调方法。该方法基于 两个冻结的单模态预训练模型,引入一组泛化提示和一组特有提示以同时实现 模态对齐和下游任务适应。同时,该方法还设计了一种基于不变风险最小化的对 比学习损失,能够有效增强模型在不同场景中的有效性和泛化性。

英文摘要

As today’s society enters the era of mobile Internet, massive multi-modal data such as images, videos, audio, and texts can be seen everywhere. These data convey diverse and heterogeneous information through different modalities, greatly enriching human work and life. However, traditional data processing approaches are no longer capable of handling the vast amount of multimodal data. Therefore, with the development of artificial intelligence technology, there is a desire to utilize technology in the fields of computer vision, natural language processing, and deep learning to achieve in-depth understanding, accurate association, and reliable cognitive reasoning of multimodal data. This, in turn, promotes the digitization and intelligence of various multimodal application domains such as manufacturing, healthcare, and education. As a typical task of multimodal understanding and reasoning, Visual question answering (VQA) aims to automatically predict answers based on the visual input and related natural language questions. Due to its significant research value in multimodal semantic understanding and reasoning, and its wide range of applications in intelligent assistants, e-commerce, and other fields, VQA has received considerable attention from researchers.

Although current deep learning-based methods have achieved remarkable results on VQA, they still face three challenges in practical applications: (1) The high seman- tic and diversity of multi-modal inputs. Multimodal inputs typically contain diverse content and rich semantic information. The visual input may include visual infor- mation from different scenes, styles, and lighting conditions, while the textual input may contain high-level semantics and complex words that are difficult to be directly understood. Moreover, there are discrepancies between the representation of visual and textual modalities, which are heterogeneous. (2) The presence of correlation bias in datasets. Due to factors such as bias, subjective choices, or human annotations in the data collection process, there are non-random statistical correlations between multimodal inputs in VQA datasets. This may lead the model to over-rely on these correlations during training and prediction, neglecting comprehensive understanding and reasoning on the multimodal inputs, thus limiting the generalizability of the model. (3) The complexity of open dynamic environments. In the real world, multimodal data such as visual scenes and natural language expressions are constantly changing and up- dating, with new information or data distributions emerging continuously. This requires the model to possess good continuous learning and generalization capabilities. To deal with the above challenges, this dissertation investigates VQA from three dimensions: semantic mining, reliable correlation, and generalized reasoning. Firstly, for the high semantic and diversity of multimodal inputs, this dissertation proposes a multi-level counterfactual contrastive learning method to do multimodal semantic mining. Sec- ondly, for the correlation bias in datasets, this dissertation designs a neural module network to perform explicit multimodal correlating and reasoning, and proposes a con- trastive graph matching method to mitigate multimodal correlation biases. Finally, for the complexity of open and dynamic scenes, this dissertation investigates a decoupling representation learning method for continuous VQA, and proposes a parameter-efficient and data-efficient prompt tuning method for open scenarios.

The main contributions of this dissertation are summarized as follows:

1. Multi-level counterfactual contrastive learning method for multimodal semantic mining. When understanding complex and heterogeneous multimodal inputs, existing methods often rely on holistic attention mechanisms or large-scale pre-training, resulting in high complexity and computational costs. Besides, they cannot perform comprehen- sive understanding of the visual input. To address these issues, this dissertation proposes a multi-level counterfactual contrastive learning method for multimodal semantic min- ing. Without introducing excessive computational parameters, this method designs an instance-level, an image-level, and a semantic-level contrastive learning layer to jointly perceive fine-grained visual content, global visual scenes, and cross-modal semantic correlations. At the same time, counterfactual thinking is introduced to further improve the quality of contrastive learning, which facilitates the comprehensive understanding of multimodal inputs.

2. Neural module based explicit multimodal correlation method. Existing methods usually perform implicit reasoning for VQA, making it difficult to determine whether the model performs the correct correlation and reasoning in the presence of correlation bias. To this end, this dissertation proposes a neural module based explicit multimodal correlation method. This method utilizes the syntactic structure of text as reasoning clues, and employs node attention module, edge attention module, and transfer module to perform sequential reasoning. It can provide fine-grained visual evidence for multimodal correlation, thereby enhancing the interpretability and reliability of the VQA model.

3. Contrastive graph matching method for multimodal correlation biases mitigat- ing. VQA datasets contain numerous under-explored correlation biases, which limit the robustness of VQA models. This dissertation thoroughly investigates existing multiple- choice VQA datasets and identifies two types of multimodal correlation biases. Besides, an evaluation dataset called NExT-OOD is constructed to measure the model’s ability to overcome these multimodal correlation biases. Moreover, to prevent the model from exploiting the biases, this dissertation proposes a contractive graph matching method. Specifically, it provides debiasing guidance from the perspective of the whole dataset, which can effectively improve the model’s generalizability for correlation reasoning.

4. Decoupling representation learning method for continuous visual question answering. Most of the existing VQA methods use offline training, which cannot handle dynamically updated multimodal data in real-world scenarios. To address this issue, this dissertation focuses on multimodal continuous learning, and introduces a novel continuous visual question answering setting called VQACL to evaluate the model’s continuous learning ability and compositional generalizability. At the same time, a decoupling representation learning method is proposed to do continuous VQA. The method extracts the sample-specific feature and sample-invariant feature for visual and textual input separately, which can successfully alleviate forgetting and enhance the models’ composition ability.

5. Parameter and data efficient prompt tuning method for open scenarios. When fine-tuning large pre-trained models on specific domains, existing methods often require significant computational resources and are prone to overfitting to domain-specific data, resulting in poor generalization. To this end, based on two frozen unimodal pre-trained models, this dissertation proposes a parameter and data efficient prompt tuning method for open scenes, which introduces a set of generalized prompts and a set of specialized prompts to perform accurate modality alignments and efficient task-specific adaptations. Besides, an invariant risk minimization based alignment objective is also incorporated to enhance the effectiveness and generalization of the model in different scenarios.

关键词多模态 视觉问答 语义挖掘 可靠关联 推理泛化
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/58517
专题毕业生_博士学位论文
推荐引用方式
GB/T 7714
张熙. 面向多模态语义理解与推理的视觉问答研究[D],2024.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
面向多模态语义理解与推理的视觉问答研究.(39126KB)学位论文 限制开放CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[张熙]的文章
百度学术
百度学术中相似的文章
[张熙]的文章
必应学术
必应学术中相似的文章
[张熙]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。