基于多模态交互与注意力机制的视觉问答

CASIA OpenIR > 毕业生 > 博士学位论文

	基于多模态交互与注意力机制的视觉问答
	刘飞
	2022-05-18
页数	146
学位类型	博士
中文摘要	近年来，随着人工智能的发展，研究者们越来越重视赋予机器以人类的智能，使其像人类一样“观察”和“阅读”周围的信息。视觉问答任务应运而生，它是一个跨越计算机视觉和自然语言处理两大领域的交叉方向，要求机器在“观察”图像/视频内容以及“阅读”相关的问题后，给出自然语言形式的答案。与单一的视觉语义解析或自然语言理解相比，视觉问答技术要求联合理解视觉与文本内容，是一种更高级的认知，因而具有重要的研究价值。同时，视觉问答也可以为许多实际应用，例如辅助视觉障碍患者、人机交互、导航助手等提供重要的技术支持。当前的视觉问答方法大体上可以分为四个阶段：第一，视觉和文本特征表示；第二，注意力机制；第三，多模态交互；第四，答案预测。其中，第二和第三阶段尤为重要，因此大多数方法都围绕注意力机制与多模态交互进行改进。尽管这些方法取得了显著的进展，但是仍然存在较大的局限性。在注意力机制方面，由于视觉问答的每个问题通常只涉及一小部分视觉内容，因此依靠注意力机制来让模型从众多内容中提炼出最相关的内容是视觉问答任务的关键。当前的方法仅通过单独地计算视觉与文本内容之间的相似度来选择要关注的区域，然而这样很容易受到训练数据、模块设计、监督损失等各种因素的影响，无法保证模型学习到一个足够鲁棒且准确的注意力机制。在多模态交互方面，由于视觉问答涉及到视觉与文本（或自然语言）两种异质的模态，这两种模态天然具有语义鸿沟，因此对这两种模态进行准确而有效地语义关联是十分必要且具有挑战性的。当前方法仍然缺乏更有效的机制来建模多模态交互，导致不充分、不准确的多模态语义理解与关联。本文针对以上限制，围绕视觉问答的多模态交互与注意力机制，来设计合理的深度神经网络与算法，有效地改进了注意力机制的学习和多模态交互的建模，从而实现了更加准确的视觉问答。论文的主要工作和创新点总结如下： 1. 基于擦除注意力学习的视觉问答。针对传统注意力机制容易生成混合的注意力分布的问题，本文提出了一种新的训练策略来改善注意力机制的学习。具体地，该方法首先提出了注意力引导的擦除机制来分别获取注意力区域特征和非注意力区域特征，然后选定注意力区域特征作为正样本，非注意力区域特征作为负样本。进一步地，该方法施加一组基于度量学习损失的约束来区分正负样本，从而学习更有判别力的注意力分布。实验结果表明该方法能够有效地改善视觉问答的准确率。通过注意力可视化发现，所生成的注意力分布更加合理以及更具判别力。 2. 基于稠密多模态交互的视觉问答。针对视觉问答中模态间交互十分有限以及模态内交互缺乏建模的问题，本文设计了一种基于稠密多模态交互的网络结构。具体地，对于模态间交互，设计了基于双向注意力机制的连接器组件，用来连接任意层级的不同模态特征；对于模态内交互，设计了基于单向注意力机制的连接器组件，来连接任意层级的相同模态特征。通过联合两种组件设计，从而使任意层级两两交互，形成稠密的多模态交互。实验结果表明，所提方法在多个公开数据集上取得了同期最优性能。 3. 基于细粒度多模态动态交互的视觉问答。针对当前视频问答方法存在的多模态交互粗糙、静态且时序尺度单一的问题，本文提出了基于细粒度多模态动态交互的视觉问答方法。具体地，该方法构建了层级的时序卷积网络来产生多个不同时序尺度的特征，并在网络每层引入了一个基于批量归一化的机制来建模该尺度下的细粒度多模态动态交互。在两个常用的公开数据集上，所提方法均取得了当时排行榜第一的性能。 4. 基于层级关系感知的多模态交互的视觉问答。针对当前视频问答方法缺乏联合的目标级和视频帧级的关系推理以及没有利用显式的语义知识等问题，本文提出了层级关系感知的多模态交互框架，以层级的方式结合目标级和视频帧级的关系推理，同时挖掘显式的语义知识来促进关系感知的多模态交互。为了能更高效地进行关系建模与推理，本方法设计了图记忆机制作为基本的关系推理模块。实验结果表明，所提方法能够更加精确地捕获到视频中各种层级关系，进而改进视频问答的准确率。
英文摘要	In recent years, with the development of artificial intelligence, researchers have paid more and more attention to endow machines with human intelligence, so that they can "see" and "read" the information around them like humans. As a result, the visual question answering task emerged. It is the intersection of computer vision and natural language processing, and requires machines to give answers in a natural language form after "seeing" image/video content and "reading" related questions. Compared with single visual semantic parsing or natural language understanding, the visual question answering technology requires a joint understanding of visual and textual content, which is a higher-level cognition, so it has important research value. At the same time, visual question answering can also provide important technical support for many practical applications, such as assisting visually impaired patients, human-computer interaction, and navigation assistants. Current visual question answering methods can be roughly divided into four stages: first, visual and textual feature representation; second, attention mechanism; third, multimodal interaction; fourth, answer prediction. Among them, the second and third stages are particularly important, thus most methods focus on attention mechanism and multimodal interaction. Despite making remarkable progress, these methods still have some significant limitations. In terms of attention mechanism, since each question of visual question answering usually involves only a small part of visual content, relying on the attention mechanism to extract the most relevant one from plentiful contents is the key of visual question answering. The current methods select the regions to focus on by separately calculating the similarity between visual and textual content, however, which is easily affected by various factors such as training data, module design, supervision loss, etc., and cannot guarantee that the model will learn a sufficiently robust and accurate attention mechanism. In terms of multimodal interaction, since visual question answering involves two heterogeneous modalities of vision and text (or natural language) and these two modalities naturally have semantic gap, it is necessary and challenging to accurately and effectively associate the semantics of the two modalities. Current methods still lack efficient mechanisms to model the multimodal interactions, resulting in insufficient and inaccurate multimodal semantic understanding and association. In view of the above limitations, this paper focuses on multimodal interaction and attention mechanism in visual question answering, and designs appropriate deep neural networks and algorithms to improve the learning of attention mechanism and the modeling of multimodal interaction, thus achieving more accurate visual question answering. The main contributions are summarized as follows: 1. Visual question answering with erasing-based attention learning. Traditional attention mechanisms in visual question answering are prone to generate mixed attention distributions. To address this problem, this paper proposes a new training strategy to improve the learning of attention mechanisms. Specifically, we first introduce an attention-guided erasing mechanism to obtain the features of attention regions and non-attention regions respectively, and then select the features of attention regions as positive samples and the features of non-attention regions as negative samples. Further, we impose a set of constraints based on metric learning loss to distinguish positive and negative samples, thereby learning a more discriminative attention distribution. Experimental results show that the proposed method can effectively improve the accuracy of visual question answering. Through attention visualization, it is found that the generated attention distribution is more reasonable and more discriminative. 2. Visual question answering with dense multimodal interactions. To address the problems of very limited inter-modality interaction and lack of intra-modality interaction in visual question answering, this paper designs a novel network structure for dense multimodal interactions. Specifically, for the inter-modality interaction, a connector component based on bidirectional attention mechanism is designed to connect the features of different modalities at any hierarchical level; for the intra-modality interaction, a connector component based on unidirectional attention mechanism is designed to connect the features of the same modalities at any level. By combining the two components, the multimodal features from any layers can interact with each other, forming dense multimodal interactions. Experimental results show that the proposed approach achieves the best performance on multiple public datasets. 3. Visual question answering with fine-grained and dynamic multimodal interactions. Current methods of video question answering suffer from several significant limitations, including single temporal scale, coarse and static multimodal interactions. To address these limitations, this paper proposes a video question answering method for fine-grained and dynamic multimodal interactions. Specifically, we construct a hierarchical temporal convolutional network to generate multiple features at different temporal scales. At each layer of the network, a batch normalization-based mechanism is introduced to model fine-grained and dynamic multimodal interactions under current scale. On two commonly used public datasets, the proposed method achieves the first place on the leaderboard. 4. Visual question answering with hierarchical relation-aware multimodal interactions. Current video question answering methods fail to integrate the object-level and the frame-level relational reasoning, and also neglect to leverage explicit semantic knowledge. To address these issues, this paper proposes a novel framework for hierarchical relation-aware multimodal interactions. The framework combines object-level and frame-level relational reasoning in a hierarchical manner, meanwhile mining explicit semantic knowledge to facilitate relation-aware multimodal interactions. In order to enable more efficient relational reasoning, a novel graph memory mechanism is designed as the basic unit of relational reasoning. Experimental results show that the proposed method can capture various hierarchical relationships in video more precisely, thereby improving the accuracy of video question answering.
关键词	视觉问答多模态交互注意力机制关系推理
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/48508
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	刘飞. 基于多模态交互与注意力机制的视觉问答[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于多模态交互与注意力机制的视觉问答.p（10058KB）	学位论文		限制开放	CC BY-NC-SA