基于深度学习的视觉问答

	基于深度学习的视觉问答
	方治炜
	2019
页数	0-132
学位类型	博士
中文摘要	随着社会开始走向智能化，人们对计算机视觉提出了新的要求，希望它能够由传统的认知识别向更高层次的逻辑推理迈进。视觉问答正是在这种背景下被提出来的，它要求算法能够正确回答针对任意图像提出的任意问题。视觉问答在智能导盲、智慧城市、人机交互、自动驾驶等领域存在巨大的应用前景，引起了学术界的广泛关注。视觉问答从一开始就与深度学习理论紧密相关。在深度学习推动下，视觉问答技术快速发展，相关算法的准确率逐年提高，但仍存在许多问题和挑战。首先，对多模态融合算法的研究大多从计算量和参数量的角度出发，而如何结合多模态融合的特点来改进融合算法的性能还有待研究。其次，在答案预测过程中，过大的答案空间意味着更多的噪声和干扰，如果能够事先对答案空间中的无关答案进行过滤，则有望简化答案预测过程，提升预测精度。此外，视觉问答模型的泛化能力普遍不足，在还缺乏卓有成效的数据扩增策略的情况下，通过引入Dropout技术提升模型的泛化能力和稳定性则显得非常有意义。最后，问题编码网络是视觉问答的两个基本输入模块之一，而如何有效增强其编码能力仍然具有很大挑战。本文在总结分析已有工作的基础上，针对上述问题进行深入探讨，相关成果和贡献总结如下： 1. 基于稀疏多模态融合的视觉问答算法。针对视觉问答中多模态之间有限语义交互的特性，提出了一种稀疏多模态融合算法---块项分解池化算法(Block Term Decomposition Pooling)。首先通过张量Tucker分解将双线性操作的张量矩阵分解为一个小的张量核与三个映射矩阵之积，解决双线性操作参数量和计算量过大的问题。然后，为了建模有限语义交互现象，将稀疏性引入到张量核上。具体做法为：只保留张量核对角线上的有限个张量块，其余位置的元素全部置零，这种稀疏张量核的分解方式又叫做块项分解。而块项分解池化等价于将多模态信息分别映射到多个独立的双线性空间进行融合，双线性空间的属性和个数由对角张量块的尺寸和数量直接决定由于张量块远远小于原始双线性操作的参数张量，在每个双线性空间中只能进行简单的信息融合，而最终的融合结果是多个简单融合的组合。块项分解池化算法将复杂的多模态融合过程分解成多个相对简单的融合过程，降低了融合难度，提升融合的多样性。为了分析张量块的尺寸分布对块项分解池化算法的影响，设计了多个块项分解池化算法的变体，并发现多尺度的张量块能够带来更好的融合效果。最后，将低秩约束引入到块项分解池化算法中，相应的低秩块项分解池化算法在多个数据集上取得很好的效果。 2. 基于答案蒸馏的视觉问答算法。针对开放式问题答案空间过大的问题，提出了一种基于答案蒸馏的视觉问答框架，该框架包含两个阶段：第一阶段对答案词典进行蒸馏，为每一个视觉问题生成一个小规模的答案候选集，将开放式问题转化为选择式问题。第二阶段将答案候选集信息连同问题和图像同时输入到答案预测网络，在候选答案中检索可能性最大的答案作为预测输出。具体来说，视觉问答数据集中存在两种可供挖掘的关系：常识关系和多选关系，前者指的是一个问题对多个图像提问会得到多个答案，后者指的是一个视觉问题被多个人回答时可能会得到不同的答案。利用这两种关系，答案蒸馏网络以多任务网络的形式联合学习答案候选集。实验结果表明，这个过程可以剔除答案词典中大部分的无关答案，显著减小答案预测网络的分类空间。为了将候选答案信息引入到答案预测网络中，又进一步设计了答案指导的视觉注意力模块，引导网络学习候选答案所关注的视觉区域，最后结合原始问题，综合判断哪个候选答案最可能是正确答案。最终实验结果表明，本算法在多个数据集上超越同期算法，取得更好的效果。 3. 基于改良Dropout的视觉问答算法。针对视觉问答模型泛化能力不足和输出不稳定的问题，提出一致性Dropout和孪生Dropout机制。首先，通过对神经元联合适应性的分析发现，普通Dropout在多路分支的视觉问答模型中会失效，因为并行的Dropout在各自独立采样的情况下会产生冲突。为此，提出了一致性Dropout机制，强制要求多个并行的Dropout层在训练时共享同一个采样的Dropout掩码。理论分析表明，这种做法可以有效抑制神经元的联合适应现象，提升模型的泛化能力。为了增强模型的稳定性，提出基于孪生网络的视觉问答模型，以网络的输出差异来度量Dropout引起的不稳定程度，并设计相应的损失函数来显式地约束它。相关的分析实验表明，一致性Dropout可以显著提升单模型的性能，而孪生Dropout机制有效地降低了模型在训练和测试阶段的差距。对比结果显示，主流视觉问答模型在使用本方法后，性能均得到明显提升。 4. 基于增强型问题编码器的视觉问答算法。针对视觉问答模型中问题编码器表征能力有限的问题，在编码器的深度和宽度两个方面进行探索。深度方面，提出基于残差连接和一致性Dropout的编码器构架，使问题编码器层数增多，性能增强。宽度方面，提出基于并行分支结构的编码器构架，使得网络在变宽的同时参数量基本不变，提升编码器的泛化能力和学习能力。相关实验结果表明，现有的问题编码器加宽和加深都会导致模型的性能下降，而本文提出的增强型问题编码器在变宽和变深时，视觉问答模型的性能会进一步上升。
英文摘要	With the development of society towards intelligent age, people put forward new requirements for computer vision, hoping it can go further from traditional cognition and recognition to high level visual understanding and reasoning. It is against this background that visual question answering (VQA) task is proposed. VQA requires that the algorithm must correctly answer any question raised for any image. Since it has great application prospects in intelligent guide, intelligent city, human-computer interaction, automatic driving and other fields, VQA has attracted wide attention from academic communities. As a cross-domain research topic over computer vision and natural language processing, VQA has been closely related to deep learning theory from the very beginning. Although we have seen a lot of achievements in the past few years, there are still many problems and challenges. Firstly, multimodal fusion method plays an essential role in VQA, but the related studies mostly focus on the problems of its computational complexity and parameters, neglecting some characteristics of multi-modal fusion in VQA. Secondly, for answer prediction, a large answer space usually means a lot of noise. And it is possible to narrow the answer space by taking account of the knowledge in training data. Thirdly, VQA models often suffer from overfitting, thus it is meaningful to design suitable dropout schemes to enhance its generalization and robustness. Finally, how to enhance the question encoder is still a challenging but seldom stressed research topic. To address the above problems, this dissertation presents several effective solutions as listed in the following: 1. A sparse multimodal fusion method called Block Term Decomposition Pooling (BTDP) is proposed to model the sparse interactions between questions and images in VQA. In order to reduce the computational complexity of full bilinear, we first adopt Tucker decomposition to express a tensor by the product between a small core tensor and three projection matrices. Then all entries in the core tensor except for those in the diagonal tensor blocks will be set to zero, which can introduce sparsity into bilinear pooling. Such decomposition is called block term decomposition and the proposed bilinear pooling is thus named Block Term Decomposition Pooling. BTDP projects the input features into different bilinear spaces and conducts multi-modal fusion. BTDP decompose the problem of multimodal fusion into several sub-problems of local fusions on multiple simple bilinear spaces where the properties and the number of such bilinear spaces are decided by the characteristics and the number of the diagonal tensor blocks. It can be proved that each of MCB, MLB and MUTAN is the special case of BTDP. In addition, we propose several variants of BTDP and find that the low-rank constrained BTDP with multi-scale diagonal tensor blocks can result in a better performance. The experimental results show that our method outperforms other multimodal fusion methods. 2. A two-stage VQA framework is proposed to narrow the large answer space for open-ended question. In the first stage, the answer vocabulary is distilled into an answer candidate set for each given visual question. Equivalently, an open-ended question is converted to a multi-choice question. In the second stage, the answer prediction network takes question, image and their candidate answers as inputs to predict the correct answer. Answer distillation is built on two learnable relationships: common sense relationship and multiple answers relationship. In order to make use of the knowledge in answer candidates, an answer guided visual attention mechanism is proposed in answer prediction network. The experimental results show that our method can effectively compress the answer space and improve the accuracy of VQA. 3. Two dropout mechanisms named coherent dropout and siamese dropout are proposed to solve the co-adaptations of neurons and the explosion of output variance of VQA model, respectively. Specifically, in coherent dropout, the relevant dropout layers in multiple paths are forced to work coherently to maximize the ability of preventing neuron co-adaptations. We show that the coherent dropout is simple in implementation but very effective to overcome overfitting. As for the explosion of output variance, we develop a siamese dropout mechanism to explicitly minimize the difference between the two output vectors produced from the same input data during training phase. Such a mechanism can reduce the gap between training and inference phases and make the VQA model more robust. Extensive experiments are conducted to verify the effectiveness of coherent dropout and siamese dropout, and the results show that our methods can bring significant improvements to the state-of-the-art VQA models on VQA-v1 and VQA-v2 datasets. 4. A multi-path stacked residual GRU module is proposed to enhance the question encoder by making its network architecture deeper and wider. In order to make question encoder deeper, the residual connection and coherent dropout techniques are used to deal with gradient vanishing and model overfitting during stacking more GRUs. To obtain a wider encoder network, we design a multi-path architecture to maintain that the number of parameters does not increase too much when the encoder goes wider. The experiments show that both stacking more GRUs and increasing the number of hidden units in a direct way will cause side effects on VQA model, while our proposed enhanced question encoder can result in a better performance when it goes deeper and wider.
关键词	视觉问答深度学习多模态融合答案蒸馏模型泛化能力问题编码
语种	中文
七大方向——子方向分类	多模态智能
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/23773
专题	紫东太初大模型研究中心_图像与视频分析
通讯作者	方治炜
推荐引用方式 GB/T 7714	方治炜. 基于深度学习的视觉问答[D]. 中国科园学院大大学. 中科院自动化研究所,2019.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Thesis.pdf（6749KB）	学位论文		开放获取	CC BY-NC-SA