In recent years, with the development of artificial intelligence, researchers have paid more and more attention to endow machines with human intelligence, so that they can "see" and "read" the information around them like humans. As a result, the visual question answering task emerged. It is the intersection of computer vision and natural language processing, and requires machines to give answers in a natural language form after "seeing" image/video content and "reading" related questions. Compared with single visual semantic parsing or natural language understanding, the visual question answering technology requires a joint understanding of visual and textual content, which is a higher-level cognition, so it has important research value. At the same time, visual question answering can also provide important technical support for many practical applications, such as assisting visually impaired patients, human-computer interaction, and navigation assistants.
Current visual question answering methods can be roughly divided into four stages: first, visual and textual feature representation; second, attention mechanism; third, multimodal interaction; fourth, answer prediction. Among them, the second and third stages are particularly important, thus most methods focus on attention mechanism and multimodal interaction. Despite making remarkable progress, these methods still have some significant limitations. In terms of attention mechanism, since each question of visual question answering usually involves only a small part of visual content, relying on the attention mechanism to extract the most relevant one from plentiful contents is the key of visual question answering. The current methods select the regions to focus on by separately calculating the similarity between visual and textual content, however, which is easily affected by various factors such as training data, module design, supervision loss, etc., and cannot guarantee that the model will learn a sufficiently robust and accurate attention mechanism. In terms of multimodal interaction, since visual question answering involves two heterogeneous modalities of vision and text (or natural language) and these two modalities naturally have semantic gap, it is necessary and challenging to accurately and effectively associate the semantics of the two modalities. Current methods still lack efficient mechanisms to model the multimodal interactions, resulting in insufficient and inaccurate multimodal semantic understanding and association. In view of the above limitations, this paper focuses on multimodal interaction and attention mechanism in visual question answering, and designs appropriate deep neural networks and algorithms to improve the learning of attention mechanism and the modeling of multimodal interaction, thus achieving more accurate visual question answering.
The main contributions are summarized as follows:
1. Visual question answering with erasing-based attention learning. Traditional attention mechanisms in visual question answering are prone to generate mixed attention distributions. To address this problem, this paper proposes a new training strategy to improve the learning of attention mechanisms. Specifically, we first introduce an attention-guided erasing mechanism to obtain the features of attention regions and non-attention regions respectively, and then select the features of attention regions as positive samples and the features of non-attention regions as negative samples. Further, we impose a set of constraints based on metric learning loss to distinguish positive and negative samples, thereby learning a more discriminative attention distribution. Experimental results show that the proposed method can effectively improve the accuracy of visual question answering. Through attention visualization, it is found that the generated attention distribution is more reasonable and more discriminative.
2. Visual question answering with dense multimodal interactions. To address the problems of very limited inter-modality interaction and lack of intra-modality interaction in visual question answering, this paper designs a novel network structure for dense multimodal interactions. Specifically, for the inter-modality interaction, a connector component based on bidirectional attention mechanism is designed to connect the features of different modalities at any hierarchical level; for the intra-modality interaction, a connector component based on unidirectional attention mechanism is designed to connect the features of the same modalities at any level. By combining the two components, the multimodal features from any layers can interact with each other, forming dense multimodal interactions. Experimental results show that the proposed approach achieves the best performance on multiple public datasets.
3. Visual question answering with fine-grained and dynamic multimodal interactions. Current methods of video question answering suffer from several significant limitations, including single temporal scale, coarse and static multimodal interactions. To address these limitations, this paper proposes a video question answering method for
fine-grained and dynamic multimodal interactions. Specifically, we construct a hierarchical temporal convolutional network to generate multiple features at different temporal scales. At each layer of the network, a batch normalization-based mechanism is introduced to model fine-grained and dynamic multimodal interactions under current scale. On two commonly used public datasets, the proposed method achieves the first place on the leaderboard.
4. Visual question answering with hierarchical relation-aware multimodal interactions. Current video question answering methods fail to integrate the object-level and the frame-level relational reasoning, and also neglect to leverage explicit semantic knowledge. To address these issues, this paper proposes a novel framework for hierarchical relation-aware multimodal interactions. The framework combines object-level and frame-level relational reasoning in a hierarchical manner, meanwhile mining explicit semantic knowledge to facilitate relation-aware multimodal interactions. In order to enable more efficient relational reasoning, a novel graph memory mechanism is designed as the basic unit of relational reasoning. Experimental results show that the proposed method can capture various hierarchical relationships in video more precisely, thereby improving the accuracy of video question answering.
|Keyword||视觉问答 多模态交互 注意力机制 关系推理|
|刘飞. 基于多模态交互与注意力机制的视觉问答[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.|
|Files in This Item:|
|Recommend this item|
|Export to Endnote|
|Similar articles in Google Scholar|
|Similar articles in Baidu academic|
|Similar articles in Bing Scholar|
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.