基于注意与记忆机制的视觉描述

CASIA OpenIR > 毕业生 > 博士学位论文

	基于注意与记忆机制的视觉描述
	王君波
	2019-12
页数	132
学位类型	博士
中文摘要	视觉描述是一个融合计算机视觉、自然语言处理和机器学习的综合问题，其任务目标是利用计算机生成一段文字去描述图像或视频里的视觉内容。随着基础设备和互联网的普及，视觉描述在人机交互、盲人导航、跨模态检索等场景中具有很多的应用需求。该任务对于人类来说非常容易，但是对于机器却具有非常大的挑战性。首先它需要算法去检测出视觉内容中的目标、属性、行为、关系等细节信息，还需要一个强大的语言模型来生成语法结构合理的句子，最后还需要算法能够准确合理地将这些视觉信息映射到语言模型能够理解的语义空间。传统的视觉描述方法一般都是通过端到端的深度卷积网络和递归神经网络来完成从视觉内容到文本的生成，而并不能很好地对视觉内容和文本元素之间的映射关系进行建模。鉴于注意与记忆机制在视觉模态与语言模态之间的对应关系建模的有效性，本论文将从注意与记忆机制出发来探索更加有效的视觉描述算法。本论文的具体工作概况如下：通过在传统的编码器-解码器框架中引入模态相关性损失和时序一致性损失，提出了用于图像描述的堆叠式前向注意和反向注意模型。该模型通过在句子生成过程中使用多层注意层进行选择，逐步过滤掉与描述内容无关的视觉信号，能够准确地摄取图像描述中需要的视觉特征。此外，该模型还进一步约束了反向注意模型中由生成单单词确定的视觉信号与前向注意模型产生的视觉信号一致性。通过图神经网络对图像中目标之间的关系进行学习，提出了基于图网络关系学习的图像描述模型。该模型能够隐式地学习到图像中语义目标之间的关系，以生成更加准确的句子描述。同时，该模型还引入一种上下文感知的注意机制，使得模型在句子生成过程中能够注意到之前关注的视觉信息。通过对视频和句子的时序信息进行显式记忆建模，提出一种用于描述视频的多模态记忆模型。该模型建立了视觉模态和文本模态的共享记忆来建模长时视觉与文本之间的依赖性，并指导描述生成过程中的关注的视觉信息与文本信息的对齐。通过对视频描述过程中三个语义层次分别进行文本记忆、属性记忆、视觉记忆建模，提出了一种用于视频描述的层次化记忆模型，能够有效解决视频内容与描述语句之间的语义鸿沟。该模型中的属性记忆建立在文本记忆和视觉记忆基础上。其中文本记忆和视觉记忆分别建模相应模态的长时相关性，并帮助指导当前的语义属性选择。本文提出的一系列方法解决了视觉描述领域的许多重要问题，并在许多不同的视觉描述基准数据集上取得了很好的实验结果。同时，本文的研究工作也指出了视觉描述技术应用于实际场景的一些关键问题，进而为后续的研究提供了一些参考方向。
英文摘要	Visual description is a comprehensive problem which combines computer vision, natural language processing, and machine learning. The goal of this problem is to utilize computer to generate a sentence to describe the visual content in an image or video. With the widely used visual devices and the Internet, visual description has many applications in practical scenes such as human-computer interaction, blind navigation, and cross-modal retrieval. This task is very easy for humans, but it is very challenging for machines. First, it calls for an algorithm to detect the details of the objects, attributes, actions, relationships, etc. in the visual content. Second, it also needs a powerful language model to generate sentences with reasonable grammatical structure. Finally, an algorithm should be designed to map these visual information into the semantic space that the language model can understand. In general, traditional visual captioning methods use end-to-end deep convolution networks and recurrent neural networks to translate visual content to text, but these methods are not good at modelling semantic alignment between visual content and text element. Considering the effectiveness of attention and memory mechanism in modelling semantic alignment between visual modality and linguistic modality, this dissertation explores many effective visual captioning algorithms based on attention and memory mechanism. The overall work of this dissertation is summarized as follows: By incorporating multimodal relevance loss and sequential consistency loss into the traditional encoder-decoder framework, a stacked forward attention and backward attention model is proposed for image captioning. The model employs multiple stacked attention layers to gradually filter out irrelevant contextual information and select suitable visual content during the sentence generation process, which can accurately capture the visual features required in the image description. In addition, the model further constrains the consistency between the attended visual feature inferred from next word in the backward attention model and the visual feature selected by the stacked forward attention model. To learn the visual relationship between the objects in the image, an image description model based on graph network relationship learning is proposed. The model can implicitly learn the relationship between semantic objects in the image to generate more accurate sentence descriptions. At the same time, the model also introduces a context-aware attention mechanism, which enables the model to notice the previously attended visual information during the sentence generation process. By explicitly modelling the video and sentence sequential information, a multimodal memory model for describing video is proposed. The model builds a visual and textual shared memory to model the long-term visual-textual dependency and further guide visual attention on described visual targets to solve visual-textual alignments. By modelling textual memory, attribute memory and visual memory in three semantic levels in the video description process, a hierarchical memory model is proposed for video captioning, which can effectively solve the semantic gap between video content and generated sentences. The attribute memory is built on the textual memory and visual memory in a hierarchical way. In addition to modelling the long-term temporal dependency for video sequences and sentences, the textual memory and visual memory guide attention for semantic attribute selection. The proposed methods in this dissertation solve many important problems in the field of visual description and have achieved better experimental results than the state-of-the-art methods on different visual description benchmark datasets. Moreover, the research work of this dissertation also points out some critical issues in the practical scenes, and provides some suggestions for subsequent research in this field.
关键词	视觉描述注意与记忆机制长序列建模模态相关性关系学习
语种	中文
七大方向——子方向分类	图像视频处理与分析
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/28357
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	王君波. 基于注意与记忆机制的视觉描述[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2019.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
王君波博士毕业论文.pdf（6335KB）	学位论文		限制开放	CC BY-NC-SA