基于序列生成的图像语义描述

CASIA OpenIR > 毕业生 > 博士学位论文

	基于序列生成的图像语义描述
	郭龙腾
	2021-05-29
页数	150
学位类型	博士
中文摘要	随着深度学习技术研究的日益深入，计算机视觉与自然语言处理两大领域逐渐呈现相结合的趋势。作为连接视觉和语言的跨模态研究任务，图像语义描述拓展并融合了计算机视觉与自然语言处理两个领域的研究边界，具有重要的学术价值。同时，图像语义描述在视觉内容的分析与检索、人机交互、辅助视觉障碍者等领域有着广阔的应用前景。图像语义描述旨在准确理解图像视觉内容的基础上，将其理解结果以自然语言的方式表达出来。因此图像语义描述的核心内容主要包括视觉内容表示与自然语言表达两方面。当前，图像文本描述工作在取得显著进展的同时，也面临着巨大的挑战。一方面，由于视觉内容的复杂性与多样性，使得准确判知图像中的目标物体、属性及其相关关系相对困难，造成文本描述的成分缺失或者表达失实；另一方面，人类语言具有复杂多变的结构，且具有社会性和主观性，而当前的算法尚存在准确率不足、解码效率低、表达单调等问题，无法实现高效、生动的图像语义描述。因此，本文围绕图像语义描述，以深度神经网络下的序列生成模型为基础，针对视觉表示中的图像内容表示和关系建模，以及语言表达中的解码效率和语言风格等难题，通过设计合理的网络结构与学习算法，实现了更加准确、高效、生动的图像语义描述。论文的主要工作和创新点归纳如下： 1. 基于视觉与语言对齐的图像语义描述。针对基于物体区域的图像表示方法无法建模物体关系的问题，提出了采用结构性的图（graph）来表示图像并利用视觉与语言的对齐关系来提高解码准确度。具体地，该方法首先使用图来统一地建模图像中的物体、属性、语义关系和几何关系等视觉元素，然后，通过图神经网络学习图上每个视觉元素的上下文感知的嵌入表示。最后，在解码过程中，通过一种层级注意力机制来准确地关注到与语言上下文相关的视觉元素。实验表明，该方法能够增强图像语义描述的准确度和丰富度，在公开数据集的多个指标上取得了同期最好的结果。 2. 基于改良自注意力机制的图像语义描述。针对自注意力机制在建模图像中物体关系时所存在的无法建模几何关系和内部协变量偏移这两个问题，分别提出了几何感知自注意力模块与归一化自注意力模块。几何感知自注意力模块显式地在自注意力机制中引入物体之间的相对几何关系；归一化自注意力模块在自注意力模块内部引入归一化技术来缓解内部协变量偏移问题。结合上述两项改进的图像语义描述模型在公开数据集上取得了同期最好的性能。在视频语义描述、视觉问答和机器翻译任务上的实验进一步验证了该方法的有效性。 3. 基于多智能体强化学习的快速图像语义描述。针对传统自回归模型中的高解码时延问题，提出了一种基于多智能体强化学习的非自回归图像语义描述方法。具体地，该方法首次将非自回归解码器建模为一个多智能体强化学习系统来优化句子级的目标，由此克服传统非自回归解码器生成句子不连贯的问题。进一步地，引入了基于反事实替换的基准奖励来解决多智能体学习过程中的信用分配问题。在图像语义描述与机器翻译上的实验表明，相比于自回归解码方法，该方法在大幅提升解码速度的同时有效保持了句子的连贯性。 4. 基于半监督学习的多风格图像语义描述。针对图像语义描述中的语言风格特性，提出了一种基于半监督学习的风格化图像语义描述方法，其单一模型能够同时生成涵盖不同语言风格的图像描述。具体地，该方法不依赖于成对的风格化图像描述数据，仅需使用常见的无风格图像描述数据以及非成对的多风格纯文本语料库进行训练。在半监督学习的基础上，设计了文本判别器、风格分类器和反向翻译网络三个模块，分别对文本描述的流畅性、风格化和准确性进行约束。实验结果表明，该方法能够生成流畅、准确，并且具有指定风格的文本描述
英文摘要	With the rapid development of deep learning, computer vision and natural language processing are increasingly intertwined. As a cross-modal task that bridges vision and language, image captioning integrates the two fields of computer vision and natural language processing, thereby has received a lot of attention recently. Image captioning has broad applications in many fields, such as the analysis and retrieval of visual content, human-computer interaction, and assistance for the visually impaired. Image captioning aims to comprehensively understand the visual content of the image and then describe it in natural language. Therefore, the core content of image captioning includes visual representation and linguistic expression. Although great progress has been made in image captioning, there still remain some challenges to be addressed. On the one hand, due to the complexity and diversity of visual information, it is difficult to accurately recognize the objects, attributes, and relationships in the image, resulting in missing or wrong content in the generated caption. On the other hand, human language has a very complex and flexible structure, and it is also social and subjective. So far, there are still some limitations in the current image captioning methods, such as inaccurate description, low decoding efficiency, and boring sentences, which can be obstacles to efficient and vivid image captioning. Therefore, this thesis focuses on image captioning, based on the neural sequence generation models. We address the problems of image representation and visual relationship understanding in visual representation, as well as the decoding efficiency and linguistic style in linguistic expression, through designing reasonable neural networks and learning algorithms, which bring more accurate, efficient, and vivid image captioning. The main contributions are summarized as follows: 1. Image captioning with vision-language alignment. Existing image captioning methods typically regard an image as a set of isolated objects, but fail to model the relationships between objects. To address this problem, we propose to represent the image with a structural graph and leverage the vision-language alignment to help the decoding process. Specifically, we first uniformly model the visual elements in the image, including objects, attributes, and semantic/geometric relations, with a graph. Then, graph neural networks are designed to obtain context-aware embeddings for the visual elements. Finally, in the decoding process, a hierarchical attention mechanism is further introduced to accurately focus on the visual elements that are related to the language context. Experimental results show that the proposed method can enhance the accuracy and richness of image description, and achieves the best results on multiple evaluation metrics. 2. Image captioning with improved self-attention mechanism. When applying self-attention to image captioning, it remains two problems, i.e., the inability to model geometric relations, and the problem of internal covariate shift. To alleviate these two problems, a Geometry-aware Self-Attention (GSA) and a Normalized Self-Attention (NSA) are respectively proposed. GSA extends self-attention to explicitly and efficiently consider the relative geometric relations between objects. NSA brings the benefit of normalization inside self-attention to mitigate the internal covariate shift. These two improvements are combined to build our image captioning model, which achieves the best performance on the public image captioning dataset. Experiments on video captioning, visual question answering, and machine translation tasks further validate the effectiveness of our method. 3. Fast image captioning with multi-agent reinforcement learning. Most image captioning models are autoregressive, which suffer from heavy latency during inference. To address this problem, we propose a non-autoregressive image captioning method with multi-agent reinforcement learning. Specifically, we formulate the non-autoregressive decoder as a multi-agent reinforcement learning system to optimize a sentence-level objective, thereby overcoming the decoding inconsistency problem in traditional non-autoregressive models. Furthermore, counterfactual reward baselines are introduced to address the credit assignment problem in the multi-agent learning process. Experimental results on image captioning and machine translation show that, compared with autoregressive models, our method significantly improves the decoding speed while maintaining the coherence of generated sentences. 4. Multi-style image captioning with semi-supervised learning. Linguistic style is an essential factor in human language. Therefore, we propose a multi-style image captioning method, whose single model can generate image descriptions of multiple language styles. Specifically, this method does not rely on paired stylized image caption data, and only requires common factual image-caption data and an unpaired multi-style language corpus for training. Under such semi-supervised learning mode, a caption discriminator, a style classifier, and a back-translation network are designed to ensure the fluency, stylization, and accuracy of generated captions, respectively. Experimental results show that this method can generate fluent, accurate, and stylized captions.
关键词	图像语义描述视觉和语言序列生成注意力机制非自回归解码
语种	中文
七大方向——子方向分类	多模态智能
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/44976
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	郭龙腾. 基于序列生成的图像语义描述[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
郭龙腾-博士论文-基于序列生成的图像语义（6291KB）	学位论文		限制开放	CC BY-NC-SA