CASIA OpenIR  > 智能感知与计算研究中心
基于视觉与语言的行人理解
荆雅
2021-05-19
页数118
学位类型博士
中文摘要

基于视觉与语言的行人理解是一个融合了计算机视觉、自然语言处理和机器学习的综合问题。其任务目标是对行人图像以及关于行人的自然语言描述进行共同理解。现在随着基础设备和互联网的普及,每天都会产生海量的多模态数据,包括视觉数据与文本数据,而其中最重要的是有关人的图像和文本,对其进行理解具有广泛的应用场景,包括视频监控中的特定行为识别、跨模态行人检索以及人机交互中与机器的交流互动。现实场景中,根据不同的任务需求,基于视觉与语言的行人理解催生了不同的解决思路,但是都需要对语言与视觉内容进行充分的理解并学习它们之间的语义关联。

本文的研究是一个由浅入深的过程,首先对整张图像中发生的行为以及相关对象进行理解,之后聚焦到具体的行人,也就是基于文本的行人检索,其研究过程也是从有监督学习进而拓展到无监督学习来探索模型的通用性。最后为了实现更自然的人机交互,对聚焦到的行人进一步进行轮廓边缘识别,也就是基于文本的行人分割。作为视觉与语言进行交互的跨模态任务,对图文之间的关联性进行建模来消除语义鸿沟是该任务面临的最大的挑战。此外,作为行人相关的任务,解决实际场景中行人姿态变化以及不同行人间的细节变化也是一个很大的挑战。虽然相关的研究已经取得了一些进展,但是仍存在很多问题,相关的技术也有很大的改进空间。基于此,本文从建立视觉与语言之间的关联出发,研究了行人图像-文本中多粒度的对应关系。针对视觉理解,研究了图像中不同对象之间的关系。针对文本理解,探索了文本描述中不同描述性词汇之间的内在语义关联。从而更好地建立不同模态之间多粒度的语义对应。综上,本文的研究内容主要包括以下五个方面:

一、针对面向行人所处环境进行理解的情境识别任务,本文提出了一种新颖的关系图神经网络,该模型在行为和对象之间构建连接,通过图节点之间的信息传递显式地对行为和相关对象之间的三元关系进行建模。此外还提出了一个两阶段的训练策略来优化模型,首先采用渐进式有监督的学习方法,该方法将权重逐步添加到交叉熵损失中来加快模型训练。最后为了统一训练和测试过程,本文使用策略梯度方法直接对不可微分的“所有值”指标进行优化。

二、针对面向行人外貌进行理解的基于文本的行人检索任务,提取与行人描述相关的视觉内容是解决这种跨模态匹配问题的关键。本文提出利用粗粒度对齐网络和细粒度对齐网络提取多粒度的相关视觉内容,其中细粒度对齐网络中使用行人姿态去指导学习视觉身体部分与文本名词短语之间的潜在语义对应关系,这也是第一个将行人姿态用于该任务的方法。

三、考虑到在基于文本的行人检索任务中,不同句子的结构会有很大的不同。而通过建模文本描述内的关系可以判断各个单词是否描述同一视觉对象,这在之前的工作中通常被忽略。因此本文提出了一种图注意关系网络通过对名词短语之间的关系进行建模来学习对齐的图像-文本表示。

四、将有监督的基于文本的行人检索拓展到无监督来减少数据标注成本。本文首次尝试在没有成对标签的情况下使模型可以迁移到新的目标域,这融合了跨模态行人检索和跨域行人检索中的挑战。针对这个新任务,本文提出了一种矩对齐网络,通过学习三个有效的矩对齐方式,包括域对齐、跨模态对齐和样例对齐来共同学习域不变且语义对齐的跨模态表示来提高模型的通用性。

五、为了更精细地识别图像中具体行人的位置,本文对行人指代性分割进行了研究,其目标是分割自然语言所描述的对象。以前的方法通常着重于设计一种隐式的递归特征交互机制用于融合视觉-文本特征来直接生成最终的分割结果,而没有显式建模被指代行人的位置。为了解决这些问题,本文从另一个角度来看待此任务,将该任务解耦为先定位再分割的方案:(1)被指代对象的位置预测;(2)对象分割结果的生成。此外,通过明确对象的位置,该模型更易于解释。

综上,本文提出的方法通过关系建模解决了基于视觉与语言的行人理解中的跨模态语义关联问题,并在许多不同的基准数据集上取得了很好的实验结果。

英文摘要

The pedestrian understanding based on vision and language is a comprehensive problem that combines computer vision, natural language processing, and machine learning. Its goal is to obtain a common understanding of pedestrian images and natural language descriptions. With the popularization of basic equipment and the Internet, massive amounts of multi-modal data are generated every day, including visual data and textual data. The most important part is the images and texts of the pedestrian. Understanding them has a wide range of application scenarios, including behavior recognition in video surveillance, cross-modal pedestrian retrieval, and communication with machines in human-computer interaction. In real scenarios, according to different task requirements, the pedestrian understanding based on vision and language has given birth to different solutions, while they all require a full understanding of language and visual content and to learn the semantic relevance between image and text.

The research in this thesis goes through a shallow-to-deep process. First, we understand the behaviors and related objects in the entire image and then focus on specific pedestrians, i.e., text-based pedestrian retrieval. The research process also expands from supervised learning to unsupervised learning to explore the versatility of the model. Finally, in order to achieve more natural human-computer interaction, segmentation is performed on the focused pedestrians, i.e., pedestrian segmentation based on text. As a cross-modal task, modeling the correlation between images and texts to eliminate the semantic gap is the biggest challenge. In addition, as a pedestrian-related task, it is also a big challenge to solve the changes in pedestrian postures and the changes in details between different pedestrians. Although recent studies have made great progress, these methods still have some limitations. There is still a lot of room to improve in technical research. Based on these, this thesis starts from the establishment of the semantic relevance between vision and language and studies the multi-granularity correspondence between pedestrian image and text. For visual understanding, this thesis studies the relationship between different objects in the image. Aiming at text comprehension, this thesis learns the intrinsic semantic relevance between different descriptive words in the text description. Therefore, the multi-granularity semantic correspondence between different modalities can be better established. In summary, the main contributions of this thesis are summarized as follow:

1. For the task of situation recognition that aims to understand the pedestrian environment, this thesis proposes a novel relational graph neural network. It models the triplet relationships between the activity and pairs of objects through message passing between graph nodes. Moreover, this thesis proposes a two-stage training strategy to optimize the model. A progressive supervised learning is first adopted to obtain an initial prediction for the activity and the objects. Then, the initial predictions are refined by using a policy-gradient method to directly optimize the non-differentiable ``value-all" metric. 

2. For text-based person search that aims to understand the appearance of pedestrians, extracting visual contents corresponding to the human description is the key to this cross-modal matching problem. This thesis proposes to use the coarse alignment network and fine-grained alignment network to extract multi-grained related visual contents. In the fine-grained alignment network, pose information is used to learn latent semantic alignment between visual body parts and textual noun phrases, which is probably the first used in text-based person search.

3.Considering that in text-based person search, the structures of different sentences can be very different. Modeling the relationships within the text description can judge whether the words describing the same visual object, which is usually ignored in previous work. Therefore, this thesis proposes a graph attentive relational network to learn the aligned image-text representations by modeling the relationships between noun phrases.

4. Extend supervised text-based person search to unsupervised learning to reduce the cost of data annotation. This thesis makes the first attempt to conduct domain adaptive text-based person search, which is a challenging cross-modal cross-domain task. For the new task, this thesis proposes a moment alignment network, where domain alignment, cross-modal alignment, and exemplar alignment are jointly modeled to reduce the domain discrepancy and semantic gap.

5. In order to identify the location of a specific pedestrian in the image better, this thesis studies the referring image segmentation of pedestrians, which aims to segment objects described by natural language. Previous methods usually focused on designing an implicit and recursive feature interaction mechanism to directly generate the final segmentation based on the cross-modal features, without explicitly modeling the location of the referred pedestrian. In order to solve these problems, this thesis considers solving this task from another perspective by decoupling this task into two sub-sequential tasks: (1) referring object position prediction, and (2) object segmentation mask generation. In addition, by clearly positioning the object, the model is easier to interpret.

In summary, the methods proposed in this thesis solve the problem of cross-modal semantic relevance in the pedestrian understanding based on vision and language through relationship modeling, which also achieve good experimental results on many different benchmark datasets.

关键词情境识别 行人检索 行人分割 多模态对齐 关系学习
语种中文
七大方向——子方向分类图像视频处理与分析
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/44872
专题智能感知与计算研究中心
推荐引用方式
GB/T 7714
荆雅. 基于视觉与语言的行人理解[D]. 自动化研究所智能化大厦1610. 中国科学院大学自动化研究所,2021.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
132665079423427500.p(21773KB)学位论文 开放获取CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[荆雅]的文章
百度学术
百度学术中相似的文章
[荆雅]的文章
必应学术
必应学术中相似的文章
[荆雅]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。