视觉-语言跨模态匹配研究

CASIA OpenIR > 毕业生 > 博士学位论文

	视觉-语言跨模态匹配研究
	牛凯
	2020-08
页数	142
学位类型	博士
中文摘要	随着高速移动互联网、云计算等技术的快速发展以及智能终端设备的大范围普及，全球数据规模呈现爆炸式增长。与此同时，数据也呈现出了多样化的趋势，文本、图片、视频、语音等各种各样的数据每时每刻都在产生，并在网络中迅速传播，世界已经进入了一个多媒体大数据的时代。多媒体大数据能够高效的传递信息，主要原因之一在于其中包含有多种不同模态的数据类型。视觉模态和语言模态，是多模态数据中代表性的两类模态形式。克服这两类不同模态数据间显著的模态异质性，根据核心的语义特征进行跨模态的准确匹配，是多模态计算领域当前研究的热点问题之一，并在此基础上产生了许多接近实际的应用场景，给人类的生产生活带来了巨大的便利。为了帮助机器更好地感知和理解视觉及语言模态的数据，本文专注于开展视觉-语言跨模态匹配的研究工作，分别从表达学习和度量学习两个角度，针对当前该领域研究中存在的问题进行了处理。具体的研究内容如下： 1. 针对细粒度视觉-语言跨模态匹配的困难，本文开展了基于多粒度注意匹配模型的自然语言描述行人搜索的研究工作。视觉-语言跨模态匹配领域的研究工作先前主要关注全局上下文的特征提取和匹配，但是这样粗粒度的解决方案可能会造成细节语义的缺失。以基于自然语言描述的行人搜索任务为例，其需要区分不同行人细节的表观特征，才能够实现准确的匹配和搜索。因此，我们提出了一种多粒度的图像-文本跨模态注意匹配方法，通过结合多个不同粒度的匹配层次，能够更好地挖掘细粒度的特征语义，进而实现更加准确的匹配和搜索。 2. 针对细粒度视觉-语言匹配中描述语句较长，所含结构、语义较为复杂，难以准确建模，进而导致模态间建模质量不均衡的问题，本文开展了基于文本依赖分析嵌入模型的自然语言描述行人搜索的研究工作。该任务先前的解决方案主要关注图像特征的提取，而对于语言特征一般仅采用简单的序列结构进行相邻关系的建模，导致模态间的特征质量存在不均衡性，给后续的跨模态匹配带来了一定的困难。因此，我们提出了基于文本依赖分析嵌入的语言建模方法，将来自文本分析工具的长距离句法依赖关系嵌入到文本编码中，以获得包含更加准确语义的语句特征进行后续的检索，并获得了该任务上当前最好的性能表现。 3. 针对从实验室研究任务向实际应用场景部署时无法避免的跨域问题，本文开展了基于适应性迁移融合模型的跨域自然语言描述行人搜索的研究工作。研究场景中的数据一般均来自于相同的数据采集条件，即拥有相同的领域特征。但在实际应用场景中，由于系统部署的时间跨度较长，后期的测试数据与训练集数据的采集条件和风格特征将会存在巨大的差异，即跨域问题。针对这一困难，我们提出了适应性迁移融合方法，其通过融合两个相反的适应性迁移方向，并且交换了处理跨模态与跨域两大核心困难的顺序，能够一定程度上缩小数据间的领域差异。基于上述方法，我们在国际比赛中获得了冠军。 4. 针对跨模态样本之间的相似性难以准确度量的困难，本文开展了基于自适应度量融合模型的图像文本匹配重排序的研究工作。视觉-语言跨模态匹配领域的研究工作目前主要关注表达学习，而度量学习方面则相对研究较少，但度量准则的优劣对于准确匹配同样非常重要。该工作通过考虑两个相反的检索方向的排序情况，以及不同视觉和文本特征空间问题的影响，对多个度量的表现分别进行评估，进而自适应的融合多个不同的度量准则，获得能够更好地衡量跨模态样本间相似程度的最终度量。所提出的方法作为一种后处理方案，具备良好的泛化性，能够方便的应用于多种跨模态的特征上，在几乎不引入额外代价的条件下，进一步提升匹配效果。
英文摘要	With the development of high-speed mobile Internet and cloud computing, and the widespread popularization of smart terminal devices, the amount of data has grown rapidly. At the same time, data is also showing a trend of diversity. Various data formats such as texts, images, videos, and voices are being generated all the time and fast spreading through the Internet. In short, the world has entered an era of multimedia big data. Multimedia big data can efficiently spread information, and one of the main reasons is that it contains many different data modalities. Among different data modalities, vision and language are two representative ones. The relation between the vision and language is one of the hot research topics currently, which mainly focuses on achieving accurate cross-modal matching by overcoming the significant modality heterogeneity. On this basis, there are many practical applications which are close to the real life scenarios, bringing tremendous convenience to human's life. Therefore, this paper focuses on the research of vision-language cross-modal matching to provide machines with better perception and understanding. We aim to deal with the difficulties in this field from the perspectives of representation learning and metric learning, respectively. The detailed research contents are as follows: 1. To deal with the difficulty in fine-grained vision-language cross-modal matching, we propose a multi-granularity attention model for the problem of person search by language. The research in the field of vision-language matching has previously focused on global context matching, but such coarse-grained solutions may be lack of detailed semantics. Taking the problem of person search by language for instance, it is necessary to extract the fine-grained apparent features of different pedestrians to achieve accurate retrieval. Therefore, we propose a multi-granularity image-text alignments method. By combining multiple granularities of cross-modal alignments, the proposed method can better extract the fine-grained features and carry out accurate cross-modal matching. 2. To deal with the difficulty in modeling the sentences which are normally long and have complex structures in fine-grained vision-language matching, we propose the textual dependency embedding method for the problem of person search by language. In this task, most of the previous solutions focus on image feature extraction, but only use simple sequence modeling approaches for sentence encoding, which will cause unbalance feature qualities between modalities and influence the cross-modal matching. Therefore, we propose the textual dependency embedding method, which embeds the long-distance dependencies from the widely-used sentence analysis tools for improving sentence encoding, and have obtained the currently state-of-the-art performance. 3. To deal with the inevitable cross-domain problem when deploying a task from research to the real-world scenarios, we propose the cross-domain adaption method for the cross-domain person search by language. In research, all the data comes from the same data acquisition conditions and shares the same domain characteristics. However, in practical applications, the data acquisition conditions and domain characteristics in inference will be greatly different from that in training, i.e., facing the cross-domain problem. To solve this problem, we propose the cross-domain adaption method, which combines two opposite directions for domain adaption, and swaps the orders of addressing the cross-domain and cross-modal difficulties. Our method can narrow the gap between different domains and facilitate the practical applications of this task. And we have won the first place award in the WIDER Face and Person Challenge 2019. 4. To deal with the difficulty in accurately measuring the cross-modal similarities, we work on the adaptive metric fusion method for re-ranking the image-text matching results. Currently, most research in the field of vision-language cross-modal matching focuses on representation learning, but metric learning is less studied, which is also very important for accurate matching. This work evaluates the performance of different metrics by considering the two opposite retrieval directions and the influence of different visual and textual feature spaces, and then adaptively fuses these metrics to obtain the final metric that can better measure the cross-modal similarity between images and sentences. As a post-processing approach, the proposed method can be easily applied to any features, i.e., having generalization ability. With almost no extra cost, our solution can further obtain significant improvements for the problem of image-text matching.
关键词	视觉-语言跨模态匹配表达学习度量学习基于自然语言描述的行人搜索图像文本匹配
语种	中文
七大方向——子方向分类	多模态智能
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/40592
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	牛凯. 视觉-语言跨模态匹配研究[D]. 中国科学院自动化研究所. 中国科学院大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
IR_毕业论文.pdf（6805KB）	学位论文		限制开放	CC BY-NC-SA