基于语义的跨模态检索研究

CASIA OpenIR > 模式识别实验室

	基于语义的跨模态检索研究
	程文龙
	2022-05-21
页数	74
学位类型	硕士
中文摘要	随着信息技术和硬件设备的发展，互联网上涌现出了大量的多媒体数据，如图像、文本、语音以及视频等。如何从大量多媒体数据中快速搜索有效信息成为了一个亟需解决的问题。为了解决该问题，跨模态检索技术应运而生，并引起了研究者们的广泛关注。与单模态检索相比，跨模态检索存在模态差异性问题，其主要挑战在于如何度量不同模态数据之间的内容相似性。随着计算机视觉和自然语言处理技术的发展，跨模态检索也取得了很大的研究进展。但仍然存在一些问题。一是跨模态检索技术的迁移与应用问题，即跨模态检索技术能否成功应用到其他相关领域中。二是先前的跨模态研究主要聚焦于图像和文本，很少关注语音，但在某些场景下使用语音比文本更加方便。三是先前的方法不能很好地抑制图像和语音之间的模态差异性问题。本文的研究工作主要聚焦于这些问题，贡献如下： 1. 提出了一种基于检索的方法来处理视觉问答中的指向问题，这是检索模型迁移到视觉问答任务中的一次成功尝试。该方法的原理为在共同特征空间中拉近问句特征与正确答案特征之间的距离，同时推远问句特征与非正确答案特征之间的距离。此外，该方法不仅能解决有候选答案约束的指向问题，也为无候选答案约束的指向问题提供了一种可行的解决思路。该方法在视觉问答的指向任务中获得了较好的性能。 2. 提出了一种基于语义信息和特征重构的检索方法来处理语音-图像检索的问题。首先，使用对应于语音数据的语义信息引入了图像和语音之间的辅助对齐关系，并据此提出了一种三模态排序损失。其次，引入了基于特征重构的循环一致性损失，这可以进一步抑制视觉模态和语音模态之间的模态差异性问题。大量的实验验证了该方法的有效性。该方法在语音-图像检索任务上取得了较好的性能。
英文摘要	With the development of information technology and hardware equipment, a large amount of multimedia data, such as images, texts, speeches and videos, have emerged on the Internet. How to quickly search for effective information from a large amount of multimedia data has become an urgent problem to be solved. To address this problem, cross-modal retrieval technology appears, and attracts much attention from researchers. Compared with single-modal retrieval, cross-modal retrieval has the modality gap, and its challenge lies in how to measure the content similarities between the data samples of different modalities. With the development of computer vision and natural language processing technology, cross-modal retrieval has made great progress. However, there are still some problems. The first problem is the migration and application of cross-modal retrieval technology, that is, whether cross-modal retrieval technology can be successfully applied to other related fields. The second problem is that prior cross-modal research mainly focuses on images and texts, but pays little attention to speeches. However, it is more convenient to use speeches than texts in some scenes. The third problem is that prior methods can not alleviate the modality gap between image and speech well. The research work of this paper focuses on these problems, and makes the following contributions: 1. We propose a retrieval-based method to deal with the pointing problem in visual question answering, which is an successful attempt to transfer the retrieval model to visual question answering task. The principle of the proposed method is to pull the question feature and correct answer feature close in a common feature space while pushing the question feature and incorrect answer feature away. In addition, this method can not only solve the pointing problem with candidate answers, but also provide a feasible solution for the pointing problem without candidate answers. The proposed method has achieved good performance on the pointing problem of visual question answering. 2. We propose a retrieval method based on semantic information and feature reconstruction to deal with the problem of speech-image retrieval. First, we leverage semantic information corresponding to acoustic data to introduce the auxiliary alignment between image and speech, and accordingly propose a tri-modal ranking loss. Second, we introduce a cycle-consistency loss based on feature reconstruction, and it can further alleviate the modality gap between visual and acoustic modalities. Extensive experiments have demonstrated the effectiveness of the proposed method. The proposed method has achieved good performance on the speech-image retrieval task.
关键词	跨模态检索视觉问答语音-图像检索三模态排序损失循环一致性损失
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/48486
专题	模式识别实验室毕业生_硕士学位论文毕业生
推荐引用方式 GB/T 7714	程文龙. 基于语义的跨模态检索研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Thesis_程文龙_已签名.pdf（3471KB）	学位论文		开放获取	CC BY-NC-SA