面向结构化学习的跨模态视频检索研究

CASIA OpenIR > 毕业生 > 博士学位论文

	面向结构化学习的跨模态视频检索研究
	王威
	2023-05-20
页数	164
学位类型	博士
中文摘要	随着当今社会进入移动互联网时代，人人都是数字内容的创作者，每天都有海量的多媒体数据产生。视频相比于图像、音频等媒体，由于其信息承载力更强、表现形式更生动形象，已经成为目前最流行的媒体之一。面对爆发式增长的海量视频，如何从中高效准确地检索感兴趣的内容，从而对视频进行有效管理和利用，这对于国计民生的方方面面都有重要的理论意义和应用价值。一个典型的例子是在互联网视频平台和视频监控领域，跨模态视频检索可以帮助更好地服务于平台用户并助力于维护社会公共安全。当前，跨模态视频检索的关键问题在于如何计算跨模态样本之间的语义相关度并实现准确的语义对齐，因为视频和文本之间存在很大的异构鸿沟并且缺少细粒度的实例级语义对应标注信息。面对这一问题，结构化学习可以帮助挖掘不同模态数据之间存在的复杂结构关联，从而实现更准确和有效的跨模态视频检索。因此，本文以结构化学习为基础，挖掘该任务不同层面的结构化关系。具体来说，（1）首先研究在标准跨模态视频检索中如何更好地利用公共空间内样本之间的语义相关度信息以帮助准确的跨模态语义对齐，据此提出挖掘样本特征层面和更高阶的检索任务层面的结构化关系；（2）然后研究如何在不引入额外标注的情况下对视频和文本进行更细粒度的实例级语义对齐，据此提出了挖掘样本内部精细层面和进一步扩展到同模态样本层面的结构化关系，以提供额外的监督信号用于准确、鲁棒的实例级语义对齐；（3）最后为了确保所学习到的视频与文本的细粒度实例级语义对齐是可靠的，研究在抽象因果层面各变量之间的因果结构化关系，并基于因果干预等手段排除其中存在的虚假关联。论文的主要工作和创新点归纳如下： 1、通用跨模态视频检索中的结构化关系挖掘。现有方法在公共空间内采用的语义对齐策略通常是经典的排序损失，其忽视了众多其他样本之间的丰富结构化关系。因此本文在样本特征层面，提出了基于逐层精化图神经网络的跨模态视频检索方法。该方法引入图神经网络对公共空间内样本间的全局结构关系进行高效建模，并通过逐层精化的策略聚焦于少数关键结构，从而实行更有效的跨模态语义对齐。进一步地，在更高语义层级上探索不同语义对齐任务之间的结构化关系。因此在检索任务层面，提出了基于辅助任务知识迁移的跨模态视频检索方法。该方法将其他一系列语义相似度各异的检索过程看作是辅助任务，并使用语义渐进的辅助任务知识迁移策略来帮助提升当前检索任务的性能。 2、细粒度跨模态视频检索中的结构化关系挖掘。细粒度跨模态视频检索旨在建立实例级的视频物体区域和文本物体单词之间的对应关系，并且不引入额外的细粒度标注。在这种弱监督的设定下，需要深入挖掘样本内的精细化结构从而提供额外的监督信号。为此，本文在样本内部精细层面，提出了基于稳定上下文学习的细粒度跨模态视频检索方法。通过利用之前常被忽视的文本样本内部的精细化结构，学习稳定的物体概念表征，从而实现稳定、准确的细粒度检索。在此基础上，提出了基于同模态样本关系挖掘的细粒度跨模态视频检索方法，通过将之前文本样本内部的精细化结构进一步扩展到文本与视频同模态样本层面，从而在一个统一的框架下充分挖掘弱监督下各模态样本内部的有用信息。 3、跨模态语义对齐过程中的因果结构关系分析。从视频和文本的全局监督信号中实现细粒度的实例级语义对齐通常伴随着很大的噪声，导致学习到许多虚假的关联。然而，在弱监督下该问题不可能完全避免。考虑到可靠的细粒度语义对齐通常是因果的，因此本文提出了基于因果干预的细粒度跨模态视频检索方法。具体来说，在抽象因果层面，本方法建立结构化因果模型来建模细粒度语义对齐过程中各变量之间的因果结构关系，接着分析导致虚假关联的时空模糊问题以及混杂效应对应的因果路径，最后使用因果干预手段阻断这些路径，从而使跨模态检索模型能够关注于真实的跨模态语义对应关系而非观测偏见。
英文摘要	As today's society enters the era of mobile Internet, everyone is a creator of digital content, and massive amounts of multimedia data are generated every day. Compared with other media such as image and audio, video has become one of the most popular media due to its stronger information carrying capacity and more vivid forms of expression. In the face of the explosive growth of massive videos, how to efficiently and accurately retrieve the content of interest, so as to effectively manage and utilize the videos, has important theoretical significance and application value for all aspects of the national economy and people's livelihood. A typical example is in the field of Internet video platform and video surveillance, where cross-modal video retrieval can help better serve platform users and help maintain public safety. Currently, the key issue in cross-modal video-text retrieval is how to calculate the semantic relevance between cross-modal samples and achieve accurate semantic alignment, as there is a large heterogeneous gap between videos and text and a lack of fine-grained instance-level semantic correspondence annotations. In the face of this problem, structured learning can help to uncover the complex structural correlations among different modal data, thus facilitating more precise and efficient cross-modal video retrieval. Therefore, this paper proposes a structured learning-based approach to uncovering structural relationships at different levels in this task. Specifically, (1) it firstly studies how to make better use of semantic correlation information between samples in the common space under a standard cross-modal video-text retrieval task. (2) Then it studies how to achieve fine-grained instance-level semantic alignment between videos and text in the absence of fine-grained annotations, and proposes to mine structured relationships at the internal fine-grained level of samples and further extend to the uni-modal sample level, providing additional supervisory signals for accurate and robust instance-level semantic alignment. (3) Finally, in order to ensure that the learned fine-grained instance-level semantic alignment between videos and text is reliable, the causal structured relationships between variables are studied, and spurious associations are eliminated based on causal intervention. The major contributions of this dissertation are summarized as follows: 1. Structured Relationship Mining in Cross-Modal Video-Text Retrieval. Merely pursuing better video and text representations is not enough for existing methods, because the semantic alignment strategy they adopt in common space is still based on the classic ranking loss, which ignores the other samples in the common space. In order to model the structured relationships between these samples, this dissertation proposes a cross-modal video-text retrieval method based on a coarse-to-fine graph neural network at the sample level. It introduces a graph neural network to efficiently model the global structural relationships between samples in the common space, and focuses on a few key structures through a layer-wise refinement strategy so as to achieve a more effective cross-modal semantic alignment. Furthermore, at a higher semantic level, structured relationships between different semantic alignment tasks can be also explored. Therefore, at the retrieval task level, this dissertation proposes a cross-modal video-text retrieval method based on knowledge transfer from auxiliary tasks. It regards other semantically-relevant retrieval processes as auxiliary tasks, and uses a general-to-specific knowledge transfer strategy to help improve the performance of the current main retrieval task. 2. Structured Relationship Mining in Fine-Grained Cross-Modal Video-Text Retrieval. Fine-grained cross-modal video-text retrieval aims to establish the instance-level correspondence between video object regions and text object words, with only global annotations of videos and text during training. Under this weakly supervised setting, to achieve accurate and robust fine-grained semantic alignment, it is necessary to mine more fine-grained structured information within samples so as to provide additional supervision signals. Therefore, this dissertation proposes a fine-grained cross-modal video-text retrieval method based on stable context learning at the intra-sample level. By utilizing the previously neglected fine-grained structure within text samples, stable object concept representations are learned to achieve stable and accurate fine-grained retrieval. Based on it, another fine-grained cross-modal video-text retrieval method based on uni-modal sample relationship mining is proposed, which further extends the fine-grained structure within text samples to the uni-modal sample level for text and videos, thus fully mining useful information within samples in each modality under weak supervision in a unified framework. 3. The Structured Causal Relationship within Cross-Modal Semantic Alignment. Achieving fine-grained instance-level semantic alignment from the global supervision of videos and text often comes with significant noise, resulting in many spurious associations. However, relying solely on the coarse-grained global supervision can never completely avoid this problem. Considering a reliable fine-grained semantic alignment is often causal, this dissertation proposes a fine-grained cross-modal video-text retrieval method based on causal intervention. Specifically, at the causal level, this method constructs a structured causal graph to model the causal relationships between variables in the process of semantic alignment, then analyzes causal paths leading to the spatial-temporal ambiguity issue and confounding effect, and finally uses causal intervention to block these paths, enabling the model to focus on the causal cross-modal semantic correspondences rather than the observational bias.
关键词	跨模态学习，视频检索，细粒度检索，结构化关系，弱监督学习
语种	中文
七大方向——子方向分类	图像视频处理与分析
国重实验室规划方向分类	多模态协同认知
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/52021
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	王威. 面向结构化学习的跨模态视频检索研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
王威博士论文最终提交.pdf（14339KB）	学位论文		限制开放	CC BY-NC-SA