CASIA OpenIR  > 毕业生  > 博士学位论文











As today's society enters the era of mobile Internet, everyone is a creator of digital content, and massive amounts of multimedia data are generated every day. Compared with other media such as image and audio, video has become one of the most popular media due to its stronger information carrying capacity and more vivid forms of expression. In the face of the explosive growth of massive videos, how to efficiently and accurately retrieve the content of interest, so as to effectively manage and utilize the videos, has important theoretical significance and application value for all aspects of the national economy and people's livelihood. A typical example is in the field of Internet video platform and video surveillance, where cross-modal video retrieval can help better serve platform users and help maintain public safety.

Currently, the key issue in cross-modal video-text retrieval is how to calculate the semantic relevance between cross-modal samples and achieve accurate semantic alignment, as there is a large heterogeneous gap between videos and text and a lack of fine-grained instance-level semantic correspondence annotations. In the face of this problem, structured learning can help to uncover the complex structural correlations among different modal data, thus facilitating more precise and efficient cross-modal video retrieval. Therefore, this paper proposes a structured learning-based approach to uncovering structural relationships at different levels in this task. Specifically, (1) it firstly studies how to make better use of semantic correlation information between samples in the common space under a standard cross-modal video-text retrieval task. (2) Then it studies how to achieve fine-grained instance-level semantic alignment between videos and text in the absence of fine-grained annotations, and proposes to mine structured relationships at the internal fine-grained level of samples and further extend to the uni-modal sample level, providing additional supervisory signals for accurate and robust instance-level semantic alignment. (3) Finally, in order to ensure that the learned fine-grained instance-level semantic alignment between videos and text is reliable, the causal structured relationships between variables are studied, and spurious associations are eliminated based on causal intervention.

The major contributions of this dissertation are summarized as follows:

1. Structured Relationship Mining in Cross-Modal Video-Text Retrieval.

Merely pursuing better video and text representations is not enough for existing methods, because the semantic alignment strategy they adopt in common space is still based on the classic ranking loss, which ignores the other samples in the common space. In order to model the structured relationships between these samples, this dissertation proposes a cross-modal video-text retrieval method based on a coarse-to-fine graph neural network at the sample level. It introduces a graph neural network to efficiently model the global structural relationships between samples in the common space, and focuses on a few key structures through a layer-wise refinement strategy so as to achieve a more effective cross-modal semantic alignment. Furthermore, at a higher semantic level, structured relationships between different semantic alignment tasks can be also explored. Therefore, at the retrieval task level, this dissertation proposes a cross-modal video-text retrieval method based on knowledge transfer from auxiliary tasks. It regards other semantically-relevant retrieval processes as auxiliary tasks, and uses a general-to-specific knowledge transfer strategy to help improve the performance of the current main retrieval task.

2. Structured Relationship Mining in Fine-Grained Cross-Modal Video-Text Retrieval.

Fine-grained cross-modal video-text retrieval aims to establish the instance-level correspondence between video object regions and text object words, with only global annotations of videos and text during training. Under this weakly supervised setting, to achieve accurate and robust fine-grained semantic alignment, it is necessary to mine more fine-grained structured information within samples so as to provide additional supervision signals. Therefore, this dissertation proposes a fine-grained cross-modal video-text retrieval method based on stable context learning at the intra-sample level. By utilizing the previously neglected fine-grained structure within text samples, stable object concept representations are learned to achieve stable and accurate fine-grained retrieval. Based on it, another fine-grained cross-modal video-text retrieval method based on uni-modal sample relationship mining is proposed, which further extends the fine-grained structure within text samples to the uni-modal sample level for text and videos, thus fully mining useful information within samples in each modality under weak supervision in a unified framework.

3. The Structured Causal Relationship within Cross-Modal Semantic Alignment.

Achieving fine-grained instance-level semantic alignment from the global supervision of videos and text often comes with significant noise, resulting in many spurious associations. However, relying solely on the coarse-grained global supervision can never completely avoid this problem. Considering a reliable fine-grained semantic alignment is often causal, this dissertation proposes a fine-grained cross-modal video-text retrieval method based on causal intervention. Specifically, at the causal level, this method constructs a structured causal graph to model the causal relationships between variables in the  process of semantic alignment, then analyzes causal paths leading to the spatial-temporal ambiguity issue and confounding effect, and finally uses causal intervention to block these paths, enabling the model to focus on the causal cross-modal semantic correspondences rather than the observational bias.

GB/T 7714
王威. 面向结构化学习的跨模态视频检索研究[D],2023.
文件名称/大小 文献类型 版本类型 开放类型 使用许可
王威博士论文最终提交.pdf(14339KB)学位论文 限制开放CC BY-NC-SA
所有评论 (0)
