CASIA OpenIR  > 毕业生  > 硕士学位论文
基于多模态预训练模型的弱监督跨模态目标定位
赵宸麟
2024-05-19
Pages64
Subtype硕士
Abstract


跨模态学习的目的是通过整合和利用多种不同感官或数据类型之间的关联性,来提高机器对于各种模态信息的理解和处理能力。跨模态目标定位是跨模态机器学习的其中一个方向,具体任务为从给定一张图像和对其的文本描述,找出描述中提到的实体在图像中的准确位置。其应用意义在于提高了计算机对图像和语言之间关系的理解能力,从而为各种智能系统和应用提供了更加丰富、智能和人性化的功能和体验。而弱监督跨模态目标定位则可以在不借助昂贵的细粒度标注信息的情况下解决这一任务,并生成更多有标注的学习数据。借助预训练模型,弱监督跨模态目标定位的视觉分支和语言分支可以获得更准确的视觉特征和文本特征,以及更具深度的跨模态交互过程,得到更加准确的实体定位结果。因此,开展基于多模态预训练模型的弱监督跨模态目标定位研究具有重要的理论意义与应用价值。

实现基于多模态预训练模型的弱监督跨模态目标定位方法需要考虑(1)缺少细粒度的标注信息(2)精细化的局部感知建模困难(3)人工定义的提示模板适应性差的三个主要问题,使得通过多模态预训练模型来解决跨模态目标定位问题仍然具有较大的挑战性。本文围绕如何在弱监督的设置下完成定位任务,以及如何克服预训练模型的预训练任务和下游任务之间的差异性,提出了两种解决方法,即部位可知提示学习和自适应提示学习。

论文的主要工作和创新点归纳如下:

1. 基于部位可知提示学习的弱监督跨模态目标定位。跨模态目标定位搭建起了视觉目标和语言实体之间的桥梁。尽管已有的基于预训练的方法能在一定程度上解决图像和文本中出现的多个实体的对齐问题,当仍存在一些情况使得模型忽略掉实体的边缘部位,即头、手臂或腿,并导致定位错误。针对上述问题,本文提出了基于部位可知提示学习的弱监督跨模态目标定位方法,通过在文本前方添加合适的提示文本使之与更加具体的细粒度信息对齐,迫使模型关注实体的在未添加提示文本时被忽略的精细部位。通过将模型对实体边缘部分的预测结果与实体的主要部分的预测结果结合起来,模型生成了对于实体更完整的注意力图,提高了定位准确率。在 RefCOCO 和 RefCOCO+ 两个数据集上进行的实验证明了模型的有效性。

2. 基于自适应提示学习的弱监督跨模态目标定位。提示学习在预训练模型应用至下游任务的方面能够发挥巨大的作用,可以调整模型的注意力实质更加适应定位任务的精细化感知需要。部位可知提示学习解决了预训练模型在推理阶段的注意力区域受限的问题,成功的使得模型注意力扩展到待推理实体的完整区域。然而,部位可知提示学习中的提示文本参数固定、泛化性差,且容易产生预测虚警。针对以上问题,本文提出了自适应提示学习,通过结合图像和文本的特征,自适应地生成用于指示实体部位的提示文本,使得模型在提示文本的引导下实现对实体区域更加全面的关注,提升了模型的定位效果。在 RefCOCO 和 RefCOCO+ 两个数据集上进行的实验证明了模型的有效性。

Other Abstract

The purpose of cross-modal learning is to enhance the machine's understanding and processing capabilities of various modal information by integrating and leveraging the correlations between different sensory modalities or data types. Visual grounding is one direction of cross-modal machine learning,with the specific task being to accurately locate entities mentioned in a given image and its textual description.The application significance of which lies in improving the computer's understanding of the relationship between images and language, thus providing richer, smarter, and more personalized functions and experiences for various intelligent systems and applications. Weakly supervised visual grounding can solve this task without relying on expensive fine-grained annotation information and generate more annotated learning data. By leveraging pretrained models, the visual and language branches of weakly supervised cross-modal target localization can obtain more accurate visual and textual features, as well as deeper cross-modal interactions, resulting in more accurate entity localization results. Therefore, conducting this research has important theoretical significance and practical value.

Implementing weakly supervised visual grounding methods based on multimodal pretraining models needs to consider three main issues: (1) lack of fine-grained annotation information, (2) difficulty in fine-grained local perception modeling, and (3) poor adaptability of manually defined prompt templates. Thus, solving the visual grounding problem through multimodal pretraining models still poses some challenges. This paper proposes two solutions around how to complete grounding tasks under weakly supervised settings and how to overcome the differences between the pretraining tasks and downstream tasks of pretraining models, namely part-aware prompt tuning and adaptive prompt tuning.

The main contributions and innovations of the paper are summarized as follows:

1.Part-aware Prompt Tuning For Weakly Supervised Visual Grounding: Visual grounding builds a bridge between visual targets and linguistic entities. Although existing pretrained-based methods can to some extent address the alignment problem of multiple entities in images and text, there are still cases where the model ignores the peripheral parts of entities, such as heads, arms, or legs, leading to grounding errors. To address this issue, this paper proposes a weakly supervised visual grounding method based on part-aware prompt tuning. By adding appropriate prompt in front of the text to align it with more specific fine-grained information, the model is forced to focus on the fine parts of entities that are ignored when prompt is not added. By combining the model's predictions of the peripheral parts of entities with predictions of the main part, the model generates more complete attention maps for entities, improving grounding accuracy. Experimental results on the RefCOCO and RefCOCO+ datasets demonstrate the effectiveness of the model.

2.Adaptive Prompt Tuning For Weakly Supervised Visual Grounding: Prompt tuning can play a significant role in applying pretrained models to downstream tasks by adjusting the model's attention to better adapt to the fine-grained perception needs of grounding tasks. Part-aware prompt tuning addresses the problem of limited attention regions of pretrained models during inference, successfully expanding the model's attention to the entire region of entities to be inferred. However, the fixed parameters and poor generalization of prompt text in part-aware prompt tuning can lead to false predictions. To address these issues, this paper proposes adaptive prompt tuning, which adaptively generates prompt indicating entity parts by combining features of images and text. This allows the model to pay more comprehensive attention to entity regions under the guidance of prompt, enhancing the grounding effectiveness of the model. Experimental results on the RefCOCO and RefCOCO+ datasets demonstrate the effectiveness of the model.

Keyword弱监督 提示学习 跨模态目标定位
Subject Area模式识别
Indexed By其他
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/57454
Collection毕业生_硕士学位论文
Recommended Citation
GB/T 7714
赵宸麟. 基于多模态预训练模型的弱监督跨模态目标定位[D],2024.
Files in This Item:
File Name/Size DocType Version Access License
(最新修改版)基于多模态预训练模型的弱监(7997KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[赵宸麟]'s Articles
Baidu academic
Similar articles in Baidu academic
[赵宸麟]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[赵宸麟]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.