CASIA OpenIR  > 毕业生  > 硕士学位论文
组合性感知的弱监督视觉定位研究
曾宇楠
2024-05-17
Pages88
Subtype硕士
Abstract

本文旨在探索组合性感知的弱监督视觉定位研究,特别关注视觉语言预训练模型在弱监督视觉定位中的应用,及其在组合性理解样本上的性能表现,同时也涉及了下游弱监督视觉定位模型与预训练模型的关联。视觉定位任务在计算机视觉和自然语言处理的交叉领域中至关重要,它要求模型不仅能够识别图像中的目标,还能根据自然语言描述准确定位这些目标。尽管深度学习的发展极大推进了这一领域的研究,但现实世界的复杂场景及语言描述的多样性和歧义性使得这一任务仍然面临挑战。本研究针对视觉语言预训练模型和下游弱监督视觉定位模型在处理复杂视觉场景和语言描述中的组合性问题,提出了新的研究方法和技术策略。本研究主要关注以下三个方面。

  • 评测视觉语言预训练模型在视觉定位中的组合性问题,特别是模型如何处理图像中单个目标的属性、多个目标间的复杂关系以及多个目标之间的主次。本文构建了一个全新专注于评估定位模型组合性感知能力的流程,包括新的数据集和新的推理任务。评测结果证明现有的视觉语言预训练模型在组合性推理样本上存在显著缺陷。
  • 针对视觉语言预训练模型存在的不足,本文提出了一种新的组合性感知微调方法。微调方法利用低成本的图像-文本注释,首先通过依存句法解析生成微调数据集,然后引入一个新的代理任务增加定位热图的多样性,以提高模型在视觉定位任务中的性能。实验结果表明,经过微调的预训练模型能够更准确地定位图像中由自然语言描述的目标,特别是在处理含有多个目标和复杂关系的场景时。
  • 研究下游弱监督视觉定位模型的组合性问题。下游弱监督模型根据预训练模型生成的伪标签进行训练,难以避免地会继承组合性推理缺陷。针对下游弱监督视觉定位模型,本文分析预训练模型带来的文本相关和文本无关噪声对其性能的影响,并据此优化模型结构、学习策略和数据处理技术来提升模型在复杂场景下的应用性能。实验分析表明,本文提出的训练目标和学习方法能有效改善模型在处理复杂视觉场景和语言描述时的性能。

本研究的贡献在于系统性地分析了视觉语言预训练模型在视觉定位任务中的组合性问题,通过构建新的数据集和评测方法,为后续的研究提供了宝贵的资源和参考。面对这些问题,本文针对视觉语言预训练模型和下游视觉定位模型提出了有效改进策略,并通过实验验证了这些策略的有效性。考虑到在弱监督学习环境下,本研究的方法能够有效利用未标注或标注不精确的数据,提高模型在真实世界复杂场景下的应用性能,这对于减少人工标注成本、提高模型的泛化能力和适应性具有一定研究意义。

Other Abstract

This thesis aims to explore the research of weakly supervised visual grounding with a focus on compositionality, particularly emphasizing the application of vision-language pre-trained models in weakly supervised visual grounding and their performance on samples requiring compositional understanding. It also touches upon the relationship between downstream weakly supervised visual grounding models and pre-trained models. Visual grounding tasks are crucial in the intersection of computer vision and natural language processing, as they require models not only to recognize objects within images but also to accurately locate these objects based on natural language descriptions. Despite the significant advancements in deep learning that have propelled research in this field, the complexity of real-world scenes and the diversity and ambiguity of linguistic descriptions continue to pose challenges. This study addresses the compositional challenges faced by vision-language pre-trained models and downstream weakly supervised visual grounding models when dealing with complex visual scenes and linguistic descriptions, proposing new research methodologies and technical strategies. This thesis focuses on the following three main aspects.

  • Evaluating the compositionality of vision-language pre-trained models in visual grounding, especially how models handle the attributes of individual objects within images, the complex relationships between multiple objects and the priorities between multiple objects. This thesis constructs a new pipeline specifically designed to evaluate a model's compositional reasoning abilities, including a new dataset and a new reasoning task. The evaluation results reveal significant deficiencies in existing vision-language pre-trained models when dealing with compositional reasoning samples.
  • In response to the shortcomings of vision-language pre-trained models, this thesis proposes a new composition-aware fine-tuning method. The fine-tuning method utilizes cost-effective image-text annotations, first generating a fine-tuning dataset through dependency parsing and then introducing a new proxy task to increase the diversity of grounding heatmaps, enhancing the model's performance in visual grounding tasks. Experimental results show that the fine-tuned models can ground objects described by natural language within images more accurately, particularly in scenarios involving multiple objects and complex relationships.
  • Investigating the compositional challenges of downstream weakly supervised visual grounding models. These downstream models are trained based on pseudo-labels generated by pre-training models and inevitably inherit compositional defects. For downstream weakly supervised visual grounding models, this thesis analyzes the impact of text-related and text-unrelated noise from pre-trained models on their performance and optimizes model structures, learning strategies, and data processing techniques accordingly to improve performance in complex scenarios. Experimental analysis demonstrates that the training objectives and learning methods proposed in this thesis effectively enhance model performance when dealing with complex visual scenes and linguistic descriptions.

The contributions of this thesis lie in its systematic analysis of the compositionality issues of vision-language pre-training models in visual grounding tasks. By constructing new datasets and evaluation methods, it provides valuable resources and references for subsequent research. Faced with these challenges, this paper proposes effective improvement strategies for vision-language pre-training models and downstream visual grounding models, and experimentally verifies the effectiveness of these strategies. Considering that in a weakly-supervised learning environment, the methods proposed in this research can effectively utilize unlabeled or imprecisely labeled data to improve the model's application performance in real-world complex scenes, this has certain research significance for reducing manual annotation costs and enhancing the model's generalization ability and adaptability.

Keyword视觉定位,视觉语言预训练模型,弱监督,组合性
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/57199
Collection毕业生_硕士学位论文
Recommended Citation
GB/T 7714
曾宇楠. 组合性感知的弱监督视觉定位研究[D],2024.
Files in This Item:
File Name/Size DocType Version Access License
Thesis.pdf(7681KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[曾宇楠]'s Articles
Baidu academic
Similar articles in Baidu academic
[曾宇楠]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[曾宇楠]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.