CASIA OpenIR  > 毕业生  > 硕士学位论文
基于跨膜态分析的图像指代分割算法研究
闫熠辰
2024-05
Pages58
Subtype博士
Abstract

随着多模态人工智能的发展,基于跨模态分析的图像指代分割任务在各个领域得到了广泛应用。图像指代分割是一项重要的视觉任务,旨在准确识别和分割图像中被指代的物体或区域。这个任务可以广泛应用于交互式图像编辑、视觉导航和具身智能等领域。

尽管目前已经有基于多模态融合和对齐的图像指代分割算法,但在当前研究中,存在两个关键问题需要解决。首先,现有的多模态特征融合方法往往只专注于单一模态的引导,无法有效融合视觉和语言信息。为此,本文提出了一种全新的视觉和语言双向引导的多模态融合方法,以提高指代分割的准确性和效率。其次,以往的多模态对齐方法通常采用与传统图像分割相同的分割方式,缺乏在分割阶段显式对齐视觉和语言特征的探索。针对这一问题,本文提出了一种在分割阶段可以显式对齐视觉和语言信息的新方法,以进一步提升图像指代分割的性能和鲁棒性。通过上述方法的实践与深入探究,本论文旨在推进图像指代分割技术向更高层次发展,进而为多模态人工智能提供更为高效精确的视觉理解手段。论文的主要工作和创新点归纳如下:
       在跨模态融合方面,与之前只用单一模态单向引导多模态融合的方法不同,本文提出一个视觉语言双向引导融合并矫正的图像指代分割框架。本文先采取CLIP框架提取视觉和语言特征,接着采用视觉引导语言先合成视觉语言特征,然后再在语言信息引导下对视觉语言特征进行矫正。具体地,本文先从视觉特征中提取出一系列需要强调的关键视觉特征。同时通过这些关键的视觉特征引导与语言特征进行融合得到多模态特征。与传统直接解码多模态特征的方法不同,本文在解码多模态特征之前先在语言信息引导下对多模态特征进行矫正,该做法是以CLIP所提取的全局语言特征引导多模态特征进行自适应矫正。矫正后的多模态特征更多地关注了图像的关键信息和输入的语言全局信息。通过以上方法,本文实现了出色的分割结果。
       在跨模态对齐方面,之前的方法都是在融合或解码多模态特征时隐式对齐视觉和语言特征,这种方式缺少视觉信息和语言信息的显式对齐,同时在最终的分割阶段,当前方法采用和传统图像分割一样的方法,即用一个固定卷积核的卷积操作直接得到最终的分割掩码,从而导致在分割阶段缺少输入语言的引导,同时也不能灵活地适应输入。针对此本文提出了一种在最终分割阶段采用动态卷积显式对齐视觉语言特征的办法。具体来讲,本文根据输入的语言特征生成一系列的动态卷积核,根据这些动态卷积核可以得到一系列的分割掩码,最终的分割结果由这些掩码加权得到。通过这种方法本模型在分割阶段显式对齐了视觉语言特征,并在分割阶段引入了输入语言的引导,因此得到出色的分割结果。

Other Abstract

With the advancement of multi-modal artificial intelligence, Referring Image Segmentation (RIS) based on cross-modal analysis has gained widespread adoption across various domains. Referring Image Segmentation stands as a crucial visual task, targeting the precise identification and segmentation of objects or areas referenced within an image. This task finds extensive applicability in interactive image editing, visual navigation, and embodied AI domains.

Although there are many RIS methods based on multi-modal fusion and alignment, current research encounters two primary challenges: Firstly, prevailing multi-modal feature fusion methods tend to emphasize guidance from a single mode, lacking effective integration of visual and linguistic information. To enhance accuracy and efficiency, this paper introduces a novel cross-modal fusion approach based on bidirectional guidance from both vision and language modalities. Secondly, in previous methods, the fused features are directly fed into a decoder and passed through convolutions with a fixed kernel to obtain the result, which follows a similar pattern as traditional image segmentation. These methods do not explicitly align language and vision features in the segmentation stage. To address this limitation, this paper proposes an innovative method to explicitly align vision and language features. This method aims at further enhancing the performance and robustness of RIS task. Through the application and exploration of these methods, we aim to foster the advancement of RIS and provide more efficient and accurate visual understanding technology for multi-modal artificial intelligence.The main contributions are summarized as follows:

    An efficient RIS framework based on the cross-modal fusion. In contrast to previous methods which only utilize a single modality to guide the multi-modal fusion, we propose a bi-directional vision-language guided framework for Referring Image Segmentation. Specifically, we utilize CLIP as the backbone to extract vision and language features. Then we propose a vision-guided multi-modal fusion approach to obtain the vision-language features.  Then we propose a language-guided emphasis calibration module to calibrate these emphasis features to further ensure the calibrated features align with the context of the input sentence. Specifically, we extract the key vision features and use them to guide the fusion with language features to obtain the multi-modal features. After fusion, we utilize language features to guide the calibration of these multi-modal features. The specific process is to use the global language features extracted from CLIP to guide the adaptive calibration of multi-modal features. The calibrated multi-modal features more focus on the key information of the image and the global information of the input language. Through the above methods, this paper achieves excellent segmentation results.

    An efficient RIS framework based on the cross-modal alignment.  In the previous methods, the final mask is obtained by convolutions with a fixed kernel, which follows a similar pattern as traditional image segmentation. These methods lack explicit alignment of language and vision features in the segmentation stage. In this paper, we propose a method that explicitly aligns the vision and language features in the segmentation stage. Specifically, we generate a series of dynamic convolution kernels. Based on these kernels we can obtain a series of segmentation masks. The final result is obtained by the weighted sum of all these masks. We explicitly the vision and language features in the segmentation stage and achieve accurate results.

Keyword跨模态分析,图像指代分割,视觉语言模型
Subject Area计算机科学技术
MOST Discipline Catalogue工学
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/57197
Collection毕业生_硕士学位论文
Recommended Citation
GB/T 7714
闫熠辰. 基于跨膜态分析的图像指代分割算法研究[D],2024.
Files in This Item:
File Name/Size DocType Version Access License
毕业论文_Final.pdf(5636KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[闫熠辰]'s Articles
Baidu academic
Similar articles in Baidu academic
[闫熠辰]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[闫熠辰]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.