基于多尺度特征融合的图像语义分割方法研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于多尺度特征融合的图像语义分割方法研究
	朱袁兵
	2024-05-16
页数	88
学位类型	硕士
中文摘要	人类能够很容易地感知和理解图像场景。构建具有人类级别能力的视觉感知和理解智能体一直是计算机视觉的研究者们的目标。近年来，随着现代机器学习，特别是深度学习的发展，计算机视觉得到了显著的进步。然而，在当前计算机视觉研究相关的很多下游任务上，计算机能力仍然显著落后于人类水平。图像语义分割，作为图像理解的关键步骤之一，是很多计算机视觉系统的基础算法。图像语义分割的目的是将图像按照语义类别分割为不同区域，使其拥有现实意义，更容易分析。区分图像中每个像素的类别是困难的，因为图像中存在大量不同的场景，特别甚至是稀有种类的物体。同时，计算机还要处理不同物体的尺寸，纹理，边缘细节等差异。因此，本文探索构建有效的图像语义分割系统的核心问题，并为之设计不同的算法来增强其性能。本文集中于处理图像语义分割中的视觉特征的空间尺度问题和开放词汇语义分割问题。由于图像语义分割的任务特性，使其在高度依赖于小尺度的特征来捕捉上下文语义和长距离依赖的同时，也亟需大尺度的特征来保持空间结构和分割细节。另外，传统图像分割往往只能分割特定的物体类别，而对未训练过的新类别缺乏泛化能力，开放词汇语义分割任务的目的就是让算法不局限于训练数据与训练类别。本文主要分为以下两个部分。 1.通过不确定性估计的方式分析实时语义分割网络中对于不同尺度的特征在不同层级的监督信号下的表现和问题。在这部分中，本文通过不确定性估计分析语义分割和边缘检测辅助任务的不确定性，揭示了实时语义分割任务中常见的空间特征和上下文特征融合不充分的问题，进而提出边缘注意力融合模块来解决该问题。同时，为了降低实时语义分割模型的不确定性进而提升其性能，本文还通过预测方差来估计不确定性，并利用估计结果对模型的训练进行正则化。 2.多分辨率的开放词汇语义分割框架。这一部分中，本文发现当前的开放词汇语义分割的算法受到视觉语言模型训练图片输入尺寸的限制，无法充分利用视觉语言模型特征的优势。因此，本文提出了一个多尺度多分辨率训练框架来利用多尺度视觉语言模型特征。其中，在特征提取过程中，多尺度视觉语言模型特征被用于增强掩码预测；在掩码分类过程中，本文设计了多尺度语义聚合模块来充分利用由视觉语言模型的多尺度输入产生的区域性语义，通过聚合局部和全局的多模态语义来提升性能。这两项工作关注的重点是语义分割中对于不同尺度特征的有效利用。第一项研究的主要关注点是，使用辅助任务来有效增强实时语义分割算法中不同尺度上下文特征的融合。第二项研究的主要关注点是，解决视觉语言模型在使用时的输入分辨率限制，通过多分辨率限制来利用多尺度多分辨率的多模态特征，进而提升掩码预测和分类的性能。本文最后从更加广泛和通用的角度，总结和讨论了图像语义分割的未来工作和方向。
英文摘要	Humans can easily perceive and understand visual scenes. Constructing agents with human-level visual perception and understanding capability has been the goal for computer vision researchers for many years. In recent years, significant progress has been made in computer vision with the development of deep learning. However, in many downstream tasks related to current computer vision research, the capabilities of computers still significantly lag behind human levels. Image semantic segmentation, as one of the key steps in image understanding, is a fundamental algorithm for many computer vision systems. The goal of image semantic segmentation is to segment images into image segments according to semantic categories, making them explicit and easier to analyze. Distinguishing the category of each pixel in an image is challenging due to the presence of a large variety of scenes and especially rare objects. At the same time, computers also need to deal with differences in object sizes, textures, edge details, and other variations. Therefore, this thesis explores the core problems in building effective image semantic segmentation systems and designs various algorithms to enhance their performance. This thesis focuses on the problem of spatial scale in handling visual features in image semantic segmentation, and open vocabulary semantic segmentation task. Due to the inherent requirement of image semantic segmentation, it highly relies on small-scale features to capture contextual semantics and long-distance dependencies while also requiring large-scale features to maintain spatial structure and segmentation details. Classic semantic segmentation methods mainly focus on segmenting objects from a pre-defined category set based on the training dataset. The goal of open vocabulary semantic is to segment semantic pixels belonging to arbitrary classes beyond pre-defined categories and datasets with pretrained vision-language models. The thesis is mainly divided into the following two parts. 1.Analyzing the performance and issues of features at different scales in real-time semantic segmentation networks through uncertainty estimation. In this part, this thesis analyzes semantic segmentation and edge detection auxiliary tasks in predictive uncertainty, revealing the insufficient fusion of spatial and contextual features in real-time semantic segmentation tasks. It proposes an edge attention fusion module to address this issue. Additionally, to reduce the uncertainty of real-time semantic segmentation models and improve their performance, this thesis estimates uncertainty through prediction variance and regularizes the model training using the estimation results. 2.Multi-resolution open-vocabulary semantic segmentation framework. In this part, this thesis finds that current open-vocabulary semantic segmentation algorithms are limited by the input size of visual language model training images, preventing them from generating and utilizing the advantages of large-scale high-resolution visual language model features. Therefore, this thesis proposes a multi-scale multi-resolution training framework to utilize multi-scale visual language model features. Specifically, in the feature extraction process, multi-scale visual language model features are used to enhance mask prediction, and in the mask classification process, a multi-scale semantic aggregation module is designed to fully utilize region-specific semantics generated by multi-scale inputs of visual language models, thereby improving performance through the aggregation of local and global multimodal semantics. The focus of these two works is on the effective utilization of features at different scales in semantic segmentation. The primary focus of the first study is to use auxiliary tasks to effectively enhance the fusion of different scale contextual features in real-time semantic segmentation algorithms. The primary focus of the second study is to address the input resolution limitations of visual language models and improve performance through multi-resolution constraints to utilize multi-scale multi-resolution multimodal features for mask prediction and classification. Finally, this paper concludes and discusses future work and directions for image semantic segmentation from a broader and more general perspective.
关键词	图像语义分割实时语义分割开放词汇语义分割视觉语言模型
学科领域	模式识别 ; 计算机感知 ; 计算机神经网络
学科门类	工学::计算机科学与技术（可授工学、理学学位）
语种	中文
七大方向——子方向分类	图像视频处理与分析
国重实验室规划方向分类	视觉信息处理
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/57642
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	朱袁兵. 基于多尺度特征融合的图像语义分割方法研究[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
朱袁兵毕业论文.pdf（29615KB）	学位论文		限制开放	CC BY-NC-SA