CASIA OpenIR  > 毕业生  > 博士学位论文
弱监督条件下的视觉场景解析方法研究
樊峻菘
2022-05
Pages136
Subtype博士
Abstract

视觉场景解析任务在计算机视觉研究领域中具有重要的意义。它致力于对视觉图像做出像素级的精确判别,从而赋予计算机对现实世界基于视觉图像的精细化感知与理解能力,在自动驾驶系统、机器人视觉导航、遥感图像分析等领域都具有重要的应用价值。近年来,深度学习技术的快速发展带来了一大批处理视觉场景解析任务的算法模型。然而,为了在实际应用场景中取得可靠的性能表现,这些模型通常需要针对不同的应用场景,设计取得大量的、像素级的人工精细标注进行训练。这种对大量精细标注的需求造成了对专业人工劳动的高度依赖,带来了数据获取时间和经济上的高昂成本,阻碍了基于深度学习的视觉场景解析模型在新场景任务下的快速部署应用。为了缓解样本标注代价过大的问题,研究者们提出了利用弱监督标注进行视觉场景解析模型训练的学习范式。典型的弱监督标注包括图像类别标注、目标框标注、稀疏的点线标注等。相比于像素级的精细标注,这些粗略的弱监督标注更易获取,可以有效减少取得训练样本所需的人工标注代价。但同时,由于缺少精确的监督信息,在处理像素级的视觉场景解析问题时,弱监督方法也面临着目标部分缺失、类别混淆等诸多挑战。为此,本文的研究工作从以下四个方面递进展开,探讨如何在弱监督条件下更好地有效挖掘利用数据信息,提升弱监督模型处理视觉场景解析任务的能力,其主要创新点包括:


- 提出基于单图像类内信息判别的弱监督视觉场景解析方法,有效缓解了基于图像标签的弱监督方法中目标分割不完整的问题。图像标签的弱监督语义分割通常依赖分类器模型的类别显著性来提取目标的位置、尺度信息。而受限于分类任务只关注不同图像、类别间差异的特性,该类方法往往只能得到最具辨识度的局部目标分割结果。为此,本文提出一种类内判别器的方法,关注单一图像内部像素点间的差异,排除类间差异显著性带来的干扰,从而有效缓解上述局部响应问题并带来更完整的目标分割结果。
- 提出基于多图像跨图信息传递的弱监督视觉场景解析方法,挖掘数据中潜在的样本间关系信息以弥补弱监督信息不足的问题。该工作首次提出了利用图像间关系信息辅助弱监督语义分割模型训练的思想,并通过建模不同图像像素点间的关联度,在训练阶段实现跨图信息的传递和共享,协同利用多幅图像获得一致性更好的特征表达,提升弱监督模型的训练效果。
- 提出基于多元目标集成的弱监督视觉场景解析方法,综合利用多种方法、多种模型挖掘潜在监督信息,以实现在弱监督条件下更充分的信息发现和利用。该工作分析弱监督伪标签估计的不唯一性,发现多种目标估计具有一定的信息互补性,进而提出协同使用这些多元目标估计进行弱监督模型训练的方法,借助深度模型的鲁棒性和噪声自适应策略从多元目标中有效提取互补信息,得到相比于单目标训练显著更优的结果。
- 提出面向多类型弱监督信息的视觉场景解析方法,研究如何结合目标类别和空间位置信息的弱监督标注实现复杂场景任务下的弱监督解析。该部分研究利用点标签为监督信息载体,协同处理语义和实例判别任务,在弱监督条件下实现了性能良好的全景分割模型训练。该方法提出一种基于转移代价度量的框架,通过建模相邻像素点间的转移代价,统一地处理视觉场景解析任务中的语义判别和实例判别问题,有效地实现弱监督条件下的全景分割模型训练,在大规模数据集上取得领先的结果。

 

总的来说,针对弱监督的视觉场景解析问题,本文工作首先展开对单图像条件下弱监督信息利用机制的研究。之后,依次从数据和模型的角度,展开对潜在监督信息挖掘的方法研究,分别提出基于多图信息传递、多元目标集成的弱监督视觉场景解析方法。最后,本文研讨如何利用类别、空间位置的多种弱监督信息完成复杂的视觉场景解析任务,实现弱监督下的全景分割。本文所提出的方法对比同期工作,均具有显著的性能提升,在领域内通用的评测数据集上达到领先的性能指标,能够有效地缓解弱监督视觉场景解析中面临的分割目标缺失、类别混淆等问题,具有很好的学术创新意义和实际应用价值。

Other Abstract

Visual scene parsing tasks play an important role in the field of computer vision. It aims at producing pixel-level discrimination for visual images, which endows the computers with the ability to accurately perceive and understand the real world through visual images, and has important application value in the fields of autonomous driving systems, robotic visual navigation, and remote sensing analysis. In recent years, with the rapid development of deep learning technologies, many visual scene parsing models have emerged. However, to achieve reliable performance in practical applications, these models generally require large amounts of pixel-level annotations for training, which are obtained by specific human labors for various target application scenarios. The requirement of large amounts of precise annotations highly relies on professional annotators, which causes high burdens on the time and economic costs for data acquisition, hindering the rapid generalization and deployment of deep visual scene parsing models in new applications. To alleviate the problem of excessive annotation burden, researchers propose a new learning paradigm that employs weakly-supervised annotations to train visual scene parsing models. Typical weakly-supervised annotations include image-level class annotations, bounding-box annotations, sparse point or scribble annotations, etc. Compared with the accurate pixel-level annotations, these coarse weak labels are much easier to obtain. Thus, it can effectively reduce the annotation burden for obtaining training samples. However, due to the lack of accurate supervision, when dealing with the pixel-level visual scene parsing problems, weakly-supervised methods face many challenges, such as missing partial targets and category confusion. To this end, this paper conducts research progressively in the following four aspects, which explores how to more effectively mine and utilize the data information under the condition of weak supervision, and improve the ability of the weakly-supervised models in processing visual scene parsing tasks. The main contributions include:


- proposes a single-image based intra-class information discrimination method for weakly-supervised visual scene parsing, which effectively alleviates the target incompletion problem in image label-based weakly-supervised methods. The image label-based semantic segmentation generally relies on classifiers' classwise saliency maps to extract the position and scale information of the targets. However, limited by the classifiers' character that only focuses on the discrimination across different images and classes, these approaches can only obtain the most recognizable partial target regions. To this end, this paper proposes an intra-class discriminator method, which concerns the difference between the pixels belonging to the same image. Thus, it can eliminate the interference of inter-class saliency, effectively alleviate the partial activation problem, and derive more complete target segmentation results.

- proposes a multi-image based cross-image information transition method for weakly-supervised visual scene parsing, which mines the potential relationship information between different samples to compensate for the information scarcity in weak supervision. This work is the first to propose the idea of ​​using the relationship information between different images to assist the training of weakly-supervised semantic segmentation models. It models the correlation among pixels in different images, transmits and shares information across images in the training stage, achieves more consistent representations by collaboratively leveraging multiple images, and finally improves the learning of weakly-supervised models.

- proposes a multi-target integration method for weakly-supervised visual scene parsing, which synthesizes multiple methods and models to mine potential supervision so that information can be discovered and utilized more sufficiently in weakly-supervised scenarios. This work analyzes the non-uniqueness of weakly-supervised pseudo-labels, finds that multiple pseudo-labels have some complementary information, and proposes an approach to jointly use these multiple targets to train weakly-supervised models. The approach leverages the deep models' robustness and a noise adaptation strategy to effectively extract complementary information from multiple targets. It demonstrates significant improvement over single target-based approaches.

- proposes a visual scene parsing method for multi-type weak supervision, and studies how to utilize class and spatial information in the weakly-supervised labels to accomplish the visual parsing tasks in complex scenes. This work applies the point labels as the supervision carrier, cooperatively processes semantic and instance discrimination tasks, and achieves well-performed panoptic segmentation models with weak supervision. This work proposes a transition cost-based framework, which models the transition costs between adjacent pixels, and uniformly handles the semantic parsing and instance parsing tasks in the visual scene parsing problems. The proposed approach can effectively train panoptic segmentation models with weak supervision and achieve cutting-edge performance on large-scale datasets.

 

In summary, for the problem of weakly-supervised visual scene parsing, this paper firstly conducts research on the information utilization mechanism with single images. Then, it conducts research to mine potential information from the view of data and model, and proposes multi-image information transition and multi-target integration methods for weakly-supervised visual scene parsing, respectively. Finally, this paper studies how to utilize multi-type weak supervision incorporating class and spatial information to accomplish complex visual scene parsing tasks and realize the weakly-supervised panoptic segmentation approach. Compared with the concurrent works, the proposed approaches achieve significant improvements and achieve leading performance on the standard datasets in the field. It demonstrates the ability to effectively alleviate the problems of missing segmentation targets and category confusion in weakly supervised visual parsing problems, and has good academic innovation and practical application value.

Keyword弱监督学习 视觉场景解析 语义分割 全景分割
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/48926
Collection毕业生_博士学位论文
Recommended Citation
GB/T 7714
樊峻菘. 弱监督条件下的视觉场景解析方法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2022.
Files in This Item:
File Name/Size DocType Version Access License
弱监督条件下的视觉场景解析方法研究.pd(21274KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[樊峻菘]'s Articles
Baidu academic
Similar articles in Baidu academic
[樊峻菘]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[樊峻菘]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.