场景语义理解中的实例特征建模研究

	场景语义理解中的实例特征建模研究
	高乃钰
	2022-05-25
页数	136
学位类型	博士
中文摘要	场景语义理解旨在感知场景中不同元素的表观、语义、几何、运动等特征，是计算机视觉领域基础且具有挑战性的问题，在智能驾驶、安防监控等实际应用中具有重要的价值。经过长期的发展，尤其是深度学习的引入，场景语义理解领域取得了长足的进步。然而，在实例分割和全景分割等实例级场景语义理解任务中，仍然存在着算法效率低下、特征建模方式不统一等问题。实例级场景语义理解任务的核心问题在于如何以合理的方式对实例特征进行建模，该建模方式直接影响算法的精度、效率、泛化性等表现。本文致力于探索场景语义理解中高效而统一的实例特征建模方法，递进地开展了对场景中前、背景实例的表观和几何信息建模的研究：首先针对场景中前景物体的多尺度特性，研究了对物体表观中实例掩码的高效建模；接着以实例掩码为纽带，探索了前景物体和背景区域的联合建模，进而简化算法流程提升计算效率；最后，深入挖掘了实例掩码和深度信息的联合建模，实现了对场景元素的表观和几何信息的完整感知。论文的主要工作和创新点归纳如下: 1. 前景物体实例掩码的高效建模。对于前景物体实例掩码的高效建模，是场景语义理解的核心问题。由于实际场景中物体尺度的多样性，对于多尺度实例掩码的尺度解耦和归一化是提升算法性能和效率的关键。为此，本工作提出了亲和度金字塔特征，该特征通过解耦不同距离的像素对亲和度，高效地建模多尺度的实例掩码信息。相应地，本工作提出了一种基于多尺度解码器特征的亲和度与语义分割联合学习框架，实现了两个特征学习任务之间的相互促进。最终，结合亲和度金字塔的层级式特点，本工作对从亲和度生成实例掩码的后处理过程进行优化，提出了更加高效的级联图割后处理模块。该模块通过由粗到细逐步分割实例掩码，实现了精度和速度的同时提升。实例分割和全景分割任务上的实验结果表明，本工作提出的基于亲和度金字塔的方法实现了更加高效的实例掩码建模，在多个数据集上取得了性能和效率的提升。 2. 前景物体和背景区域实例掩码的联合建模。除前景物体外，对于背景区域的感知也是场景语义理解的重要组成部分。然而，由于对前、背景实例掩码的建模方式不一致，现有方法往往含有语义分割、实例分割和预测融合等多个模块，限制了算法效率。因此，探索前景物体和背景区域的实例掩码的联合建模，提出更加简洁高效的算法模型是十分必要的。为此，本工作提出了类别与实例感知嵌入特征，并通过深度度量学习建立相应的特征空间。在该特征空间下，分属不同实例的像素嵌入之间具有可区分性，同时各像素嵌入位于对应的类别子空间内。该特征统一编码像素级类别和实例信息，进而实现对前、背景实例掩码的联合建模。全景分割任务上的实验结果表明，本工作提出的基于类别与实例感知嵌入的模型实现了更好的精度-速度平衡，验证了联合建模前景物体和背景区域实例掩码的有效性。 3. 实例掩码和深度信息的联合建模。除实例掩码等表观信息外，对于场景几何深度信息的恢复是实现更加全面的场景语义理解的重要步骤。先前方法往往通过直接在模型中添加逐像素密集预测的深度估计模块解决此问题。然而该方式在深度估计中仅考虑像素级底层特征，缺乏对于实例级几何信息的挖掘和利用，实例分割和深度估计方式的不统一也导致多任务间信息交互的不足。因此，探索实例掩码和深度信息的联合建模是十分必要的。为此，本工作提出了基于逐实例式深度估计的实例掩码和深度信息联合建模方案，实现了对实例级几何信息的挖掘和利用。深度感知全景分割任务上的实验结果表明，本工作所提出的方法在多个数据集上实现了相对基准方法更优的性能，尤其是显著提升了在前景物体上的掩码分割和深度估计性能，验证了联合建模实例掩码和深度信息的有效性。
英文摘要	Scene semantic understanding aims to perceive the appearance, semantics, geometry, motion, and other characteristics of elements in the scene. It is a basic and challenging problem in the field of computer vision, and has important value in practical applications such as intelligent driving and security monitoring. After long-term development, especially the introduction of deep learning, the field of scene semantic understanding has made great progress. However, in instance-level scene semantic understanding tasks such as instance segmentation and panoptic segmentation, there are still problems such as low efficiency and inconsistent feature modeling. The core problem of instance-level scene semantic understanding task is how to model the instance features reasonably, and the modeling method directly affects the accuracy, efficiency, generalization, and other performance of the algorithm. This thesis is dedicated to exploring efficient and unified instance feature modeling methods in scene semantic understanding, and progressively conducts research on the appearance and geometric information modeling of foreground and background instances in the scene. Firstly, for the multi-scale characteristics of foreground objects, the efficient modeling of object instance mask is studied. Then, using the instance mask as a link, the joint modeling of foreground objects and background regions is explored, thereby simplifying the algorithm pipeline and improving the computational efficiency. Finally, the joint modeling of instance mask and depth is explored, enabling a complete perception of appearance and geometric information of scene elements. The main work and innovations of the thesis are summarized as follows: 1. Efficient modeling of foreground instance mask. The efficient modeling of foreground instance mask is a core problem of scene semantic understanding. Due to the diversity of object scales in real scenes, scale decoupling and normalization for multi-scale masks are the keys to improving the algorithm performance and efficiency. To this end, this work proposes an affinity pyramid to efficiently model multi-scale mask information by decoupling pixel-pair affinities at different distances. Correspondingly, this work proposes a joint learning framework for affinity and semantic segmentation based on multi-scale decoder features, which realizes the mutual benefit between the two tasks. Finally, combined with the hierarchical characteristics of the affinity pyramid, this work optimizes the post-processing process of generating instance masks from affinity, and proposes a more efficient post-processing module, in which the instance mask is segmented in a coarse-to-fine manner, achieving both accuracy and efficiency improvements. Experimental results on instance segmentation and panoptic segmentation tasks show that the proposed method models instance mask more efficiently, and improve both accuracy and efficiency on multiple datasets. 2. Joint modeling of object and background instance mask. Besides foreground objects, the perception of background regions is also an important part of scene semantic understanding. Due to the inconsistent mask modeling methods for foreground and background instances, existing methods contain multiple modules such as semantic segmentation, instance segmentation, and prediction fusion module, which limits the efficiency of the whole system. Therefore, it is necessary to explore the joint modeling of object and background instance mask, and to propose more concise and efficient algorithm models. To this end, this work proposes category- and instance-aware embedding features, which establishes a specific feature space through metric learning. In this feature space, pixel embeddings from different instances are distinguishable, and each pixel embedding locates in the corresponding category subspace. By simultaneously encoding pixel-level category and instance information, joint modeling of object and background instance masks is achieved. The experimental results on the panoptic segmentation task show that the model based on category- and instance-aware embeddings proposed in this work achieves a better accuracy-speed balance, which demonstrates the effectiveness of jointly modeling object and background instance masks. 3. Joint modeling of instance mask and depth information. In addition to the apparent information such as instance masks, the recovery of scene geometry depth is an important step towards a more comprehensive scene semantic understanding. Previous methods tend to address this problem by directly adding a pixel-wise dense prediction depth estimation module to the model. However, this method only considers pixel-level features in depth estimation, lacks the mining and utilizing of instance-level geometric information, and the inconsistency of instance segmentation and depth estimation also leads to insufficient information interaction between multiple tasks. Therefore, it is necessary to explore the joint modeling of instance mask and depth information. To this end, this work proposes a joint modeling scheme of instance mask and depth based on instance-wise depth estimation, which enables the mining and utilizing of instance-level geometric cues. The experimental results on the depth-aware panoptic segmentation task show that the proposed method achieves better performance than baseline on multiple datasets, especially the mask segmentation and depth estimation performance on foreground objects is significantly improved, which demonstrates the effectiveness of jointly modeling instance mask and depth information.
关键词	场景语义理解实例分割全景分割单目深度估计
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/48650
专题	复杂系统认知与决策实验室_智能系统与工程毕业生_博士学位论文
推荐引用方式 GB/T 7714	高乃钰. 场景语义理解中的实例特征建模研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
场景语义理解中的实例特征建模研究_高乃钰（28840KB）	学位论文		开放获取	CC BY-NC-SA