CASIA OpenIR  > 智能感知与计算
基于特征学习和融合的 RGB-D 场景理解
李亚蓓
2020-05-31
页数142
学位类型博士
中文摘要

计算机视觉要解决的主要问题为图像中有什么物体且它的位置在哪儿,其中的关键词“什么”和“哪儿”突出了计算机视觉的一个关键任务在于理解场景中的内容及其位置。场景理解包括场景分类、物体检测、语义分割等不同粒度的识别问题。视觉场景的多样性和识别任务的复杂性使得场景理解成为富有挑战性的问题。传统的计算机视觉主要利用二维RGB图像进行场景理解研究,在理论和应用层面都取得了很大进展。但由于从场景三维空间到二维图像的映射过程存在信息缺失,基于RGB的场景理解存在一些难以解决的问题,如对光线颜色敏感、对尺度变化不够鲁棒、对遮挡物体不易处理等。近来随着消费级别深度传感器的出现,人们能够获得场景的深度数据并使用RGB-D数据进行场景理解。深度图像可以提供对光线鲁棒的几何形状信息,与RGB图像提供的颜色纹理信息相互补充。多视角RGB-D图像也可以进一步重建3D场景,解决尺度变化、遮挡等问题。本论文基于RGB-D图像和3D点云两种输入,重点研究场景理解中的室内场景分类和场景语义分割问题,主要包括:

(1)研究了基于RGB-D图像全局特征的室内场景分类问题。针对室内场景全局特征的类内差异大,类间差异小的问题,我们提出了一个多任务学习网络同时优化基于度量学习的结构化损失函数和分类交叉熵损失函数,提高单模态场景表达的判别性。在特征融合时,为了获得具判别性的多模态场景表达,我们提出判别性特征融合网络,学习模态特征表达之间的独有关系和关联关系。实验表明,所提出的框架在RGB-D室内场景分类上可以获得优异的效果。

(2)研究了基于RGB-D图像局部表达的室内场景分类问题。场景中的局部信息,如物体和物体间的关系等对场景理解有重要作用。为了更好地描述场景中物体空间位置的高度可变性,以及更好地去除噪声局部信息,本文提出了一个基于注意力机制的模态内池化模型,选择和聚合对场景分类有帮助的局部区域。为了更好地融合多模态局部信息,我们扩展了基于注意力机制的模态间池化模型,自适应调节融合时每个模态局部信息的贡献。实验表明所提出的框架在RGB-D场景分类问题上可以得到优异性能,并可以通过可视化进一步解释模型的决策依据。

(3)研究了基于RGB-D图像的室内场景语义分割问题。本文进一步讨论精细化像素级别的场景理解。深度图像在场景语义分割任务中能提供重要的边缘形状信息。但在高层融合RGB和深度特征时,RGB和深度模态中的低层信息将大量丢失。针对该问题我们提出了一个多层次多模态融合网络。为了更好地结合分布差异较大的RGB模态和深度模态低层特征,我们设计了基于语义指导的融合模块。它利用高层特征预测的分割结果与真实标注间的残差信息进行监督训练。通过自顶向下地级联不同层级的融合模块,可以得到更精细的全分辨率分割结果。实验表明所提出的基于语义指导的融合方法在RGB-D场景分割上取得了优异的结果。

(4)研究了基于3D点云的室内场景语义分割问题。相比RGB-D图像,使用3D点云进行场景理解能有效克服尺度变化、视角变化以及遮挡等问题。在学习场景3D点云特征表达时,3D点云的数据量较大是一个突出问题,现有方法难以获得全局场景的上下文信息。针对此问题,本文提出了基于知识蒸馏和特征融合的3D场景语义分割方法。针对不同粒度的信息表达,我们设计了不同分辨率输入的双流点云网络,其中稠密局部流包含细节信息,稀疏全局流包含上下文语义信息。同时,我们提出了蒸馏模块和融合模块在双流中互相传递细节和全局上下文信息。实验验证了所提出框架在3D点云场景语义分割任务上的有效性。

英文摘要

Computer vision aims to discover what is in the real scene and where it is. The keywords "what" and "where" highlight the main task of computer vision is to understand the content and the location of the content in the scene. Scene understanding includes recognition tasks in different levels including scene classification, object detection, semantic segmentation, etc. Scene understanding is a challenging problem for the diversity of the scene and the complexity of recognition tasks. Traditional methods mainly use two-dimensional RGB images for scene understanding research. Numerous progress has been made on both the theory and application levels. However, when a three-dimensional scene is projected to the two-dimensional RGB image, there are losses of information. There are some intrinsic problems in RGB-based scene understanding such as light variations, scale variations, and occlusions. Recently, with the release of the depth sensor, people can obtain the depth data in the scene and utilize RGB-D data to understand scenes. Depth images can provide robust geometric information that is invariant to lighting, which is complementary to RGB images that provide color and texture information. Multi-view RGB-D images can also further reconstruct 3D scenes to solve the problem of scale variations and occlusions. This thesis focuses on indoor scene classification and scene semantic segmentation in scene understanding with RGB-D images or 3D point cloud as input, mainly including:

(1) We study the indoor scene classification problem based on the global features of RGB-D images. The global features in indoor scenes have large intra-class differences and less inter-class variations. To learn discriminative representations in each modality, we construct a deep multi-task network to simultaneously minimize the structured loss which is based on metric learning and the cross-entropy loss. In the feature fusion stage, to obtain discriminative multi-modal representation, we design a discriminative feature fusion network that learns correlative features of multiple modalities and distinctive features of each modality. Experiments validate that the proposed framework can achieve excellent results in RGB-D indoor scene classification. 

(2) We study the indoor scene classification problem based on the local representations of RGB-D images. Local information, such as object and the relationship between objects, plays an important role in understanding the scene. To better maintain the high spatial variability and remove noisy local information, we propose an intra-modality attentive pooling block. It can select and aggregate informative local regions. To better fuse multi-modal local cues, we extend the cross-modality attentive pooling block. It can adaptively adjust the contributions of local regions between RGB and depth modalities. Experiments show that the proposed framework can achieve excellent performance on the RGB-D scene classification. The proposed model is also interpretable, which helps to understand the mechanisms of both scene classification and multi-modal fusion.

(3) We study the indoor scene semantic segmentation problem based on RGB-D images. We further discuss the pixel-level scene understanding task. The depth image can provide important geometric-based edge information in the scene semantic segmentation. However, when high-level multi-modal fusion is conducted, the low-level information in the RGB and depth modalities will be severely lost. To solve this problem, we propose a multi-level multi-modal fusion network. To better combine the low-level RGB and depth features that have modality gaps, we design the semantics-guided fusion block. It utilizes the residual information between the segmentation results predicted by high-level features and the ground truths as supervision in learning. By cascading semantics-guided fusion blocks from top to down, we can obtain finer full-resolution segmentation results. Experiments show that the proposed fusion method achieves excellent results in RGB-D scene segmentation.

(4) We study the indoor scene semantic segmentation problem based on 3D point cloud data. Compared with RGB-D images, scene understanding based on 3D point clouds can effectively overcome scale variations, view changes, and occlusions. One of the main challenges in learning representation for the 3D point cloud is large-scale of the 3D point cloud. Thus, it's difficult for the existing methods to obtain context information on the global scene. To solve this problem, we propose a 3D point cloud segmentation method based on distillation and fusion. We construct a two-stream point cloud network with different input resolutions to process the information on different levels. The dense local stream contains detailed information and the sparse global stream contains global context cues. Meanwhile, we propose the distillation module and the fusion module to transfer details and global context information between the two streams. Experiments verify the effectiveness of the proposed framework in 3D point cloud scene semantic segmentation.

关键词RGB-D 3D点云 场景分类 场景语义分割
语种中文
七大方向——子方向分类图像视频处理与分析
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/39732
专题智能感知与计算
推荐引用方式
GB/T 7714
李亚蓓. 基于特征学习和融合的 RGB-D 场景理解[D]. 中国科学院自动化研究所智能化大厦1610. 中国科学院自动化研究所,2020.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
基于特征学习和融合的RGB-D场景理解-(10465KB)学位论文 开放获取CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[李亚蓓]的文章
百度学术
百度学术中相似的文章
[李亚蓓]的文章
必应学术
必应学术中相似的文章
[李亚蓓]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。