基于多模态特征表达与融合的RGB-D物体识别

CASIA OpenIR > 毕业生 > 博士学位论文

	基于多模态特征表达与融合的RGB-D物体识别
	程衍华
	2017-05-25
中文摘要	物体识别是计算机视觉领域最基本也是最核心的任务之一，包括图片级的物体识别和更加精细的像素级物体识别(即场景语义分割)。近年来，随着深度传感技术的发展，如微软的Kinect，我们能够同步地获取到高分辨率的RGB图像和高质量的depth图像(即RGB-D数据) 来描述同一个物体的多模态信息。如何利用RGB图像提供的丰富的颜色、纹理信息，以及depth图像提供的纯粹的形状、几何信息，来进一步提升物体识别的准确率，并解决传统的基于RGB单模态的物体识别算法对视角、尺度、姿态以及光线变化的不鲁棒性，成为学术界和工业界共同的研究热点问题。本论文从RGB-D物体识别的四个关键环节出发，包括特征表达、度量学习、分类器学习以及RGB-D多模态融合，并充分考虑物体识别中大规模人工标记训练样本标签的成本代价，进行了如下研究： (1) 研究了大规模无标注样本条件下RGB-D物体的无监督特征表达学习。考虑到手工设计RGB及depth特征往往复杂度高、表征能力有限，而现在的深度学习特征依赖于大规模的人工标记样本来进行监督训练，需要大量的人力、物力及时间成本，本文探索如何有效地从大规模的廉价的无标记RGB-D样本中自动挖掘物体判别性的表观和形状特征。论文中我们将卷积和Fisher核编码相结合(CFK特征)，以及将卷积、空间金字塔匹配和递归神经网络相结合 (CNN-SPM-RNN特征)，构造了两种无监督的层级特征表达学习器来有效地表征RGB-D物体各个模态的特性。 (2) 研究了少量标注样本结合大规模无标注样本条件下RGB-D物体的特征及分类器联合协同学习。尽管无监督特征学习算法不依赖于样本的标签，但是在涉及到具体的物体识别任务时，依然需要人工标记大规模样本的类别标签来训练分类模型，如SVM分类器。为了进一步降低RGB-D物体识别整个任务对大规模人工标记样本的依赖性，我们探索如何利用少量的标注样本和大规模的无标注样本来获得高精度的RGB-D物体识别性能。受益于RGB和depth模态的互补性，我们提出了一种基于协同学习的半监督特征及分类器联合学习框架，只利用5%的标记样本便获得了和当时最好的全监督算法相比拟的物体识别性能。 (3) 研究了大规模标注样本条件下RGB-D物体尺度及视角不变性的多模态融合学习。有效地融合RGB和depth模态的互补性能进一步提升RGB-D物体识别的准确度和鲁棒性。现在的融合策略一般采用简单的特征拼接或者分类器分数加和，这种融合策略很容易受到物体尺度、视角变化的干扰，而且无法适应RGB和depth信息在识别不同物体时贡献的差异性。为了解决这些问题，我们首先提出了一种密集匹配策略将物体映射到同一个尺度及视角空间，并在该空间定义了一种多模态融合学习策略来动态的权衡RGB-D物体中各个模态的重要性，实验表明相比当时主流的方法，我们的方法在RGB-D物体识别标准库上获得了更好的分类精度。 (4) 研究了大规模标注样本条件下RGB-D场景语义分割。相比上述图片级的RGB-D物体识别任务，像素级的RGB-D物体识别任务，即场景语义分割难度更大，它需要识别图像中每个像素点的类别标签，包括分类和定位两个任务。基于现在的全卷积神经网络模型，我们提出了一种局部位置敏感的反卷积神经网络用于提升物体的边沿分割效果，并提出了一种开关融合策略来学习RGB和depth两种模态在描述不同场景下各种物体时权重的可变性，用于进一步提升分类的精度。实验表明相比当时主流的方法，我们的方法在RGB-D场景语义分割标准库上获得了更好的分割结果。
英文摘要	Object Recognition is one of the most fundamental problems in computer vision, which includes both image-level object recognition and pixel-level object recognition (i.e. scene semantic segmentation). Recently, with the rapid development of commodity depth cameras like Microsoft Kinect, we can provide high quality synchronized RGB and depth information (RGB-D data) to depict multimodal characteristics of an object. Specifically, the RGB modality captures rich colors and textures, while the depth modality provides pure geometry and shape cues which are robust to the illumination and color variations. Thus, it is meaningful for us to explore using RGB-D data to further improve the performance of object recognition, as well as the robustness of computer vision system to the variations of viewpoint, scale, pose and illumination compared to traditional RGB-based vision methods. In this thesis, we first analyse the key components of RGB-D object recognition framework, including feature representation, metric learning, classifier learning and multimodal fusion conditioned on different scales of manually labeled training set, and then provide the following contributions to RGB-D object recognition task. (1)We investigate unsupervised feature learning methods to represent RGBD objects given only massive unlabeled data. There are obvious drawbacks of the popular handcrafted features as well as deep learning methods. Towards handcrafted features, they can only capture specific and limited cues of objects, and designing new handcrafted features is often very hard. Though deep learning methods like convolutional neural networks can learn powerful features of objects, they always demand a large-scale manually labeled dataset for supervised training, which is very expensive and time consuming. Therefore, it is meaningful to investigate powerful unsupervised feature learning methods to discover discriminative appearance and shape features of RGB-D objects from the inexpensive unlabeled data. In this thesis, we propose two kinds of such methods, one combines single-layer convolutional neural networks and Fisher kernel encoding(CFK), and the other one combines single-layer convolutional neural networks, spatial pyramid matching and recursive neural networks (CNN-SPM-RNN). Both of the two methods can automatically learn powerful features from large-scale unlabeled RGB-D data. (2) We investigate semi-supervised joint feature and classifier learning for RGB-D object recognition given limited labeled and massive unlabeled data. Although the aforementioned unsupervised feature learning method do not depend on any manual labels, the following object recognition task still requires enough manually labeled examples to learn the classifier such as SVM. To alleviate the human labeling labors, we investigate how to leverage limited labeled and massive unlabeled data for high-performance RGB-D object recognition. Benefitting from the complementary cues of RGB and depth, we propose a semi-supervised multimodal deep learning model for RGB-D object recognition based on co-training algorithm. Experiments on the benchmark RGB-D datasets demonstrate that, with only 5% labeled training data, our approach achieves competitive performance for object recognition compared with those state-of-the-art results reported by fully-supervised methods. (3) We investigate scale and viewpoint invariance multimodal fusion for RGB-D object recognition given enough labeled data. An effective fusion of the RGB and depth cues can further improve the accuracy and robustness of RGB-D object recognition. Recent methods generally combine RGB and depth with feature concatenation or score fusion, which can be susceptible to object pose and viewpoint variations, and hardly adapt to the varying contributions of RGB and depth for distinguishing different categories in different scenes. To address these problems, we first propose a new similarity measure based on dense matching, through which objects in comparison are warped and aligned to better tolerate these variations. Then we introduce a learning-to-combination way to fuse a group of dense matchers equipped with various fusion weights for final RGB-D object recognition. The proposed approach achieves new state-of-the-art results on several public RGB-D object recognition benchmarks. (4) We investigate RGB-D scene segmentation given enough labeled data.Compared to image-level RGB-D object recognition, RGB-D scene semantic segmentation is a more difficult problem, which requires to predict the category label of every pixel in the image. Based on the popular fully convolutional neural networks, we propose locality-sensitive deconvolution networks with gated fusion (LSD-GF) for RGB-D indoor semantic Segmentation. LSD-GF can refine object boundary segmentation over each modality, whilst adjusting the contributions of RGB and depth over each pixel for high-performance object recognition. Experiments on the benchmark RGB-D scene datasets show that our approach achieves the best results for RGB-D indoor semantic segmentation.
关键词	Rgb-d 物体识别特征表达度量学习多模态融合
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/14668
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	程衍华. 基于多模态特征表达与融合的RGB-D物体识别[D]. 北京. 中国科学院研究生院,2017.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
yhc_thesis_实名.pdf（10502KB）	学位论文		限制开放	CC BY-NC-SA