CASIA OpenIR  > 综合信息系统研究中心  > 视知觉融合及其应用
基于深度信息的同时定位与稠密建图方法研究
邢晓霞
Subtype博士
Thesis Advisor杨一平 ; 蔡莹皓
2022-05-28
Degree Grantor中国科学院自动化研究所
Place of Conferral中国科学院自动化研究所
Degree Discipline计算机应用技术
Keyword局部特征描述子 单目图像深度估计 相机位姿估计 八叉树地图 语义地图
Abstract

近年来,同时定位与建图(Simultaneous Localization and Mapping,SLAM)作为移动机器人实现自主导航的关键技术受到了人们的广泛关注。单目SLAM 因其结构简单、成本低、灵活性和拓展性强等方面的优势成为了视觉 SLAM 的主要研究热点。然而,单目SLAM通常只能构建稀疏或半稠密的场景地图,无法提供稠密地图支持机器人导航和避障等应用。本文围绕基于深度估计的同时定位与稠密语义地图构建方法开展相关研究工作,论文的主要工作和创新点归纳如下:
一、针对视角变化较大、光照变化剧烈等情况下难以建立鲁棒特征匹配的问题,提出一种融合二维图像与三维几何信息的局部特征描述子学习方法,通过非线性特征融合充分挖掘特征之间的互补信息。在关键点匹配和帧间配准任务上的实验结果表明,所提局部特征描述子的性能优于仅包含单一信息和通过直接串联的融合方式构建的特征描述子。
二、针对目前单目图像深度估计方法难以生成细节清晰的深度图像的问题,提出一种结合全局自注意力机制与动态滤波的单目图像深度估计方法。首先,通过自注意力机制对所有图像位置的语义特征进行加权获得丰富的图像上下文信息。然后,采用动态滤波方法利用高分辨率图像特征引导粗糙深度图像上采样,得到细节清晰的深度图像。在室内数据集NYU和室外数据集KITTI上的实验结果表明本文的方法可以获得准确且细节清晰的深度图像。
三、为了进一步提升深度估计与相机位姿估计的准确率,提出一种结合传统几何与自监督深度学习的深度估计与位姿计算的自迭代优化方法。一方面,通过神经网络预测的深度图像执行伪RGB-D SLAM,依据伪深度信息和SLAM鲁棒的优化算法及回环检测可以得到优于单目SLAM的可靠的相机位姿和稀疏地图点。另一方面,通过单目SLAM创建的稀疏地图点引导图像深度估计,提升深度估计的质量。在TUM RGB-D和KITTI数据集上的实验结果表明了所提方法的有效性。
四、提出一种基于深度估计的稠密语义地图构建方法,首先通过二维图像目标检测与三维点云分割获得三维物体语义信息;然后依据物体点云间的重叠度建立三维物体间的数据关联,从而建立三维物体语义地图。同时,依据每一帧的稠密点云和位姿估计结果构建环境稠密的八叉树地图。实验结果表明深度估计可以辅助单目SLAM构建三维物体语义地图与稠密八叉树地图,为机器人导航和避障等应用提供支持。

Other Abstract

Simultaneous Localization and Mapping (SLAM) as the key component for autonomous navigation of intelligent mobile robots, has attracted great attention in recent years. Among various SLAM types, monocular SLAM has become a popular topic in visual SLAM due to its advantages of simple structure, low cost, flexibility, strong scalability, etc. However, monocular SLAM is only able to build sparse or semi-dense maps of the environment, which cannot be used to support applications such as robot navigation and obstacle avoidance. This thesis focuses on simultaneous localization and dense semantic mapping based on depth estimation. The main contents of the thesis are summarized as follows:
1. Since it is difficult to establish robust feature matching under challenging conditions such as large viewpoint variations and severe lighting changes, we propose a local feature descriptor that combines 2D image and 3D geometric information together. The non-linear feature fusion fully captures the complementary information between two features. Experimental results on keypoint matching and pairwise registration tasks show that the proposed local feature descriptor performs much better than other feature descriptors with a single modality and the direct fusion method by concatenating different features together.
2. Current monocular depth estimation methods are difficult to generate depth images with clear and sharp details, we combine a global self-attention mechanism and dynamic guided upsampling to learn monocular depth estimation. On one hand, the self-attention mechanism is able to capture long-range dependencies by computing the representation of each image location by a weighted sum of features at all image locations. On the other hand, a dynamic guided upsampling module is designed to employ a dynamically generated kernel conditioned on low-level features to guide the upsampling of the coarse depth map. Experimental results show that the proposed method is able to generate visually pleasant and highly-accurate depth maps on indoor dataset NYU and outdoor dataset KITTI.
3. To further improve the accuracy of depth prediction and camera pose estimation for monocular videos, we propose a method to iteratively update the predicted depths and camera pose by combining the respective advantages of self-supervised monocular depth estimation and monocular SLAM. On one hand, pseudo RGB-D SLAM with CNN-predicted depth is able to achieve reliable camera pose estimation superior to monocular SLAM by incorporating pseudo-depth information, robust optimization algorithm, and loop closure detection. On the other hand, the obtained sparse map from monocular SLAM is able to guide the depth estimation network with improved performance. Experimental results on TUM RGB-D and KITTI datasets demonstrate the effectiveness of the proposed method.
4. We further build a dense semantic map based on predicted depth estimation. The method first obtains 3D object semantic information through 2D image object detection and 3D point cloud segmentation. Next, the association between objects is established according to the ratio of the overlap between object point clouds to build the 3D object semantic map. Simultaneously, a dense octomap of the environment is built based on the dense point cloud and camera pose. Experimental results show that depth estimation can assist monocular SLAM to build dense octomap and 3D object semantic maps to support applications such as robot navigation and obstacle avoidance.

Pages133
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/48787
Collection综合信息系统研究中心_视知觉融合及其应用
毕业生_博士学位论文
Recommended Citation
GB/T 7714
邢晓霞. 基于深度信息的同时定位与稠密建图方法研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.
Files in This Item:
File Name/Size DocType Version Access License
CASIA Thesis -邢晓霞.pd(7942KB)学位论文 开放获取CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[邢晓霞]'s Articles
Baidu academic
Similar articles in Baidu academic
[邢晓霞]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[邢晓霞]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.