英文摘要 | Simultaneous Localization and Mapping (SLAM) as the key component for autonomous navigation of intelligent mobile robots, has attracted great attention in recent years. Among various SLAM types, monocular SLAM has become a popular topic in visual SLAM due to its advantages of simple structure, low cost, flexibility, strong scalability, etc. However, monocular SLAM is only able to build sparse or semi-dense maps of the environment, which cannot be used to support applications such as robot navigation and obstacle avoidance. This thesis focuses on simultaneous localization and dense semantic mapping based on depth estimation. The main contents of the thesis are summarized as follows:
1. Since it is difficult to establish robust feature matching under challenging conditions such as large viewpoint variations and severe lighting changes, we propose a local feature descriptor that combines 2D image and 3D geometric information together. The non-linear feature fusion fully captures the complementary information between two features. Experimental results on keypoint matching and pairwise registration tasks show that the proposed local feature descriptor performs much better than other feature descriptors with a single modality and the direct fusion method by concatenating different features together.
2. Current monocular depth estimation methods are difficult to generate depth images with clear and sharp details, we combine a global self-attention mechanism and dynamic guided upsampling to learn monocular depth estimation. On one hand, the self-attention mechanism is able to capture long-range dependencies by computing the representation of each image location by a weighted sum of features at all image locations. On the other hand, a dynamic guided upsampling module is designed to employ a dynamically generated kernel conditioned on low-level features to guide the upsampling of the coarse depth map. Experimental results show that the proposed method is able to generate visually pleasant and highly-accurate depth maps on indoor dataset NYU and outdoor dataset KITTI.
3. To further improve the accuracy of depth prediction and camera pose estimation for monocular videos, we propose a method to iteratively update the predicted depths and camera pose by combining the respective advantages of self-supervised monocular depth estimation and monocular SLAM. On one hand, pseudo RGB-D SLAM with CNN-predicted depth is able to achieve reliable camera pose estimation superior to monocular SLAM by incorporating pseudo-depth information, robust optimization algorithm, and loop closure detection. On the other hand, the obtained sparse map from monocular SLAM is able to guide the depth estimation network with improved performance. Experimental results on TUM RGB-D and KITTI datasets demonstrate the effectiveness of the proposed method.
4. We further build a dense semantic map based on predicted depth estimation. The method first obtains 3D object semantic information through 2D image object detection and 3D point cloud segmentation. Next, the association between objects is established according to the ratio of the overlap between object point clouds to build the 3D object semantic map. Simultaneously, a dense octomap of the environment is built based on the dense point cloud and camera pose. Experimental results show that depth estimation can assist monocular SLAM to build dense octomap and 3D object semantic maps to support applications such as robot navigation and obstacle avoidance. |
修改评论