面向平面场景的单目视觉SLAM研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	面向平面场景的单目视觉SLAM研究
	杜思聪
	2021-05-26
页数	98
学位类型	硕士
中文摘要	视觉同步定位与地图构建 (Simultaneous Localization and Mapping, SLAM) 是三维计算机视觉领域的核心问题，也是移动机器人、增强现实等领域的关键技术。视觉 SLAM 通过分析输入的图像信息，在相机运动的同时估计相机轨迹并构建环境地图。近年来单目视觉 SLAM 因成本低廉、使用便捷受到了业界的广泛关注。单目视觉 SLAM 通过初始化过程计算初始时刻的相机位姿和场景结构，为后续的 SLAM 系统运行提供初值。初始化质量决定了单目视觉 SLAM 的性能。在单目视觉 SLAM 运行的实际场景中，平面是一类常见且重要的特征结构。根据相机观测到平面个数的不同，本文将平面场景分为单平面和多平面两类。单平面场景是指相机视野中只包含一个平面，而多平面场景是指相机视野中存在多个平面结构。现有单目视觉 SLAM 方法在平面场景下没有充分使用平面约束，存在初始化速度较慢、鲁棒性差、依赖相机运动模式等问题。针对以上问题，本文对平面场景下的单目视觉 SLAM 技术展开了深入的研究，具体工作如下： 1. 针对单平面场景，本文提出了全局平面优化的方法完成单目视觉 SLAM 初始化任务。该方法首先根据连续的多帧信息计算出每两帧之间的单应性矩阵，然后通过全局平面优化的方法对相机位姿和平面法向进行求解。全局平面优化的目标是最小化基于单应矩阵的投影误差，优化变量为相机位姿和平面法向。最后再利用平面方程和相机投影方程完成地图点重建。相比于已有方案，全局平面优化方法避开了单应矩阵分解、三角化等容易受到噪声扰动的过程，并且大大减小了系统优化变量的数量。实验表明，该方法收敛速度快，能够准确地计算出相机位姿和地图点。 2. 针对多平面场景，本文在全局平面优化的基础上引入了平面实例分割方法，完成了单目视觉 SLAM 初始化任务。该平面实例分割方法通过深度神经网络实现，其输入是单张图片。神经网络通过编码器将图片上的所有像素点映射到高维嵌入空间中，并且使位于相同平面上的像素点在该空间中尽可能靠近，然后在此空间中使用聚类完成对平面像素点的划分。为了加快推断速度，本文在神经网络训练过程中使用了细粒度剪枝使模型参数更加稀疏。在平面实例分割结果的基础上，本文对包含像素点数量最多的平面进行特征点检测与跟踪，并根据特征点的对应关系计算帧间单应矩阵。在此基础上使用全局平面优化方法求解相机位姿和平面法向，最后使用平面方程和相机投影方程完成地图点求解。实验表明，该方法能够比较准确地完成平面实例分割，并提高了多平面场景下单目视觉 SLAM 的初始化精度。 3. 本文构建了一套基于滑动窗口优化的单目视觉 SLAM 系统。该系统在完成初始化任务的基础上，使用滑动窗口完成在跟踪过程中对相机位姿的持续计算和地图点的重建。滑动窗口序列维持先进先出机制，当有新的图像加入该序列时，最早加入该滑动窗口的图像信息会被移出。本文通过优化的方法计算新加入图像对应的相机位姿和地图点。整个滑动窗口在 SLAM 系统运行过程中只保存数量较少的图像信息，从而达到计算精度和计算效率之间的平衡。
英文摘要	Visual Simultaneous Localization and Mapping (SLAM) is a core issue in 3D computer vision. It is important in mobile robots and augmented reality. It can estimate the camera trajectory and the surrounding map through the visual images. Recently, monocular SLAM has received widespread attention due to its low cost and high convenience. It calculates initial poses and map points during the initialization process, which are the initial value for the following process for the monocular SLAM system. The initialization quality determines the performance of a monocular SLAM system. In actual scenes, the planar structures are common for SLAM. According to the number of planes observed by a camera, this thesis divides the planar scenes into two categories: single-plane scenes and multi-plane scenes. Single-plane scenes refer to that there is only one plane in the view of a camera. Multi-plane scenes mean that there are more than one planes in the view. However, the existing monocular SLAM methods do not fully use the planar constraints in planar scenes. Their initialization speed is slow and the robustness is poor. The calculated results are dependent on the camera motion. Due to the weaknesses mentioned above, this thesis mainly focuses on the study of monocular SLAM in planar scenes. The main work of this thesis includes: 1. For single-plane scenes, a global plane optimization method is proposed to achieve the monocular SLAM initialization task. The method first calculates the homography matrixes in multi-frame from the corresponding points. Then the global plane optimization method is used to compute the camera poses and the plane normal. The goal of the global plane optimization is to minimize the projection error from homography matrixes. The variables are camera poses and the plane normal. After that, the map points can be calculated using planar constraints instead of triangulation. Compared to existing methods, the proposed method avoids the process of decomposition and triangulation, and reduces the optimization variables greatly.Experimental results show that our method is fast and accurate to compute camera poses and map points. 2. For multi-plane scenes, a plane instance segmentation method is introduced to perform the SLAM initialization task based on the global plane optimization.We use a deep neural network to achieve the segmentation task from a single image. This method ﬁrst maps each pixel to an embedding space where pixels from the same plane instance have similar embeddings. Then the plane instances are obtained by grouping the embedding vectors. In order to reduce the inference time, we introduce a ﬁne-grained level network slim method during network training. From the segmentation results, we can obtain the predominant plane with the largest number of pixels. We detect and track the keypoints on this plane and compute the homography matrixes according to the corresponding points. After that we implement the global plane optimization method to calculate the camera poses and plane normal. The map points are recovered from planar constraints. Experimental results show that our work enhances the accuracy during monocular SLAM initialization in multi-plane scenes. 3. A monocular SLAM system is built based on the sliding window optimization method.This SLAM system first performs the initialization method above. Then we use a sliding window to compute the camera poses and map points. The sliding window follows the first-in-first-out strategy. The oldest frame will be marginalized when the new frame is added in the sequence. We use an optimization method to calculate the camera poses and map points for the new frames. The sliding window only maintains few frames to realize the balance between accuracy and efficiency.
关键词	视觉SLAM，SLAM初始化，平面实例分割，滑动窗口
语种	中文
七大方向——子方向分类	三维视觉
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/44693
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	杜思聪. 面向平面场景的单目视觉SLAM研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
面向平面场景的单目视觉SLAM研究.pd（15094KB）	学位论文		限制开放	CC BY-NC-SA