基于深度特征的机器人视觉定位研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于深度特征的机器人视觉定位研究
	管培育
	2022-05-20
页数	138
学位类型	博士
中文摘要	服务机器人良好的定位能力是其自主运动的前提。由于视觉传感器适用范围广、信息量丰富、成本低等特点，视觉定位受到普遍关注。利用深度网络对视觉图像信息进行表征，为服务机器人在视角、光照变化以及复杂环境下实现鲁棒的定位提供保障，具有重要的理论研究意义和广泛应用前景。本文结合未知环境和已知地图情形，面向基于深度特征的机器人视觉定位开展研究，论文的主要内容如下：首先，介绍了基于深度特征的视觉定位研究背景与意义，从未知环境下的视觉SLAM、基于已知地图的视觉定位两方面进行现状综述，并对论文内容和结构进行了介绍。其次，针对未知环境下传统视觉SLAM方法中手工设计的特征点对环境变化较为敏感等问题，提出了一种基于点和物体级语义特征的视觉同时定位与建图方法PO-SLAM。使用卷积神经网络从图像中提取物体级的语义特征，并利用深度图像对物体检测框进行快速几何分割以将物体和背景分开，促进同一帧内的点-物体特征关联，进而结合帧间点-点特征关联和帧间物体-物体特征关联，在优化过程中施加点和物体的约束限制匹配点的语义信息相同，提高数据关联的准确性，同时施加物体之间的相对位置不变性约束提高定位精度。此外，还根据点-物体关联结果去除与动态物体关联的特征点，提升视觉定位对动态环境的鲁棒性。数据集和实际场景中的实验表明了所提方法的有效性。第三，提出一种基于空间特征变换的场景坐标回归网络SFT-CR，通过对机器人所在环境进行隐式表征实现相机重定位。一方面，针对标准卷积操作对视角变化引起的图像几何变换缺乏内在不变性的问题，设计了空间特征变换网络，对卷积特征进行显式变换，有效提升了深度特征对几何变换的鲁棒性进而提高坐标估计的精度。另一方面，构建了基于最大似然的损失函数，引入坐标估计不确定度，使得场景坐标回归网络除了输出图像中2D像素对应的3D坐标，还提供各坐标的不确定度。基于坐标不确定度对得到的2D-3D点对进行筛选，并根据PnP算法求取相机的6D位姿，提高了定位的准确度和效率。此外，在特征提取中引入CoordConv操作以提高弱纹理区域特征的可区分性。所提方法的有效性在数据集上进行了验证。第四，提出了基于多任务学习的视觉位置识别方法MTA。针对现有位置识别方法中基于三元排序任务的训练方式忽略邻近位置图像全局特征分布的紧凑性，导致泛化性较差的问题，引入了一个新的二分类任务，将查询图像-正样本对构成正类，查询图像-负样本对构成负类，通过二分类损失约束所有正样本对特征距离小于所有负样本对特征距离，并结合现有的三元排序任务联合训练全局特征提取网络，增强图像全局特征位置内的紧凑性和位置间的可区分性，从而提高了模型的泛化能力。同时，在全局特征提取网络中嵌入注意力模块，使得网络在特征聚合时更加关注对位置识别有用的区域，提高了图像全局特征的鉴别性。所提方法在数据集和真实场景中进行了实验验证。第五，提出了一种基于离线混合地图的服务机器人定位软件架构，在ROS框架下实现了所提PO-SLAM与重定位方法SFT-CR和MTA的集成。根据环境区域大小和任务需求，基于MTA方法大规模场景下的适应性以及SFT-CR方法定位的精确性，构建了显式和隐式相结合的混合地图，其中，显式地图包含PO-SLAM提取的关键帧局部特征点与对应位姿、3D地图点以及MTA提取的关键帧全局特征，而隐式地图通过对定位精度要求较高的重要区域训练场景坐标回归网络予以建立。当PO-SLAM中局部特征跟踪失败时，机器人利用MTA、SFT-CR方法恢复位姿，实现可靠稳定的定位，其有效性通过室内办公环境下的导航实验进行了验证。最后，对本文工作进行了总结，并指出了需要进一步开展的研究工作。关键词：视觉定位，语义SLAM，重定位，场景坐标回归网络，视觉位置识别，离线混合地图
英文摘要	The localization ability of service robots is the premise of autonomous motion. With the characteristics of wide application range, rich information, and low cost of visual sensors, visual localization has received much attention. Using deep networks to represent visual image information can guarantee the robust localization of service robots in viewpoint variation, illumination change and complex environments. It is significant in both research and applications. Considering two situations of unknown environment and known map, this thesis conducts the research on robot visual localization based on deep features. The main contents are as follows: Firstly, the research background and its significance of this thesis are given. The research development of visual SLAM in unknown environments and visual localization based on known map is reviewed. The content and structure of this thesis are introduced. Secondly, aiming at the problem that the hand-crafted feature point of the traditional visual SLAM methods in unknown environments is sensitive to environmental change, a visual simultaneous localization and mapping method based on point feature and object-level semantic feature termed as PO-SLAM is proposed. It utilizes convolutional neural network to extract object-level semantic features from the image, and then a fast geometric segmentation algorithm is applied to differentiate objects and background with the help of depth image, which facilitates point-object feature association within the same frame. Furthermore, the point-point feature association and object-object feature association between frames are constructed. The point-object constraint is imposed on the optimization process to make the semantic information of matched points be the same, which is beneficial to data association. Also, the relative position invariance constraint between objects is exerted to improve the accuracy of localization. In addition, the feature points associated with dynamic objects are removed according to the point-object association results, and the robustness of visual localization to dynamic environments is improved. Experiments on the dataset and real scenario demonstrate the effectiveness of the proposed method. Thirdly, a scene coordinate regression network SFT-CR based on spatial feature transformation is proposed, which realizes camera relocation by implicit scene representation. On one hand, to solve the problem that the standard convolution operation lacks intrinsic invariance to the image geometric transformation caused by the viewpoint change, a spatial feature transformation network is designed to explicitly transform the convolution features, which effectively improves the robustness of the deep features to geometric transformation. As a result, the accuracy of coordinate estimation is improved. On the other hand, a loss function based on maximum likelihood is constructed and the uncertainty of coordinate estimation is introduced. The scene coordinate regression network provides not only 3D coordinates corresponding to the 2D pixels in the image but also the uncertainty of each coordinate. Based on the coordinate uncertainty, the obtained 2D-3D correspondences are screened for the 6D pose estimation of the camera by the PnP algorithm, which improves the accuracy and efficiency of localization. Besides, the CoordConv operation is introduced in the feature extraction to enhance the feature discrimination in weak texture areas. The effectiveness of the proposed method is verified by relocalization experiments on the datasets. Fourthly, a multi-task learning-based visual place recognition method labeled as MTA is proposed. The training based on the triplet ranking task in the existing methods ignores the compactness of the global features from the images corresponding to adjacent positions, which leads to the problem of weak generalization. Aiming at this problem, a new binary classification task is introduced, where all the query-positive pairs are regarded as the positive class and all query-negative pairs correspond to the negative class. A binary classification loss is designed to constraint the feature distances of all the positive pairs less than those of all the negative pairs. The global feature extraction network is trained by combining the binary classification task and the existing triplet ranking task, which enhances the intra-place global feature compactness and inter-place feature separability. Therefore, the generalization of the model is improved. Moreover, an attention module is proposed and embedded into the global feature extraction network, which makes the network pay more attention to the regions useful for place recognition during feature aggregation, which increases the discrimination of the global image feature. The proposed method is experimentally validated on public datasets and actual environment. Fifthly, a localization software architecture for service robots based on the offline hybrid map is proposed, which integrates the proposed PO-SLAM with the relocation methods SFT-CR and MTA under the ROS framework. According to the size of the environment and the task requirement, a hybrid map combining explicit and implicit ones is constructed based on the adaptability of MTA method to large-scale environment and the relocalization accuracy of SFT-CR method. The explicit map contains the local feature points, 3D map points, and poses of key frames from PO-SLAM as well as the global features of key frames extracted by MTA. The implicit map is established by training the scene coordinate regression network for the important area that requires high localization accuracy. When the tracking process fails in PO-SLAM, the robot restores the pose utilizing MTA or SFT-CR method. The reliable and stable localization is then achieved, which is verified by navigation experiments in an indoor office environment. Finally, the conclusions are given and future work is addressed. Key Words: Visual localization, Semantic SLAM, Relocalization, Scene coordinate regression network, Visual place recognition, Offline hybrid map
关键词	视觉定位语义SLAM 重定位
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/48775
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	管培育. 基于深度特征的机器人视觉定位研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
管培育_学位论文_最终版0610.pdf（35838KB）	学位论文		限制开放	CC BY-NC-SA