基于深度学习的机器人视觉三维环境感知研究

CASIA OpenIR > 毕业生 > 博士学位论文

基于深度学习的机器人视觉三维环境感知研究

刘洁锐

2024-05

页数

124

学位类型

博士

中文摘要

机器人良好的三维环境感知能力是其提供优质服务的重要前提和保障。对机器人感知而言，视觉感知以其适用范围广、信息量丰富、部署成本低等特点受到广泛关注。机器人在实际非结构化环境中的作业往往涉及大范围导航和局部场景物体操作等不同层级的任务，这就需要机器人具备分层式的三维环境感知能力。本文面向基于深度学习的机器人视觉三维环境感知，从大范围空间结构和局部精细化物体语义两个层级深入开展研究，为机器人在复杂环境下的智能自主服务提供感知支撑，具有重要的理论研究意义和广泛应用前景。论文的主要内容如下：

首先，介绍了机器人视觉三维环境感知的研究背景和研究意义，从大范围空间结构感知和局部精细化物体语义感知两个方面进行了现状综述，并对论文内容和结构做了介绍。

其次，提出了一种基于几何先验和像素级灵敏度的时序单目深度估计网络，实现大范围空间立体信息的感知。针对单目深度估计方法相机位姿估计准确性不足的挑战，借助时序光流信息引入几何约束，并结合先验深度图实现准确的相机位姿估计。为了提升深度编码的泛化性，结合时序帧的几何先验构建先验特征一致性正则化，辅助优化深度编码器。并构建灵敏度自适应深度解码器，通过建模深度预测对不同像素的敏感性，进而选择性地调节代价量与最终深度预测之间的短连接路径，缓解输入帧不同像素的灵敏度差异对深度预测的影响，提升深度估计的准确性。在此基础上，结合视觉里程计为机器人提供定位信息。在数据集上的实验表明了所提方法的有效性。

第三，设计了一种基于双流紧凑深度变换和特征校正的鸟瞰视角语义分割网络，获取空间语义分布，并与大范围空间立体信息共同构成了大范围空间结构信息。一方面，构建双流紧凑深度变换网络，对不同时序帧进行选择性组合，以兼顾深度预测和鸟瞰特征集成，并通过差异化降低待匹配时序帧特征的分辨率，优化深度预测中代价量的计算。另一方面，设计基于特征校正的分割网络，利用可变形卷积神经元的感受野自适应可变特性，促使分割网络自采样对应空间区域特征，实现对输入特征映射的校正，同时通过语义对比学习更好地训练分割网络。此外，引入虚拟相机内参提升三维感知任务学习对二维图像数据增广的兼容性。在数据集上对所提方法进行了验证。

第四，面向局部场景精细化语义感知，提出一种基于结构自编码器和推理注意力的类别级物体位姿估计网络，实现物体类别和姿态等语义信息的获取。设计同一类别不同实例的重构损失以挖掘相同类别物体图像中的共享结构特征，以此学习得到结构自编码器，有效提升彩色图像特征的类别泛化性。进一步地，构建推理注意力解码器，结合图像和深度点云特征对类别属性进行隐性建模并对特征进行推理，以此得到更有价值的特征表达。考虑到解码器中参数交互较多导致学习难度较大，还采用梯度解耦策略对推理注意力解码器和点云特征提取器进行解耦，加速整体网络收敛。所提方法在数据集和真实场景中进行了实验验证。

第五，机器人局部场景精细化感知还应满足物体类别语义扩展的需求，为此，提出了基于通用特征学习的小样本物体检测网络。针对现有小样本物体检测骨干网络所提取特征具有基类有偏性，导致新类适应性弱的问题，构建了双分支骨干网络，在原有骨干网络的基础上添加平行的新分支以保留更多的类别无关特征，并结合概率路径选择和基于源的通道丢弃，设计特征聚合方法对双分支的特征进行有效集成。集成后的通用特征既保留了对新类物体的兼容性，同时也增强了特征的可判别性。同时构建松弛对比损失辅助网络训练，通过挖掘难样本对信息提供额外监督信号，促进了检测网络的学习。所提方法的有效性在数据集上进行了验证。

第六，将上述空间结构和精细化物体语义两个层级的感知方法在ROS框架下集成起来，构建分层式的机器人三维环境感知系统，设计了包括大范围空间结构感知层、局部精细化物体语义感知层、感知融合层和感知控制层的软件架构，利用大范围空间结构信息生成环境度量地图，并将局部场景中感兴趣物体的类别、位置和姿态等语义信息嵌入其中，形成度量化全局物体语义地图，提供机器人自主作业所需的感知基础。机器人实际实验验证了所提感知系统的有效性。

最后，对本文工作进行了总结，并指出了需要进一步开展的研究工作。

英文摘要

The 3D environment perception ability is an important prerequisite and guarantee for robots to provide high-quality services. Visual perception of robots draws much attention because of its wide applications, rich information and low cost. Robots work in real unstructured environments, which often involve different levels of tasks such as large-scale navigation and local object manipulation. This requires robots to possess hierarchical 3D environment perception ability. Aiming at the robotic 3D visual environment perception based on deep learning, this dissertation conducts the research from two levels of large-scale spatial structure and locally refined object semantics, which provides perception support for robot intelligent and autonomous services in complex environments. It is significant in theory with widespread application prospects. The main contents of the dissertation are as follows:

Firstly, the research background and significance of robotic 3D visual environment perception are introduced, and the research development is summarized from the aspects of large-scale spatial structure perception and locally refined object semantic perception. The content and structure of the paper are also introduced.

Secondly, a temporal monocular depth estimation network based on geometric prior and pixel-level sensitivity is proposed to realize the perception of large-scale spatial stereo information. Aiming at the challenge of accurate camera pose estimation in monocular depth estimation methods, geometric constraint is introduced by using temporal optical flow information, which is then combined with the prior depth map to achieve accurate camera pose estimation. To improve the generalization of the depth encoder, the prior feature consistency regularization is constructed by combining the geometric prior of the temporal frames to assist the training of the depth encoder. By modeling the sensitivity of the depth prediction to different pixels, a sensitivity adaptive depth decoder is constructed to selectively adjust the short connection path between the cost volume and the final depth prediction. As a result, the impact of the sensitivity difference of different pixels in the input frame on the depth prediction is alleviated, and the accuracy of the depth estimation is improved. On this basis, visual odometry is combined to achieve localization for the robot. Experiments on datasets show the effectiveness of the proposed method.

Thirdly, a bird's-eye-view semantic segmentation network based on two-stream compact depth transformation and feature rectification is designed to obtain the spatial semantic distribution, which is combined with the large-scale spatial stereo information to form the large-scale spatial structure information. On one hand, a two-stream compact depth transformation network is built to use different temporal frames to balance depth prediction and bird's-eye-view feature ensemble. By differentially reducing the resolution of the features of the temporal frames, the cost calculation in depth prediction is optimized. On the other hand, a segmentation network based on feature rectification is designed, and the deformable convolutional neurons with adaptive variable receptive fields are used to promote the segmentation network to autonomously sample the features of the corresponding space region, which rectifies the mapping of input features. Furthermore, the segmentation network is better trained through semantic contrastive learning. In addition, the virtual camera intrinsic parameters are adopted to improve the compatibility of 3D perception learning to 2D image data augmentation. The proposed method is verified on the datasets.

Fourthly, from the perspective of the refined semantic perception in local scenes, a category-level object pose estimation network based on structural auto-encoder and reasoning attention is proposed to obtain semantic information including object category and pose. The reconstruction loss of different instances from the same category is designed to mine the shared structural features within the same category. Then, the structural auto-encoder is learned, and the category generalization of color image features is effectively improved. Further, a reasoning attention decoder is constructed, which combines the image and point features to implicitly model the category attributes, and then infers features to obtain more valuable feature representation. Considering that the larger number of parameter interactions in the decoder increases the difficulty of learning, a gradient decoupling strategy is used to decouple the reasoning attention decoder and the point feature extractor, which accelerates the network convergence. The proposed method is verified in datasets and actual scene.

Fifthly, considering that the refined local scene perception of robot should also meet the semantic expansion need of object categories, a few-shot object detection network based on generalized feature learning is proposed. To tackle the problem that the features extracted by the existing few-shot object detection backbone are base class-biased with weak adaptability to the novel classes, a two-branch backbone is constructed. A new parallel branch is added upon the original backbone for retaining more class-agnostic features. Moreover, feature aggregation is designed with probabilistic path selection and source-based channel dropout to effectively integrate the two-branch features. The resultant generalized features not only retain the compatibility to novel classes, but also enhance the discriminability of features. Also, a loose contrastive loss is constructed to assist network training, By providing additional supervision by mining hard samples, the learning of the detection network is promoted. The effectiveness of the proposed method is verified on the datasets.

Sixthly, the above perception methods of the spatial structure and refined object semantic levels are integrated under the ROS framework to form a hierarchical robotic 3D environment perception system. Its software architecture is organized with large-scale spatial structure perception layer, locally refined object semantic perception layer, perception fusion layer and perception control layer. The environment metric map is generated by using large-scale spatial structure information, and the local object semantic information including category, position and pose is embedded into it to form a global object semantic map. The proposed perception system provides necessary perception basis for autonomous services of robots, and its effectiveness is verified through the actual robot experiments.

Finally, the conclusions are given, and future work is addressed.

关键词

机器人视觉感知大范围空间结构局部精细化物体语义分层式三维环境感知系统

学科领域

电子、通信与自动控制技术

学科门类

工学

语种

中文

是否为代表性论文

是

七大方向——子方向分类

三维视觉

国重实验室规划方向分类

环境多维感知

是否有论文关联数据集需要存交

否

文献类型

学位论文

条目标识符

http://ir.ia.ac.cn/handle/173211/57079

专题

毕业生_博士学位论文

推荐引用方式
GB/T 7714

刘洁锐. 基于深度学习的机器人视觉三维环境感知研究[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
毕业论文-刘洁锐-20240520.pd（28661KB）	学位论文		限制开放	CC BY-NC-SA