基于深度学习的机器人视觉三维环境感知研究 | ||||
刘洁锐![]() | ||||
2024-05 | ||||
页数 | 124 | |||
学位类型 | 博士 | |||
中文摘要 |
| |||
英文摘要 | The 3D environment perception ability is an important prerequisite and guarantee for robots to provide high-quality services. Visual perception of robots draws much attention because of its wide applications, rich information and low cost. Robots work in real unstructured environments, which often involve different levels of tasks such as large-scale navigation and local object manipulation. This requires robots to possess hierarchical 3D environment perception ability. Aiming at the robotic 3D visual environment perception based on deep learning, this dissertation conducts the research from two levels of large-scale spatial structure and locally refined object semantics, which provides perception support for robot intelligent and autonomous services in complex environments. It is significant in theory with widespread application prospects. The main contents of the dissertation are as follows: Firstly, the research background and significance of robotic 3D visual environment perception are introduced, and the research development is summarized from the aspects of large-scale spatial structure perception and locally refined object semantic perception. The content and structure of the paper are also introduced. Secondly, a temporal monocular depth estimation network based on geometric prior and pixel-level sensitivity is proposed to realize the perception of large-scale spatial stereo information. Aiming at the challenge of accurate camera pose estimation in monocular depth estimation methods, geometric constraint is introduced by using temporal optical flow information, which is then combined with the prior depth map to achieve accurate camera pose estimation. To improve the generalization of the depth encoder, the prior feature consistency regularization is constructed by combining the geometric prior of the temporal frames to assist the training of the depth encoder. By modeling the sensitivity of the depth prediction to different pixels, a sensitivity adaptive depth decoder is constructed to selectively adjust the short connection path between the cost volume and the final depth prediction. As a result, the impact of the sensitivity difference of different pixels in the input frame on the depth prediction is alleviated, and the accuracy of the depth estimation is improved. On this basis, visual odometry is combined to achieve localization for the robot. Experiments on datasets show the effectiveness of the proposed method. Thirdly, a bird's-eye-view semantic segmentation network based on two-stream compact depth transformation and feature rectification is designed to obtain the spatial semantic distribution, which is combined with the large-scale spatial stereo information to form the large-scale spatial structure information. On one hand, a two-stream compact depth transformation network is built to use different temporal frames to balance depth prediction and bird's-eye-view feature ensemble. By differentially reducing the resolution of the features of the temporal frames, the cost calculation in depth prediction is optimized. On the other hand, a segmentation network based on feature rectification is designed, and the deformable convolutional neurons with adaptive variable receptive fields are used to promote the segmentation network to autonomously sample the features of the corresponding space region, which rectifies the mapping of input features. Furthermore, the segmentation network is better trained through semantic contrastive learning. In addition, the virtual camera intrinsic parameters are adopted to improve the compatibility of 3D perception learning to 2D image data augmentation. The proposed method is verified on the datasets. Fourthly, from the perspective of the refined semantic perception in local scenes, a category-level object pose estimation network based on structural auto-encoder and reasoning attention is proposed to obtain semantic information including object category and pose. The reconstruction loss of different instances from the same category is designed to mine the shared structural features within the same category. Then, the structural auto-encoder is learned, and the category generalization of color image features is effectively improved. Further, a reasoning attention decoder is constructed, which combines the image and point features to implicitly model the category attributes, and then infers features to obtain more valuable feature representation. Considering that the larger number of parameter interactions in the decoder increases the difficulty of learning, a gradient decoupling strategy is used to decouple the reasoning attention decoder and the point feature extractor, which accelerates the network convergence. The proposed method is verified in datasets and actual scene. Fifthly, considering that the refined local scene perception of robot should also meet the semantic expansion need of object categories, a few-shot object detection network based on generalized feature learning is proposed. To tackle the problem that the features extracted by the existing few-shot object detection backbone are base class-biased with weak adaptability to the novel classes, a two-branch backbone is constructed. A new parallel branch is added upon the original backbone for retaining more class-agnostic features. Moreover, feature aggregation is designed with probabilistic path selection and source-based channel dropout to effectively integrate the two-branch features. The resultant generalized features not only retain the compatibility to novel classes, but also enhance the discriminability of features. Also, a loose contrastive loss is constructed to assist network training, By providing additional supervision by mining hard samples, the learning of the detection network is promoted. The effectiveness of the proposed method is verified on the datasets. Sixthly, the above perception methods of the spatial structure and refined object semantic levels are integrated under the ROS framework to form a hierarchical robotic 3D environment perception system. Its software architecture is organized with large-scale spatial structure perception layer, locally refined object semantic perception layer, perception fusion layer and perception control layer. The environment metric map is generated by using large-scale spatial structure information, and the local object semantic information including category, position and pose is embedded into it to form a global object semantic map. The proposed perception system provides necessary perception basis for autonomous services of robots, and its effectiveness is verified through the actual robot experiments. Finally, the conclusions are given, and future work is addressed. | |||
关键词 | 机器人视觉感知 大范围空间结构 局部精细化物体语义 分层式三维环境感知系统 | |||
学科领域 | 电子、通信与自动控制技术 | |||
学科门类 | 工学 | |||
语种 | 中文 | |||
是否为代表性论文 | 是 | |||
七大方向——子方向分类 | 三维视觉 | |||
国重实验室规划方向分类 | 环境多维感知 | |||
是否有论文关联数据集需要存交 | 否 | |||
文献类型 | 学位论文 | |||
条目标识符 | http://ir.ia.ac.cn/handle/173211/57079 | |||
专题 | 毕业生_博士学位论文 | |||
推荐引用方式 GB/T 7714 | 刘洁锐. 基于深度学习的机器人视觉三维环境感知研究[D],2024. |
条目包含的文件 | ||||||
文件名称/大小 | 文献类型 | 版本类型 | 开放类型 | 使用许可 | ||
毕业论文-刘洁锐-20240520.pd(28661KB) | 学位论文 | 限制开放 | CC BY-NC-SA |
个性服务 |
推荐该条目 |
保存到收藏夹 |
查看访问统计 |
导出为Endnote文件 |
谷歌学术 |
谷歌学术中相似的文章 |
[刘洁锐]的文章 |
百度学术 |
百度学术中相似的文章 |
[刘洁锐]的文章 |
必应学术 |
必应学术中相似的文章 |
[刘洁锐]的文章 |
相关权益政策 |
暂无数据 |
收藏/分享 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论