CASIA OpenIR  > 毕业生  > 硕士学位论文





Other Abstract

With the rapid advancements of computer vision, machine learning, and cloud computing, the new generation of information technology is increasingly integrating with robotics. Visual representation learning for robotic tasks aims to extract task-relevant visual features from vast visual data. Efficient feature representation can reduce the dependency on subsequent policy learning. Therefore, visual representation learning for robotic tasks holds significant theoretical and practical importance. This thesis explores self-supervised learning methods for visual representation learning in robotic tasks by leveraging the intrinsic structure and statistical properties of the data, specifically focuses on multi-view object representation contrastive learning and video frame-level representation learning. The main contents of the thesis are summarized as follows:

(1)To address the performance degradation of object segmentation models in complex, cluttered scenes and from varying observation viewpoints, a multi-view consistent self-supervised object segmentation method is proposed. It establishes pixel-level and object-level correspondences through inter-frame camera relative pose estimation and a cross-frame object association strategy. The initial object segmentation network is optimized in a self-supervised manner through multi-view observations, ensuring consistent segmentation results and similar feature representations for the same object across different viewpoints, while keeping the representations of different objects apart. Experimental results demonstrate that this approach effectively enhances the performance of the initial segmentation network in complex robotic manipulation scenarios.

(2)To address the challenge that fixed-interval sampling strategies in video representation learning struggle to capture the details of different manipulation actions, a self-supervised frame-level video representation learning method based on period prediction is proposed. By integrating temporal contrastive learning with the period prediction task, this method learns video frame-level representations with both semantic continuity and fine-grained discriminability from demonstration videos of robot manipulation tasks. Subsequently, reinforcement learning is used to learn specific robotic manipulation strategies. Experiments show that the proposed method significantly improves the efficiency and success rate of robot task learning, demonstrating strong generalizability and practical applicability.

(3)Experimental validation of self-supervised visual representation learning on a robot platform, the proposed multi-view consistent object segmentation method and the frame-level video representation learning method based on period prediction are applied to real-world robotic manipulation scenarios and validated on four typical robotic tasks. Experimental results demonstrate that self-supervised learning can effectively leverage the structural information inherent in video data. With appropriately designed self-supervised pretext tasks, it can learn visual feature representations for manipulation tasks without the need for manual labels, showing promise for real-world robotic applications.

Keyword自监督学习 机器人操作任务学习 视觉表征学习 具身视觉感知
Document Type学位论文
Recommended Citation
GB/T 7714
马文轩. 自监督机器人操作任务视觉表征学习方法研究[D],2024.
Files in This Item:
File Name/Size DocType Version Access License
硕士学位论文马文轩最终版.pdf(16914KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[马文轩]'s Articles
Baidu academic
Similar articles in Baidu academic
[马文轩]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[马文轩]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.