自监督机器人操作任务视觉表征学习方法研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	自监督机器人操作任务视觉表征学习方法研究
	马文轩
	2024-05
页数	84
学位类型	硕士
中文摘要	随着计算机视觉、机器学习和云计算等领域的迅速发展，新一代信息技术逐渐与机器人领域相互融合。机器人视觉特征的表征学习旨在从大量的视觉数据中提取和操作任务相关的视觉特征。良好的操作任务特征表示能够减轻对后续机器人策略学习的依赖性。因此，机器人操作任务的视觉表征学习有着重要的理论研究意义和实际应用价值。本文研究基于自监督学习的机器人操作任务视觉表征学习方法，利用数据本身蕴含的结构和统计特性，从多视角物体表征对比学习和视频逐帧表示学习两个方面开展研究和实验验证。论文的主要工作如下：一、针对复杂堆叠场景中目标物体存在遮挡和观测角度差异导致目标分割结果不理想的问题，提出一种多视角一致的自监督目标分割方法。该方法基于帧间相对位姿估计和跨帧物体关联策略，建立像素级和物体级的对应关系。通过多视角观测以自监督的方式优化初始目标分割网络，使得同一物体在多视角观测下具有一致的分割结果和相似的特征表示，而不同物体的特征表示相互远离。实验结果表明多视角一致的自监督目标分割方法能够有效提升初始目标分割网络在复杂机器人操作场景中的性能。二、针对视频表征学习中固定时间间隔的样本构造方法难以学习不同操作动作的细节的问题，提出一种基于周期预测的自监督视频逐帧表示学习方法。该方法结合时序对比学习和周期预测任务，从机器人操作任务的示教视频中学习同时具备语义连续性和细粒度区分性的视频逐帧表示。最后通过强化学习方式学习机器人具体的操作策略。实验结果表明所提自监督视频逐帧表示学习方法可以有效提高机器人操作任务学习的效率与成功率，具有较好的泛化性和实用性。三、自监督视觉表征学习机器人平台实验验证，将所提多视角一致的目标分割方法和基于周期预测的视频逐帧表示学习方法应用于真实机器人操作场景，并在四种典型机器人操作任务上进行了实验验证。实验结果表明，自监督学习能够利用视频数据本身的结构信息，在合理设计的自监督辅助任务下，无需人工标签可有效地学习操作任务的视觉特征表示，有望应用于真实机器人应用场景。
英文摘要	With the rapid advancements of computer vision, machine learning, and cloud computing, the new generation of information technology is increasingly integrating with robotics. Visual representation learning for robotic tasks aims to extract task-relevant visual features from vast visual data. Efficient feature representation can reduce the dependency on subsequent policy learning. Therefore, visual representation learning for robotic tasks holds significant theoretical and practical importance. This thesis explores self-supervised learning methods for visual representation learning in robotic tasks by leveraging the intrinsic structure and statistical properties of the data, specifically focuses on multi-view object representation contrastive learning and video frame-level representation learning. The main contents of the thesis are summarized as follows: (1)To address the performance degradation of object segmentation models in complex, cluttered scenes and from varying observation viewpoints, a multi-view consistent self-supervised object segmentation method is proposed. It establishes pixel-level and object-level correspondences through inter-frame camera relative pose estimation and a cross-frame object association strategy. The initial object segmentation network is optimized in a self-supervised manner through multi-view observations, ensuring consistent segmentation results and similar feature representations for the same object across different viewpoints, while keeping the representations of different objects apart. Experimental results demonstrate that this approach effectively enhances the performance of the initial segmentation network in complex robotic manipulation scenarios. (2)To address the challenge that fixed-interval sampling strategies in video representation learning struggle to capture the details of different manipulation actions, a self-supervised frame-level video representation learning method based on period prediction is proposed. By integrating temporal contrastive learning with the period prediction task, this method learns video frame-level representations with both semantic continuity and fine-grained discriminability from demonstration videos of robot manipulation tasks. Subsequently, reinforcement learning is used to learn specific robotic manipulation strategies. Experiments show that the proposed method significantly improves the efficiency and success rate of robot task learning, demonstrating strong generalizability and practical applicability. (3)Experimental validation of self-supervised visual representation learning on a robot platform, the proposed multi-view consistent object segmentation method and the frame-level video representation learning method based on period prediction are applied to real-world robotic manipulation scenarios and validated on four typical robotic tasks. Experimental results demonstrate that self-supervised learning can effectively leverage the structural information inherent in video data. With appropriately designed self-supervised pretext tasks, it can learn visual feature representations for manipulation tasks without the need for manual labels, showing promise for real-world robotic applications.
关键词	自监督学习机器人操作任务学习视觉表征学习具身视觉感知
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/57201
专题	毕业生_硕士学位论文多模态人工智能系统全国重点实验室_智能机器人系统研究
推荐引用方式 GB/T 7714	马文轩. 自监督机器人操作任务视觉表征学习方法研究[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
硕士学位论文马文轩最终版.pdf（16914KB）	学位论文		限制开放	CC BY-NC-SA