CASIA OpenIR  > 毕业生  > 博士学位论文
基于强化学习的机器人操作策略表征与学习
杨依明
2024-05-19
Pages126
Subtype博士
Abstract

近年来强化学习在机器人领域的研究与应用受到了广泛的关注。与模仿学习等其它机器人策略学习方法不同,强化学习允许机器人智能体通过与环境的自主交互来学习和优化策略,不需要明确的监督信号即可从过往经验中不断调整和改进自身的行为决策。这种学习模式使得机器人能够在非结构化环境中持续获得性能提升,在复杂多变的环境下更好更快地完成任务。而机器人操作策略的表征与学习作为机器人从感知智能迈向认知智能的关键一环,更是天然契合强化学习自主能动的闭环学习特性。

然而现有基于强化学习的机器人操作策略研究往往没有与机器人任务本身的独特性相结合。机器人操作脱胎于人类操作,机器人本体和操作任务本身往往具备鲜明的仿生特性,相应的任务大多也能被人类所执行。因此如果在设计机器人操作策略时能借鉴人脑在执行操作任务时对观测到的时空耦合信息的处理能力、多脑区协同机制以及完成操作任务时脑神经对人手的控制方法,就能使机器人操作策略具备更为合理的归纳偏置,事半功倍。鉴于此,本论文采取强化学习为主要方法,充分借鉴人类执行操作任务时的认知行为机制,深刻考虑机器人任务的固有属性,致力于在多样化的任务环境中探究与完善机器人的策略表征和学习算法,以期提高机器人操作策略的整体性能。
本论文主要研究内容如下:

  • 高效稳健的控制器是机器人实现后续操作任务的前提,为此本文提出时空变压器网络,通过在变压器结构中交替引入时间注意力和空间注意力机制并在多头注意力中加入相关性编码,增强了机器人控制策略对时空观测信息的关联建模和表征能力。现有基于变压器结构的强化学习策略表征无法很好地同时兼顾观测序列中时间信息和空间信息的提取,继而忽略了观测序列在时间和空间维度上的耦合关系,导致策略学习的偏差和数据利用效率低下。时空变压器网络通过在变压器结构中交替引入时间注意力和空间注意力机制,增强了策略网络对时空观测信息的关联建模和表征能力。此外,本研究在多头注意力中加入相关性编码,为时空变压器网络处理机器人时空观测信息提供了有效的归纳偏置。在多个仿真机器人环境中的实验表明,本研究提出的时空变压器方法在策略性能和数据效率方面,显著优于现有的基于变压器的强化学习方法,且本研究中引入的时空注意力和相关性编码具有明显的协同作用。
  • 机器人观测的时空耦合信息在操作任务中同样重要,因此我们借鉴生物神经系统在精准操作中处理时空信息的多脑区协同工作机制对机器人操作策略表征进行了建模,提出了一种基于脉冲神经网络的仿多脑区协同精准操作方法。该方法模拟了人脑多脑区协同处理复杂任务的机制。首先根据人类海马体、小脑和前额叶皮质中神经元连接特性,为各仿脑区模块设计了独特的网络结构,并以此模拟海马体的记忆功能、小脑的运动控制功能和前额叶皮质的认知规划功能。随后根据人类在精准操作任务中的多脑区协同机制,设计了仿海马体、仿小脑和仿前额叶皮质三个网络的协作方案,实现了真实机器人上的高效精准轴孔装配策略。这种仿生表示增强了机器人策略网络的时空信息处理和决策能力,为网络的监督学习和强化学习训练提供了有效的归纳偏置,仿真和真实机器人上的实验结果证明了该方法在机器人精准操作中的实用性和高效性。
  • 除了具备先进的大脑,人类能够高效完成操作任务的另一个原因在于灵巧的双手。但高自由度和强关节耦合的灵巧手为机器人操作任务策略学习带来了困难。为此,我们提出使用仿生结构动作图来表征仿生手的动作空间,通过借鉴人手的神经控制机制,充分考虑仿生手关节间的运动约束和操作特性,使策略表征更加自然有效。同时,本研究还利用从全状态观测中提取的专家策略作为特权信息来指导部分观测下机器人智能体的学习,提出特权专家策略引导的强化学习算法,克服了非结构化视觉观测导致的强化学习算法训练困难的问题。此外,本研究还在理论上推导并证明了这种仿生策略表示与学习算法的有效性。然后在抓取、使用工具、开关门等一系列仿生手操作任务的仿真实验中,证实了特权专家策略引导的强化学习算法在学习效率和操作性能方面的优越性。实验结果证实了本文提出的算法能够显著加快机器人双仿生灵巧手操作任务的学习速度,同时获得更高的任务成功率。
Other Abstract

The research and application of reinforcement learning in robotics has received much attention in recent years. Unlike other robot policy learning methods such as imitation learning, reinforcement learning allows a robot agent to learn and optimize its policy through autonomous interaction with the environment, and to continuously adjust and improve its behavior and decisions from past experiences without explicit supervisory signals. This learning paradigm allows robots to continuously gain performance improvements in unstructured environments, ultimately enabling robot agents to perform better and faster in complex and changing environments, which has led to the increasing use of reinforcement learning algorithms in robotics scenarios. The characterization and learning of robot manipulation policies, as a key part of the robot's transition from perceptual intelligence to cognitive intelligence, is a natural fit for the closed-loop learning characteristics of reinforcement learning autonomy and mobility.

However, existing research on robot manipulation policies based on reinforcement learning is often not combined with the uniqueness of the robot tasks themselves. Robot manipulation is derived from human manipulation, and the robot ontology and manipulation tasks themselves often have distinctive bionic characteristics, and most of the corresponding tasks can also be performed by humans. Therefore, if the design of robot manipulation policy can draw on the human brain's ability to process the observed spatio-temporal coupling information when performing the manipulation task, the synergistic mechanism of multiple brain regions, and the human hand control method when completing the manipulation task, the robot manipulation policy can be equipped with more reasonable induction bias, which is twice as effective as the other way around. In view of this, this thesis adopts reinforcement learning as the main method, fully draws on the cognitive behavioral mechanisms of human beings when performing manipulation tasks, deeply considers the inherent properties of robot tasks, and devotes itself to exploring and perfecting the robot's policy characterization and learning algorithms in diversified task environments, with a view to improving the overall performance of the robot's manipulation policies.

 

  • Efficient and robust controllers are the prerequisite for robots to achieve subsequent manipulation tasks. In this paper, we propose a spatio-temporal transformer network, which enhances the ability of the robot control policy to model and characterize the correlation of spatio-temporal observation information by alternately introducing temporal and spatial attention mechanisms into the transformer structure and incorporating correlation encoding in the multi-head attention. Existing reinforcement learning policy characterization based on the transformer structure cannot well take into account the extraction of temporal and spatial information in the observation sequences at the same time, and then ignores the coupling relationship between the observation sequences in the temporal and spatial dimensions, which leads to the bias of the policy learning and the inefficiency of the data utilization. The spatio-temporal transformer network enhances the ability of the policy network to model and characterize the correlation of spatio-temporal observation information by alternately introducing temporal attention and spatial attention mechanisms in the transformer structure. In addition, this study incorporates correlation encoding in the multi-head attention to provide an effective generalization bias for the spatio-temporal transformer network to process robot spatio-temporal observation information. Experiments in multiple simulated robot environments show that the spatio-temporal transformer approach proposed in this study significantly outperforms existing transformer-based reinforcement learning approaches in terms of policy performance and data efficiency, and that the spatio-temporal attention and correlation encoding introduced in this study have significant synergistic effects.
  • The spatio-temporal coupling information observed by the robot is equally important in the manipulation task, so we model the robot manipulation policy characterization by drawing on the mechanism of multi-brain-area collaborative work of the biological neural system for processing spatio-temporal information in precision manipulation, and propose a spiking neural network (SNN)-based approach to mimic multi-brain-area collaborative precision manipulation. This model simulates the mechanism of collaborative processing of complex tasks by multiple brain regions in the human brain. First, unique network structures are designed for each bionic brain region module according to the neuronal connection characteristics in the human hippocampus, cerebellum, and prefrontal cortex, to simulate the memory function of the hippocampus, the motor control function of the cerebellum, and the cognitive planning function of the prefrontal cortex. Then, based on the multi-brain region collaboration mechanism of humans in control tasks, a collaboration scheme for the three networks of bionic hippocampus, bionic cerebellum, and bionic prefrontal cortex is designed, realizing an efficient and precise shaft-hole assembly policy on real robots. This bionic representation enhances the spatio-temporal information processing and decision-making capabilities of the robot policy network, effectively providing inductive bias for the supervised learning and reinforcement learning training of the network. Experimental results on simulated and real robots demonstrate the practicality and efficiency of this method in precise robotic manipulation.
  • In addition to possessing an advanced brain, another reason for humans to be able to perform manipulation tasks efficiently lies in dexterous hands. However, high degrees of freedom and strong coupling between joints of robotic dexterous hands create difficulties in learning policies for their manipulation tasks. To this end, we draw on the neural control mechanism of the human hand, fully consider the kinematic constraints between the joints of the bionic hand and the manipulation characteristics, and propose the use of bionic structural action graphs to characterize the action space of the bionic hand, so as to make the policy characterization of the agent more natural and effective. Meanwhile, this study also utilizes expert policies extracted from full-state observations as privileged information to guide the learning of robotic agents under partial observations, and proposes privileged expert advisory learning algorithms guided by privileged expert policies, which overcomes the problem of difficult training of reinforcement learning algorithms caused by unstructured visual observations. In addition, this study theoretically derives and proves the effectiveness of this bionic policy representation and learning algorithm. The superiority of the privileged expert advisory reinforcement learning algorithm in terms of learning efficiency and manipulation performance is then confirmed in simulation experiments of a series of bionic hand manipulation tasks, such as grasping, using tools, and opening and closing doors. The experimental results confirm that the algorithm proposed in this paper can significantly accelerate the learning speed of the robot's dual bionic dexterous hand manipulation tasks while obtaining a higher task success rate.
Keyword强化学习 机器人操作 机器人控制 策略表征
Subject Area控制理论
MOST Discipline Catalogue工学::控制科学与工程
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/56638
Collection毕业生_博士学位论文
Recommended Citation
GB/T 7714
杨依明. 基于强化学习的机器人操作策略表征与学习[D],2024.
Files in This Item:
File Name/Size DocType Version Access License
博士毕业论文-杨依明-20240528.(19731KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[杨依明]'s Articles
Baidu academic
Similar articles in Baidu academic
[杨依明]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[杨依明]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.