机器人持续性策略学习算法研究

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 机器人理论与应用

	机器人持续性策略学习算法研究
	熊方舟
	2020-05
页数	140
学位类型	博士
中文摘要	近年来，随着社会经济的快速发展，人工智能广泛地应用于各行各业，受到了人们越来越多的关注。人工智能通过对人类的思维活动以及行为进行模拟，旨在实现具备人类智能水平的智能系统。人工神经网络凭借其强大的学习规则的能力被广泛地用于人工智能的研究中。然而当外界环境发生变化时，现有的人工神经网络很难像人类一样的快速适应并进行自主学习。事实上，人工神经网络依次学习不同的任务时会遗忘之前已经学会的任务，这被称为“灾难性遗忘”。当机器人对不同的任务进行策略学习时，也会发生这种现象。本文利用持续学习的方法，对机器人任务的策略进行研究，使得机器人在多任务的序贯式学习过程中既能学会新任务又不会遗忘之前的任务，具备一定程度的持续学习能力。本文基于不同的持续学习场景对机器人任务进行学习，从多回合任务学习、多情况任务学习和多任务学习三个方面对机器人的持续性策略学习问题开展了研究，主要工作和创新点归纳如下： 1.提出了一种融合状态池更新的贝叶斯Q学习算法。多任务的持续学习问题涉及到对单个任务的学习，如何提高单个任务的持续学习收益显得尤为重要。本文利用强化学习来研究离散状态动作空间下的机器人策略学习问题，通过引入贝叶斯Q学习算法来学习机器人的状态动作值函数的分布。在多个回合的任务学习中，基于机器人与环境的交互数据对分布进行后验更新。为了提高持续学习的收益，提出了一种融合状态池更新的贝叶斯Q学习算法，即通过机器人对环境的探索情况来更新状态动作值函数的分布，以便机器人更快地完成任务，从而在多回合任务的持续学习中获得了更高的累积回报收益。 2.提出了一种融合弹性权重巩固的机器人多情况持续学习算法。机器人在对一个任务进行学习的过程中，会遇到不同的任务情况或者条件设置，如果不加以约束限制，在持续学习的场景下存在灾难性遗忘的问题。相关神经科学的研究表明，哺乳动物新大脑皮层中存在一种保护知识的方式——突触巩固，通过对之前任务情况起重要作用的神经网络权重进行保护，来实现对知识的记忆。本文受此启发提出了一种融合弹性权重巩固的机器人多情况持续学习算法，通过在人工神经网络的优化过程中加入突触巩固的机制来实现对权重的约束学习，从而保护了机器人在之前任务情况中的学习性能，相关的理论分析和实验证明了该算法的有效性。 3.提出了一种基于状态基元和策略学习的持续学习算法。机器人在对多个任务进行依次学习的过程中，主要涉及两个关键问题——学习新的任务和保护之前的任务。在学习每一个任务时，有研究表明大脑神经元网络会生成新的活动模式，而这些活动模式具有低维流形子空间。本文借鉴了神经科学中的这种学习机制，对机器人任务中的状态信息进行表示学习，同时生成“状态基元”来描述与神经科学中对应的低维特征。为了保护任务的性能，本文基于状态基元依次对每个任务单独学习策略，最终提出了一种基于状态基元和策略学习的持续学习算法，使得机器人在多任务的策略学习中避免了灾难性遗忘的发生。 4.提出了一种基于状态基元学习的多任务持续学习框架。对多个任务进行持续学习的过程中，机器人任务中的状态基元的生成方式可以固定不变或者随着不同的学习任务发生改变，本文基于这两种不同的状态基元的限制形式，总结并提出了关于“硬约束”和“软约束”下的基于状态基元学习的多任务持续学习框架。此外，针对任务之间的差异性问题，通过向机器人状态基元的学习中加入与之前任务构成的编码约束限制，在软约束的学习框架下，进一步提出了一种基于编码状态基元的持续学习算法，使得机器人在任务差异性更大的多任务持续学习场景中不会遗忘之前学过的任务，取得了更高的任务平均成功率。
英文摘要	In recent years, with the rapid development of social economy, artificial intelligence has been widely used in various industries and has received more and more attention. Artificial intelligence aims to realize intelligent systems with human intelligence levels by simulating human thinking activities and behaviors. The artificial neural network is widely used in the research of artificial intelligence due to its strong ability to learn rules. However, when the external environment changes, it is difficult for an artificial neural network to adapt and learn autonomously as quickly as humans. When learning different tasks sequentially, it will forget the previously learned tasks. This phenomenon is called “catastrophic forgetting”, and it also occurs when the robot learns policies for different tasks sequentially. In this thesis, the policies of robot tasks are studied with the method of continual learning, so that the robot can learn new tasks without forgetting previous tasks when learning multi-task sequentially. Based on different scenarios of continual learning, we study the continual policy learning for robots from three aspects: multi-epoch task learning, multi-condition task learning, and multi-task learning. The main work and innovations are summarized as follows: 1. A Bayesian Q-learning Algorithm Incorporating State Pool Updates The multi-task continual learning problem involves the learning of a single task, so how to increase the reward in a single task is particularly important. In this thesis, we use reinforcement learning to study the policy learning problem of the robot in discrete state and action spaces, and introduce the Bayesian Q-learning algorithm to learn the distribution of state-action value function. In the process of multi-epoch task learning, the posterior distribution is updated based on the interaction of the robot and the environment. In order to increase the task reward, a Bayesian Q-learning algorithm with state update is proposed, that is, the distribution of state-action value function is updated by the robot's exploration of the environment, so that the robot can complete the task more quickly. As a result, the robot obtains more rewards in the case of multi-epoch task learning. 2. A Multi-condition Continual Learning Algorithm for Robot with Elastic Weight Consolidation The robot encounters different conditions or settings when learning tasks. If there are no constraints, it will lead to a catastrophic forgetting problem in the continual learning scenario. Studies on neuroscience suggest that there is a way to protect knowledge in the neocortex of the mammalian brain, synaptic consolidation, which retains the knowledge by protecting the neural network weights important for previous tasks. Inspired by these studies, we propose a multi-condition continual learning algorithm for the robot. Specifically, by adding this synaptic consolidation mechanism to optimize the weights of the artificial neural network, so that the performance of the robot in previous tasks is protected. The effectiveness of the algorithm is proved by relevant theoretical analysis and experiments. 3. A Continual Learning Algorithm Based on State Primitives and Policy Learning When faced with multiple tasks, continual learning for the robot mainly involves two key issues-learning new tasks and protecting previous tasks. Some studies have shown that learning requires networks of neurons to generate new activity patterns, which comprise a low-dimensional subspace. We apply this learning mechanism to represent and learn state information in robot tasks, and generate “state primitives” to describe corresponding low-dimensional features in neuroscience. To protect the performance of the previous tasks, we learn the policy for each task separately based on state primitives, and finally proposes a continual learning algorithm to help the robot avoid catastrophic forgetting in the case of multi-task sequential learning. 4. A Multi-task Continual Learning Framework Based on State Primitives Learning In the process of learning multiple tasks continuously, the generation mode of state primitives in robot tasks can be fixed or changed with different tasks. In this thesis, build upon “hard constraint” and “soft constraint” on state primitives, we summarize and propose a multi-task continual learning framework, called state primitives learning. In addition, considering the differences in tasks, we introduce encoding constraints for previous tasks in the learning of state primitives under the proposed framework with the soft constraint, and further propose a continual learning algorithm based on encoding state primitives. The extensive experiments demonstrate our method can remember the previously learned tasks in a multi-task continual learning scenario, and achieves a higher success rate among different tasks.
关键词	持续学习策略学习机器人灾难性遗忘状态基元
语种	中文
七大方向——子方向分类	强化与进化学习
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/39077
专题	多模态人工智能系统全国重点实验室_机器人理论与应用
推荐引用方式 GB/T 7714	熊方舟. 机器人持续性策略学习算法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Thesis.pdf（3642KB）	学位论文		开放获取	CC BY-NC-SA