强化学习中环境建模误差问题研究

CASIA OpenIR > 智能感知与计算研究中心

	强化学习中环境建模误差问题研究
	黄文振
	2021-11
页数	134
学位类型	博士
中文摘要	强化学习旨在通过试错的方式来获取序列决策问题的最优策略。早期的强化学习方法通常采用表格形式记录不同状态的价值或使用线性函数来逼近状态的价值，这使得它们的应用范围往往局限于一些低维问题。强化学习与深度神经网络等性能强大的函数逼近器的结合，极大地拓展了强化学习的应用场景，包括从模拟机器人的控制，到围棋等各种棋类游戏，再到以视觉为输入的各类复杂的电脑游戏等一系列的场景。然而，常见的强化学习方法的样本效率（Sample Efficiency）都较低，即它们需要大量样本才能学习到良好的策略，这意味着智能体需要与环境进行大量交互，进而导致这类算法应用到真实场景时，存在对设备的磨损过大以及训练时间过长等问题。为提高强化学习的样本效率，本文将研究的重点聚焦于基于模型的强化学习。基于模型的强化学习方法通常被认为具有较高样本效率。此类方法学习一个动力学模型来模拟真实环境，并利用该模型生成虚拟数据、执行在线规划或进行策略搜索，从而减少对真实样本的需求量。但学习到的动力学模型的精确程度却对这些基于模型的强化学习算法的性能有着很大的影响，模型误差可能导致次优的性能甚至算法发散。针对模型误差问题，本文从两个角度展开研究：一、调整智能体与环境的交互策略——收集合适的训练样本来降低动力学模型的预测误差；二、调整生成样本的使用方式——限制带有较大预测误差的生成样本参与到策略的训练过程中。主要工作概括如下： 1.为了收集合适的训练样本以降低动力学模型的预测误差，本工作提出了一种针对规划树（Planning Tree）的每一层进行自举的强化学习方法。这种方法能够衡量动力学模型对不同状态-动作对进行预测时的不确定度，并对不确定度高的状态增加探索，从而减少潜在的模型误差。此外，本工作在更新动作价值函数时，目标值是通过对自举分布进行采样获得的，这样做可以更好地将未来和当前的不确定度联系起来。本工作还引入了先验机制以提高探索效率。实验结果表明本工作所提的方法可以有效地减少模型误差，并在多个Atari游戏上取得了比其他基于模型的方法更好的性能。 2.为了限制带有预测误差的生成样本参与到策略的训练过程中，本工作提出了一种自适应的针对生成样本的重加权机制。具体而言，本工作通过先使用生成样本来更新动作价值和策略网络，再在真实样本上计算更新前后两个网络的损失值的差异的方式，以该差异来评估生成样本对训练过程的影响。为了高效地实现对每个样本的重加权，本工作构造了一个权重预测网络，并基于上述评估标准设计了一个元学习算法来训练该权重网络。算法大体流程如下：使用加权后的生成样本来更新动作价值和策略网络，计算更新前后损失值的差异，然后通过链式法则求取该差异对权重网络参数的梯度，并以此来更新权重网络。实验结果表明，提出的方法在多个控制任务上优于现有的基于模型和无模型的强化学习算法。权重变化的可视化进一步验证了该加权方案的合理性。 3.本文的第二个研究工作（参见上一段）尝试通过最小化生成样本的负面影响来对样本进行加权，但这种方法还面临权重低估这一挑战。针对这一挑战，本文提出了两种解决方案。方案一，本工作扩展了第二个研究工作的思路，从生成样本在训练过程中对策略或价值网络参数的影响的角度出发，仍使用元学习方法训练权重预测函数，但额外地以真实样本为基准，对权重预测函数进行调节，从而避免生成样本的权重被低估。方案二，本工作从生成样本在训练过程中对价值网络输出结果的影响的角度出发，构建了一种样本加权机制，该机制通过直接对比真实样本和生成样本上计算出的目标值的差异来对权重预测网络进行监督训练，在避免生成样本的权重被低估的同时降低了计算代价。实验结果表明，在多个高维连续控制任务中，基于以上机制设计的算法不仅优于当前最优的基于模型和无模型的强化学习算法，还优于第二个研究工作所提的算法。综上，针对模型误差问题，本文提出两条解决思路：收集具有较高不确定度的状态-动作对以减少潜在的模型误差，以及减少带有预测误差的生成样本的负面影响，并按照这些思路展开了相关研究。
英文摘要	The goal of Reinforcement Learning (RL) is to discover the optimal policies for sequential decision making problems through trial and error. In the past, RL methods usually utilize tabular to record the action values of different states or approximates them with the linear function, which limits these methods to low-dimensional problems. The combination of RL and powerful function approximators such as deep neural networks has greatly extended the application scenarios of RL, from acquiring continuous control of simulated humanoids, to mastering various board games including Go, to playing a variety of complex computer games from pixels. However, the common RL approaches have low sample efficiency, that is, they need a large number of samples to learn a good policy. This problem means massive interactions with the environment and would lead to excessive wear and too long training wallclock time when these approaches are applied to real scenes. In order to improve the sample efficiency of reinforcement learning, this paper focuses on model-based reinforcement learning~(MBRL). MBRL is commonly considered to be sample-efficient. MBRL learns a dynamics model to simulate the real environment, and uses the model to generate imaginary samples, perform online planning, or do policy search, so that it can reduce the amount of demand for real samples. However, the accuracy of the learned dynamics model has a great impact on the performances of MBRL, and model error may lead to suboptimal performance and even algorithm divergence. This paper tackles the problem of model error from two perspectives: First, adjusting the behavior policy of the agent—— collecting appropriate training samples to reduce the prediction error of the dynamics model; Second, adjusting the usage of imaginary samples——limiting the usage of the samples with large prediction error for optimizing the policy. The main works of this paper can be summarized as follows: 1.To collect appropriate training samples, this work proposes a bootstrapped model-based RL method which bootstraps the modules in each depth of the planning tree. This method can quantify the dynamics model's uncertainty on different state-action pairs and lead the agent to explore the pairs with higher uncertainty to reduce the potential model errors. Moreover, we sample target values from their bootstrap distribution to connect the uncertainties at current and subsequent time-steps and introduce the prior mechanism to improve the exploration efficiency. Experiment results demonstrate that our method efficiently decreases model error and outperforms TreeQN and other state-of-the-art methods on multiple Atari games. 2. To limit the usage of the samples with large prediction error for optimizing the policy, this work proposes to adaptively reweight the imaginary samples. More specifically, we evaluate the effect of an imaginary sample by calculating the change of the loss computed on the real samples when we use the sample to update the action-value and policy networks. To obtain the weight of each sample efficiently, we build a weight prediction network, and design a meta-learning algorithm to train it: First, use the weighted imaginary samples to update the action value and policy networks, and calculate the difference of the losses before and after updating. Then, calculate the gradient of the parameters of the weight network with respect to the difference through the chain rule, and update the weight network with the gradient. Extensive experimental results demonstrate that our method outperforms state-of-the-art model-based and model-free RL algorithms on multiple tasks. Visualization of our changing weights further validates the reasonability of utilizing reweight scheme. 3.The second research work (refer to the previous paragraph) attempts to reweight the imaginary samples by minimizing their negative effects. However, this work still faces the challenge of underestimation. For this challenge, this work propose two solutions. The first solution expands the second work and focuses on the effects of the generated samples on the policy or value network parameters during the training process. This work still utilizes meta-learning to train the weight prediction function, but additionally selects the real samples as the references to adjust the function and avoid underestimation. The second solution focuses on the effects of the generated samples on the output results of the value network during the training process. This work builds a weight mechanism that directly compares the target values calculated on real and imaginary samples. The weight prediction network is trained based on the differences of these target values through supervised learning. This method can make full use of the dynamics model and effectively reduce the computational cost. Experimental results show that in multiple high-dimensional continuous control tasks, the algorithms designed based on the above mechanisms are not only better than the current optimal model-based and model-free reinforcement learning algorithms, but also better than the algorithm proposed in the second work. In summary, for the problem of model errors, this paper proposes two ideas: collecting state-action pairs with higher uncertainty to reduce potential model errors and reducing the negatives of generated samples with prediction errors, and then carries out corresponding research works following these ideas.
关键词	基于模型的强化学习深度强化学习元学习
语种	中文
七大方向——子方向分类	决策智能理论与方法
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/46601
专题	智能感知与计算研究中心
推荐引用方式 GB/T 7714	黄文振. 强化学习中环境建模误差问题研究[D]. 中国科学院大学. 中国科学院大学人工智能学院,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Thesis.pdf（69564KB）	学位论文		开放获取	CC BY-NC-SA