面向连续控制任务的深度强化学习值函数估计研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	面向连续控制任务的深度强化学习值函数估计研究
	何强
	2022-05
页数	146
学位类型	硕士
中文摘要	近十年以来，深度强化学习的蓬勃发展对社会、经济、文化等领域产生了巨大的影响，例如解蛋白质结构、游戏娱乐、能源调度、交通决策和机器人控制等。然而，将深度强化学习应用在现实世界中，还有一系列棘手的挑战亟待解决，例如值函数估计的偏差、难以泛化到离线场景、神经网络性质和强化学习的冲突、计算资源消耗过大等问题。这些困难阻止着人们将深度强化学习应用到现实场景中。值函数估计是现代深度强化学习算法核心问题。真实世界场景中的动作空间往往是连续的，而对于连续动作空间的强化学习问题，通常优化目标需要求解动作空间的积分，求解积分有相当的困难。本文聚焦于连续动作空间下的值函数估计问题。我们从理论研究的角度切入，面向具体问题，对连续动作空间的值函数估计问题展开了深入的研究。我们选取的角度如下。第一，如何通过改善值函数估计的准确性，进而去改善连续控制下的深度强化学习算法或者提出新的算法，第二，通过分析贝尔曼方程方程收敛的必要条件，得出好的价值函数应该具有什么样的优良性质，进而改善算法。第三，从集成学习的角度得到一个准确的值函数估计，并且改善集成强化学习的资源消耗问题。本文主要的贡献如下。 1. 提出了一种缓解连续动作空间设置下的值函数欠估计算法。在动态规划中，价值函数被后续价值的估计所更新，这会产生误差的积累。由于过估计，任何状态都有可能有一个相对较高的值，如对于坏的状态或访问次数少的状态，从而导致次优策略或学习失败。TD3算法表明，过估计问题经常发生在只使用一个critic算法中。因此，他们同时利用了一对critic，并取其最小值。我们在松散的假设下证明，TD3算法取target值的方式会导致欠估计问题。为了缓解欠估计问题，本文提出加权延时深度确定性策略梯度算法（Weighted Delayed Deep Deterministic Policy Gradient, WD3）。本文在实验中验证了欠估计问题的存在，并在连续控制任务上对WD3算法进行了评估：WD3算法通过改善欠估计问题取得了最优的性能。 2. 提出了一种缓解离线强化学习设置下值函数估计间隔的算法。离线强化学习研究如何从给定的任意一个静态数据集，不与环境进行交互的情况下学习最优策略。我们首先从理论出发分析离线情境下值函数的估计间隔（esitmation gap）：当值函数通过贝尔曼方程被评估时，由于智能体无法与环境互动，使得它无法通过贝尔曼方程消除估计间隔。这种间隔会导致价值函数对数据集中未出现的动作产生灾难性的错误的值函数估计。之后本文提出了一种新的离线策略优化方法，即悲观离线策略优化（Pessimistic Offline Policy Optimization, POPO），该算法利用悲观的分布化价值函数来缓解估计间隔问题，从而学习一个鲁棒的策略。我们在D4RL数据集上验证了POPO算法的有效性，并与最先进的离线强化学习方法进行比较。POPO的实验结果均优于测试的算法。 3. 提出中一种引导深度强化学习值函数良好表示的算法。在神经网络设置下讨论在策略估计过程收敛情况下的值函数表示学习性质。本文首先给出了神经网络设置下值函数表示的定义，随后讨论了贝尔曼方程收敛后值函数表示具有的性质。本文在理论上证明，当价值函数收敛之后，由于伪逆的性质以及神经网络的懒惰训练机制，其学习价值函数和目标价值函数的表示应该存在表示间隔（representation gap）。但在实际训练过程中，表示随着训练的进行会长得越来越相似，这导致了表示重叠问题（representation overlaps）。为了解决表示重叠问题，我们提出了阻止表示重叠策略优化框架（Policy Optimization from Preventing Representation Overlaps framework, POPRO）。我们在PyBullet连续控制环境上验证了POPRO算法的有效性。POPRO的实验结果均优于测试的算法。 4. 提出了一个通用的集成强化学习框架，称为极简集成策略梯度框架（Minimalist Ensemble Policy Gradient framework，MEPG），该框架简单且易于实现。MEPG框架没有引入任何额外的损失函数，通过利用两个同步的深度高斯过程保持集成学习性质，进而缓解了集成强化学习方法消耗大量计算资源的问题。之后本文给出了极简集成策略梯度框架的策略估计过程保持两个同步的深度高斯过程的证明。最后，本文通过将MEPG框架与简单的DDPG和SAC算法相结合，导出了ME-DDPG和ME-SAC算法。本文在PyBullet连续控制任务上证明了MEPG框架的有效性。结果表明，MEPG框架以14\%-27\%的集成强化学习的计算资源达到或超过了最先进的集成DRL算法和无模型DRL方法。此外，MEPG框架可以与任何DRL算法相结合。
英文摘要	Since the last decade, the booming development of deep reinforcement learning (DRL) has had a huge impact on social, economic, and cultural fields, such as solving protein structures, game entertainment, energy scheduling, transportation decision making, and robot control. However, there are a series of thorny challenges to applying DRL to real-world applications, such as value function estimation bias, difficulty in generalizing to offline scenarios, conflicts between neural networks and RL, and massive consumption of computational resources. These difficulties prevent human beings from applying DRL to real-world scenarios. Value function estimation is the core problem in current DRL algorithms. The action space in real-world scenes is often continuous. And for RL problems in continuous action space, the optimization objective usually requires solving the integral in the action space, which is considerable difficulty in DRL. In this thesis, we focus on the problem of estimating value functions under continuous action spaces. We start from a theoretical perspective and face specific problems to develop an in-depth study of the value function estimation problem in continuous action space. The perspectives we choose are as follows. Firstly, how to improve the accuracy of the value function estimation and then improve the DRL algorithm or propose a novel algorithm under continuous action space. Secondly, by analyzing the necessary conditions for the convergence of the Bellman equation, we derive what kind of properties a good value function should have and then improve the algorithm. Thirdly, an accurate estimate of the value function is obtained from the perspective of ensemble learning, and the resource consumption problem of ensemble RL is improved. The main contributions of this thesis are as follows. 1. WD3 algorithm is proposed to alleviate the underestimation of the value function under the continuous action space setting. In dynamic programming, the value function is updated by subsequent estimates of the value, which can generate an accumulation of errors. Due to overestimation, any state may have a relatively high value, such as a bad state or a state with few visits, leading to suboptimal policy or learning failure. The TD3 algorithm shows that the overestimation problem often occurs in the use of only one critic algorithm. Therefore, they utilize a pair of critics at the same time and take its minimum value as a target. We show that this approach leads to an underestimation issue under loose assumptions. To alleviate the underestimation issue, we propose Weighted Delayed Deep Deterministic Policy Gradient (WD3) algorithm. The existence of the underestimation problem is also verified experimentally. We evaluate the WD3 algorithm on PyBullet continuous control tasks. And the WD3 algorithm achieves superior performance by improving the underestimation problem. 2. POPO algorithm to alleviate the estimation gap for offline RL is proposed. Offline RL studies how to learn optimal policies from a given arbitrary static data set without any interaction with the environment. We first propose a theoretical approach to analyze the estimation gap of the value function in an offline context. The inability of the agent to interact with the environment prevents it from eliminating the estimation gap through the Bellman equation. The estimation gap can lead to catastrophic value function estimation problems for value functions whose inputs actions do not appear in the dataset. To tackle this issue, we propose PPessimistic Offline Policy Optimization (POPO), an algorithm that learns a robust policy using a pessimistic distributional value function to alleviate the estimation gap issue. Then we validate the effectiveness of the POPO algorithm on the D4RL dataset and compare it with state-of-the-art (SOTA) offline RL methods. The experimental results show that POPO outperforms the tested SOTA algorithms. 3. The properties of the value function representation in the case of convergence of the policy evaluation process are discussed in the neural network setting. We first define the representation of value function under the neural network setting. Then we discuss the properties that the value function representation should have after the convergence of the Bellman equation. We theoretically show that when the value function converges, there should be a representation gap between its learning value function and that of the target value function due to the nature of the pseudo-inverse and the lazy training mechanism. However, in practice, the representations grow more and more similar as training proceeds, which leads to the representation overlaps problem. To tackle the representation overlaps issue, we propose the Policy Optimization from Preventing Representation Overlaps (POPRO) framework. We validate the effectiveness of the POPRO algorithm on PyBullet continuous control tasks. The experimental results show that POPRO outperforms the tested SOTA methods. 4. A general ensemble RL framework, called Minimalist Ensemble Policy Gradient (MEPG), is proposed, which is simple and easy to implement. MEPG framework, which does not introduce any additional loss function, solves the problem that ensemble RL methods consume a large number of computational resources. We show that the policy evaluation phase of MEPG maintains two simultaneous deep Gaussian processes. Finally, we derive the ME-DDPG and ME-SAC algorithms by combining the MEPG framework with the DDPG and SAC algorithms. The effectiveness of the MEPG framework is demonstrated on the PyBullet continuous control task. The empirical results show that the MEPG framework exceeds or matches the SOTA ensemble DRL algorithms and model-free DRL methods with fewer computational resources. Moreover, the MEPG framework can be combined with any DRL algorithm.
关键词	深度强化学习值函数估计值函数表示集成强化学习
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/48781
专题	毕业生_硕士学位论文多模态人工智能系统全国重点实验室_脑机融合与认知评估
推荐引用方式 GB/T 7714	何强. 面向连续控制任务的深度强化学习值函数估计研究[D]. 中国科学院自动化研究所. 中国科学院大学,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
132980257514376250.p（4687KB）	学位论文		限制开放	CC BY-NC-SA