基于混合更新Q值的深度强化学习方法研究

	基于混合更新Q值的深度强化学习方法研究
	李主南
	2020-05-21
页数	88
学位类型	硕士
中文摘要	近年来，随着算力和数据的爆发式增长，掀起了人工智能相关领域的研究与应用热潮，深度强化学习也因此成为了一个研究热点。在深度强化学习领域，不管是基于值的方法，还是基于策略梯度的方法，都会涉及到 Q 值的估计更新问题。目前，绝大部分方法都是利用 Q 学习方式来更新目标，然而这种方式会产生过估计问题。因此，有必要提出一种新的更新 Q 值方法来扩展现有方法。　　过估计是著名 Q 学习算法的一个特性,它会导致算法训练得到的策略是次优的，因而利用 Q 学习方式来更新 Q 值的深度强化学习方法普遍存在该问题，包括Actor-Critic 算法。本文将围绕如何解决强化学习算法存在的过估计问题展开，主要目标是提出一种缓解过估计的方法，同时该方法对欠估计也能够起到有效的限制作用。我们首先对产生过估计问题的原因进行分析，产生该问题的主要原因是使用函数近似器估计时引入了噪声。其次，针对已有方法在解决过估计问题的同时还引入了欠估计问题，我们结合凸几何领域的凸组合概念，提出了混合更新方法。并且从理论上分析了混合更新方法能够减少方差，从而有效地提升算法的性能，并在一个简单的马尔科夫过程中得到了验证。最后，我们将该方法分别与深度强化学习中著名的深度 Q 网络算法，深度确定性策略梯度算法和双延迟深度确定性策略梯度算法结合起来，提出了相应的改进算法并在 Gym 平台上进行了实验。最后的实验结果表明，改进后算法的性能在大部分情况下优于原始算法，再一次验证了本文提出的方法的有效性。　　本文的研究成果主要有两点，一是针对过估计问题，受凸几何的启发提出了混合更新方法，并且在理论与实验中验证了该方法的有效性。二是将该方法与三种典型的深度强化学习算法结合起来，提出了对应的三种改进算法，大部分实验结果再一次表明该方法是缓解过估计问题的一种有效方式。
英文摘要	In recent years, with the explosion of computing power and data, it has set off a wave of research and application in the field of artificial intelligence, and deep reinforcement learning has therefore become a research hotspot. In the field of deep reinforcement learning, whether it is a value-based method or a gradient-based method, it will involve the problem of estimating the Q-value. At present, most methods use the way of Q-learning to update the target, but this method will result in overestimation problems. Therefore, it is necessary to propose a new method for updating the Q value to extend the existing method. Overestimation bias is a property of the well-known Q-learning algorithm, which could lead to suboptimal policies. Therefore, deep reinforcement learning methods that use the way of Q-learning to update the Q-value generally have this problem, including the Actor-Critic algorithm. This paper focuses on the problem of overestimation in reinforcement learning, and it’s goal is to propose a method to mitigate overestimation while effectively limit the negative effects of underestimation. We firstly analyzes the overestimation, and finds the main cause of the problem is the noise introduced by the function approximator. Secondly, the current method solves the overestimation problem and also introduces the underestimation bias, this paper is inspired by the concept of convex combination in the field of convex geometry, and proposes a mixing update method. And the method could reduce the variance and effectively improve the algorithm's performance, which is verified from theory and experiment respectively. Finally, we combine the method with three well-known deep reinforcement learning algorithms respectively: DQN, DDPG and TD3, and propose the corresponding improved algorithm and conducts some experiments on the Gym platform. The final experimental results show that the performance of the improved algorithm is better than the original algorithm in most cases, and validates the effectiveness of the proposed method once again. The research results of this paper mainly have two points. The one is a proposed method, the mixing update method is proposed inspired by convex geometry, and the effectiveness of the method is verified in theory and experiments. The other is to combine the method with three typical deep reinforcement learning algorithms, and propose three corresponding improved algorithms. We evaluate our improved algorithm on the suite of OpenAI gym tasks, most experimental results show that this method is an effective way to alleviate the overestimation problem.
关键词	深度强化学习 Q 学习算法过估计欠估计 Actor-Critic 凸组合混合更新
学科领域	人工智能
学科门类	工学
语种	中文
七大方向——子方向分类	强化与进化学习
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/39162
专题	复杂系统认知与决策实验室_智能系统与工程
推荐引用方式 GB/T 7714	李主南. 基于混合更新Q值的深度强化学习方法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
李主南毕业论文.pdf（3839KB）	学位论文		开放获取	CC BY-NC-SA