CASIA OpenIR  > 毕业生  > 硕士学位论文
基于内在动机的深度强化学习探索策略研究
陈忠鹏
2024-05-14
Pages82
Subtype硕士
Abstract

设计探索策略是深度强化学习领域中的一个研究热点和难点,尤其是在稀疏奖励环境中。目前,深度强化学习算法通常使用经典的探索策略,如Epsilon-greedy,来引导智能体进行探索。然而,这些经典探索策略难以解决智能体在稀疏奖励环境中遇到的硬探索问题。基于内在动机的探索策略是目前解决稀疏奖励环境中硬探索问题的有效方法,内在动机指的是高等生物在学习过程中往往会在没有外部刺激的情况下自发地探索陌生、未知的环境以提升自身在环境中的适应能力。受此启发,基于内在动机的探索策略通过某些指标将智能体的内在动机形式化为内在奖励信号,以驱动智能体进行探索。所设计的内在奖励的质量好坏会直接影响到智能体的表现,质量差的内在奖励不仅无法为智能体提供有效的探索方向,反而会成为阻碍智能体探索的噪声信号,因此,基于内在动机的探索策略的核心在于如何设计合理的内在奖励。本文针对基于内在动机的探索策略中存在的内在奖励设计问题展开研究,提出了以下三种内在奖励设计方法:

1. 提出了一种基于随机特征计数的内在奖励设计方法(CRF,Count by Random Feature)。该方法先通过随机网络获取原始高维状态的低维随机特征向量,然后将其离散化为二值编码,最后基于此二值编码来统计状态的访问次数,成功地解决了在高维状态环境中无法直接统计状态访问次数的问题,同时,基于随机特征计数的内在奖励设计方法实现简单,无需训练额外的深度神经网络。

2. 提出了一种基于条件生成对抗网络的内在奖励设计方法(CGAN-ICM,Conditional Generative Adversarial Network-Inverse Curiosity Module)。该方法利用条件生成对抗网络来建模强化学习环境中的状态转移动力学规律,然后利用生成器输出的多个不同下一状态预测值的平均预测误差作为驱动智能体进行探索的内在奖励信号。CGAN-ICM能够只通过一个生成器和一个判别器实现多个前向模型的预测输出任务,从而解决了在Disagreement$^{\citep{pathak2019self}}$中训练多个前向模型需要消耗大量计算资源和时间的问题。此外,CGAN-ICM可以看作是ICM$^{\citep{pathak2017curiosity}}$的集成学习版本,其通过多个预测输出的平均预测误差来衡量状态新颖度的做法能够获得比ICM更具备统计意义的结果。

3. 提出了一种基于全局视角和局部视角的内在奖励设计方法(CEMP,Continuous Exploration via Multiple Perspectives)。在基于内在动机的探索策略这一研究领域中的大部分工作主要通过全局视角或局部视角等单一视角来衡量状态的新颖度从而推导出每个状态对应的内在奖励,然而,从全局视角衡量状态的新颖度存在一个显著缺点,即状态的新颖度和对应的内在奖励会逐渐衰减,这种衰减的内在奖励无法持续驱动智能体在环境中进行探索。相反,从局部视角衡量状态的新颖度只会盲目地鼓励智能体频繁访问未知的状态,这不利于智能体在学习过程中策略的收敛。CEMP算法通过综合利用全局视角和局部视角下的内在奖励来驱动智能体进行探索,它通过使用局部视角计算的内在奖励来弥补从全局视角计算的内在奖励会逐渐衰减的不足。同时,从局部视角计算的内在奖励能够引导智能体发现更多环境中的新颖轨迹,从而提高智能体学习到最优策略的可能性。

在MiniGrid中的多个稀疏奖励环境的实验结果表明本文所提出的三种内在奖励设计方法均是合理有效的并且能够取得比基准算法更好的表现。

Other Abstract

Designing exploration strategies is a key focus and challenge in the field of deep reinforcement learning, especially in sparse reward environments. Currently, deep reinforcement learning algorithms typically employ classic exploration strategies such as Epsilon-greedy to guide agents to explore. However, these classic strategies struggle to address the hard exploration problem encountered by agents in sparse reward environments. Exploration strategies based on intrinsic motivation have proven to be effective in addressing this challenge. Intrinsic motivation refers to the spontaneous exploration of unfamiliar or unknown environments by higher organisms during the learning process, aiming to enhance their adaptability within the environment. Inspired by this, exploration strategies based on intrinsic motivation formalize the intrinsic motivation of agents into intrinsic reward signals through certain indicators, and utilize these intrinsic reward signals to drive agents to explore. The quality of the designed intrinsic reward directly impacts the performance of the agent. Poor-quality intrinsic rewards not only fail to provide effective exploration directions but can also become noise signals hindering exploration. Therefore, the core of exploration strategies based on intrinsic motivation lies in how to design reasonable intrinsic rewards. This paper conducts research on the design problems of intrinsic rewards in exploration strategies based on intrinsic motivation, proposing three intrinsic reward design methods:

1. An intrinsic reward design method based on random feature counting (CRF, Count by Random Feature) is proposed. This method first obtains low-dimensional random feature vectors of the original high-dimensional states through a random network, then discretizes these vectors into binary codes, and finally counts the number of state visits based on these binary codes. This method successfully addresses the issue of directly counting state visits in high-dimensional state environments. Moreover, it is simple to implement and does not require training additional deep neural networks.

2. An intrinsic reward design method based on conditional generative adversarial networks (CGAN-ICM, Conditional Generative Adversarial Network-Inverse Curiosity Module) is proposed. This method utilizes conditional generative adversarial networks to model the dynamics of state transitions in reinforcement learning environments. It then uses the average prediction error of multiple different next-state prediction values output by the generator as the intrinsic reward signal to drive the agent's exploration. CGAN-ICM can achieve multiple forward model prediction output tasks using only one generator and one discriminator, addressing the issue of consuming significant computational resources and time in training multiple forward models in Disagreement$^{\citep{pathak2019self}}$. Furthermore, CGAN-ICM can be regarded as an ensemble learning version of ICM$^{\citep{pathak2017curiosity}}$, achieving more accurate results by measuring the novelty of states through the average prediction error of multiple prediction outputs.

3. An intrinsic reward design method based on both global and local perspectives (CEMP, Continuous Exploration via Multiple Perspectives) is proposed. Most work in the field of exploration strategies based on intrinsic motivation mainly uses single perspective such as global or local perspective to measure the novelty of states and derive corresponding intrinsic rewards for each state. However, measuring the novelty of states from a global perspective has a significant drawback: the novelty of states and the corresponding intrinsic rewards gradually diminish, making it unable to drive agents to explore continuously in the environment. Conversely, measuring the novelty of states from a local perspective only encourages agents to frequently access unknown states blindly, which is not conducive to the convergence of policy during the learning process. The CEMP algorithm drives agents to explore by integrating intrinsic rewards from both global and local perspectives. It compensates for the gradual decay of intrinsic rewards calculated from a global perspective by using intrinsic rewards calculated from a local perspective. Meanwhile, intrinsic rewards calculated from a local perspective can guide agents to discover more novel trajectories in the environment, thereby increasing the likelihood of agents learning optimal policy.

Experimental results in multiple sparse reward environments in MiniGrid demonstrate that the three proposed intrinsic reward design methods are reasonable, effective, and outperform baseline algorithms.

Keyword深度强化学习,内在动机,探索策略,稀疏奖励
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/57174
Collection毕业生_硕士学位论文
Recommended Citation
GB/T 7714
陈忠鹏. 基于内在动机的深度强化学习探索策略研究[D],2024.
Files in This Item:
File Name/Size DocType Version Access License
基于内在动机的深度强化学习探索策略研究.(5803KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[陈忠鹏]'s Articles
Baidu academic
Similar articles in Baidu academic
[陈忠鹏]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[陈忠鹏]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.