CASIA OpenIR  > 毕业生  > 硕士学位论文
分层强化学习的子目标生成与探索策略
王开申
2024-05-13
Pages64
Subtype硕士
Abstract

分层强化学习是强化学习领域的一个重要研究方向,其核心思想是利用时序抽象的方法对强化学习问题进行分层建模,并将目标任务分解为若干个简单的子任务进行求解。目前,传统的强化学习在处理复杂任务时还面临一些挑战,包括长序列决策、稀疏奖励和弱迁移能力等问题,而分层强化学习通过分而治之的思想可以有效地弥补这些难点。随着人工智能的迅速发展,分层强化学习已经成为了备受关注的研究热点,并在视觉导航、自然语言处理和机器人控制等现实世界领域中得到了广泛的应用。
在分层强化学习中,子目标是上层策略经过目标分解后下发给下层策略的子任务,它能够有效地指导下层策略进行学习和探索。然而,上层策略的子目标生成仍存在一些问题需要解决。一方面,现有方法的研究重点主要集中在如何生成含有更多学习信息的子目标,这就导致上层策略设计得过于复杂,子目标的生成过程消耗较多的计算和存储资源,从而降低了智能体的学习效率。另一方面,在长序列稀疏奖励任务中,下层策略需要子目标的有效引导才能实现更好的探索。现有的探索方法不仅需要设计合理的度量标准来评估智能体访问过的状态,而且在前期探索阶段,智能体还要花费较多的时间去探索一些对策略学习无用的状态,导致探索效率低下。
本文以智能体的学习效率和探索效率为切入点,分别从子目标的生成过程和探索方式两方面展开相关研究,主要研究内容和创新点总结如下:
• 基于互信息限制的子目标生成。本文提出了一种利用互信息缩减子目标空间的方法。该方法使用对比学习将子目标映射到互信息度量空间,并计算它们之间的互信息距离。同时,本文利用计算得到的互信息距离对上层策略生成的子目标施加了两种限制:一种限制缩小了当前状态与子目标之间的互信息距离,可以使得子目标能够被下层策略实现。另一种限制缩小了子目标与最终目标之间的互信息距离,可以使得最终目标在子目标实现后也能够被下层策略实现。这两种限制使得子目标可以作为当前状态和最终目标之间的关键节点,有效地指导下层策略的学习。实验结果表明,本文提出的方法能够提高智能体的学习效率,并且训练时间不受状态空间和动作空间大小的影响。
• 基于扩散模型的子目标探索。本文受扩散现象的启发,将分子由高浓度区域向低浓度区域运动的过程视为一种探索过程,并利用随机游走对此过程进行建模。为了实现智能体的随机游走,本文采用了扩散模型作为上层策略,同时利用其拟合多峰分布的能力对下层策略进行时序抽象。在抽象到下层策略实现的子目标之后,上层策略对抽象策略施加合适的噪声来指导下层策略模拟随机游走,从而实现子目标在整个状态空间的扩散,进而探索到最终目标。实验结果表明,基于扩撒模型的随机游走可以提高智能体的探索能力。此外,本文还讨论了扩散模型与在线式强化学习结合时存在的问题,并通过实验验证了奖励类型对智能体探索性能的影响。
综上所述,本文将互信息和扩散模型引入到分层强化学习中,提高了子目标的生成效率和探索效率,减少了智能体的训练时间,为分层强化学习在现实环境中的应用提供了不同的思路。

Other Abstract

Hierarchical reinforcement learning is a significant research direction in reinforcement learning. It employs temporal abstraction to hierarchically model reinforcement learning problems and decomposes complex  target task into several simpler subtasks for solution. This divide-and-conquer approach effectively tackles challenges like long-horizon decision-making, sparse rewards, and weak transferability in complex environments. With the rapid development of artificial intelligence, hierarchical reinforcement learning has gained significant attention and widespread application in real-world domains such as visual navigation, natural language processing, and robot control. 

In hierarchical reinforcement learning, subgoals serve as effective guides for the lower-level policy in learning and exploration.  However, there are still some issues that need to be addressed in generating subgoals for the upper-level policy. On one hand, current research primarily focuses on generating subgoals with more learning information. However, this approach often leads to overly complex design of the upper-level policy, resulting in high computational and storage costs, which in turn reduces the learning efficiency of the agent. On the other hand, in long-horizon sparse reward tasks, the lower-level policy requires effective guidance from subgoals to achieve better exploration. Existing exploration methods not only need to design reasonable metrics to evaluate the states visited by the agent, but also demand significant time investment during the initial exploration phase to explore states that are useless or irrelevant to policy learning, which leads to low exploration efficiency. 

This paper focuses on the learning efficiency and exploration efficiency of the agent, conducting related research from two aspects: the generation process and the exploration strategy of subgoals. The main research content and innovative points are summarized as follows:

• Subgoal generation based on the mutual information constraint. This paper proposes a method that utilizes mutual information to reduce the subgoal space. The method employs contrastive learning to map subgoals into a mutual information metric space and calculates the mutual information distance between them. Additionally, this paper imposes two constraints on subgoals. One constraint aims to reduce the mutual information distance between the current state and the subgoal, enabling the subgoal to be achieved by the lower-level policy. The other constraint aims to reduce the mutual information distance between the subgoal and the desired goal, ensuring that the desired goal can be achieved by the lower-level policy after the subgoal is accomplished.  These two constraints allow subgoals to serve as crucial waypoints between the current state and the desired goal, effectively guiding the learning of the lower-level policy. Experimental results demonstrate that the proposed method improves the learning efficiency of the agent, regardless of the dimensions of the state and action spaces. 

• Subgoal exploration based on diffusion model. Inspired by the phenomenon of diffusion, this paper considers the movement of molecules from high-concentration regions to low-concentration regions as an exploration process and models it using random walk. To achieve random walk of the agent, this paper uses the diffusion model as the upper-level policy to perform temporal abstraction for the lower-level policy. After abstracting the subgoals, the upper-level policy applies appropriate noise to the abstracted policy to guide the lower-level policy in simulating random walk, thereby achieving diffusion of the subgoals throughout the entire state space and ultimately exploring the desired goal. Experimental results demonstrate that the random walk based on the diffusion model can enhance the exploration capability of the agent. Furthermore, this paper discusses the challenges of combining the diffusion model with online reinforcement learning and experimentally verifies the influence of reward type on the exploratory performance of the agent. 

In summary, this paper introduces mutual information and diffusion model into hierarchical reinforcement learning, enhancing the efficiency of subgoal generation and exploration, reducing training time for the agent, and providing novel insights for the application of hierarchical reinforcement learning in real-world environments.

Keyword分层强化学习,子目标生成,互信息,扩散模型
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/56502
Collection毕业生_硕士学位论文
Recommended Citation
GB/T 7714
王开申. 分层强化学习的子目标生成与探索策略[D],2024.
Files in This Item:
File Name/Size DocType Version Access License
毕业论文_签名版.pdf(8335KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[王开申]'s Articles
Baidu academic
Similar articles in Baidu academic
[王开申]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[王开申]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.