CASIA OpenIR  > 毕业生  > 硕士学位论文
基于目标条件强化学习的无监督技能发现方法研究
张天
2024-05-16
页数80
学位类型硕士
中文摘要

深度强化学习在解决具体问题时面临奖励函数难以设计的困难,为此研究人员提出了无监督技能发现的方法,旨在无环境(任务)奖励信号的前提下进行无监督的预训练,使智能体掌握技能策略。相比于为了获得某一具体任务的控制策略,而设计复杂的专一的奖励函数,这种无监督的预训练方式不仅回避了复杂的奖励函数设计问题,还能获得适应同环境或相似环境中大多数任务的具有通用性的技能策略,能够有效降低训练成本,提高策略的泛化性。然而,现有的基于互信息理论的无监督技能发现方法普遍存在探索能力弱,技能策略无法有效覆盖状态空间的问题。这是由于技能策略的训练依赖内在奖励,而无监督探索过程面临奖励分布动态变化以及稀疏奖励的问题。这会导致技能策略缺乏多样性以及通用性。考虑到目标条件强化学习能够利用目标之间的合理泛化来提升探索能力,并且目标条件策略与技能条件策略具有天然相似性,本文将目标条件强化学习方法用于改进无监督技能发现的探索过程。此外,目标条件强化学习针对目标可定义的任务,将奖励函数简化为判定目标是否达成的二元奖励信号。这大大降低了奖励函数的复杂性,但同时也引入了稀疏奖励的问题。通过目标重标记以及子目标规划等方法,目标条件强化学习能够一定程度上缓解稀疏奖励造成的训练困难。然而,现有的目标条件强化学习方法仍然存在样本利用率低,目标条件策略泛化能力弱的问题。

本文针对上述问题,从提高目标条件强化学习样本利用率的角度出发,提出了基于虚拟状态转移经验的目标条件强化学习方法。并将目标条件强化学习方法用于改进无监督技能发现的探索过程,提出了目标引导的无监督技能发现方法。本文主要完成的研究工作以及创新点总结如下:

1.针对目标条件强化学习存在的奖励稀疏以及样本利用率低的问题,本文提出了一种基于虚拟状态转移经验的目标条件强化学习方法。该方法通过扩展目标重标记范围,提出了轨迹内重标记以及跨轨迹重标记两种目标重标记方法。前者基于真实交互轨迹构造出真实状态转移经验,后者基于跨轨迹采样的虚拟目标构造出虚拟状态转移经验。二者共同为智能体的策略学习提供了丰富的历史数据。考虑到虚拟状态转移经验容易造成策略学习过程的不稳定,本文基于目标条件强化学习的最优子结构属性,提出了一种子目标引导的策略改进方法。它采用子目标预测模型对当前状态到任意目标的潜在路径进行规划,子目标预测模型所提出的子目标将其可达性考虑在内,能够引导当前策略实现重标记目标。本文在导航与机械臂操作任务的实验中验证了所提方法的有效性。与基准算法的对比结果表明,所提方法在任务平均成功率以及样本利用率上均获得显著提升。

2.针对现有基于互信息理论的无监督技能发现方法存在探索能力弱,策略学习效果差的问题,本文提出了一种目标引导的无监督技能发现方法。首先,本文探讨了以往研究工作存在的局限性,具体包含以下三个方面:1)最大化互信息不鼓励探索;2)状态不可辨性导致技能退化;3)瓶颈状态限制技能探索。并将造成这种局限性的主要原因,归咎于它们采取的探索与学习并行化的方式。因此,本文所提方法采用探索与学习解耦的两阶段过程来实现无监督技能发现。随后,本文指出目标条件策略与技能条件策略存在天然的相似性,可以采用目标条件强化学习改进无监督技能发现的探索阶段,并通过微调探索策略加速技能策略的学习。本文所提方法能够克服以往工作中普遍存在的奖励分布动态变化的问题。本文在具有瓶颈状态的迷宫地图中验证了所提方法的有效性。与基准算法的对比结果表明,所提两阶段方法能够更充分探索状态空间,突破瓶颈限制。并且技能策略学习效果获得显著改善。

英文摘要

Deep reinforcement learning faces the difficulty of designing reward functions when solving specific problems. To address this problem, researchers have proposed an unsupervised skill discovery approach to obtain skill policies through unsupervised pre-training in the absence of environmental (task) reward signals. Compared to designing complex specialized reward functions in order to obtain a control strategy for a specific task, this unsupervised pre-training approach not only bypasses the design of complex reward function, but also obtains skill policies with generalization that are adapted to most of the tasks in the same or similar environments. This approach can effectively reduce the training cost and improve the generalization of strategies. However, existing unsupervised skill discovery methods based on mutual information theory generally suffer from weak exploration capabilities and skill policies that cannot effectively cover the state space. This is due to the fact that the training of skill policies relies on intrinsic rewards, and the unsupervised exploration process faces the problem of dynamically changing reward distributions as well as sparse rewards. This leads to a lack of diversity as well as generalization of skill policies. Considering that goal-conditioned reinforcement learning can utilize reasonable generalization between goals to improve exploration, and that goal-conditioned strategies are naturally similar to skill-conditioned strategies, in this thesis, we use goal-conditioned reinforcement learning methods to improve the exploration process for unsupervised skill discovery. In addition, goal-conditioned reinforcement learning simplifies the reward function into a binary reward signal that determined by whether the goal is achieved for tasks with definable goals. This greatly reduces the complexity of the reward function, but also introduces the problem of sparse rewards. Through methods such as goal relabeling and subgoal planning, goal-conditioned reinforcement learning can alleviate the training difficulties caused by sparse rewards. However, existing goal-conditioned reinforcement learning methods still suffer from low sample utilization and weak generalization ability of goal-conditioned strategies.

The main research work and innovation points completed in this thesis are summarized as follows:

1. For the problems of sparse rewards and low sample utilization in goal-conditioned reinforcement learning, this thesis proposes a goal-conditioned reinforcement learning method based on virtual transition experience. This method proposes two goal relabeling methods, within-trajectory relabeling and across-trajectory relabeling, by extending the goal relabeling range. The former constructs a real transition experience based on real interaction trajectories, and the latter constructs a virtual transition experience based on virtual goals sampled across trajectories. They both provide rich historical data for the agent's strategy learning. Considering that virtual transition experience can easily cause instability in the policy learning process, this thesis proposes a subgoal-guided policy improvement method based on the optimal substructure attributes of goal-conditioned reinforcement learning. It uses a subgoal prediction model to plan potential paths from the current state to any goal. The subgoals proposed by the subgoal prediction model take their reachability into consideration and can guide the current strategy to achieve the relabeling goal. This thesis verifies the effectiveness of the proposed method in experiments on typical ant navigation and robotic arm manipulation tasks. The comparison results with the baseline algorithm show that the proposed method has significantly improved the average task success rate and sample utilization rate.

2. For the problems that existing unsupervised skill discovery methods based on mutual information theory have weak exploration capabilities and poor policy learning effects, this thesis proposes a goal-guided unsupervised skill discovery method. First, this thesis explores the limitations of previous research work, which specifically include the following three aspects: 1) maximizing mutual information discourages exploration; 2) state indistinguishability leads to skill degradation; 3) bottleneck status limits skill exploration. The main reason for these limitations are attributed to the way they parallelize exploration and learning. Therefore, the method proposed in this thesis adopts a two-stage process of decoupling exploration and learning to achieve unsupervised skill discovery. Subsequently, this thesis points out that there are natural similarities between goal-conditioned strategies and skill-conditioned strategies. Goal-conditioned reinforcement learning can be used to improve the exploration phase of unsupervised skill discovery and accelerate the learning of skill policies by fine-tuning the exploration strategy. The method proposed in this thesis can overcome the problem of dynamic changes in reward distribution that is common in previous work. This thesis verifies the effectiveness of the proposed method in maze maps with bottleneck states. The comparison results with the benchmark algorithm show that the proposed two-stage method can more fully explore the state space and break through bottleneck restrictions. And the skill policy learning effect is significantly improved.

关键词目标 稀疏奖励 无监督强化学习 探索 技能策略
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/56907
专题毕业生_硕士学位论文
推荐引用方式
GB/T 7714
张天. 基于目标条件强化学习的无监督技能发现方法研究[D],2024.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
基于目标条件强化学习的无监督技能发现方法(13799KB)学位论文 限制开放CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[张天]的文章
百度学术
百度学术中相似的文章
[张天]的文章
必应学术
必应学术中相似的文章
[张天]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。