连续控制任务中集成策略的多样性探索研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	连续控制任务中集成策略的多样性探索研究
	李超
	2024-05
页数	78
学位类型	硕士
中文摘要	强化学习是一种从环境反馈中学习的机器学习方法，其有效性依赖智能体与环境充分交互并采集反馈信号，然而现有的强化学习算法存在采样效率不足的问题，这极大的限制了强化学习在实际生活中的应用。本文研究强化学习中的连续控制问题，聚焦智能体探索过程，从探索的有效性和探索能力两个方面分析影响智能体采样效率的探索行为，归结为有偏探索、盲目探索和重复探索。探索的有效性是研究提高探索能力的前提，在探索的有效性方面，本文关注由价值估计偏差引发的有偏性探索问题。在探索能力的方面，本文从策略的多样性探索角度缓解由简单的探索策略导致的盲目性探索和重复性探索问题。总结本文的贡献点如下。第一是研究价值偏差导致的有偏探索问题。执行者-评论家框架，是一种有效解决连续控制问题的强化学习框架。在该框架中，执行者（策略）通过评论家（价值函数）的价值评估来指导动作的选择，因此，价值函数的准确性是策略探索有效性的保证。研究人员指出，在执行者-评论家框架中，价值函数存在显著的价值估计偏差，并导致有偏的探索行为，影响智能体的探索。对于这种由价值偏差导致的有偏探索行为，本文提出一种自适应的价值偏差调整方法，通过集成多个价值模型，收紧价值偏差的边界，以实现更准确的价值估计。大量的实验表明，提出的方法在多个环境下实现了更准确的价值估计，提升了算法的性能。第二是研究策略的盲目探索问题。当环境只能提供微弱的反馈信息时，智能体会陷入盲目性的低效探索。为此，本研究提出了名为CCEP（Centralized and Cooperative Exploration Policy）的启发式探索方式，该探索方式利用价值函数中的偏差引导智能体定向探索。为了进一步提高探索能力，CCEP集成多个策略的探索结果，集中式地训练策略，实现对环境的多样性探索和策略的信息交互。本文从算法的性能、探索的多样性和探索的能力多个维度对算法进行了评估。实验表明，CCEP能够多样的探索环境，实现了更高效的探索，并在多个环境上取得了较先进算法更优的性能。第三是研究策略的重复探索问题。智能体对相似区域重复探索而不主动的发现新的有价值的信息，会导致过度探索造成学习的停滞。为了减少重复探索，促进多样的探索，本研究提出了一种感知轨迹的集成探索方式（Trajectories-awarE Ensemble exploratioN，TEEN）。TEEN训练智能体最大化折扣累积回报的同时最大化获取的信息分布的信息熵。理论分析说明了TEEN能够有效的实现对环境的多样性探索，并揭示了先前的集成强化学习算法的采样效率可能受到不够多样的子策略限制。在测试的环境中，相较于先进的最大熵探索方法、集成探索方法、启发式探索方法，TEEN展现出了更强的探索能力和更优的算法性能。
英文摘要	Reinforcement learning is a machine learning method based on feedback, where the effectiveness relies on the thorough interaction between agents and the environment to collect feedback. However, existing reinforcement learning algorithms suffer from the issue of insufficient sample efficiency, greatly limiting their practical applications. This paper investigates the problem of continuous control in reinforcement learning, focusing on the exploration process of agents. This paper analyze the impact of exploration behavior on the sample efficiency of agents from two aspects: the effectiveness of exploration and the exploration capability, summarized as biased exploration, blind exploration, and repetitive exploration. The effectiveness of exploration is a prerequisite for improving exploration capability. In terms of exploration effectiveness, this paper focuses on the issue of biased exploration caused by value estimation bias. Regarding exploration capability, this paper addresses the problems of blind exploration and repetitive exploration stemming from simplistic exploration policies, from the perspective of exploring diversity in policies. The contributions of this paper are summarized as follows. The first contribution is the investigation of biased exploration caused by value estimation bias. The Actor-Critic framework is an effective reinforcement learning framework for addressing continuous control problems. In this framework, the Actor (policy) makes actions based on value estimation by the Critic (value function), making the accuracy of the value function essential for effective policy exploration. Researchers have pointed out significant value estimation bias in the Actor-Critic framework, leading to biased exploration behavior that affects the agent's exploration. To address this biased exploration resulting from estimation bias, this paper proposes an adaptive estimation bias adjustment method, which integrates multiple value models to tighten the boundaries of estimation bias to achieve more accurate value estimation. Extensive experiments demonstrate that the proposed method achieves more accurate value estimation and improves algorithm performance across multiple environments. The second contribution is the investigation of policy's blind exploration problem. When the environment provides only weak feedback, agents tend to engage in inefficient blind exploration. To address this issue, this paper proposes a heuristic exploration approach called CCEP (Centralized and Cooperative Exploration Policy), which utilizes biases in the value function to guide agents towards directed exploration. Furthermore, to enhance exploration capability, CCEP integrates exploration results from multiple directed policies and centrally trains policies to achieve diverse exploration and information exchange among policies. This paper evaluates the algorithm from multiple dimensions including algorithm performance, exploration diversity, and exploration capability. Experimental results demonstrate that CCEP can explore environments diversely, achieve more efficient exploration, and outperform state-of-the-art algorithms in multiple environments. The third contribution is the investigation of the problem of repetitive exploration by policies. Agents repeatedly exploring similar regions without actively discovering new valuable information can lead to over-exploration and learning stagnation. To mitigate repetitive exploration and promote diverse exploration, this paper proposes a Tarjectories-awarE Ensemble exploratioN approach (TEEN). TEEN trains agents to maximize both the discounted cumulative return and the entropy of the information distribution obtained from trajectories. Theoretical analysis demonstrates that TEEN can effectively achieve diverse exploration of the environment and reveals that the sampling efficiency of previous ensemble reinforcement learning algorithms may be limited by insufficiently diverse sub-policies. Experimental results show that compared to state-of-the-art methods such as maximum entropy exploration, ensemble exploration, and heuristic exploration, TEEN exhibits stronger exploration capability and superior performance.
关键词	强化学习集成学习价值估计多样性探索
收录类别	其他
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/56640
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	李超. 连续控制任务中集成策略的多样性探索研究[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
李超-硕士毕业论文.pdf（7255KB）	学位论文		限制开放	CC BY-NC-SA