多智能体策略一致性奖励塑造算法研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	多智能体策略一致性奖励塑造算法研究
	杨晨
	2024-05
页数	78
学位类型	硕士
中文摘要	多智能体系统在现实生活中的应用处处可见，对多智能体系统的研究近年来也受到了研究者们的广泛关注，而强化学习是处理多智能体系统决策问题的常用方法。相比于单智能体强化学习，多智能体强化学习面临着更多技术层面的挑战，其中一个便是多智能体信用分配问题。信用分配问题关注在多个智能体协作解决任务的过程中，如何区分不同智能体对团队的贡献程度，并以此为依据差异化不同智能体的更新过程，最终提高智能体的学习效率，实现多智能体之间高效协作。目前，奖励塑造是解决多智能体信用分配问题的一个重要手段。然而当前常见的多智能体奖励塑造算法仍然存在着诸多不足，其中关键的一点便是无法保证策略一致性。策略一致性指多智能体奖励塑造算法引入额外奖励后，智能体依据额外奖励训练收敛到的最优策略，应当与智能体依赖原问题下环境奖励所训练收敛得到的最优策略一致。而目前多智能体奖励塑造算法往往忽略了对策略一致性的保证，导致这些算法在很多场景下只能收敛到次优策略。针对以上问题，本文设计了一种基于势能的奖励塑造方法，并且理论上证明了该方法对策略一致性的保证。在此基础上，本文针对势能函数的设计分别提出了基于状态的势能奖励塑造方法和基于状态-动作对的势能奖励塑造方法。本文所设计的势能奖励塑造方法的核心思想在于为多智能体系统中的每个智能体引入个体势能函数，并将每个智能体的内在奖励设计为个体势能的折扣差分形式。理论分析表明该形式下的内在奖励能够满足策略一致性的成立。此外，针对势能函数的设计问题，本文分别提出基于状态的势能函数和基于状态-动作对的势能函数。其中前者根据环境的状态生成个体势能，而后者根据环境的状态以及智能体的动作生成个体势能。在内在奖励的生成上，基于状态-动作对的势能函数能够刻画更精细的奖励信号，在训练过程中为智能体提供更丰富的反馈。在方法实现上，本文将势能奖励塑造方法与经典的Actor-Critic框架结合，并且利用双层优化的技巧，将势能函数的更新方向与多智能体强化学习的最终目标对齐，使得本文所设计的方法在无需引入专家知识设计额外更新目标的同时，实现了整个系统的端到端训练和更新，在不同多智能体环境上具有良好的泛化性。在实验验证部分，本文在捕猎环境和星际争霸环境下的多个任务中对比了本文所提出的算法和目前流行的多智能体强化学习算法的性能表现。在各类任务中，本文所提出的算法普遍表现出最优性能，并且基于状态-动作对的势能奖励塑造方法相较于基于状态的势能奖励塑造方法表现出了更好的效果和更大的潜力。此外，本文设计了一系列消融实验以及可视化分析验证了利用个体势能生成内在奖励这一设计在提高智能体学习效率，实现高效协调合作上的有效性。
英文摘要	The application of multi-agent systems can be seen everywhere in real life. The study of multi-agent systems has also received widespread attention from researchers in recent years. Reinforcement learning is a common method for dealing with decision-making problems in multi-agent systems. Compared to single-agent reinforcement learning, multi-agent reinforcement learning faces greater technical challenges. One of the challenges is the multi-agent credit assignment problem. The credit assignment problem focuses on how to distinguish the contribution of different agents to the team in the process of multiple agents collaborating to solve tasks, and use this as a basis to differentiate the update processes of different agents, ultimately improving the learning efficiency of the agents and achieving efficient collaboration between multiple agents. Currently, reward shaping is an important method to solve the problem of multi-agent credit assignment. However, the current common multi-agent reward shaping algorithms still have many shortcomings. One of the key shortcomings is that they cannot guarantee policy invariance. Policy invariance refers to that the optimal policy that agents converge to under the additional rewards introduced by reward shaping should be consistent with the optimal policy that agents converge to under the environmental rewards in the original problem. However, most current multi-agent reward shaping algorithms ignore the guarantee of policy invariance, resulting in that these algorithms converge to suboptimal policies in many scenarios. In response to the above problems, this thesis designs a potential-based reward shaping method, and theoretically proves that this method guarantees policy invariance for multi-agent reinforcement learning. On this basis, this thesis further proposes the state-based potential reward shaping method and the state-action pair-based potential reward shaping method that differ in the form of potential functions. The core idea of the potential-based reward shaping method designed in this thesis is to introduce an individual potential function for each agent in the multi-agent system, and formulates the intrinsic reward of each agent as a discounted differential form of the individual potential. Theoretical analysis shows that this form of intrinsic reward can guarantee policy invariance. In addition, considering the design of the potential function, this thesis proposes a form of potential function based on state and a form of potential function based on state-action pair. The former generates individual potential based on the state of the environment, while the latter generates individual potential based on the state of the environment and the actions of the agent. In the generation of intrinsic rewards, the potential function based on state-action pairs can depict more precise reward signals and provide more frequent feedback to the agents during the training process. In the implementation of proposed method, this thesis combines the potential-based reward shaping with the classic Actor-Critic framework, and adopts the bi-level optimization technique to align the update of the potential functions with the final objective of multi-agent reinforcement learning, guaranteeing the whole system with an end-to-end training procedure without introducing expert knowledge to design extra update objective, and enjoys better generalization in various multi-agent environments. In the experiment verification, this thesis compares the performance of the proposed method with prevailing multi-agent reinforcement learning methods in multiple tasks of the Predator-Prey environment and the StarCraft environment. Empirical results show that the proposed method generally achieves the best performance in various tasks, and the state-action pair-based potential reward shaping method shows faster convergence and better performance than the state-based potential reward shaping method. In addition, this thesis designed a series of ablation experiments and visualization study to verify the effectiveness of generating intrinsic rewards with individual potential in improving the learning efficiency of agents and achieving efficient coordination and cooperation.
关键词	多智能体系统深度强化学习信用分配奖励塑造
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/56511
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	杨晨. 多智能体策略一致性奖励塑造算法研究[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
master_thesis.pdf（6011KB）	学位论文		限制开放	CC BY-NC-SA