CASIA OpenIR  > 毕业生  > 硕士学位论文





The application of multi-agent systems can be seen everywhere in real life. The study of multi-agent systems has also received widespread attention from researchers in recent years. Reinforcement learning is a common method for dealing with decision-making problems in multi-agent systems. Compared to single-agent reinforcement learning, multi-agent reinforcement learning faces greater technical challenges. One of the challenges is the multi-agent credit assignment problem. The credit assignment problem focuses on how to distinguish the contribution of different agents to the team in the process of multiple agents collaborating to solve tasks, and use this as a basis to differentiate the update processes of different agents, ultimately improving the learning efficiency of the agents and achieving efficient collaboration between multiple agents. Currently, reward shaping is an important method to solve the problem of multi-agent credit assignment. However, the current common multi-agent reward shaping algorithms still have many shortcomings. One of the key shortcomings is that they cannot guarantee policy invariance. Policy invariance refers to that the optimal policy that agents converge to under the additional rewards introduced by reward shaping should be consistent with the optimal policy that agents converge to under the environmental rewards in the original problem. However, most current multi-agent reward shaping algorithms ignore the guarantee of policy invariance, resulting in that these algorithms converge to suboptimal policies in many scenarios.

In response to the above problems, this thesis designs a potential-based reward shaping method, and theoretically proves that this method guarantees policy invariance for multi-agent reinforcement learning. On this basis, this thesis further proposes the state-based potential reward shaping method and the state-action pair-based potential reward shaping method that differ in the form of potential functions. The core idea of the potential-based reward shaping method designed in this thesis is to introduce an individual potential function for each agent in the multi-agent system, and formulates the intrinsic reward of each agent as a discounted differential form of the individual potential. Theoretical analysis shows that this form of intrinsic reward can guarantee policy invariance. In addition, considering the design of the potential function, this thesis proposes a form of potential function based on state and a form of potential function based on state-action pair. The former generates individual potential based on the state of the environment, while the latter generates individual potential based on the state of the environment and the actions of the agent. In the generation of intrinsic rewards, the potential function based on state-action pairs can depict more precise reward signals and provide more frequent feedback to the agents during the training process. In the implementation of proposed method, this thesis combines the potential-based reward shaping with the classic Actor-Critic framework, and adopts the bi-level optimization technique to align the update of the potential functions with the final objective of multi-agent reinforcement learning, guaranteeing the whole system with an end-to-end training procedure without introducing expert knowledge to design extra update objective, and enjoys better generalization in various multi-agent environments.

In the experiment verification, this thesis compares the performance of the proposed method with prevailing multi-agent reinforcement learning methods in multiple tasks of the Predator-Prey environment and the StarCraft environment. Empirical results show that the proposed method generally achieves the best performance in various tasks, and the state-action pair-based potential reward shaping method shows faster convergence and better performance than the state-based potential reward shaping method. In addition, this thesis designed a series of ablation experiments and visualization study to verify the effectiveness of generating intrinsic rewards with individual potential in improving the learning efficiency of agents and achieving efficient coordination and cooperation.

关键词多智能体系统 深度强化学习 信用分配 奖励塑造
GB/T 7714
杨晨. 多智能体策略一致性奖励塑造算法研究[D],2024.
文件名称/大小 文献类型 版本类型 开放类型 使用许可
master_thesis.pdf(6011KB)学位论文 限制开放CC BY-NC-SA
所有评论 (0)
