英文摘要 | In real life, multi-agent systems are ubiquitous. reinforcement learning is a common approach to deal with multi-agent problems. However, compared with single-agent reinforcement learning, multi-agent reinforcement learning faces unique challenges such as the non-stationarity of environment, credit assignment, and ad-hoc cooperation. To address these challenges, multi-agent reinforcement learning has made a series of breakthroughs in recent years, represented by the centralized training decentralized execution framework. In the framework of centralized training decentralized execution, the value function factorization approach combines the local Q values of each agent into global Q values through credit assignment networks, which better evaluates the contribution of each agent, achieves better cooperation, and achieves good results on many challenging tasks.
However, many real-world scenarios require that the agent be able to generalize to unseen adversarial scenarios at test time, i.e., the algorithm is required to have the ability to achieve policy generalization under adversarial scenarios. However, existing value function factorization methods lack these capabilities. To improve the policy generalization ability of the agent in the adversarial scenario, this thesis conducts research related to policy generalization from three aspects, namely, making full use of the global state of the agent, the local observation of the agent, and the relationship between agents, and proposes three algorithms to solve the policy generalization problem of the agent in the adversarial scenario respectively. The main work and contributions of this thesis are summarized as follows.
1. A multi-agent credit assignment method based on ensemble learning is proposed to make full use of the global state of the agent.
First, by constructing multiple credit assignment networks, different credit assignment subnetworks can focus on different subspaces of the global state space, to learn “good but different” credit assignment subnetworks, and by integrating credit assignment networks to balance the policies of different credit assignment subnetworks to make full use of global state information, so that the algorithm can focus on multiple subspaces of the global state space and avoid overfitting to a certain subspace, thus improving the policy generalization ability of the agent.
2. A local observation reconstruction method for ad-hoc cooperation is proposed to make full use of the local observation of the agent.
Firstly, the local observation information of the agent is decomposed into three parts, then the attention mechanism is used to handle the length-changing input information, which makes the algorithm insensitive to the agent's length-changing input information, and secondly, the local observation abstraction is implemented using sampling networks, which makes the algorithm make full use of the high-dimensional state representations in different situations. Finally, the reconstruction of the local observation information of each agent is achieved, which enables the algorithm to perform zero-shot generalization in ad-hoc cooperation scenarios, thus improving the policy generalization ability of the agent.
3. A neighborhood relationship learning method for ad-hoc cooperation is proposed to make full use of the relationships among the agents.
Firstly, the relationship information between the agents is encoded into the value decomposition network by a graph-based relationship encoder. Meanwhile, to solve the problem that the number of agents keeps changing in ad-hoc cooperation, an attention-based local observation abstraction mechanism is used. This algorithm not only makes full use of the topology between agents but also achieves zero-shot generalization in ad-hoc cooperation scenarios without re-training, thus improving the policy generalization ability of the agent. |
修改评论