英文摘要 | Intelligent decision-making is a crucial capability that reflects the level of artificial intelligence. With the development of deep reinforcement learning technology in recent years, people's understanding of intelligent decision-making in single-agent scenarios has significantly improved. However, there is not only one decision-making agent in most real-world problems. Such scenarios are usually defined as multi-agent systems. Decision-making problems in scenarios such as network communication, energy supply, financial markets, multi-node robot control, and military deduction are all considered multi-agent decision-making problems. According to the relationship between the agents, multi-agent decision-making problems can be divided into three types: fully cooperative, fully competitive, and mixed relationship. A large number of team collaboration problems in real life can basically be abstracted as fully cooperative multi-agent decision-making problems. Therefore, coordinating the behavior of each agent in the team through artificial intelligence to achieve the optimal team decision is a significant real-world challenge.
The problem of multi-agent collaborative decision-making focuses on fully cooperative scenarios. It focuses on solving complex decision-making problems through cooperation among agents in decentralized partially observable Markov decision processes (Dec-POMDP). Similar to team collaboration activities in human society, it is critical to quickly and effectively organize the team to ensure that each member has a high level of collaboration awareness, which involves many practical issues. Based on a comprehensive description of the key problems encountered in the practical implementation of multi-agent collaborative decision-making, this paper proposes solutions to these problems, thereby enhancing the capabilities of multi-agent reinforcement learning algorithms in terms of universality and efficiency. The main work and innovative points of this paper include the following aspects:
1. A method for reconstructing the global state from multiple local observations
In response to the problem of global information unobservability, this paper conducts research on reconstructing global state information. The partially observable problem of multi-agents involves both the unobservability between agents and the environment itself, making the multi-agents partially observable problem more complex. Therefore, exploring methods to reconstruct global information from current and historical local information is necessary. Firstly, this paper starts with how to reconstruct global information that is beneficial to the reinforcement learning process through multiple different local observations. The specific method is to construct a probabilistic graphical model between the local observation of the agent and the global state, and derive the abstract representation of the global state in the latent space through variational inference. Secondly, the graph neural network model is used to extract the underlying topological information of the multi-agent system, and the relationship between the agents is encoded into the neural network model to assist in reconstructing the global state information. Experimental results show that this method enables the performance of multi-agent reinforcement learning algorithms under the condition of unknown global states to approach or even exceed the scenario where the global state is known.
2. A consensus distributed inference method without communication
For the consensus representation problem, this paper researches formalizing consensus representation and proposes an explicit collaborative algorithm framework. Multi-agent reinforcement learning algorithms based on communication achieve consensus by passing essential information between agents, but this brings additional communication overhead.Tacit understanding can be regarded as the highest level of collaborative behavior in human social activities. That is, consensus can be reached without communication with each other. Inspired by the concept of tacit understanding and the invariance of perspective in computer vision, this paper aims to study the essence of consensus that promotes collaborative behavior and explore how to formalize consensus representation with mathematical symbols. Firstly, using contrastive learning methods, different local observations corresponding to the same global state are aligned and mapped to a discrete space during the centralized training stage. The discrete representation is defined as the consensus signal between the agents. Secondly, during decentralized execution, the consensus signal is used as the basis for the agent's decision-making, ensuring that the agent can choose actions under the guidance of the same signal. Experimental results show that this method allows agents to infer the explicit consensus of the entire team in a distributed manner based on the information they possess, without the need for communication. Moreover, with minimal changes in the amount of model parameters, the performance of this method far exceeds that of more complex algorithms.
3. A dual coordination mechanism for hierarchical multi-agent reinforcement learning framework
For the temporal abstraction problem, this paper conducts research on a multi-agent reinforcement learning algorithm framework with dual coordination mechanism between different agents and different levels. Most existing hierarchical reinforcement learning algorithms are not applicable to multi-agent cooperation scenarios, or the design of macro strategies requires the introduction of expert knowledge. This paper proposes a universal hierarchical multi-agent reinforcement learning framework in cooperative scenarios to improve the sample efficiency of basic algorithms. Firstly, by taking the advantage function of the macro policy as the intrinsic reward function of the micro policy, the connection between strategies at different levels is constructed so that the joint policy performance can be monotonically improved when optimizing the macro or micro policy. By carefully building the target loss function of the two-layer policies, the entire training process is ensured to be end-to-end and does not require expert knowledge. Secondly, to improve the decision-making algorithm's interpretability, the hierarchical reinforcement learning method proposed in this paper can compare the relationship between the two layers of decision networks and provide real-time feedback on the current goal of the agent. Experimental results show that this method can significantly improve the sample efficiency of the algorithm and can be applied to various tasks and basic algorithms.
4. A noval multi-agent collaborative generalization model under the new relationship between individuals and the whole
In response to the model expression and generalization problem, this paper relaxes the assumption that restricts the expressive power of neural network models and constructs a universal value decomposition algorithm framework. Multi-agent reinforcement learning methods based on value decomposition can alleviate the credit assignment problem to a certain extent. However, these algorithms are accompanied by a strong Individual-Global-Max (IGM) assumption, which severely restricts the expressive power of the neural network model. Therefore, this paper constructs a new relationship between the individual and the whole to liberate the expressive power of the model. Firstly, by decoupling the two functions of a single agent, the original individual model of the agent is split into two models, which are responsible for evaluating and searching its own strategy. The evaluation model only focuses on evaluating the agent's contribution in the global context, and it is updated through traditional temporal difference methods; the search model finds the optimal action by sampling in a subset of the action space and is updated according to the supervised learning paradigm. Secondly, to prevent the algorithm from falling into local optimal solutions, this paper also proposes an exploration method to balance the complexity and final performance of the above methods. Experimental results show that this method completely abandons the IGM assumption and enhances the expression and generalization ability of the original algorithm. |
修改评论