Multi-agent cooperation is the key problem in solving team confrontation, which has attracted extensive attention from researchers in recent years. Researchers combine game theory and deep reinforcement learning to model multi-agent cooperative tasks as the decentralized partially observable Markov decision process, and propose a series of important work according to the centralized training with decentralized execution learning paradigm. Among them, the value decomposition framework is a representative method in this paradigm, which provides important basis for solving the credit assignment problem in multi-agent cooperation. However, the current value decomposition framework still has some shortcomings, such as ignoring the exploration of credit assignment strategy space, lack of uncertainty representation of credit assignment and so on. In addition, the lack of information caused by partially observable constraints will make agents’ action-value estimations contain great uncertainties, and the current value decomposition framework ignores the treatment of these uncertainties. These deficiencies cause the value decomposition framework can only get sub-optimal strategies in many scenarios. To this end, based on the value decomposition framework, this thesis further studies the two key problems of credit assignment and partially observable constraints in multi-agentcooperation. For the credit assignment problem, this thesis proposes a stochastic credit assignment method and an uncertainty-based multi-agent credit assignment method; for partially observable constraints, a multi-agent uncertainty sharing method is proposed in this thesis.
The three works of this thesis can be summarized as follows:
2. The uncertainty-based multi-agent credit assignment method. The value decomposition framework decomposes the joint state action-value function into local observation action-value functions of multiple agents by using the mixing network to realize credit assignment, which performs well in many problems. However, these methods obtain the parameters of the mixing network through single-point estimation, which is difficult to effectively deal with the random factors in the environment due to the lack of uncertainty representation of credit assignment, so they can only converge to the suboptimal strategy. Therefore, starting from the uncertainty, this thesis performs Bayesian analysis on the mixing network, and proposes a multi-agent credit assignment method based on uncertainty, which guides the credit assignment by explicitly quantifying the uncertainty of the parameters of the mixing network. The mixing network determines the credit assignment, so the uncertainty of credit assignment can be expressed by quantifying the uncertainty of the mixing network parameters. At the same time, considering the complexity of the interaction behaviors among agents, this thesis utilizes Bayesian hypernetwork to implicitly model the complex posterior distribution of the mixing network, in order to avoid falling into the local optima by specifying the distribution type a priori. In terms of method, the first work formally defines the mixing network parameter space as the credit assignment strategy space. Sampling a credit assignment strategy from the unimodal Gaussian distribution is equivalent to sampling the parameters of the mixing network. In essence, the Gaussian distribution is used to model the posterior distribution of the mixing network parameters. In contrast, this work can model the complex posterior distribution of the mixing network parameters through Bayesian hypernetwork, which breaks the limitation of prior distribution types. It is the deepening and generalization of the first work.
3. The multi-agent uncertainties sharing method. Under partially observable conditions, each agent cannot obtain the global state information of the environment and the information of other agents, and can only make decisions based on local observations. This lack of information will make the agent’s estimation of action-value contain great uncertainties. Through the single-point estimation of the action-value function for policy learning, the current value decomposition framework ignores the treatment of these uncertainties, inhibits the agent’ s exploration of action space, and leads to the final convergence of the algorithm to local optima. What is more complicated is that these uncertainties of agents are inconsistent, which will greatly hinder the collaborative exploration of agents. Therefore, this thesis proposes a multi-agent uncertainties sharing method, which uses Bayesian neural network to explicitly quantify the uncertainties of action-value estimation of all agents, and combines Thompson sampling to select actions to interact with the environment and other agents. In addition, in order to stabilize the training and coordinate the behaviors of agents to improve the exploration efficiency, aiming at the uncertainty differences between agents, this thesis further introduces the uncertainties sharing mechanism to ensure that all agents maintain the same uncertainties in the estimation of the value of the same action.
|MOST Discipline Catalogue||工学::控制科学与工程|
|Funding Project||National Natural Science Foundation of China|
|杨光开. 对抗环境中基于值分解框架的多智能体协同算法研究[D]. 中科院自动化研究所. 中科院自动化研究所,2022.|
|Files in This Item:|
|Recommend this item|
|Export to Endnote|
|Similar articles in Google Scholar|
|Similar articles in Baidu academic|
|Similar articles in Bing Scholar|
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.