CASIA OpenIR
对抗环境中基于值分解框架的多智能体协同算法研究
杨光开
2022-05
Pages90
Subtype硕士
Abstract

多智能体协同是求解团队对抗的关键问题,近年来受到研究者们的广泛关注。研究者们结合博弈论和深度强化学习将多智能体协同任务建模为分布式部分可观测马尔可夫决策过程,并按照中心化训练分布式执行学习范式提出了一系列重要工作。其中,值分解框架是该范式中的代表性方法,为解决多智能体协同中的信用分配问题提供了重要支撑。然而,当前值分解框架仍然存在一些不足,如忽略了对信用分配策略空间的探索,缺乏信用分配不确定度表示等。此外,部分可观测约束造成的信息缺失会使得智能体对动作价值估计包含极大的不确定性,而当前值分解框架忽略了对这些不确定性的处理。这些不足导致值分解框架在很多场景中只能得到次优策略。为此,本文基于值分解框架针对多智能体协同中的信用分配和部分可观测约束这两个关键问题展开进一步研究。对于信用分配问题,本文提出了随机化信用分配方法和基于不确定度的多智能体信用分配方法;对于部分可观测约束问题,本文提出了多智能体不确定度共享方法。

本文的三个研究工作可以总结为如下内容:

1. 随机化信用分配方法。在很多困难的多智能体协同任务中,智能体之间交互十分复杂,需要具备复杂的合作行为才能得到良好的联合策略。信用分配在很大程度上决定了智能体之间的协同能力,如何探索到更好的信用分配策略以避免陷入局部最优解是提升智能体协同能力的关键。当前的值分解框架以确定性的方法实现信用分配,忽略了对信用分配策略空间的探索,无法得到更好的联合策略。针对这一问题,本文提出了随机化信用分配方法,从形式上定义了信用分配策略空间。通过在训练时基于可学习的高斯分布,以一定概率采样出一个信用分配策略,利用随机性触发对信用分配策略空间的探索。其中,高斯分布的学习依赖于重参数化技巧,通过标准的随机梯度下降进行优化。同时利用熵正则化控制探索范围,避免探索过度导致学习不稳定,最终实现了对信用分配策略空间的有效探索。
2. 基于不确定度的多智能体信用分配方法。值分解框架利用混合网络将联合状态动作值函数分解为多个智能体的局部观察动作值函数以实现信用分配,在很多问题中表现良好。然而这些方法通过单一点估计得到混合网络的参数,因缺乏信用分配不确定度表示而难以有效应对环境中的随机因素,导致其只能收敛到次优策略。为此,本文从不确定度出发,对混合网络进行贝叶斯分析,提出了一种基于不确定度的多智能体信用分配方法,通过显式地量化混合网络参数的不确定度来指导信用分配。混合网络决定了信用分配,因此信用分配不确定度可以通过量化混合网络参数的不确定度来表示。同时考虑到智能体之间交互行为的复杂性,本文利用贝叶斯超网络隐式地建模混合网络参数复杂的后验分布,以避免先验地指定分布类型而陷于局部最优解。从方法上看,第一个工作形式上将混合网络参数空间定义为信用分配策略空间,从单峰高斯分布中采样一个信用分配策略等价于采样混合网络的参数,本质上是利用该高斯分布建模混合网络参数的后验分布。相比之下,本工作利用贝叶斯超网络可以建模混合网络参数复杂的后验分布,打破了先验分布类型限制,是第一个工作的深化与推广。
3. 多智能体不确定度共享方法。在部分可观测条件下,智能体无法获取环境的全局状态信息以及其他智能体的信息,只能基于局部观察决策。这种信息缺失会使得智能体对动作价值估计包含极大的不确定性。当前值分解框架通过对动作值函数单一点估计进行策略学习,忽略了对这些不确定性的处理,抑制了智能体对动作空间的探索,导致算法最终收敛到局部最优。更复杂的在于,智能体的这些不确定性并不一致,这种不一致性会极大程度阻碍智能体的协同探索。因此,本文提出了一种多智能体不确定度共享方法,利用贝叶斯神经网络显式地量化了所有智能体对动作价值估计的不确定度,并结合汤普森采样选择动作以与环境和其他智能体交互。除此之外,为了稳定训练并协调智能体的行为以提高探索效率,针对智能体之间的不确定度差异,本文进一步引入不确定度共享机制确保所有智能体对同一动作的价值估计保持相同的不确定度。
 

Other Abstract

Multi-agent cooperation is the key problem in solving team confrontation, which has attracted extensive attention from researchers in recent years. Researchers combine game theory and deep reinforcement learning to model multi-agent cooperative tasks as the decentralized partially observable Markov decision process, and propose a series of important work according to the centralized training with decentralized execution learning paradigm. Among them, the value decomposition framework is a representative method in this paradigm, which provides important basis for solving the credit assignment problem in multi-agent cooperation. However, the current value decomposition framework still has some shortcomings, such as ignoring the exploration of credit assignment strategy space, lack of uncertainty representation of credit assignment and so on. In addition, the lack of information caused by partially observable constraints will make agents’ action-value estimations contain great uncertainties, and the current value decomposition framework ignores the treatment of these uncertainties. These deficiencies cause the value decomposition framework can only get sub-optimal strategies in many scenarios. To this end, based on the value decomposition framework, this thesis further studies the two key problems of credit assignment and partially observable constraints in multi-agentcooperation. For the credit assignment problem, this thesis proposes a stochastic credit assignment method and an uncertainty-based multi-agent credit assignment method; for partially observable constraints, a multi-agent uncertainty sharing method is proposed in this thesis.

The three works of this thesis can be summarized as follows:
1. The stochastic credit assignment method. In many difficult multi-agent cooperative tasks, the interactions among agents are very complex, which requires complex cooperative behaviors to get a good joint strategy. Credit assignment determines the collaborative abilities of agents to a great extent. How to explore a better credit assignment strategy to avoid falling into local optimal solution is the key to improving the collaborative ability of agents. The current value decomposition framework realizes credit assignment by a deterministic method, ignores the exploration of credit assignment strategy space, and cannot get a better joint strategy. To solve this problem, this thesis proposes a stochastic credit assignment method, and formally defines the credit assignment strategy space. Based on the learnable Gaussian distribution, a credit assignment strategy is sampled with a certain probability during training, and the randomness is used to trigger the exploration of the credit assignment strategy space. Among them, the learning of Gaussian distribution depends on the reparameterization trick, which is optimized by standard stochastic gradient descent. Meanwhile, entropy regularization is used to control the exploration range to avoid learning instability caused by excessive exploration, and finally realize the effective exploration of credit assignment strategy space.

2. The uncertainty-based multi-agent credit assignment method. The value decomposition framework decomposes the joint state action-value function into local observation action-value functions of multiple agents by using the mixing network to realize credit assignment, which performs well in many problems. However, these methods obtain the parameters of the mixing network through single-point estimation, which is difficult to effectively deal with the random factors in the environment due to the lack of uncertainty representation of credit assignment, so they can only converge to the suboptimal strategy. Therefore, starting from the uncertainty, this thesis performs Bayesian analysis on the mixing network, and proposes a multi-agent credit assignment method based on uncertainty, which guides the credit assignment by explicitly quantifying the uncertainty of the parameters of the mixing network. The mixing network determines the credit assignment, so the uncertainty of credit assignment can be expressed by quantifying the uncertainty of the mixing network parameters. At the same time, considering the complexity of the interaction behaviors among agents, this thesis utilizes Bayesian hypernetwork to implicitly model the complex posterior distribution of the mixing network, in order to avoid falling into the local optima by specifying the distribution type a priori. In terms of method, the first work formally defines the mixing network parameter space as the credit assignment strategy space. Sampling a credit assignment strategy from the unimodal Gaussian distribution is equivalent to sampling the parameters of the mixing network. In essence, the Gaussian distribution is used to model the posterior distribution of the mixing network parameters. In contrast, this work can model the complex posterior distribution of the mixing network parameters through Bayesian hypernetwork, which breaks the limitation of prior distribution types. It is the deepening and generalization of the first work. 

3. The multi-agent uncertainties sharing method. Under partially observable conditions, each agent cannot obtain the global state information of the environment and the information of other agents, and can only make decisions based on local observations. This lack of information will make the agent’s estimation of action-value contain great uncertainties. Through the single-point estimation of the action-value function for policy learning, the current value decomposition framework ignores the treatment of these uncertainties, inhibits the agent’ s exploration of action space, and leads to the final convergence of the algorithm to local optima. What is more complicated is that these uncertainties of agents are inconsistent, which will greatly hinder the collaborative exploration of agents. Therefore, this thesis proposes a multi-agent uncertainties sharing method, which uses Bayesian neural network to explicitly quantify the uncertainties of action-value estimation of all agents, and combines Thompson sampling to select actions to interact with the environment and other agents. In addition, in order to stabilize the training and coordinate the behaviors of agents to improve the exploration efficiency, aiming at the uncertainty differences between agents, this thesis further introduces the uncertainties sharing mechanism to ensure that all agents maintain the same uncertainties in the estimation of the value of the same action.

Keyword多智能体协同,信用分配,贝叶斯超网络,部分可观测约束,贝叶斯神经网络
Subject Area模式识别
MOST Discipline Catalogue工学::控制科学与工程
Language中文
Funding ProjectNational Natural Science Foundation of China[61876181]
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/48516
Collection中国科学院自动化研究所
毕业生_硕士学位论文
毕业生
智能系统与工程
Recommended Citation
GB/T 7714
杨光开. 对抗环境中基于值分解框架的多智能体协同算法研究[D]. 中科院自动化研究所. 中科院自动化研究所,2022.
Files in This Item:
File Name/Size DocType Version Access License
对抗环境中基于值分解框架的多智能体协同算(17847KB)学位论文 开放获取CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[杨光开]'s Articles
Baidu academic
Similar articles in Baidu academic
[杨光开]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[杨光开]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.