基于强化学习的多智能体协同决策关键问题研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于强化学习的多智能体协同决策关键问题研究
	徐志伟
	2024-05-13
页数	128
学位类型	博士
中文摘要	智能决策是体现人工智能水平的重要能力。随着近些年深度强化学习技术的发展，人们对于单智能体场景下智能决策的认识得到了显著提升。然而在大多数现实问题中，参与决策的智能体并非是单一的，该类场景通常被定义为多智能体系统。网络通信、能源供给、金融市场、多节点机器人控制及军事推演等场景下的决策问题都被视为多智能体决策问题。根据智能体之间的关系可以将多智能体决策问题分为完全合作式、完全竞争式以及混合关系式三种类型。实际生活中的许多团队协作问题可以抽象为完全合作式的多智能体决策问题。因此，如何通过人工智能来协同团队中各个智能体的行为以实现团队最优决策是一个具有重要意义的现实挑战。多智能体协同决策问题聚焦于完全合作场景，即关注非中心化的局部可观测马尔可夫决策过程（Decentralized partially observable Markov decision process，Dec-POMDP）中智能体间通过协作解决复杂的决策问题。与人类社会中团队协作活动类似，如何快速有效地组织团队以确保每个成员具有高度的协作意识是关键，这涉及到许多实际的关键问题。本文在综合描述多智能体协同决策落地于实际所遇到关键问题的基础上，提出了针对这些问题的解决方法，从而在通用性和效率性两个方面提升多智能体强化学习算法的能力。本文的主要工作和创新点包含如下几个方面： 1. 提出了一种多局部信息重建全局状态方法针对全局信息不可观测问题，开展重构全局状态信息研究。多智能体局部可观测问题既涉及智能体之间的不可观测性，又涉及环境本身状态的不可观测性，这使得多智能体局部观测问题变得更加复杂。因此，需要探索从当前和历史的多个局部信息中重建全局信息的方法。首先，本文着手于研究如何通过多个不同的局部信息重建有利于强化学习过程的全局信息，具体做法是构建智能体局部观测与全局状态之间的概率图模型，通过变分推断的方式推导出全局状态在潜在空间中的抽象表示。其次，利用图神经网络模型抽取多智能体系统底层的拓扑信息，将智能体之间的关系编码到神经网络模型中以协助重建全局状态信息。实验结果表明该方法使得多智能体强化学习算法在全局状态未知的条件下的表现能够逼近甚至优于全局状态已知的场景。 2. 提出了一种无通信共识分布式推断方法针对共识表示问题，开展共识形式化表示以及显式协同算法框架研究。基于通信的多智能体强化学习算法通过智能体之间传递关键信息从而达成共识，但这会带来额外的通信开销。“默契”可以视作人类社交活动中最高层次的协同行为，即彼此无需沟通即可达成共识。受默契概念以及计算机视觉当中的视角不变性的启发，本文旨在研究促成协同行为的共识本质，探索如何用数学符号形式化表示共识。首先，利用对比学习方法，在集中训练阶段将智能体的局部观测与全局状态之间进行对齐，使对应于同一全局状态的不同局部观测映射到一个离散空间当中，并将该离散向量表示定义为智能体之间的共识信号。其次，在分散执行时，将共识信号作为智能体决策的依据，保证智能体能够在相同信号的指引下选择动作。实验结果表明该方法使智能体无需通信的前提假设，仅通过自身掌握的信息分布式推断出团队整体的显式共识。并且在模型参数量改动非常小的情况下，该方法的表现能够远远优于更加复杂的算法。 3. 提出了一种双协同机制分层多智能体强化学习通用框架针对时序抽象问题，研究具有不同智能体间和不同层级间双重协同策略的多智能体强化学习算法框架。现有的分层强化学习算法大多数不适用于多智能体合作场景，亦或者是宏观策略的设计需要引入专家知识。本文旨在提出一个合作场景下通用的分层多智能体强化学习框架，提升基础算法的样本效率。首先，通过将宏观策略的优势函数作为微观策略的内在奖励函数，构建不同层级间策略的联系，从而在优化宏观策略或者微观策略时也能保证联合策略表现的单调提升。通过精心构建两层策略的目标损失函数，保证了整个训练过程是端到端且无需专家知识的。其次，为了提升决策算法的可解释性，本文提出的分层强化学习方法可以通过比较两层决策网络之间的关系，实时反馈当前智能体的目标。实验结果表明该方法能够明显地提升算法的样本效率，并能够适用于各种任务与基础算法。 4. 提出了一种新型个体与整体关系下的多智能体协同泛化模型针对模型表达与泛化问题，放松限制神经网络模型表达能力的假设，构建通用值分解算法框架。基于值分解的多智能体强化学习方法在一定程度上可以缓解信用分配问题，但这些算法都伴随着较强的个体-全局最优（Individual-Global-Max，IGM）假设，这会严重限制神经网络模型的表达能力。因此本文构建了新的个体与整体之间的关系以解放模型的表达能力。首先，通过将单个智能体的两个功能解耦，将原有的智能体个体模型拆分成两个模型，分别负责对自身策略的评估和搜索。评估模型只专注于对智能体在全局中的贡献评估，其通过传统的时序差分方法进行更新；而搜索模型则是通过在动作空间子集中采样的方式寻找最优动作，并根据监督学习范式更新神经网络权重参数。其次，为了防止算法陷入局部最优解，本文还提出了一种探索方法，以平衡上述方法的复杂度和最终表现。实验结果表明该方法完全摒弃了个体-全局最优假设，并提升了原始算法的表达和泛化能力。
英文摘要	Intelligent decision-making is a crucial capability that reflects the level of artificial intelligence. With the development of deep reinforcement learning technology in recent years, people's understanding of intelligent decision-making in single-agent scenarios has significantly improved. However, there is not only one decision-making agent in most real-world problems. Such scenarios are usually defined as multi-agent systems. Decision-making problems in scenarios such as network communication, energy supply, financial markets, multi-node robot control, and military deduction are all considered multi-agent decision-making problems. According to the relationship between the agents, multi-agent decision-making problems can be divided into three types: fully cooperative, fully competitive, and mixed relationship. A large number of team collaboration problems in real life can basically be abstracted as fully cooperative multi-agent decision-making problems. Therefore, coordinating the behavior of each agent in the team through artificial intelligence to achieve the optimal team decision is a significant real-world challenge. The problem of multi-agent collaborative decision-making focuses on fully cooperative scenarios. It focuses on solving complex decision-making problems through cooperation among agents in decentralized partially observable Markov decision processes (Dec-POMDP). Similar to team collaboration activities in human society, it is critical to quickly and effectively organize the team to ensure that each member has a high level of collaboration awareness, which involves many practical issues. Based on a comprehensive description of the key problems encountered in the practical implementation of multi-agent collaborative decision-making, this paper proposes solutions to these problems, thereby enhancing the capabilities of multi-agent reinforcement learning algorithms in terms of universality and efficiency. The main work and innovative points of this paper include the following aspects: 1. A method for reconstructing the global state from multiple local observations In response to the problem of global information unobservability, this paper conducts research on reconstructing global state information. The partially observable problem of multi-agents involves both the unobservability between agents and the environment itself, making the multi-agents partially observable problem more complex. Therefore, exploring methods to reconstruct global information from current and historical local information is necessary. Firstly, this paper starts with how to reconstruct global information that is beneficial to the reinforcement learning process through multiple different local observations. The specific method is to construct a probabilistic graphical model between the local observation of the agent and the global state, and derive the abstract representation of the global state in the latent space through variational inference. Secondly, the graph neural network model is used to extract the underlying topological information of the multi-agent system, and the relationship between the agents is encoded into the neural network model to assist in reconstructing the global state information. Experimental results show that this method enables the performance of multi-agent reinforcement learning algorithms under the condition of unknown global states to approach or even exceed the scenario where the global state is known. 2. A consensus distributed inference method without communication For the consensus representation problem, this paper researches formalizing consensus representation and proposes an explicit collaborative algorithm framework. Multi-agent reinforcement learning algorithms based on communication achieve consensus by passing essential information between agents, but this brings additional communication overhead.Tacit understanding can be regarded as the highest level of collaborative behavior in human social activities. That is, consensus can be reached without communication with each other. Inspired by the concept of tacit understanding and the invariance of perspective in computer vision, this paper aims to study the essence of consensus that promotes collaborative behavior and explore how to formalize consensus representation with mathematical symbols. Firstly, using contrastive learning methods, different local observations corresponding to the same global state are aligned and mapped to a discrete space during the centralized training stage. The discrete representation is defined as the consensus signal between the agents. Secondly, during decentralized execution, the consensus signal is used as the basis for the agent's decision-making, ensuring that the agent can choose actions under the guidance of the same signal. Experimental results show that this method allows agents to infer the explicit consensus of the entire team in a distributed manner based on the information they possess, without the need for communication. Moreover, with minimal changes in the amount of model parameters, the performance of this method far exceeds that of more complex algorithms. 3. A dual coordination mechanism for hierarchical multi-agent reinforcement learning framework For the temporal abstraction problem, this paper conducts research on a multi-agent reinforcement learning algorithm framework with dual coordination mechanism between different agents and different levels. Most existing hierarchical reinforcement learning algorithms are not applicable to multi-agent cooperation scenarios, or the design of macro strategies requires the introduction of expert knowledge. This paper proposes a universal hierarchical multi-agent reinforcement learning framework in cooperative scenarios to improve the sample efficiency of basic algorithms. Firstly, by taking the advantage function of the macro policy as the intrinsic reward function of the micro policy, the connection between strategies at different levels is constructed so that the joint policy performance can be monotonically improved when optimizing the macro or micro policy. By carefully building the target loss function of the two-layer policies, the entire training process is ensured to be end-to-end and does not require expert knowledge. Secondly, to improve the decision-making algorithm's interpretability, the hierarchical reinforcement learning method proposed in this paper can compare the relationship between the two layers of decision networks and provide real-time feedback on the current goal of the agent. Experimental results show that this method can significantly improve the sample efficiency of the algorithm and can be applied to various tasks and basic algorithms. 4. A noval multi-agent collaborative generalization model under the new relationship between individuals and the whole In response to the model expression and generalization problem, this paper relaxes the assumption that restricts the expressive power of neural network models and constructs a universal value decomposition algorithm framework. Multi-agent reinforcement learning methods based on value decomposition can alleviate the credit assignment problem to a certain extent. However, these algorithms are accompanied by a strong Individual-Global-Max (IGM) assumption, which severely restricts the expressive power of the neural network model. Therefore, this paper constructs a new relationship between the individual and the whole to liberate the expressive power of the model. Firstly, by decoupling the two functions of a single agent, the original individual model of the agent is split into two models, which are responsible for evaluating and searching its own strategy. The evaluation model only focuses on evaluating the agent's contribution in the global context, and it is updated through traditional temporal difference methods; the search model finds the optimal action by sampling in a subset of the action space and is updated according to the supervised learning paradigm. Secondly, to prevent the algorithm from falling into local optimal solutions, this paper also proposes an exploration method to balance the complexity and final performance of the above methods. Experimental results show that this method completely abandons the IGM assumption and enhances the expression and generalization ability of the original algorithm.
关键词	强化学习多智能体系统协同与合作分层决策对比学习
语种	中文
是否为代表性论文	是
七大方向——子方向分类	多智能体系统
国重实验室规划方向分类	多智能体决策
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/56564
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	徐志伟. 基于强化学习的多智能体协同决策关键问题研究[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于强化学习的多智能体协同决策关键问题研（12464KB）	学位论文		限制开放	CC BY-NC-SA