基于值分解优化的多智能体深度强化学习方法研究

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 复杂系统智能机理与平行控制团队

	基于值分解优化的多智能体深度强化学习方法研究
	王凌霄
	2021-05-26
页数	100
学位类型	硕士
中文摘要	随着深度学习算法的实际效果和上下游软硬件的综合水平的大幅提升，深度学习技术开始被运用在信息科学各个领域的交叉性前沿研究中。最近三年来，深度学习方法已经在多智能体强化学习领域中成功地进行了许多探索，多智能体深度强化学习已经成为人工智能领域最近几年以来发展最为迅速的子方向之一。多智能体深度强化学习方向上的算法可以依据其背景设计原理分为几个技术路线大类，其中主要的技术路线包括基于同步通信的方法、基于价值函数分解的方法等等。本文针对多主体、连续时间、即时决策和不稳定通信环境下存在的新型问题，讨论了现有几种算法的应用局限和研究潜力，结合强化学习、图神经网络等领域的最新研究成果，提出了基于价值函数分解的多智能体深度强化学习算法的改进方法。本文提出的改进方法有两个主要创新点。首先，针对不完全信息高频决策环境的特点，本文在价值函数分解算法的执行阶段引入了异步历史观测数据，既减少了同步通信机制所引入的高额信道负担，又丰富了智能体在当前时刻的进行决策所需的外部参考数据，该改进算法实现了多智能体在执行阶段的决策性能与计算开销的平衡，并形成了一种介于经典价值函数分解算法和经典同步通信算法之间的泛化算法型式。其次，本文在价值函数分解算法的学习阶段引入了隐式图关系的挖掘，使用注意力机制计算智能体之间的权重系数并得到对应的隐式图邻接矩阵，并在隐式图上进行多智能体动作价值向量的图卷积运算，该改进算法使得智能体之间的关系可以在不借助专家经验的情况下自动生成，并且将图神经网络计算模块引入到动作价值函数的聚合过程中。本文在星际争霸多智能体挑战环境的不同任务下对上述算法改进进行测试，并与经典的多智能体深度强化学习方法进行比较，通过实验发现本文提出的算法在实验环境下表现出优于经典算法的效果。
英文摘要	With the significant improvement in deep learning algorithms and high-performance hardware and software platform, deep learning has successfully implemented many cutting-edge research efforts in the field of multi-agent reinforcement learning, and deep multi-agent reinforcement learning has become one of the most rapidly developing subfield in the field of artificial intelligence in recent years. In the last three years, many algorithms and models have emerged in deep multi-agent reinforcement learning, and these algorithms can be classified into several broad categories based on their design principles, among which the main technical routes include synchronous communication-based methods, value function decomposition-based methods, and soon. This paper discusses the application limitations and research potentials of several existing algorithms for the new problems existing in multi-agent, continuous-time, real-time decision and unstable communication environments, combined with the latest research results in the fields of reinforcement learning and graph neural networks. An improved method of multi-agent deep reinforcement learning algorithm based on value function decomposition is presented. The improved method proposed in this paper has two main innovations. First, this paper introduces asynchronous historical observations in the execution phase of the value function decomposition algorithm, which reduces the bandwidth burden introduced by the synchronous communication mechanism and increases the external features required by the agents to make decisions at the current timestep. This algorithm improvement achieves a balance between decision performance and computational overhead of multi-agents in the execution phase and proposes a unified algorithm framework between the value function decomposition algorithm and the synchronous communication algorithm. Secondly, this paper introduces the relation mining of implicit graph in the learning phase of the value function decomposition algorithm, uses the attention mechanism to calculate the weight coefficients among the agents and obtain the corresponding adjacency matrix of the implicit graph, and performs the graph convolution operation of action value vectors on the implicit graph. This algorithm improvement allows the relationships between agents to be generated automatically without expert knowledge and introduces the graph neural network computational module into the action-value function aggregation part. In this paper, the above algorithm improvement is tested under different tasks in the StarCraft Multi-Agent Challenge environment and compared with the classical methods, and it is found that the algorithm proposed in this paper performs better than the classical algorithm in the experimental environment.
关键词	深度强化学习多智能体系统价值函数分解算法图神经网络
语种	中文
七大方向——子方向分类	决策智能理论与方法
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/44697
专题	多模态人工智能系统全国重点实验室_复杂系统智能机理与平行控制团队
推荐引用方式 GB/T 7714	王凌霄. 基于值分解优化的多智能体深度强化学习方法研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于值分解优化的多智能体深度强化学习方法（13415KB）	学位论文		开放获取	CC BY-NC-SA