多智能体博弈对抗的协同行为自学习算法与应用

CASIA OpenIR

	多智能体博弈对抗的协同行为自学习算法与应用
	董宗宽
	2021-05-17
页数	111
学位类型	硕士
中文摘要	多智能体系统（Multi-Agent System，MAS）是在同一个环境中由多个交互智能体组成的系统，该系统常用于解决独立智能体以及单层系统难以解决的问题。近来，随着深度学习（Deep Learning，DL）技术的兴起，融合深度神经网络和强化学习（Reinforcement Learning，RL）的单智能体深度强化学习技术（Deep Reinforcement Learning，DRL）在完美信息场景中大放异彩，研究人员开始将DRL方法巧妙融合到MAS中，从而出现了一系列多智能体深度强化学习算法（Multi-agent Deep Reinforcement Learning，MDRL）。其中，多智能体协同行为研究是MDRL的一个重点方向，主要目的是通过多个具有自主决策能力的实体协作配合实现单个智能体不能完成的需求。目前研究协同行为的MDRL算法虽然通过DRL强大的特征学习和策略表征能力实现了端到端的协作行为学习，但神经网络模型无法很好地结合人类已有专家知识，而且端到端的学习特性不具备良好的可解释性，不能直观揭示出智能体之间协同行为出现的本质机理，极大地限制了MDRL算法的实际应用。本论文从多智能体系统的策略表征问题出发，研究多智能体系统在协同任务场景中的自学习算法与实际应用系统。论文的主要工作和创新点归纳如下：（1）提出基于关键决策要素驱动的多智能体协同机理表征针对用于解决多智能体博弈任务的深度强化学习算法对生成行动方案的表征能力不足的问题，基于语言几何学（Linguistic Geometry，LG）理论方法，构建了关键决策要素驱动下的行动方案生成算法M2K-LG（Multi-process/thread Multi-layer with Key decision elements for Linguistic Geometry），并针对战争迷雾引入虚拟对抗单位机制，针对计算资源最大化利用引入了多线程和多进程算法框架。实验分析表明，该算法完善了原有语言几何学理论框架，适用于复杂场景下的多智能体博弈对抗，同时可通过对算法运行框架的选择，最大化利用硬件计算资源来高效进行行动方案生成。（2）提出结合CART决策树的M2KT-LG算法进行关键决策要素的提取针对缺乏专家知识的场景中无法对多个关键决策要素的优先级进行排序的问题，引入基于决策树算法的关键决策要素提取模块，提出了带有关键决策要素筛选的M2KT-LG（Multi-process/thread Multi-layer Key decision elements with Decision Tree for Linguistic Geometry）算法。该算法利用语言几何学与仿真平台交互过程中生成的大量历史经验数据构建（特征化的关键决策要素-决策结果）训练样本集，再使用决策树算法对关键决策要素的重要性程度排序。通过排序结果可以减少关键决策要素的数量，降低行动方案的推理复杂度，同时可以完善对应任务场景下的专家知识，便于后续方案规划时关键决策要素的确定。（3）提出应用语言几何学进行底层行动方案生成与表征的MDRL算法针对MDRL算法端到端的产生行为决策导致行动方案的可解释性不足的问题，引入完善后的M2K-LG算法对MDRL算法进行改进，提出了基于值分解类MDRL算法与LG的Z-Learning（Zone-Learning，区域学习）算法。该方法通过上下两层协作产生行动方案，上层为数据驱动，通过值分解函数类的MDRL算法对下层进行行动目标分配；下层为知识驱动，通过M2K-LG算法对上层确定的具体行动目标产生行动方案，实现底层动作的抽象与组合。实验结果证明，该算法在保证智能体博弈能力与原有算法相同的前提下，依靠语言几何学的表达能力很好地展现了智能体的决策依据与中短期行动规划，提高了MDRL算法的可解释性。最后，论文将研究成果应用到预研项目中，开发了可视化M2KT-LG行动方案规划软件与面向实时策略游戏的类似gym（OpenAI机构开源的游戏平台，提供标准的交互接口）的通用性多智能体协同行为自学习算法模块，实现了对星际争霸的游戏小场景与墨子平台假设想定的行动方案推理。并进一步结合行动方案的可视化模块得到了针对多智能体博弈场景下协同行为学习的算法应用系统。
英文摘要	Multi-Agent System (MAS) is a system composed of multiple interactive agents in the same environment. This system is often used to solve problems that are difficult to solve by independent agents and single-layer systems. Recently, with the rise of deep learning (Deep Learning, DL) technology, single-agent deep reinforcement learning (DRL) technology that integrates deep neural networks and reinforcement learning (RL) has played a major role in perfect information scenarios. Shining brilliantly, researchers began to cleverly integrate DRL methods into MAS, and a series of multi-agent deep reinforcement learning algorithms (Multi-agent Deep Reinforcement Learning, MDRL) emerged. Among them, the research of multi-agent collaborative behavior is a key direction of MDRL, and the main purpose is to realize the needs that a single agent cannot complete through the cooperation of multiple entities with autonomous decision-making capabilities. Although the MDRL algorithm that currently studies collaborative behavior has achieved end-to-end collaborative behavior learning through the powerful feature learning and strategy representation capabilities of DRL, the neural network model cannot well integrate human existing expert knowledge, and the end-to-end learning characteristics are not available. The good interpretability cannot directly reveal the essential mechanism of cooperative behavior between agents, which greatly limits the practical application of the MDRL algorithm. Our work starts from the problem of strategy representation of multi-agent systems, and studies the self-learning algorithms and practical application systems of multi-agent systems in collaborative task scenarios. The main work and innovations of the thesis are summarized as follows: (1) Propose the multi-agent co-mechanism based on key decision elements. Aiming at the problem that the deep reinforcement learning algorithm used to solve the multi-agent game task is insufficient to generate action plans, based on the theory and method of Linguistic Geometry (LG), an action plan generation algorithm driven by key decision elements is constructed M2K-LG (Multi-process/thread Multi-layer with Key decision elements for Linguistic Geometry), and introduces a virtual confrontation unit mechanism for the fog of war, and introduces a multi-threaded and multi-process algorithm framework for the maximization of computing resources. Experimental analysis shows that the algorithm improves the original theoretical framework of Linguistic Geometry and is suitable for multi-agent game confrontation in complex scenarios. At the same time, it can maximize the use of hardware computing resources to efficiently generate action plans by selecting the algorithm operating framework. (2) Propose the M2KT-LG algorithm combined with CART decision tree to extract key decision elements Aiming at the problem that the priority of multiple key decision elements cannot be sorted in scenarios lacking expert knowledge, the key decision element extraction module based on decision tree algorithm is introduced, and M2KT-LG (Multi-process /thread Multi-layer Key decision elements with Decision Tree for Linguistic Geometry) algorithm. The algorithm uses a large amount of historical experience data generated during the interaction between language geometry and the simulation platform to construct a training set (characterized key decision elements-decision results), and then uses a decision tree algorithm to rank the importance of key decision elements. Sorting results can reduce the number of key decision-making elements, reduce the reasoning complexity of the action plan, and at the same time improve the expert knowledge in the corresponding task scenario, which facilitates the determination of key decision-making elements in subsequent plan planning. (3) Propose the MDRL algorithm that applies Linguistic Geometry to generate and represent the underlying action plan In view of the problem of insufficient interpretability of action plans caused by the end-to-end behavior decision of the MDRL algorithm, the improved M2K-LG algorithm was introduced to improve the MDRL algorithm, and the MDRL algorithm based on value decomposition and LG's Z-Learning (Zone-Learning) were proposed. The method generates action plans through the collaboration of the upper and lower layers. The upper layer is data-driven, through the value decomposition function class The MDRL algorithm assigns action targets to the lower layer; The lower layer is knowledge-driven. The M2K-LG algorithm generates action plans for the specific action goals determined by the upper layer to realize the abstraction and combination of the bottom layer actions;. The experimental results prove that the algorithm, under the premise of ensuring that the game ability of the agent is the same as the original algorithm, relies on the expression ability of Linguistic Geometry to well show the agent’s decision-making basis and short-term action planning, and improves the interpretability of the MDRL algorithm. Finally, the thesis applied the research results to the national defense pre-research project, developed the visualized TK-LG action plan planning software, the interpretable MDRL algorithm module and the corresponding visualization software, and realized the small game scene based on StarCraft and the hypothesis of the Mozi platform. Imagine a plan of action to reason about task requirements. Finally, the paper applies the research results to the pre-research project, and developes a visualized M2KT-LG action plan planning software and gym-oriented (gym, OpenAI organization open source game platform, providing standard interactive interfaces) universal multi-agent collaborative behavior self-learning algorithm module for real-time strategy games. The self-learning algorithm module of body cooperative behavior realizes the action plan reasoning of the small game scenes of StarCraft2 and the MoZi simulation platform. And further combined with the visualization module of the action plan to obtain the algorithm application system for collaborative behavior learning in multi-agent game scenarios.
关键词	多智能体深度强化学习协同行为学习语言几何学可解释性 Z学习
语种	中文
七大方向——子方向分类	复杂系统推演决策
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/44863
专题	中国科学院自动化研究所
推荐引用方式 GB/T 7714	董宗宽. 多智能体博弈对抗的协同行为自学习算法与应用[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
董宗宽-2018E8014661014-（4970KB）	学位论文		开放获取	CC BY-NC-SA