对抗场景中的智能体策略泛化研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	对抗场景中的智能体策略泛化研究
	陈皓
	2022-05-25
页数	82
学位类型	硕士
中文摘要	在现实生活中，多智能体系统无处不在。强化学习是处理多智能体问题的常用方法。然而相比单智能体强化学习，多智能体强化学习面临着环境非平稳、信用分配、Ad-Hoc协作等独特的挑战。为了应对上述挑战，多智能体强化学习在近年来取得了以中心化训练分布式执行框架为代表的一系列突破性进展。在中心化训练分布式执行框架框架下，价值函数分解方法通过信用分配网络把每个智能体的本地Q值组合成全局Q值，更好地评价了每个智能体的贡献，实现了更好的合作，在很多极具挑战性的任务上取得了良好的效果。然而，许多现实世界的场景要求智能体能够在测试时泛化到从未见过的对抗场景，也就是要求算法具有对抗场景下的策略泛化能力。然而现有的价值函数分解方法缺乏上述能力。为了提高对抗场景中智能体的策略泛化能力，本文分别从充分利用智能体的全局状态、局部观测、智能体之间的关系三个方面入手展开了策略泛化相关的研究，并分别提出了三个算法以解决对抗场景中的智能体策略泛化问题。本文的主要工作和贡献总结如下： 1. 提出了一种基于集成学习的多智能体信用分配方法，以充分利用智能体的全局状态。首先，通过构建多重信用分配网络让不同的信用分配子网络关注到全局状态空间的不同子空间，进而学到“好而不同”的信用分配子网络，同时通过集成信用分配网络来平衡不同信用分配子网络的策略从而充分利用全局状态信息，使得算法可以关注到全局状态空间的多个子空间，避免过拟合于某个子空间，进而提升智能体的策略泛化能力。 2. 提出了一种面向Ad-Hoc协作的局部观测重建方法，以充分利用智能体的局部观测。首先把智能体的局部观测信息分解为三个部分，然后利用注意力机制处理长度变化的输入信息，使得算法对长度变化的智能体输入不敏感，最后利用采样网络实现了局部观测抽象，使得算法充分利用不同局面中的高维状态表征。最终实现了对每个智能体的局部观测信息的重建，使得算法可以在Ad-Hoc协作场景下进行零样本泛化，进而提升智能体的策略泛化能力。 3. 提出了一种面向Ad-Hoc协作的邻域关系学习方法，以充分利用智能体之间的关系。首先通过基于图的关系编码器将智能体之间的关系信息编码到价值分解网络中。同时，为了解决在Ad-Hoc协作中智能体数量一直在变化的问题，使用了基于注意力的本地观测抽象机制。本算法不仅可以充分利用智能体之间的拓扑结构，还可以在不重新进行训练的前提下实现在Ad-Hoc协作场景下的零样本泛化，，进而提升智能体的策略泛化能力。
英文摘要	In real life, multi-agent systems are ubiquitous. reinforcement learning is a common approach to deal with multi-agent problems. However, compared with single-agent reinforcement learning, multi-agent reinforcement learning faces unique challenges such as the non-stationarity of environment, credit assignment, and ad-hoc cooperation. To address these challenges, multi-agent reinforcement learning has made a series of breakthroughs in recent years, represented by the centralized training decentralized execution framework. In the framework of centralized training decentralized execution, the value function factorization approach combines the local Q values of each agent into global Q values through credit assignment networks, which better evaluates the contribution of each agent, achieves better cooperation, and achieves good results on many challenging tasks. However, many real-world scenarios require that the agent be able to generalize to unseen adversarial scenarios at test time, i.e., the algorithm is required to have the ability to achieve policy generalization under adversarial scenarios. However, existing value function factorization methods lack these capabilities. To improve the policy generalization ability of the agent in the adversarial scenario, this thesis conducts research related to policy generalization from three aspects, namely, making full use of the global state of the agent, the local observation of the agent, and the relationship between agents, and proposes three algorithms to solve the policy generalization problem of the agent in the adversarial scenario respectively. The main work and contributions of this thesis are summarized as follows. 1. A multi-agent credit assignment method based on ensemble learning is proposed to make full use of the global state of the agent. First, by constructing multiple credit assignment networks, different credit assignment subnetworks can focus on different subspaces of the global state space, to learn “good but different” credit assignment subnetworks, and by integrating credit assignment networks to balance the policies of different credit assignment subnetworks to make full use of global state information, so that the algorithm can focus on multiple subspaces of the global state space and avoid overfitting to a certain subspace, thus improving the policy generalization ability of the agent. 2. A local observation reconstruction method for ad-hoc cooperation is proposed to make full use of the local observation of the agent. Firstly, the local observation information of the agent is decomposed into three parts, then the attention mechanism is used to handle the length-changing input information, which makes the algorithm insensitive to the agent's length-changing input information, and secondly, the local observation abstraction is implemented using sampling networks, which makes the algorithm make full use of the high-dimensional state representations in different situations. Finally, the reconstruction of the local observation information of each agent is achieved, which enables the algorithm to perform zero-shot generalization in ad-hoc cooperation scenarios, thus improving the policy generalization ability of the agent. 3. A neighborhood relationship learning method for ad-hoc cooperation is proposed to make full use of the relationships among the agents. Firstly, the relationship information between the agents is encoded into the value decomposition network by a graph-based relationship encoder. Meanwhile, to solve the problem that the number of agents keeps changing in ad-hoc cooperation, an attention-based local observation abstraction mechanism is used. This algorithm not only makes full use of the topology between agents but also achieves zero-shot generalization in ad-hoc cooperation scenarios without re-training, thus improving the policy generalization ability of the agent.
关键词	深度强化学习多智能体策略泛化 Ad-Hoc 协作信用分配
语种	中文
资助项目	National Natural Science Foundation of China[61876181]
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/48757
专题	毕业生_硕士学位论文中国科学院自动化研究所毕业生复杂系统认知与决策实验室_智能系统与工程
推荐引用方式 GB/T 7714	陈皓. 对抗场景中的智能体策略泛化研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
毕业论文-陈皓.pdf（13782KB）	学位论文		限制开放	CC BY-NC-SA