多智能体强化学习预训练方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	多智能体强化学习预训练方法研究
	孟令辉
	2024-05-15
页数	146
学位类型	博士
中文摘要	自二十一世纪一十年代中期以来，神经网络赋能的多智能体强化学习技术成为提升群体智能决策能力的主要途径，该技术通过优化整个群体系统来促进智能体间的协同决策，以实现高效群体决策智能。同时为发挥神经网络的强大表征能力，该领域在日益广泛的决策场景上不断推出新的技术，逐渐兴起了基于值函数分解（value decomposition，如QMIX）、策略共享（policy-sharing，如MAPPO）以及离线预训练（offline pre-training）等代表性算法。其中，前两者在视频游戏、交通信号控制、广告竞价，甚至能源产业的发展上取得了超越人类的决策水平。但是，由于受到其在线试错学习（trial-and-error learning）机制和多智能体特性双重约束，其存在由维度诅咒（curse of dimensionality）导致的“样本效率低”问题，进一步地在跨任务上存在“泛化性差”问题。针对此现象，将在线试错收集数据过程前置在离线阶段，并基于预训练技术的多智能体强化学习框架应运而生。另外，控制单个实体的离线强化学习预训练方法在数据、平台和算法上日趋完善，一定程度上缓解了强化学习中的样本效率等问题，因此专为多智能体领域预训练框架设计的数据和平台亟待研究。此外，由于在发展的起步阶段，数据驱动下的预训练方法，仍存在难以利用具有现实世界特点的轨迹数据，即面对次优轨迹、多源轨迹数据时的“训练瓶颈”问题。针对此背景下，本文主要面向多智能体强化学习预训练框架中的三个主要问题展开研究：“离线轨迹缺失及基础预训练方法建模能力不足”、“对次优轨迹利用不充分、性能不足”、“利用多源轨迹建模时，对任务区分性不足”。相应地，本文重点关注于多智能体强化学习预训练方法的离线轨迹收集和基座平台搭建，并沿着新型模型结构的探索、利用轨迹数据的不同质量、来源进行的预训练机制设计与优化的思路展开研究，完成了四项创新性工作：多智能体强化学习预训练方法的数据集及基座搭建。针对多智能体在线强化学习中样本效率低和泛化性差的问题，多智能体领域逐渐兴起以预训练框架为代表的新范式。而针对预训练框架构建的大规模且多样化的数据集及对应的训练平台仍缺乏验证，其中存在决策场景和收集方法选定、轨迹合理性分析、统一且公正的评估指标定义等多项难题。本文提出了一套面向多智能体强化学习预训练设计的基础数据集和训练平台D4MARL。并通过不同质量轨迹中各要素分布的分析，佐证了数据集的合理性。此外，在收集方法上还提出基于动力学表征构建世界模型的多智能体强化学习框架UTOPIA，提出基于去噪世界模型的轨迹收集方法，以降低收集高质量轨迹的代价。实验表明，本文所提供的轨迹数据具有完备性和多样性，且训练平台提供的基座具备合理性。从一定程度上为多智能体强化学习预训练提供了有效保障。基于转换器模型的多智能体强化学习基础预训练方法。即使基于已收集的轨迹数据，受模型容量和训练机制约束，现有离线多智能体强化学习预训练方法的建模能力仍不足。本文将转换器（transformer）引入多智能体强化学习的预训练与在线微调，构建可扩展并充分利用离线轨迹的基础预训练方法。实验上，在星际争霸微操场景的多个子地图上相比在线多智能体强化学习的样本效率平均提升40%，并初步在多任务预训练设定下提供实现方法。此外，对transformer模型在多智能体强化学习上预训练和微调过程中的接口设计、数据编码方案进行了对比研究。其中，数据集、算法实现和模型参数组合经验被多篇论文使用和引用，从而为多智能体强化学习预训练方法提供了鲁棒且可扩展的基础方案。次优轨迹下的多智能体强化学习预训练方法。在多智能体强化学习预训练的基础方法中，专家策略的训练依赖专家轨迹，而在部分场景中专家轨迹的规模有限，且其收集同样需要专家策略与环境交互，进而催生了循环悖论问题。因此，为从数据质量角度放宽预训练方法的限制，本文提出了以奖励函数表征引导多智能体策略的对比预训练机制RCP，并基于该机制为多智能体设计相关模型结构YANHUI。进而缓解了现有方法对次优轨迹利用不充分、性能不足的问题。实验上，YANHUI可以同时利用专家轨迹和次优轨迹，并在不同比例次优轨迹下展现出较好的鲁棒性。在90%比例为次优轨迹的条件下，获得与当时最优预训练模型相当的表现。进一步从不同质量轨迹的利用上，为多智能体强化学习预训练方法提供新机制和新方法。多源轨迹下的多智能体强化学习通用策略预训练方法。在传统多智能体强化学习预训练方法中，策略基于单个任务数据离线训练并对该任务进行微调。本文提出了一种面向多源轨迹数据进行通用策略预训练和微调的方法M3，通过对策略和任务离散化表征实现跨任务间策略的动态共享。此外，本文还将策略表征的隐空间拆解为智能体通用和独立模块，动态支持跨任务下智能体的协作，进一步降低了其泛化误差。在D4MARL提供的多源轨迹和训练平台的基础上，M3在多个难度任务集合上完成了少样本甚至零样本的有效迁移。进一步从不同来源轨迹的利用上，为多智能体强化学习预训练方法提供新机制和新方法。
英文摘要	Since the mid-2010s, multi-agent reinforcement learning instantiated with neural networks has emerged as the primary approach to enhancing collective intelligent decision-making capabilities. This technology promotes collaborative decision among agents by optimizing the entire group system, aiming to achieve efficient collective decision-making intelligence. To leverage the powerful representational capabilities of neural networks, the field has continuously introduced new techniques in increasingly diverse decision-making scenarios, giving rise to representative algorithms based on value decomposition (such as QMIX), policy-sharing (such as MAPPO), and offline pre-training. Among them, the former two have achieved superhuman decision-making levels in video games, traffic signal control, advertising bidding, and even the development of the energy industry. However, due to the dual constraints of their online trial-and-error learning mechanisms and multi-agent characteristics, they suffer from the "sample inefficiency" problem caused by the curse of dimensionality and further exhibit poor generalization across tasks. To address this issue, a multi-agent reinforcement learning framework that moves the online trial-and-error data collection process to the offline stage and is based on pre-training techniques has emerged. In addition, the offline data, platform, and algorithms within the single agent offline reinforcement learning pre-training methods have been studied, alleviating some of the sample efficiency issues in reinforcement learning. Therefore, there is an urgent need for research on data and platforms specifically designed for pre-training frameworks in the multi-agent domain. Furthermore, in the early stages of development, data-driven pre-training methods still face a "training bottleneck" when utilizing trajectory data with real-world characteristics, such as suboptimal and multi-task trajectories. In this context, this thesis primarily focuses on three main issues within the pre-training framework of multi-agent reinforcement learning: "the lack of offline trajectories and insufficient modeling capabilities of basic pre-training methods", "inadequate utilization of suboptimal trajectories and insufficient performance", and "insufficient task discrimination when modeling with multi-task trajectories". Correspondingly, this thesis emphasizes the collection of offline trajectories and the establishment of a foundation benchmark for pre-training methods in multi-agent reinforcement learning. It conducts research along the lines of exploring novel model structures, designing and optimizing pre-training mechanisms that utilize trajectory data of varying quality and tasks. The four main innovations are as follows: 1. Construction of Datasets and Benchmark for Multi-agent Reinforcement Learning Pre-Training Methods. To address the issues of low sample efficiency and poor generalization in online multi-agent reinforcement learning, a new paradigm represented by pre-training frameworks has gradually emerged in the multi-agent domain. However, there is still a lack of validation for large-scale and diverse datasets and corresponding training platforms tailored for pre-training frameworks. This poses challenges such as selecting decision-making scenarios and collection methods, analyzing trajectory rationality, and defining unified and fair evaluation metrics. This thesis proposes D4MARL, a fundamental dataset and training platform designed specifically for multi-agent reinforcement learning pre-training. Through an analysis of the distribution of various elements in trajectories of different qualities, this thesis demonstrates the rationality of the dataset. Additionally, this thesis introduces UTOPIA, a multi-agent reinforcement learning framework based on dynamic representations to construct world models and proposes a trajectory collection method based on denoised world models to reduce the cost of collecting high-quality trajectories. Empirical results show that the provided trajectory data is complete and diverse, and the benchmark provided by the training platform is reasonable, thus providing effective support for multi-agent reinforcement learning pre-training to some extent. 2. Basic Pre-training Method for Multi-agent Reinforcement Learning Based on Transformer Models. Despite the availability of collected trajectory data, existing offline multi-agent reinforcement learning pre-training methods still suffer from insufficient modeling capabilities due to constraints on model capacity and training mechanisms. This thesis introduces transformers into the pre-training and online fine-tuning of multi-agent reinforcement learning, constructing a scalable and offline trajectory-utilizing basic pre-training method. Experimentally, on multiple maps from SMAC the proposed method achieves an average improvement of 40% in sample efficiency compared to online multi-agent reinforcement learning and provides an implementation approach for multi-task pre-training settings. Additionally, this thesis conducts a comparative study on the input design and data embedding schemes of the transformer model during pre-training and fine-tuning in multi-agent reinforcement learning. The dataset, algorithm implementation, and model parameter combinations have been used and referenced by many subsequent papers, providing a robust and scalable foundation for multi-agent reinforcement learning pre-training methods. 3. Pre-training Method for Multi-agent Reinforcement Learning with Suboptimal Trajectories. In the basic approach to pre-training for multi-agent reinforcement learning, expert policies rely on expert trajectories for training. However, in some scenarios, the scale of expert trajectories is limited, and their collection also requires interaction between expert policies and the environment, leading to a circular paradox problem. Therefore, to relax the constraints of pre-training methods from the perspective of data quality, this thesis proposes a contrastive pre-training mechanism called RCP, which guides multi-agent strategies using reward function representations. Based on this mechanism, this thesis designs a related model structure called YANHUI for multi-agents. This approach alleviates the issues of insufficient utilization of suboptimal trajectories and inadequate performance in existing methods. Experimentally, YANHUI can simultaneously utilize expert and suboptimal trajectories, exhibiting good robustness under different proportions of suboptimal trajectories. Even with 90% suboptimal trajectories, it achieves comparable performance to the state-of-the-art pre-training model at that time. This further provides new mechanisms and methods for multi-agent reinforcement learning pre-training approaches in terms of utilizing trajectories of varying quality. 4. Pre-training Method for Universal Policies in Multi-agent Reinforcement Learning with Multi-task Trajectories. Traditional pre-training approaches in multi-agent reinforcement learning involve offline training of policies based on single-task data, followed by fine-tuning for that specific task. This thesis introduces a method for pre-training and fine-tuning universal policies using multi-task trajectory data called M3. By discretizing representations of policies and tasks, dynamic sharing of policies across tasks is achieved. Additionally, in the latent space of policy representations, this thesis decomposes the policies into agent-sharing and agent-specific modules, dynamically supporting collaboration among agents across different tasks and further reducing generalization errors. Leveraging the multi-task trajectories and training platform provided by D4MARL, M3 achieves effective transfer learning with few-shot or even zero-shot across multiple sets of challenging tasks. This further contributes new mechanisms and methods to pre-training approaches in multi-agent reinforcement learning, particularly in terms of utilizing trajectories from different tasks.
关键词	多智能体强化学习预训练方法神经网络表示学习在线强化评估
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/56560
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	孟令辉. 多智能体强化学习预训练方法研究[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
201918014628045-孟令辉-（6367KB）	学位论文		限制开放	CC BY-NC-SA