多智能体强化学习预训练方法研究 | |
孟令辉 | |
2024-05-15 | |
页数 | 146 |
学位类型 | 博士 |
中文摘要 | 自二十一世纪一十年代中期以来,神经网络赋能的多智能体强化学习技术成为提升群体智能决策能力的主要途径,该技术通过优化整个群体系统来促进智能体间的协同决策,以实现高效群体决策智能。同时为发挥神经网络的强大表征能力,该领域在日益广泛的决策场景上不断推出新的技术,逐渐兴起了基于值函数分解(value decomposition,如QMIX)、策略共享(policy-sharing,如MAPPO)以及离线预训练(offline pre-training)等代表性算法。其中,前两者在视频游戏、交通信号控制、广告竞价,甚至能源产业的发展上取得了超越人类的决策水平。但是,由于受到其在线试错学习(trial-and-error learning)机制和多智能体特性双重约束,其存在由维度诅咒(curse of dimensionality)导致的“样本效率低”问题,进一步地在跨任务上存在“泛化性差”问题。针对此现象,将在线试错收集数据过程前置在离线阶段,并基于预训练技术的多智能体强化学习框架应运而生。另外,控制单个实体的离线强化学习预训练方法在数据、平台和算法上日趋完善,一定程度上缓解了强化学习中的样本效率等问题,因此专为多智能体领域预训练框架设计的数据和平台亟待研究。此外,由于在发展的起步阶段,数据驱动下的预训练方法,仍存在难以利用具有现实世界特点的轨迹数据,即面对次优轨迹、多源轨迹数据时的“训练瓶颈”问题。
针对此背景下,本文主要面向多智能体强化学习预训练框架中的三个主要问题展开研究:“离线轨迹缺失及基础预训练方法建模能力不足”、“对次优轨迹利用不充分、性能不足”、“利用多源轨迹建模时,对任务区分性不足”。相应地,本文重点关注于多智能体强化学习预训练方法的离线轨迹收集和基座平台搭建,并沿着新型模型结构的探索、利用轨迹数据的不同质量、来源进行的预训练机制设计与优化的思路展开研究,完成了四项创新性工作:
|
英文摘要 | Since the mid-2010s, multi-agent reinforcement learning instantiated with neural networks has emerged as the primary approach to enhancing collective intelligent decision-making capabilities. This technology promotes collaborative decision among agents by optimizing the entire group system, aiming to achieve efficient collective decision-making intelligence. To leverage the powerful representational capabilities of neural networks, the field has continuously introduced new techniques in increasingly diverse decision-making scenarios, giving rise to representative algorithms based on value decomposition (such as QMIX), policy-sharing (such as MAPPO), and offline pre-training. Among them, the former two have achieved superhuman decision-making levels in video games, traffic signal control, advertising bidding, and even the development of the energy industry. However, due to the dual constraints of their online trial-and-error learning mechanisms and multi-agent characteristics, they suffer from the "sample inefficiency" problem caused by the curse of dimensionality and further exhibit poor generalization across tasks. To address this issue, a multi-agent reinforcement learning framework that moves the online trial-and-error data collection process to the offline stage and is based on pre-training techniques has emerged. In addition, the offline data, platform, and algorithms within the single agent offline reinforcement learning pre-training methods have been studied, alleviating some of the sample efficiency issues in reinforcement learning. Therefore, there is an urgent need for research on data and platforms specifically designed for pre-training frameworks in the multi-agent domain. Furthermore, in the early stages of development, data-driven pre-training methods still face a "training bottleneck" when utilizing trajectory data with real-world characteristics, such as suboptimal and multi-task trajectories.
In this context, this thesis primarily focuses on three main issues within the pre-training framework of multi-agent reinforcement learning: "the lack of offline trajectories and insufficient modeling capabilities of basic pre-training methods", "inadequate utilization of suboptimal trajectories and insufficient performance", and "insufficient task discrimination when modeling with multi-task trajectories". Correspondingly, this thesis emphasizes the collection of offline trajectories and the establishment of a foundation benchmark for pre-training methods in multi-agent reinforcement learning. It conducts research along the lines of exploring novel model structures, designing and optimizing pre-training mechanisms that utilize trajectory data of varying quality and tasks. The four main innovations are as follows:
1. Construction of Datasets and Benchmark for Multi-agent Reinforcement Learning Pre-Training Methods. To address the issues of low sample efficiency and poor generalization in online multi-agent reinforcement learning, a new paradigm represented by pre-training frameworks has gradually emerged in the multi-agent domain. However, there is still a lack of validation for large-scale and diverse datasets and corresponding training platforms tailored for pre-training frameworks. This poses challenges such as selecting decision-making scenarios and collection methods, analyzing trajectory rationality, and defining unified and fair evaluation metrics. This thesis proposes D4MARL, a fundamental dataset and training platform designed specifically for multi-agent reinforcement learning pre-training. Through an analysis of the distribution of various elements in trajectories of different qualities, this thesis demonstrates the rationality of the dataset. Additionally, this thesis introduces UTOPIA, a multi-agent reinforcement learning framework based on dynamic representations to construct world models and proposes a trajectory collection method based on denoised world models to reduce the cost of collecting high-quality trajectories. Empirical results show that the provided trajectory data is complete and diverse, and the benchmark provided by the training platform is reasonable, thus providing effective support for multi-agent reinforcement learning pre-training to some extent.
2. Basic Pre-training Method for Multi-agent Reinforcement Learning Based on Transformer Models. Despite the availability of collected trajectory data, existing offline multi-agent reinforcement learning pre-training methods still suffer from insufficient modeling capabilities due to constraints on model capacity and training mechanisms. This thesis introduces transformers into the pre-training and online fine-tuning of multi-agent reinforcement learning, constructing a scalable and offline trajectory-utilizing basic pre-training method. Experimentally, on multiple maps from SMAC the proposed method achieves an average improvement of 40% in sample efficiency compared to online multi-agent reinforcement learning and provides an implementation approach for multi-task pre-training settings. Additionally, this thesis conducts a comparative study on the input design and data embedding schemes of the transformer model during pre-training and fine-tuning in multi-agent reinforcement learning. The dataset, algorithm implementation, and model parameter combinations have been used and referenced by many subsequent papers, providing a robust and scalable foundation for multi-agent reinforcement learning pre-training methods.
3. Pre-training Method for Multi-agent Reinforcement Learning with Suboptimal Trajectories. In the basic approach to pre-training for multi-agent reinforcement learning, expert policies rely on expert trajectories for training. However, in some scenarios, the scale of expert trajectories is limited, and their collection also requires interaction between expert policies and the environment, leading to a circular paradox problem. Therefore, to relax the constraints of pre-training methods from the perspective of data quality, this thesis proposes a contrastive pre-training mechanism called RCP, which guides multi-agent strategies using reward function representations. Based on this mechanism, this thesis designs a related model structure called YANHUI for multi-agents. This approach alleviates the issues of insufficient utilization of suboptimal trajectories and inadequate performance in existing methods. Experimentally, YANHUI can simultaneously utilize expert and suboptimal trajectories, exhibiting good robustness under different proportions of suboptimal trajectories. Even with 90% suboptimal trajectories, it achieves comparable performance to the state-of-the-art pre-training model at that time. This further provides new mechanisms and methods for multi-agent reinforcement learning pre-training approaches in terms of utilizing trajectories of varying quality.
4. Pre-training Method for Universal Policies in Multi-agent Reinforcement Learning with Multi-task Trajectories. Traditional pre-training approaches in multi-agent reinforcement learning involve offline training of policies based on single-task data, followed by fine-tuning for that specific task. This thesis introduces a method for pre-training and fine-tuning universal policies using multi-task trajectory data called M3. By discretizing representations of policies and tasks, dynamic sharing of policies across tasks is achieved. Additionally, in the latent space of policy representations, this thesis decomposes the policies into agent-sharing and agent-specific modules, dynamically supporting collaboration among agents across different tasks and further reducing generalization errors. Leveraging the multi-task trajectories and training platform provided by D4MARL, M3 achieves effective transfer learning with few-shot or even zero-shot across multiple sets of challenging tasks. This further contributes new mechanisms and methods to pre-training approaches in multi-agent reinforcement learning, particularly in terms of utilizing trajectories from different tasks.
|
关键词 | 多智能体强化学习 预训练方法 神经网络 表示学习 在线强化评估 |
语种 | 中文 |
文献类型 | 学位论文 |
条目标识符 | http://ir.ia.ac.cn/handle/173211/56560 |
专题 | 毕业生_博士学位论文 |
推荐引用方式 GB/T 7714 | 孟令辉. 多智能体强化学习预训练方法研究[D],2024. |
条目包含的文件 | ||||||
文件名称/大小 | 文献类型 | 版本类型 | 开放类型 | 使用许可 | ||
201918014628045-孟令辉-(6367KB) | 学位论文 | 限制开放 | CC BY-NC-SA |
个性服务 |
推荐该条目 |
保存到收藏夹 |
查看访问统计 |
导出为Endnote文件 |
谷歌学术 |
谷歌学术中相似的文章 |
[孟令辉]的文章 |
百度学术 |
百度学术中相似的文章 |
[孟令辉]的文章 |
必应学术 |
必应学术中相似的文章 |
[孟令辉]的文章 |
相关权益政策 |
暂无数据 |
收藏/分享 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论