面向多目标覆盖任务的深度强化学习迁移泛化方法研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	面向多目标覆盖任务的深度强化学习迁移泛化方法研究
	徐一凡
	2024-06
页数	74
学位类型	硕士
中文摘要	在无人技术快速发展的今天，多智能体多目标覆盖任务在通信、军事作战等领域备受关注。传统优化方法对特定覆盖场景进行优化目标和约束建模，需要较多专家知识且难以处理动态序贯决策任务。深度强化学习作为目前决策领域发挥巨大作用的一类方法，具有与环境交互学习、奖励函数设计灵活以及能够适应高维输入等特性，因此在多目标覆盖任务中也受到了广泛关注。然而，深度强化学习方法训练成本高，迁移泛化性弱等特点使其难以在复杂多变的多目标覆盖场景上灵活部署，为该领域带来了新的挑战。近年来，深度强化学习（Deep Reinforcement Learning, DRL）在决策控制领域取得了显著成就。尽管如此，DRL 算法在处理高复杂度、环境多变的真实世界应用问题时表现出低灵活性、低适应性等特点，使得DRL 在真实世界应用受阻。在诸多应用场景中，多智能体多目标覆盖任务具有场景灵活多变、行为连续高动态、要素多元耦合等因素，作为真实世界复杂应用任务的典型代表，本文将该任务作为研究DRL 迁移泛化问题的任务场景。因此，本文以多智能体多目标覆盖任务为研究场景，着重研究深度强化学习方法在该类复杂决策任务下的迁移泛化能力。本文的主要内容和创新点如下：（1）针对场景元素增加带来的策略状态空间探索困难问题，提出一种双观测的课程学习算法，用于新环境下的多智能体策略迁移。针对目标数量增加导致的观测信息难处理问题，通过聚类方法提取观测中的关键信息，并与原始信息结合作为策略前端的输入特征；针对智能体数量增加导致的状态空间爆炸问题，通过课程学习方法实现算法从简单环境到复杂环境的逐步学习。实验验证了双观测模块能够有效处理目标数量增加带来的观测维度增长问题，同时，课程学习机制提升了算法在智能体规模增加时的训练效率和策略性能。（2）针对训练环境单一带来的策略过拟合问题，提出一种随机环境生成和共享特征提取的策略学习框架，用于提升多目标覆盖任务下策略的零样本泛化能力。针对训练过程数据来源单一问题，设计一系列环境参数分布用于随机化生成训练环境，从数据增强角度提升策略泛化性。针对课程学习训练机制下的灾难性遗忘问题，设计基于域自适应方法的环境共同特征提取模块，从特征角度提取环境先验知识，辅助下游策略学习。实验验证了域随机化能够有效处理算法面向单一场景过拟合问题，同时域自适应提升了算法在新环境下的迁移泛化能力。（3）针对强化学习策略迁移泛化能力难以定量评估问题，提出一种面向环境差异的策略迁移泛化性能评估框架，并在该框架下提出两种方法。从策略性能角度，利用直接反映策略性能的回报值构建检验统计量，评估策略在新环境中回报值的变化情况。利用源环境的奖励函数构建三种不同的回报值计算方式，以定位环境偏移发生的具体模块。从策略行为角度，利用策略在源环境和目标环境中的轨迹构建统计量，评估策略在新环境中轨迹的变化情况。实验通过对比两种检测方法和其他基线算法，验证了本章提出两种方法的有效性和灵活性。综上所述，本文以深度强化学习在多智能体多目标覆盖任务中的应用为任务场景，研究了深度强化学习方法在该类复杂决策场景下的迁移泛化问题。本文的研究成果，一方面直接为多目标覆盖任务的迁移泛化、实际部署提供算法支撑，另一方面为强化学习的迁移泛化问题提供解决思路。
英文摘要	In today’s era of rapid development of unmanned technology, multi-agent multitarget coverage tasks are receiving significant attention in areas such as communication and military operations. Traditional optimization methods model optimization objectives and constraints for specific coverage scenarios, requiring extensive expert knowledge and being ill-suited for handling dynamic sequential decision-making tasks. Deep Reinforcement Learning (DRL), as a category of methods making a significant impact in the decision-making field, is characterized by its capacity to interact with and learn from the environment, flexible design of reward functions, and adaptability to highdimensional inputs. Therefore, it has also garnered widespread attention for multi-target coverage tasks. However, the high training cost and weak transfer generalization of DRL methods make them less agile in complex and variable multi-target coverage scenarios, posing new challenges for the field. In recent years, Deep Reinforcement Learning (DRL) has achieved remarkable accomplishments in the decision control domain. Despite this, DRL algorithms exhibit low flexibility and adaptability when dealing with complex and variable real-world applications, hindering DRL’s application in the real world. In many application scenarios, multi-agent multi-target coverage tasks involve factors such as flexible and variable scenes, continuous and highly dynamic behaviors, and multi-element coupling, representing typical complex real-world application tasks. This paper takes this task as a scenario for researching the transfer generalization problem of DRL. Accordingly, this paper focuses on the multi-agent multi-target coverage task as the research scenario and primarily studies the transfer generalization capabilities of DRL methods under such complex decision-making tasks. The main content and innovations of this paper are as follows: (1) To address the problem of policy state space exploration difficulties brought by increased scenario elements, a twin-observation curriculum learning algorithm is proposed for multi-agent policy transfer in new environments. To handle the issues posed by an increased number of targets, which complicates the processing of observational information, clustering methods are used to extract key information from observations and combine them with original information as input features for the policy frontend. To counteract the state space explosion problem caused by an increased number of agents, a curriculum learning method is employed to achieve gradual learning from simple to complex environments. Experiments demonstrate that the twin-observation module can effectively deal with the issues of increasing observation dimensions due to a larger number of targets, while the curriculum learning mechanism improves the training ef-ficiency and policy performance as the scale of agents increases. (2) To avoid overfitting the policy towards the training scenarios, a policy learning framework with stochastic environment generation and shared feature extraction is proposed to enhance the zero-shot generalization capability of policies in multi-target coverage tasks. To tackle the issue of a single source of data during the training process, a series of environmental parameter distributions are designed to randomize the generation of training environments, enhancing policy generalization from a data augmentation perspective. To address catastrophic forgetting under the curriculum learning training mechanism, a domain adaptation-based common feature extraction module is devised, which extracts prior knowledge from the environment to assist downstream policy learning. Experiments verify that domain randomization effectively addresses algorithm overfitting to single scenarios, and domain adaptation improves the algorithm’s transfer generalization ability in new environments. (3) To tackle the problem of quantitatively assessing the transfer generalization ability of reinforcement learning policies, a policy transfer generalization performance evaluation framework oriented towards environmental differences is proposed, and two methods are introduced under this framework. From the perspective of policy performance, statistical measures constructed using direct reflections of policy performance —reward values—are used to assess changes in policy reward values in new environments. Three different methods of computing reward values using the reward function from the source environment are used to pinpoint which specific modules are affected by environmental shifts. From the perspective of policy behavior, statistical measures using policy trajectories in source and target environments assess changes in policy trajectories in new environments. Experiments validate the effectiveness and flexibility of the two methods proposed in this chapter compared to other baseline algorithms. To sum up, this paper takes the application of deep reinforcement learning in multiagent multi-target coverage tasks as the task scenario and investigates the transfer generalization issues of DRL methods in such complex decision-making scenarios. The research findings of this paper provide algorithmic support for the transfer generalization and practical deployment of multi-target coverage tasks, as well as offering solutions for the transfer generalization problems of reinforcement learning.
关键词	多目标覆盖任务强化学习迁移泛化课程学习域自适应环境偏移
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/57455
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	徐一凡. 面向多目标覆盖任务的深度强化学习迁移泛化方法研究[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
学位论文终稿_徐一凡.pdf（20521KB）	学位论文		限制开放	CC BY-NC-SA