CASIA OpenIR  > 毕业生  > 硕士学位论文
面向多目标覆盖任务的深度强化学习迁移泛化方法研究
徐一凡
2024-06
页数74
学位类型硕士
中文摘要

在无人技术快速发展的今天,多智能体多目标覆盖任务在通信、军事作战等
领域备受关注。传统优化方法对特定覆盖场景进行优化目标和约束建模,需要较
多专家知识且难以处理动态序贯决策任务。深度强化学习作为目前决策领域发
挥巨大作用的一类方法,具有与环境交互学习、奖励函数设计灵活以及能够适应
高维输入等特性,因此在多目标覆盖任务中也受到了广泛关注。然而,深度强化
学习方法训练成本高,迁移泛化性弱等特点使其难以在复杂多变的多目标覆盖
场景上灵活部署,为该领域带来了新的挑战。
近年来,深度强化学习(Deep Reinforcement Learning, DRL)在决策控制领
域取得了显著成就。尽管如此,DRL 算法在处理高复杂度、环境多变的真实世
界应用问题时表现出低灵活性、低适应性等特点,使得DRL 在真实世界应用受
阻。在诸多应用场景中,多智能体多目标覆盖任务具有场景灵活多变、行为连续
高动态、要素多元耦合等因素,作为真实世界复杂应用任务的典型代表,本文将
该任务作为研究DRL 迁移泛化问题的任务场景。
因此,本文以多智能体多目标覆盖任务为研究场景,着重研究深度强化学习
方法在该类复杂决策任务下的迁移泛化能力。本文的主要内容和创新点如下:
(1)针对场景元素增加带来的策略状态空间探索困难问题,提出一种双观测
的课程学习算法,用于新环境下的多智能体策略迁移。针对目标数量增加导致
的观测信息难处理问题,通过聚类方法提取观测中的关键信息,并与原始信息
结合作为策略前端的输入特征;针对智能体数量增加导致的状态空间爆炸问题,
通过课程学习方法实现算法从简单环境到复杂环境的逐步学习。实验验证了双
观测模块能够有效处理目标数量增加带来的观测维度增长问题,同时,课程学习
机制提升了算法在智能体规模增加时的训练效率和策略性能。
(2)针对训练环境单一带来的策略过拟合问题,提出一种随机环境生成和共
享特征提取的策略学习框架,用于提升多目标覆盖任务下策略的零样本泛化能
力。针对训练过程数据来源单一问题,设计一系列环境参数分布用于随机化生成
训练环境,从数据增强角度提升策略泛化性。针对课程学习训练机制下的灾难性
遗忘问题,设计基于域自适应方法的环境共同特征提取模块,从特征角度提取环
境先验知识,辅助下游策略学习。实验验证了域随机化能够有效处理算法面向单
一场景过拟合问题,同时域自适应提升了算法在新环境下的迁移泛化能力。
(3)针对强化学习策略迁移泛化能力难以定量评估问题,提出一种面向环境
差异的策略迁移泛化性能评估框架,并在该框架下提出两种方法。从策略性能角
度,利用直接反映策略性能的回报值构建检验统计量,评估策略在新环境中回报
值的变化情况。利用源环境的奖励函数构建三种不同的回报值计算方式,以定位
环境偏移发生的具体模块。从策略行为角度,利用策略在源环境和目标环境中的
轨迹构建统计量,评估策略在新环境中轨迹的变化情况。实验通过对比两种检测方法和其他基线算法,验证了本章提出两种方法的有效性和灵活性。
综上所述,本文以深度强化学习在多智能体多目标覆盖任务中的应用为任
务场景,研究了深度强化学习方法在该类复杂决策场景下的迁移泛化问题。本文
的研究成果,一方面直接为多目标覆盖任务的迁移泛化、实际部署提供算法支
撑,另一方面为强化学习的迁移泛化问题提供解决思路。

英文摘要

In today’s era of rapid development of unmanned technology, multi-agent multitarget
coverage tasks are receiving significant attention in areas such as communication
and military operations. Traditional optimization methods model optimization objectives
and constraints for specific coverage scenarios, requiring extensive expert knowledge
and being ill-suited for handling dynamic sequential decision-making tasks. Deep
Reinforcement Learning (DRL), as a category of methods making a significant impact
in the decision-making field, is characterized by its capacity to interact with and learn
from the environment, flexible design of reward functions, and adaptability to highdimensional
inputs. Therefore, it has also garnered widespread attention for multi-target
coverage tasks. However, the high training cost and weak transfer generalization of DRL
methods make them less agile in complex and variable multi-target coverage scenarios,
posing new challenges for the field.
In recent years, Deep Reinforcement Learning (DRL) has achieved remarkable accomplishments
in the decision control domain. Despite this, DRL algorithms exhibit
low flexibility and adaptability when dealing with complex and variable real-world applications,
hindering DRL’s application in the real world. In many application scenarios,
multi-agent multi-target coverage tasks involve factors such as flexible and variable
scenes, continuous and highly dynamic behaviors, and multi-element coupling, representing
typical complex real-world application tasks. This paper takes this task as a
scenario for researching the transfer generalization problem of DRL.
Accordingly, this paper focuses on the multi-agent multi-target coverage task as the
research scenario and primarily studies the transfer generalization capabilities of DRL
methods under such complex decision-making tasks. The main content and innovations
of this paper are as follows:
(1) To address the problem of policy state space exploration difficulties brought by
increased scenario elements, a twin-observation curriculum learning algorithm is proposed
for multi-agent policy transfer in new environments. To handle the issues posed
by an increased number of targets, which complicates the processing of observational
information, clustering methods are used to extract key information from observations
and combine them with original information as input features for the policy frontend. To
counteract the state space explosion problem caused by an increased number of agents,
a curriculum learning method is employed to achieve gradual learning from simple to
complex environments. Experiments demonstrate that the twin-observation module can
effectively deal with the issues of increasing observation dimensions due to a larger
number of targets, while the curriculum learning mechanism improves the training ef-ficiency and policy performance as the scale of agents increases.
(2) To avoid overfitting the policy towards the training scenarios, a policy learning
framework with stochastic environment generation and shared feature extraction is
proposed to enhance the zero-shot generalization capability of policies in multi-target
coverage tasks. To tackle the issue of a single source of data during the training process,
a series of environmental parameter distributions are designed to randomize the
generation of training environments, enhancing policy generalization from a data augmentation
perspective. To address catastrophic forgetting under the curriculum learning
training mechanism, a domain adaptation-based common feature extraction module
is devised, which extracts prior knowledge from the environment to assist downstream
policy learning. Experiments verify that domain randomization effectively addresses algorithm
overfitting to single scenarios, and domain adaptation improves the algorithm’s
transfer generalization ability in new environments.
(3) To tackle the problem of quantitatively assessing the transfer generalization
ability of reinforcement learning policies, a policy transfer generalization performance
evaluation framework oriented towards environmental differences is proposed, and two
methods are introduced under this framework. From the perspective of policy performance,
statistical measures constructed using direct reflections of policy performance
—reward values—are used to assess changes in policy reward values in new environments.
Three different methods of computing reward values using the reward function
from the source environment are used to pinpoint which specific modules are affected
by environmental shifts. From the perspective of policy behavior, statistical measures
using policy trajectories in source and target environments assess changes in policy trajectories
in new environments. Experiments validate the effectiveness and flexibility of
the two methods proposed in this chapter compared to other baseline algorithms.
To sum up, this paper takes the application of deep reinforcement learning in multiagent
multi-target coverage tasks as the task scenario and investigates the transfer generalization
issues of DRL methods in such complex decision-making scenarios. The research
findings of this paper provide algorithmic support for the transfer generalization
and practical deployment of multi-target coverage tasks, as well as offering solutions
for the transfer generalization problems of reinforcement learning.

关键词多目标覆盖任务 强化学习 迁移泛化 课程学习 域自适应 环境偏移
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/57455
专题毕业生_硕士学位论文
推荐引用方式
GB/T 7714
徐一凡. 面向多目标覆盖任务的深度强化学习迁移泛化方法研究[D],2024.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
学位论文终稿_徐一凡.pdf(20521KB)学位论文 限制开放CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[徐一凡]的文章
百度学术
百度学术中相似的文章
[徐一凡]的文章
必应学术
必应学术中相似的文章
[徐一凡]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。