CASIA OpenIR  > 毕业生  > 博士学位论文
基于深度强化学习的连续动作空中博弈对抗决策
李伟凡
2023-06-26
页数140
学位类型博士
中文摘要

空中博弈是指争夺制空权进行的军事行动,包括防空和空中格斗。在空中博弈中,控制导弹和飞行器摧毁对手的制空手段是主要的手段。随着人工智能技术的发展,导弹、飞机和无人机的控制正在愈加智能化。在未来的空中博弈问题中,基于人工智能的无人系统将成为主流。因此,人工智能的决策水平将成为空中博弈胜负的关键之一。基于深度强化学习的方法可以通过试错的方式学习高性能的决策策略。然而在面对连续动作空间问题时,深度强化学习方法的性能往往有所不足。为了提升连续动作空间的智能体性能,深度强化学习方法会选择离散化的方式进行处理。离散化方法飞行器控制时有着不精确、不稳定和不符合机械结构的问题。而连续动作空间的飞行器控制通常很难训练、效率低下并且在自博弈问题中不稳定。由此可见,如何在空中博弈问题中,基于连续动作空间的控制器设计强化学习算法仍然面对许多挑战和困难。

本文在综述当前研究现状的基础上,以深度强化学习算法为核心工具,围绕着连续动作空间中飞行器控制、多智能体强化学习方法和自博弈方法进行深入的研究。基于自模仿学习方法,对深度强化学习的探索问题进行研究,提出了基于辅助任务的导弹逆轨拦截方法。基于值分解的方法,对多智能体深度强化学习的问题进行了研究,实现了多导弹多目标的拦截。基于自博弈的方法,对零和博弈的空战问题进行了研究,实现了连续动作空间自博弈的空战智能体。论文的主要章节包含了以下的工作和贡献:

1. 针对逆轨拦截机动目标的制导律设计问题,提出了一种基于辅助任务的深度强化学习方法来解决空中博弈中含噪声和延迟的非完美信息导弹1对1导弹逆轨拦截问题。 针对噪声观测和延迟环境对特征提取学习效率的影响,通过构建以目标机动为标签的监督学习辅助任务,为智能体提供准确的特征梯度信息,从而提升智能体特征提取的学习效率。针对导弹围捕长时延迟奖励的问题,利用高斯分布的特点,在自模仿学习的基础上提出了高斯自模仿学习方法。该方法通过截断自模仿学习时对高斯分布方差的梯度,在提升智能体学习效率的同时,保持了智能体的探索性能和最终性能。通过结合高效的监督学习辅助任务和高斯自模仿学习,基于辅助任务的强化学习方法在导弹逆轨拦截问题中超越了传统的比例导引方法和针对逆轨拦截设计的角度约束方法,验证了基于深度强化学习的导弹导引律设计能力。

针对空中博弈中多导弹对多目标的围捕问题,提出基于有向图注意力网络的多智能体强化学习方法。首先,针对多导弹多目标围捕时观测可变的问题,提出了基于有向图注意力网络的特征提取方法。该网络使导弹之间建立有向的通信结构,允许导弹的观测可以在导弹之间和观测内部之间进行整合,输出分布式策略与值函数。其次,为解决多智能体策略梯度优化不稳定的问题,提出了一种基于分布式优势值函数的近端策略优化方法。 该方法通过分布式优势值函数对近端策略优化更新的步长进行约束,并在SMAC和MuJoCo环境中进行了验证。 通过设计的通信方式和优化方式,在2对1导弹围捕问题中,本方法达到了最优性能。 在多对多导弹围捕问题中,提出的方法在通信和优化方面都超越了过去的方法。最后,在5对5导弹围捕中验证了目标分配性能,在10对10导弹围捕中验证了泛化能力。

针对空战1对1环境,提出了一种基于经验回放的扩散模型连续动作空间自我博弈方法。首先,为了提升深度强化学习智能体自博弈的效率,提出了基于经验回放的自博弈方法。 通过构造历史缓存,记录过去不同对手的博弈轨迹,并让值函数逼近该历史轨迹的均值。 通过逼近历史轨迹均值的值函数,计算每个状态动作对的优势值函数,并按照该优势值函数更新以逼近纳什均衡解,从而加快策略的收敛。接着,基于扩散模型建立神经网络,提升了智能体在连续动作空间中策略表达的能力。 在开源的足球博弈和乒乓球环境中,验证了提出的自博弈方法有效性。在连续动作博弈的乒乓球环境中,验证了提出的扩散模型的有效性。最后在1对1空战问题中,基于提出的自博弈方法和扩散模型,击败了Mini-Max算法和高斯分布建模的方法。

未来基于元强化学习实现满足多种奖励函数的导引律,并进一步分析扩散模型的纳什均衡收敛性能。

英文摘要

The air combat problem refers to military operations conducted in the air to compete for air supremacy, including aerial combat and air defense.  With the development of artificial intelligence technology, the control of missiles, aircraft and drones is becoming more intelligent. In future air combat issues, AI-based unmanned systems will become mainstream. The decision-making level of artificial intelligence will become one of the keys to winning or losing in air combat.

Deep reinforcement learning method can achieve high-performance strategies through trial and error.  However, when facing continuous action space problems, the performance of deep reinforcement learning methods is often insufficient. In order to improve the performance of agents in continuous action space, deep reinforcement learning methods will choose a discretization method for processing. Discretization methods have problems with imprecision, instability and inconsistency with mechanical structures when controlling aircraft. The control of aircraft in continuous action space is usually difficult to train, inefficient and unstable in self-play problems. Therefore, how to design a reinforcement learning algorithm based on continuous action space controllers in air combat problems still faces many challenges and difficulties.

Based on the current researches, this paper conducts in-depth research around aircraft control in continuous action space, multi-agent reinforcement learning methods and self-play methods. Based on self-imitation learning methods, research issues related to deep reinforcement learning are studied, and a missile reverse trajectory interception method based on auxiliary tasks is proposed. Based on value decomposition methods, research issues related to multi-agent deep reinforcement learning are studied, and interception of multiple missiles and multiple targets is achieved. Based on self-play methods, research issues related to zero-sum game air combat problems are studied, and an air combat intelligent agent with continuous action space self-play is achieved. The main sections of the paper include the following work and contributions:

For the problem of guidance law design for head-on trajectory interception of maneuvering targets, a deep reinforcement learning method based on auxiliary tasks is proposed to solve the problem of non-perfect information missile 1v1 head-on trajectory interception with noise and delay in air combat. To improve the efficiency of feature extraction learning under the influence of noisy observation and delayed environment, a supervised learning auxiliary task based on target maneuvering as a label is constructed to provide accurate feature gradient information for the intelligent agent, thereby improving the learning efficiency of feature extraction by the intelligent agent. To solve the problem of long-delayed rewards for missile interception, a Gaussian self-imitation learning method is proposed based on self-imitation learning. This method uses the characteristics of Gaussian distribution to truncate the gradient of Gaussian distribution variance during self-imitation learning. While improving the learning efficiency of intelligent agents, it maintains the exploratory performance and final performance of intelligent agents. By combining efficient supervised learning auxiliary tasks and Gaussian self-imitation learning, reinforcement learning methods based on auxiliary tasks have surpassed traditional proportional guidance methods and angle constraint methods designed for head-on trajectory interception in missile interception problems based on deep reinforcement learning, verifying the design ability of missile guidance laws based on deep reinforcement learning.

For the problem of multi-missile-to-multi-target interception in air combat, a multi-agent reinforcement learning method based on directed graph attention network is proposed. Firstly, a feature extraction method based on directed graph attention network is proposed for multi-missile-to-multi-target interception with variable observation. The network establishes a directed communication structure between missiles, allowing missile observations to be integrated between missiles and within observations, outputting distributed policies and value functions. Secondly, to solve the problem of unstable multi-agent policy gradient optimization, a proximal policy optimization method based on distributed advantage value function is proposed. This method constrains the step size of proximal policy optimization update by distributed advantage value function and has been verified in SMAC and MuJoCo environments. By designing communication methods and optimization methods, this method achieves optimal performance in 2v1 missile interception problems. In multi-missile-to-multi-target interception problems, the proposed method has surpassed past methods in communication and optimization. Finally, target allocation performance was verified in 5v5 missile interception and generalization ability was verified in 10v10 missile interception.

For air combat 1v1 environment, a diffusion model continuous action space self-play method based on experience replay is proposed. Firstly, to improve the efficiency of deep reinforcement learning agent self-play, an experience replay self-play method is proposed. By constructing historical caches to record game trajectories with different opponents in the past and letting value functions approximate their mean values. By approximating the Nash equilibrium solution according to this advantage value function update to accelerate policy convergence. Secondly, based on diffusion models, neural networks are established to improve the ability of intelligent agents to express strategies in continuous action space. The effectiveness of the proposed self-play method was verified in open-source football games and table tennis environments. The effectiveness of the proposed diffusion model was verified in continuous action game table tennis environment. Finally, based on the proposed self-play method and diffusion model, Mini-Max algorithm and Gaussian distribution modeling methods were defeated in 1v1 air combat problems.

In the future, a guidance law that satisfies various reward functions will be implemented based on meta-reinforcement learning, and the Nash equilibrium convergence performance of the diffusion model will be further analyzed.

关键词强化学习 深度强化学习 自注意力网络 智能决策 多智能体系统
语种中文
七大方向——子方向分类决策智能理论与方法
国重实验室规划方向分类多智能体决策
是否有论文关联数据集需要存交
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/52148
专题毕业生_博士学位论文
通讯作者李伟凡
推荐引用方式
GB/T 7714
李伟凡. 基于深度强化学习的连续动作空中博弈对抗决策[D],2023.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
毕业论文-李伟凡.pdf(43167KB)学位论文 限制开放CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[李伟凡]的文章
百度学术
百度学术中相似的文章
[李伟凡]的文章
必应学术
必应学术中相似的文章
[李伟凡]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。