CASIA OpenIR  > 复杂系统认知与决策实验室
稀疏奖励环境下基于自博弈框架的智能空战算法研究
何少钦
2024-05
Pages80
Subtype硕士
Abstract
近年来,随着人工智能技术的不断发展,深度强化学习应用于智能决策领域
的研究也逐渐成为热点。在空战领域,超视距空战代表着空战未来发展的主要方
向,而和人工智能算法结合的超视距智能空战的研究在国际形势变幻莫测的今
天具有非常重要的研究价值。
本研究致力于超视距智能空战的未来发展,以深度强化学习为基础,开发了
适用于 1v1 超视距空战场景的空战博弈智能体训练算法。针对如何从零开始训
练空战智能体的问题,本研究提出了基于自博弈框架的训练方法,并引入具有纳
什均衡收敛性保证的神经虚拟自博弈训练算法。为了充分利用神经虚拟自博弈
训练期间产生的离线数据,本研究设计了基于离线强化学习的策略优化算法用
以改进神经虚拟自博弈训练框架。针对 1v1 超视距空战博弈场景状态空间庞大、
奖励稀疏的问题,本研究提出了多种引导智能体探索和提高智能体探索效率的
方法,包括辅助奖励和好奇心机制等。本研究的主要贡献包括以下几个方面:
(1) 提出了基于神经虚拟自博弈的空战博弈智能体训练算法,解决自博弈
训练框架的收敛性问题。相比于过去基于规则对手的训练算法,该训练框架无需
专家设计复杂的规则对手帮助智能体进行训练,且不存在对专家规则对手过拟
合的缺陷。相比于朴素的自博弈训练框架,神经虚拟自博弈在 1v1 超视距空战,
这一二人零和博弈场景下收敛到近似纳什均衡有理论上的保证。实验结果证明
了基于神经虚拟自博弈框架训练的空战智能体的性能和策略对纳什均衡的收敛
程度要高于传统的朴素自博弈训练算法。
(2) 提出了基于离线强化学习的神经虚拟自博弈训练算法,解决神经虚拟
自博弈训练算法对离线数据的样本利用效率低的问题。原本的神经虚拟自博弈
使用行为克隆算法来近似智能体的历史平均策略。用于历史平均策略学习的离
线数据集收集的是基于强化学习的最佳反应策略和环境进行交互时生成的样本。
行为克隆算法仅仅利用了强化学习五元组中的当前状态和动作,以动作为标签,
当前状态为输入,使用监督学习框架,学习状态到动作的简单映射。它忽略了五
元组中的奖励以及环境状态转移等信息,对离线数据的样本利用效率低。本研究
提出使用离线强化学习取代行为克隆算法,充分利用五元组中的所有样本数据,
提高了样本利用效率。在综合对比实验中,相比于朴素自博弈和神经虚拟自博
弈,基于离线强化学习的神经虚拟自博弈展现出了更好的训练性能和对纳什均
衡的收敛性。
(3) 提出了基于好奇心机制和离线强化学习的神经虚拟自博弈训练算法,解
决环境奖励稀疏的 1v1 超视距空战场景下的智能体训练问题。1v1 超视距空战场
景的原始奖励极为稀疏,本研究提出的第一种方法是设计多种任务相关的辅助
奖励,引导智能体的探索倾向,解决算法冷启动问题。第二,本研究引入了有效
动作集的概念,提高智能体的探索效率。最后,本研究引入了好奇心机制,和前
两种方法不同的是好奇心模块的输入为全局状态。利用好奇心机制生成的内在
奖励激励博弈双方一起努力探索更多不同的博弈态势,增加博弈态势的多样性,
从而帮助智能体策略跳出局部最优,更好地向纳什均衡策略收敛。最后的训练性
能实验和纳什均衡收敛性实验也证明了,基于好奇心机制和离线强化学习的神
经虚拟自博弈算法相比于其他算法具有显著的优势。
本文从自博弈训练框架对纳什均衡的收敛性、对离线数据的样本利用效率
以及稀疏环境下的探索方法三个方面着手,提出了基于好奇心机制和离线强化
学习的神经虚拟自博弈训练算法。该训练算法可以用于解决稀疏奖励环境下基
于自博弈框架的智能空战算法的训练问题。
Other Abstract
In recent years, with the continuous development of artificial intelligence technol
ogy, research on the application of deep reinforcement learning in the field of intel
ligent decision-making has gradually become a hot topic. In the field of air combat,
beyond-visual-range (BVR) air combat represents the main direction of future air com
bat development. The research on BVR intelligent air combat combined with artificial
intelligence algorithms is of great research value in today’s ever-changing international
situation.
This study is committed to the future development of BVR intelligent air combat,
based on deep reinforcement learning, and has developed an air combat game agent
training algorithm suitable for 1v1 BVR air combat scenarios. Regarding the issue of
how to train air combat agent from scratch, this study proposes a training method based
on a self-play framework, and introduces a Neural Fictitious Self-Play(NFSP) training
algorithm with guaranteed Nash equilibrium convergence. In order to fully utilize the
offline data generated during NFSP training, this paper designs an offline data policy
optimization algorithm based on offline reinforcement learning to improve the NFSP
training algorithm. Considering the situation of the vast state space and sparse rewards
in the 1v1 BVR air combat game scene, this study proposes various methods to guide
the agent to explore and improve the efficiency of agent exploration, including auxiliary
rewards and curiosity mechanisms. The main contributions of this study include the
following aspects:
(1) To address the convergence issue of the self-play training framework, this
study proposes an air combat game agent training algorithm based on NFSP frame
work. Compared to past training algorithms based on rule-based opponents, this train
ing framework does not require experts to design complex rules to assist the agent in
training, and does not suffer from the overfitting to rule-based opponents. Compared to
the naive self-play training framework, NFSP theoretically guarantees convergence to
approximate Nash equilibrium in the 1v1 BVR air combat, a two-player zero-sum game
scenario. Experimental results also demonstrate that the performance and strategy con
vergence to Nash equilibrium of the air combat agent generated based on the NFSP
framework are higher than those of traditional naive self-play training algorithms.
(2) To address the low efficiency of the NFSP algorithm in utilizing offline data,
this study proposes an NFSP training algorithm based on offline reinforcement learning.
The original NFSP uses a behavior cloning algorithm based on a supervised learning
framework to approximate the agent’s historical average strategy. The offline data used
for historical average strategy learning are quintuples collected by the best response
strategy based on reinforcement learning interacting with the environment. The behav
ior cloning algorithm only utilizes the current state and action from the quintuples. It
uses the action as the label, the current state as input, and learns a simple mapping from
state to action using a supervised learning framework. It ignores information such as re
wards and environmental state transitions in the quintuples, resulting in low efficiency
in utilizing offline data. This study proposes using offline reinforcement learning in
stead of the behavior cloning algorithm to fully utilize all sample data in the quintuples,
improving sample utilization efficiency. In comprehensive comparative experiments,
compared to naive self-play and NFSP, NFSP based on offline reinforcement learning
demonstrates better training performance and convergence to Nash equilibrium.
(3) To address the training problem in the sparse reward environment of 1v1 BVR
air combat, this study proposes an NFSP training method based on curiosity mechanisms
and offline reinforcement learning. The original reward in the 1v1 BVR air combat
scene is extremely sparse. This study proposes multiple methods to guide the agent’s
exploration tendency and solve the algorithm cold start problem. The first method is
to design various task-related auxiliary rewards to guide the agent’s exploration ten
dency. The second is to introduce the concept of an effective action set to improve
the agent’s exploration efficiency. Finally, the curiosity mechanism is introduced, with
the input of the curiosity module being the global state, unlike the first two methods.
The intrinsic rewards generated by the curiosity mechanism incentivize both players in
the game to explore more different game situations together, increase the diversity of
game situations, help the agent’s strategy break out of local optima, and better converge
to Nash equilibrium strategies. The performance experiments and Nash equilibrium
convergence experiments also demonstrate that the NFSP algorithm based on curiosity
mechanisms and offline reinforcement learning has significant advantages over other
algorithms.
This paper proposes an NFSP training algorithm based on curiosity mechanisms
and offline reinforcement learning from three aspects: the convergence of the self-play
training framework to Nash equilibrium, the efficiency of utilizing offline data, and ex
ploration methods in a sparse environment. It aims to solve the training problem of
intelligent air combat algorithms based on the self-play framework in sparse reward
environments.
Keyword强化学习,离线强化学习,空战,智能决策,好奇心机制
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/57062
Collection复杂系统认知与决策实验室
毕业生_硕士学位论文
Recommended Citation
GB/T 7714
何少钦. 稀疏奖励环境下基于自博弈框架的智能空战算法研究[D],2024.
Files in This Item:
File Name/Size DocType Version Access License
硕士毕业论文.pdf(4570KB)学位论文 开放获取CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[何少钦]'s Articles
Baidu academic
Similar articles in Baidu academic
[何少钦]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[何少钦]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.