CASIA OpenIR  > 毕业生  > 硕士学位论文
基于模仿学习的战术兵棋智能体构建与优化关键技术研究
王筱琦
2024-05
Pages59
Subtype硕士
Abstract

认知智能是国家新一代人工智能的重要发展方向,复杂博弈环境下智能体构建与优化技术研究具有重要的应用价值和学术意义。然而在具有奖励稀疏、决策因素复杂、随机性高等特点兵棋推演环境,研究面临着难以探索到有效策略、状态转移不确定、多智能体行为难建模等挑战。现有方案难以完全满足兵棋环境对智能体提出的要求,因此需要针对兵棋特点进行方法的综合应用创新,以实现适应兵棋特点且具有良好性能的智能体的高效获取。针对上述挑战,本文以陆军战术兵棋推演为验证环境,研究智能体的构建与优化关键技术。主要工作如下:

1针对兵棋推演奖励稀疏、决策因素复杂、随机性高的特点,提出从模仿学习到强化学习的两阶段智能体构建与优化技术路线。首先使用模仿学习通过拟合高质量专家演示数据实现具有良好决策水平智能体的高效获取,并为后续强化学习提供较高的起点从而降低探索难度和试错成本,提高学习效率;随后使用强化学习对智能体进行在线调优从而缓解模仿学习智能体受限于专家演示水平且难以应对数据集中未出现场景的问题,进一步提高智能体对抗水平。经过相辅相成的两个阶段后,智能体在与基准智能体的对抗中达到0.83的胜率,显著高于基准智能体的0.48胜率。

2针对兵棋推演异构多智能体协同决策问题,经过对算子之间态势信息利用方式共性与特性的分析,设计了基于注意力机制的多任务学习兵棋智能体网络该网络在特征共享的同时实现任务特异特征的自适应表达,具有提高推理效率、减少参数量、简化系统等优点与硬参数共享网络相比提高了算法在数据集上的效果和性能,与传统单任务学习网络构成的智能体相比对抗胜率提高0.37

3针对兵棋推演状态空间大、底层属性多的特点,提炼构建游戏统计数据、算子属性状态、空间信息三类态势特征为智能体决策提供输入;根据任务需要对原始复盘数据进行质量和来源筛选,针对兵棋推演中有效动作的稀疏性和延后性设置样本过滤和标签生成规则缓解类别分布不均问题,构建高质量专家演示数据集保障模仿学习算法的效果,并通过引入时序信息应对敌方可观测信息的不完备性提高算法在数据集上的学习效果;在强化学习阶段针根据兵棋推演经验和任务特点进行启发式奖励重塑,引导智能体弥补模仿学习阶段策略的不足,实现了算法的有效收敛和智能体决策水平的进一步提高

Other Abstract

With cognitive intelligence being a national crucial development direction for the next generation of artificial intelligence, research on the construction and optimization technologies of intelligent agents holds significant value in industry and acadamia. However, in complex strategic gaming environments like the wargame, researches face challenges in many aspects, including exploring effective strategies, uncertain state transitions, multi-agent behavior modeling and so on. Since existing methods fail to fully meet the requirements posed by the wargame, comprehensive research and tailored innovation is of necessity to accomplishing the goal of efficiently accquiring high performamce agents that adapt well to the wargame environment. Addressing the aforementioned challenges, this paper focuses on the construction and optimization of tactical wargame agents. The main works are as follows:

  1. Addressing the sparse rewards, complicated decision factors, and high randomness in the wargame, a two-stage construction and optimization technical route consisting of imitation learning and reinforcement learning is proposed. Firstly, imitation learning is used to efficiently acquire a high performance agent by fitting high-quality expert demonstration dataset, which provides high starting points for the subsequent reinforcement learning to reduce exploration difficulty and trial-and-error costs, thus enhancing learning efficiency. Subsequently, reinforcement learning is used to fine-tune the agent to alleviate limitations of imitation learning, including restricted by expert demonstration levels and struggling with scenarios not present in the dataset, further improving the agents' adversarial capabilities. After these two complementary stages, the agent achieves a win rate of 0.83 against benchmark agent, significantly higher than the benchmark agent' 0.48 win rate.
  2. For heterogeneous multi-agent cooperative decision-making problem in the wargame, after analyzing the commonalities and characteristics of situation information utilization between operators, a multi-task attention network for wargame agent is proposed to simultaneously model the decision making processs for multiple operators, achieving adaptive expression of task-specific features while sharing other features, with advantages such as improved reasoning efficiency, reduced parameter count, and simplified systems. This network enhances algorithm performance and effectiveness on datasets compared to classical hard-parameter sharing network, and improves win rate by 0.37 compared to conventional single-task learning agent.
  3. Considering the large state space and diverse underlying attributes in the wargame, three types of situation features are distilled and constructed to provide inputs for the agent’s decision making, including game statistical data, operator attribute states, and spatial information. Considering the sparsity and postponement of effective actions in the wargame, in order to construct high-quality expert demonstration dataset for imitation learning, original replay data is filtered based on quality and source, and rules for sample filtering and label generation are set to mitigate class distribution imbalance. Besides, in response to incomplete enemy observable information, temporal information is introduced to enhance algorithm performance on datasets. Moreover, heuristic reward reshaping is performed based on wargaming experience and task characteristics during the reinforcement learning stage to guide agents to compensate for strategy shortcomings from the imitation learning stage, achieving effective convergence and further enhancing the decision-making level of the agents.
Keyword兵棋推演 模仿学习 强化学习 人机对抗
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/57257
Collection毕业生_硕士学位论文
Recommended Citation
GB/T 7714
王筱琦. 基于模仿学习的战术兵棋智能体构建与优化关键技术研究[D],2024.
Files in This Item:
File Name/Size DocType Version Access License
毕业论文-王筱琦.pdf(2737KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[王筱琦]'s Articles
Baidu academic
Similar articles in Baidu academic
[王筱琦]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[王筱琦]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.