基于表示学习和对手建模的动态博弈系统策略生成技术研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于表示学习和对手建模的动态博弈系统策略生成技术研究
	詹员
	2023-05-22
页数	88
学位类型	硕士
中文摘要	近年来，人工智能技术的研究热点逐渐从智能感知迈向智能决策。在复杂博弈环境下，如网络游戏、派单推荐、军事对抗、资源调度，开发高效的决策系统上升到了关乎国家发展的重要战略地位。深度强化学习的理论和算法近些年层出不穷，被广泛用于各类序列决策任务，其迅速发展已成为人工智能决策系统的基准算法。然而在涉及高维视觉输入的现实任务中，状态维度巨大、冗余信息过多，这要求强化学习智能体能够同时学习特征编码和策略网络的组合，这需要消耗海量的数据信息和计算资源；另一方面，传统强化学习方法一般将对手看作环境的一部分，未考虑对手的行为特征，这导致己方智能体无法利用对手的弱点求得制胜策略。因此，如何针对不同场景设计合适的对手建模方法也是一个非常重要的问题。针对以上两个问题，本文设计了有效的状态表示和对手建模算法，本文的主要贡献总结如下：（1）提出了一种基于表示学习的特征空间约减与特征提取方法。针对高维状态输入下智能体接受信息冗余、样本利用率低的问题，本文开展了状态表示学习在强化学习中的研究，探索了高容量神经网络对状态表示的影响，希望通过简单地构造高容量神经网络来增强特征提取器的状态表示能力。于是本文引入一种新的跳跃连接机制，使得上下游的特征图得以交互，丰富特征信息的同时隐式拓展了神经网络的容量。在此基础上，本文将信息论的思想引入神经网络，通过信息瓶颈机制压缩状态表示的冗余信息，进一步提升状态表示的能力。实验结果表明，与基准算法相比，本文所提出的算法具有优异的得分表现，同时所学习到的状态表示具备良好的泛化性。（2）提出了一种基于长短期记忆网络和变分自编码器的对手建模方法。针对在空中博弈对抗场景下，对手态势空间的高维性和时序性导致己方难以准确地学习从对手状态到对手策略的映射。本文借鉴表示学习的思想，将对手建模抽象为从观测信息中提取对手策略的表示信息，利用长短期记忆网络对离线的对抗数据进行序列化建模，学习对手的历史状态、动作与隐藏策略的内在联系，同时借助变分自编码器生成对手的策略表示。最后本文根据空中博弈对抗的机制特点，将飞行器决策过程抽象化为“信息感知、态势表示、策略生成、机动控制”四个模块，构造了训练强化学习智能体所需的特征编码、奖励函数、动作空间、博弈算法，同时与对手建模方法相结合，增强了智能体的决策能力。实际对抗结果表明，己方智能体表现出了优秀的机动能力。（3）提出了一种基于对手建模和多臂赌博机的自适应策略生成方法。针对在对手策略不断变化时，对手建模对未知对手表示能力下降的问题，本文将策略生成分为两个过程：在离线阶段，将对手策略作为先验信息来对条件变分自编码器的解码端进行调控，从而学习对手的策略表示，进而学习到近似最优策略。在实时对抗时，引入多臂赌博机在固定策略和近似最优策略之间切换。通过两个过程来提升策略的适应性。实验结果表明，本文所提出的算法在多个场景下的表现明显优于基准算法，尤其是在面对未知对手时，仍然保持着优秀的决策能力。
英文摘要	In recent years, the focus of artificial intelligence research has increasingly shifted from intelligent perception to intelligent decision-making. The creation of effective decision-making systems has assumed a crucial strategic position in relation to national growth in complex gaming environments, such as online games, dispatch recommendations, military clashes, and resource scheduling. Deep reinforcement learning theories and algorithms have arisen in recent years, and have been widely used for various sequential decision-making problems, its rapid development has made it a benchmark method for artificial intelligence decision-making systems. However, in realistic tasks involving high-dimensional visual inputs, the huge state dimension and excessive redundant information require reinforcement learning agents capable of learning combinations of feature representation and policy networks simultaneously, which consumes a significant amount of data and computational resources; On the other hand, traditional reinforcement learning algorithms generally consider the opponent as part of the environment and don't take into account the behavioral characteristics of the opponent, which makes it impossible for the agents to exploit the opponent's weaknesses in order to come up with a winning strategy. Therefore, it is also an essential issue to consider how to create the proper opponent modeling algorithms for various scenarios. This research develops a powerful state representation learning and opponent modeling algorithm for the aforementioned two issues. The specific contributions include the following three points: (1) A representation learning-based feature space reduction and feature extraction method is proposed. To address the problems of redundant information reception and low sample utilization under high-dimensional state input, this paper conducts a study of state representation learning in reinforcement learning, investigates the impact of high-capacity neural networks on state representation problem, aiming to merely enrich the representation space of state encoding by extending and deepening the neural network layers. In this paper, we introduce a new skip-connection mechanism, which enables the interaction of upstream and downstream feature maps, enriching the feature data while implicitly increasing the capacity of the neural network. On this basis, this paper introduces the idea of information theory into neural networks to further compress the redundant information of state representation through the information bottleneck mechanism and further enhance the capability of state representation. The experimental results demonstrate that the proposed algorithm outperforms the benchmark algorithms in terms of score performance, and the learnt state representations is capable of good generalization. (2) A opponent modeling method based on long short-term memory networks and variational autoencoders is proposed. In the scenarios of aerial gaming confrontation, the high dimensional and temporal nature of the adversary's state space makes it difficult for us to learn the mapping from the opponent's state to the opponent's strategy accurately. In this paper, we draw on the idea of representation learning to abstract opponent modeling as extracting the representation information of opponent strategies from the observed state, modeling the offline data serially using long and short-term memory networks, learning the intrinsic connection between the opponent's historical states, actions and hidden strategies, and generating the opponent's strategy representation by variational autoencoders at the same time. Finally, according to the mechanism characteristics of aerial game, this paper abstracts the decision-making process of aircraft into four modules: "information perception, situation representation, strategy generation, and maneuver control", and constructs the feature encoding, reward function, action space, and decision-making algorithm required for training reinforcement learning agents, and combines them with opponent modeling methods to enhance the decision-making capability of the agent. The results of the confrontation showed that our agent showed excellent rounding and striking abilities. (3) A self-adaptive strategy generation method based on opponent modeling and multi-armed bandits is proposed. To address the problem that the ability of opponent modeling to represent unknown adversaries decreases when the opponent's strategy keeps changing, this paper divides the strategy generation into two processes: In the offline stage, the opponent's policy is used as a priori information to modulate the decoding side of the conditional variational autoencoder, in order to learn the opponent's policy representation and the near-optimal policy. In real-time adversarial scenarios, a multi-armed bandit is used to switch between a fixed conservative strategy and an approximate optimal strategy. The two processes are used to enhance the adaptability of the strategy. Experimental results show that the performance of the proposed methods in multiple scenarios is significantly better than the baseline algorithms, especially when facing unknown opponents, it still maintains excellent decision-making ability.
关键词	深度强化学习表示学习对手建模空中博弈
学科领域	人工智能
学科门类	工学::控制科学与工程
语种	中文
七大方向——子方向分类	决策智能理论与方法
国重实验室规划方向分类	智能博弈与对手建模
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/51906
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	詹员. 基于表示学习和对手建模的动态博弈系统策略生成技术研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
詹员学位论文-最终版.pdf（12738KB）	学位论文		限制开放	CC BY-NC-SA