博弈对抗环境中智能策略研究

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 深度强化学习

	博弈对抗环境中智能策略研究
	唐振韬
	2021-05-17
页数	160
学位类型	博士
中文摘要	策略博弈是反映人工智能“智能化”水平的重要体现，一直受到研究人员的广泛关注。博弈过程需要对当前状态进行态势评估，依据态势评估信息推演出的可能性收益来做决策。作为当下主流的两类通用人工智能决策规划算法：深度强化学习和统计前向规划算法，已经在游戏人工智能领域中取得了令人瞩目的研究成果。深度强化学习方法融合了深度学习的感知能力和强化学习的决策能力，以最大化环境奖赏信号作为优化目标，实现端到端方式的决策模型更新。统计前向规划算法则是融合人类启发式先验知识构建前向模型，基于前向模型在推理环境中自适应探索并规划出高价值的动作序列作为博弈决策。为有效利用二者优势，基于深度强化学习与统计前向规划方法，研究博弈对抗环境中智能策略方法和实现技术，以进一步提高博弈策略模型表现，对于提升机器博弈性能，推动智能决策技术在专业领域的应用，具有重要的理论意义和应用价值。本文以对博弈抗环境中智能策略方法为研究目标，按照完全信息回合制博弈到非完美信息实时博弈的研究过程，围绕深度强化学习和统计前向规划方法展开研究。首先从完全信息回合制博弈出发，基于五子棋平台研究自我博弈式策略对抗模型；其次针对即时对抗博弈，基于星际争霸平台研究端到端式宏观策略对抗模型；再次针对非完美信息实时博弈，基于格斗游戏平台研究对手建模式前向推理规划策略对抗模型；最后针对多机器人实时对抗博弈，基于 RoboMaster机器人平台研究基于对手经验回放与好奇心机制驱动的多机器人协作对抗模型。本文的主要工作和创新点归纳如下： 1. 提出了一种基于强化学习与蒙特卡罗树搜索的策略博弈模型针对完全信息回合制零和博弈问题，本研究提出了基于自我博弈方式的自适应动态规划强化学习方法，在对抗过程中优化状态估值神经网络模型，并结合蒙特卡罗树搜索方法提升博弈模型的多层推理能力。本项研究利用了高效的神经网络模型结构和精简的状态特征，加快了神经网络前向推理和训练收敛速度，在五子棋对抗平台上进行性能验证，棋力水平达到业余顶尖水平。 2. 提出了一种基于策略和价值混合式网络的端到端即时策略博弈模型针对即时策略博弈问题，本研究面向宏观生产序列决策任务设计了端到端深度强化学习策略模型。在感知层面深度融合环境数值信息和二维平面信息，并利用优先级经验回放、双重神经网络模型、胜率预测辅助任务和多线程并行机制等方法加快模型训练效率，使深度博弈模型与游戏不同等级的内置机器人对抗训练，验证模型的表现性能。本研究将宏观生产决策模型应用到星际争霸机器人，荣获 2019 年星际争霸人工智能挑战赛学生组冠军。 3. 提出了一种基于自适应对手模型的滚动时域演化策略博弈模型针对实时同步对抗的非完美信息双人零和博弈问题，本研究基于神经网络模型进行对手建模，根据双方对抗过程中生成的状态-动作-奖赏三元组数据，采用监督式和强化式训练方法自适应调节对手模型，降低由于对手策略模型缺失产生的负面影响。本研究利用滚动时域演化方法的前向推理演化能力，结合自适应对手模型和前向模型进行决策动作序列滚动推理优化。在格斗游戏人工智能对抗平台 FightingICE 测试模型的算法性能表现，所提出的算法模型荣获 2020年格斗游戏人工智能竞赛冠军。 4. 提出了一种基于对手策略回放和好奇心驱动的多机器人策略博弈模型针对实时多机器人系统策略博弈问题，本研究以实体机器人 RoboMaster 对抗平台作为研究背景，通过与行为决策树模型进行系统交互，设计了一种基于对手策略的经验回放方法加快模型训练过程，解决模型训练“冷启动”问题，并且通过改进式的好奇心驱动机制提高模型的环境探索能力。本研究构建了由常规奖赏和探索奖赏结合的多目标奖赏系统更新网络模型权重，通过不同难度对抗环境中的模型优化过程，设计并实现多级仿真到实体迁移的虚实决策系统。所提出的技术方法应用到 RoboMaster 实体机器人决策模块，荣获 RoboMaster 人工智能挑战赛决策组一等奖和开源优秀奖 A 级。
英文摘要	The strategy game is an important embodiment of reflecting the level of artificial intelligence, which has been widely concerned by researchers. In the process of the game, we have to evaluate the current state and make decisions according to the possible benefits derived from the situation. Two main kinds of Artificial General Intelligent decision and planning algorithms, deep reinforcement learning and statistical forward planning algorithms, have achieved remarkable achievements in the field of game artificial intelligence. Deep reinforcement learning methods combine the perception ability of deep learning and the decision-making capacity of reinforcement learning, and take maximizing the environmental reward signal as the optimization objective to realize the end-to-end decision-making model updating. Statistical forward planning algorithms integrate human heuristic prior knowledge to construct a forward model, and plan with a forward model. It adaptively explores and plans high-value action sequences as the game decision. Therefore, based on deep reinforcement learning and statistical forward planning, it is of great theoretical significance and application value to study the intelligent strategy approaches and implementation technology in adversarial games, to further improve the performance of game strategy model, and promote the application of intelligent decision technology in professional fields. This thesis takes the intelligent strategy approaches in adversarial games as the research objective. According to the research process from the complete information round game to the imperfect information real-time game, this thesis focuses on deep reinforcement learning and statistical forward planning methods. First, for the complete information turn-based game, the self-play confrontation model is studied based on the Gomoku platform. Then, for the real-time strategy game, the end-to-end macro strategy confrontation model is studied based on the StarCraft platform. Afterward, for the imperfect information real-time game, the adaptive opponent modeling with forward-to-end reasoning planning strategy confrontation model is studied based on the fighting game platform. Finally, in the case of the multi-robot real-time confrontation game, the cooperative and competitive multi-robot strategy model is studied based on the RoboMaster robot system, with the replay memory of opponent’ s experience and the curiosity-driven mechanism. The main work and innovation points of this thesis are summarized as follows: 1. A strategy game model based on reinforcement learning and Monte Carlo tree search. Aiming at the problem of complete information turn-based zero-sum game, this thesis proposes an adaptive dynamic programming reinforcement learning method based on self-play mode, optimizes the neural network model of state estimation in the confrontation process, and combines with Monte Carlo tree search method to improve the multi-layer reasoning ability of the game model. The efficiency of model inference and training convergence has been improved through efficient neural network architecture and effective state representation. The strategy model of Gomoku has been verified in the platform and reached the top level of the amateur player. 2. An end-to-end real-time strategy game neural network model based on the hybrid of policy and value. Aiming at the real-time strategy game problem in complex environments, an endto-end deep reinforcement learning strategy model is designed for macro production sequence decision tasks. At the perceptual level, the numerical information and twodimensional plane information of the environment are deeply fused. And the model training efficiency is accelerated with priority experience replay, double neural network model, victory prediction auxiliary task and multi-threaded parallel mechanism. Thus, the deep game model can compete with the built-in AI when training with different levels, to verify the performance of the model. The macro production decision-making model was applied to StarCraft AI and won the champion of student StarCraft AI tournament in the Student Group. 3. An enhanced rolling horizon evolution algorithm with the adaptive opponent model. Aiming at solving the problem of two-player zero-sum game with imperfect information in real-time synchronous confrontation, the opponent model is built based on the neural network model. According to the triple data of state, action and reward generated in the process of confrontation, the supervised and reinforcement training methods are used to adaptively adjust the opponent model to reduce the negative impact caused by the lacking of opponent strategy. Then, leveraging the forward planning ability of rolling horizon evolution algorithm, the rolling reasoning optimization of the decision sequence is carried out by combining the adaptive opponent model and forward model. The algorithm’s performance of fighting game AI Agent was tested on the FightingICE platform. The proposed algorithm model won the champion of fighting game artificial intelligence competition in 2020. 4. A multi robot strategy game model based on replay memory of opponent strategy and curiosity driven. Aiming at solving the strategy game problem of real-time multi robot system, the physical robot RoboMaster is taken as the research background, and design the replay memory metohd with opponent strategy, which is based on decision behaivor tree, to speed up the training process. This method can solve the “cold start” problem well in model optimization. In addition, an improved curiosity-driven approach is proposed to enhance the model’ s ability of environmental exploration. Then, a multi-objective reward system is constructed to update the weight of network, which combines the conventional reward and exploration reward. Finally, a multi-level decision-making optimization system is designed and realized for the migration from simulation to real. The proposed approach was applied into the RoboMaster decision-making module, and won the first prize and open-source excellent awards A in Decision Group of RoboMaster Artificil Intelligence Challenge.
关键词	深度强化学习统计前向规划策略博弈智能决策游戏人工智能
语种	中文
七大方向——子方向分类	机器博弈
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/45058
专题	多模态人工智能系统全国重点实验室_深度强化学习
推荐引用方式 GB/T 7714	唐振韬. 博弈对抗环境中智能策略研究[D]. 北京. 中国科学院自动化研究所,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
博弈对抗环境中智能策略研究-唐振韬博士学（23513KB）	学位论文		开放获取	CC BY-NC-SA