CASIA OpenIR  > 毕业生  > 硕士学位论文
融合对手建模的多风格策略集成算法研究
白丰硕
2023-05
页数71
学位类型硕士
中文摘要

复杂博弈是指具有较高复杂度的博弈问题,它们通常涉及众多参与者、庞大的决策变量集合以及多样化的潜在结果。这类问题的行为特征不仅受到个体决策的影响,还与参与者之间的相互作用及协同决策密切相关。在实际应用场景中,复杂博弈问题在诸如社交网络信息传播、市场竞争中的价格策略制定、全球气候政策的国际协同治理以及国际政治与军事冲突等领域均具有广泛的应用前景和研究价值。然而,现实世界中的博弈问题通常具有巨大的规模、众多的参与者,且参与者之间往往存在多种形式的相互作用。这些挑战使得这类问题难以直接成为研究对象。近年来,深度强化学习在各种序列决策问题中展现出了强大的自学习和探索能力,为解决复杂博弈问题提供了新的思路。因此,基于实际问题抽象出的博弈游戏受到学术界的广泛关注。通过研究这些复杂博弈游戏,设计针对性的博弈算法,进而将其应用于解决实际问题,已经成为一种有效的问题研究范式。这种方法不仅有助于推动博弈理论的发展,还能为解决现实世界中的复杂博弈问题提供有力支持。

而当前基于自主进化的博弈求解框架存在诸多问题,使其在实际应用中存在很多挑战与困难。主要的问题表现为策略循环克制的低效性和未充分利用对手信息的不确定性。在现实世界中,博弈求解的整个过程中存在非常长的周期,而大部分计算开销都发生在相对较弱的策略区间,且这一部分策略空间较大,因此导致策略训练效率低下,时间和算力成本提升。与此同时,在典型的博弈对抗中,一方面因为策略空间的复杂度大,存在对手策略的意图行为存在不确定性,态势认知主观,行为不重复,存在决心意图难刻画,决策信息不完全等困难。另一方面采用固定的行动策略容易被对手找到弱点而被剥削,因此策略需要灵活变化,实时寻找最优的策略以提高己方策略的性能水平。这两方面主要的原因导致基于自主进化的博弈求解框架在实际应用中存在诸多局限。

本文以复杂博弈为研究对象,旨在解决当前基于自主进化的博弈求解框架所面临的挑战。针对博弈求解过程中策略因陷入循环克制导致的训练低效性,以及未充分利用对手信息的问题,本文系统地开展了融合对手建模的多风格策略集成算法研究。以下概述了本文的主要工作和创新点:
1. 针对典型的博弈对抗场景,对策略循环克制和策略多样性的研究背景进行系统地分析。在非传递动态博弈中,智能体的行为多样性有助于更有效地探索环境并学习到更优策略。因此,获得行为风格迥异且具备胜利潜力的基线策略至关重要。在此基础上,本文提出了一种基于奖励重塑的多风格基线策略训练框架,旨在高效地训练出风格迥异且具有获胜能力的策略。为了定量地度量策略的多样性,该训练框架引入基于琴生-香农散度的度量指标。与此同时,将策略和参数化奖励函数的联合训练形式化为双层优化问题,从而提高基线策略训练效率和策略多样性。在典型博弈场景中的实验结果表明,该框架能够高效地训练出风格迥异且有获胜能力的基线策略,且策略间呈现出明显的循环克制关系。

2. 针对基于自主进化的博弈对抗求解算法中存在的循环克制问题,系统地分析了多任务强化学习和梯度修正的研究背景,并提出了一种基于梯度修正的多风格策略集成算法。该算法能够利用单个神经网络高效地学习并掌握不同风格的策略,同时能够利用风格标识符实现策略网络在不同风格之间的切换。具体地,集成策略的训练采用了两阶段优化方法,包括策略提升阶段和策略修正阶段。策略提升阶段旨在提高集成策略在使用某一风格时的胜率,而策略修正阶段则辅助减缓策略提升阶段参数更新对集成策略在其他风格上性能的负面影响。在博弈对抗测试环境中的实验结果表明,该训练算法框架能够高效地训练出集成策略。同时扩展测试的实验结果表明,集成策略能够击败策略循环克制链中的其他策略,从而打破循环克制并提升策略的传递性。

3. 针对基于自主进化的博弈对抗求解算法中存在的对手信息利用不充分问题,本文指出在自适应进化算法框架中,己方策略缺乏针对对手策略的系统性建模,以及对对手行为风格的有效分析,导致无法合理预测对手未来行为所带来的潜在风险。为解决这一问题,经过对对手建模研究方法进行系统分析后,本文提出了一种基于类型的对手策略分布预测算法,该算法能够利用对手历史行为数据预测其行为风格。通过基于置信度的己方风格确定机制,该算法进一步降低了对手和系统的不确定性,并提高了己方策略获胜的概率。在博弈对抗测试环境中进行的实验结果证实了对手风格预测算法的有效性和泛化性。

总体而言,本文从解决基于自主进化的复杂博弈求解框架所存在的缺陷出发,针对智能体训练过程中策略循环克制的求解挑战,以及博弈对抗环境下由于对手策略的不可知性、多样性和不确定性等因素导致策略性能难以提升的问题,系统地进行求解方法研究并提出了一系列训练框架。本文取得的研究成果在理论和实际应用方面具有重要价值。

英文摘要

Complex games refer to games with high complexity, which typically involve numerous participants, a vast collection of decision variables, and diverse potential outcomes. The behavioral characteristics of such problems are not only influenced by individual decisions, but also closely related to the interactions and collaborative decision-making among the participants. In practical application scenarios, complex game problems have broad prospects for application and research value in areas such as information dissemination in social networks, price strategy formulation in market competition, international collaborative governance of global climate policies, and international political and military conflicts. However, game problems in the real world often have enormous scales, complex digital systems, and numerous participants, with various forms of interactions between them and an extremely complex strategy space. These challenges make it difficult to study such problems directly. In recent years, deep reinforcement learning has demonstrated powerful self-learning and exploration capabilities in various sequential decision-making problems, providing new ideas for solving complex game problems. Therefore, games abstracted from practical problems have attracted widespread attention in academia. By studying these complex games, designing targeted algorithms, and then applying them to solve practical problems, this has become an effective problem research paradigm. This approach not only helps to promote the development of game theory but also provides powerful support for solving complex game problems in the real world.

This paper aims to tackle the challenges faced by the current autonomous evolutionary game-solving framework by focusing on complex games as the research object. Specifically, the paper addresses the issue of low training efficiency caused by strategy cycle, as well as the problem of not fully utilizing opponent information during game-solving. To overcome these issues, the paper systematically conducted research on a multi-style strategy integration algorithm that incorporates opponent modeling. Below is a brief summary of the main contributions and innovations of this paper.
1. This paper provides a systematic analysis of the research background of strategy cycle and policy diversity in typical game adversarial scenarios. It emphasizes the importance of diversity in agents' behavior in non-transitive dynamic games, as it enables more effective exploration of the environment and learning of better policies. Based on this, the paper proposes a multi-style baseline policy training framework based on reward reshaping, which is designed to train policies with diverse styles and winning capabilities more efficiently. To quantify the diversity of policies, the framework introduces a measurement indicator based on the Jensen-Shannon divergence~(JSD). The joint training of the policy and the parameterized reward function is formulated as a bi-level optimization problem, which enhances the efficiency of baseline policy training and the diversity of policies. Experimental results in typical game scenarios demonstrate that the proposed training framework can efficiently train baseline policies with diverse styles and winning capabilities, while exhibiting clear strategy cyclic relationships.

2. This paper focus on the strategy cyclic in autonomous evolutionary game-solving algorithms and provides a systematic analysis of the research background of multi-task reinforcement learning and gradient modification. To overcome this problem, the paper proposes a gradient correction based multi-style strategy integration algorithm that can efficiently learn and master policies with different styles using a single neural network. The algorithm enables switching between different styles through style identifiers via the policy network. The integrated policy training adopts a two-stage optimization method, which includes policy improvement and policy correction. The policy improvement stage aims to enhance the performance of the integrated policy when using a specific style, while the policy correction stage assists in mitigating the negative impact of parameter updates on the performance of the integrated policy in other styles during the policy improvement stage. Experimental results in adversarial testing environments demonstrate that the proposed training algorithm framework can efficiently train integrated policies. Additionally, the extended experimental results demonstrate that the integrated policy can defeat other policies in the cyclic domination chain, breaking the strategy cycle and improving the transitivity of the policy.

3. This paper reveals the problem of inadequate exploitation of opponent information in autonomous evolutionary game-solving algorithms. It highlights that the absence of a systematic approach to modeling opponent strategies and analyzing their behavioral styles within the adaptive evolutionary algorithm framework hampers the ability to predict the potential risks posed by opponent future behavior. To address this issue, after a systematic analysis of opponent modeling research methods, this paper proposes a type-based opponent policy distribution prediction algorithm that can use historical opponent behavioral data to predict their behavior style. Through a confidence-based mechanism for determining one's own style, the algorithm further reduces uncertainty for both opponents and the system, and improve the performance with one's own policy. Experimental results in game adversarial testing environments have demonstrated the effectiveness and generalization of the opponent type prediction algorithm.

Overall, this paper addresses the limitations of autonomous evolutionary game-solving frameworks for complex games, focusing on the challenge of strategy cycle during agent training and the difficulty of improving policy performance in adversarial environments due to factors such as opponent strategy's unknowability, diversity, and uncertainty. Systematic research on solution methods is conducted, and a series of training frameworks are proposed. The research achievements obtained in this paper have significant theoretical and practical value.

关键词智能博弈对抗 深度强化学习 奖励重塑 多任务强化学习 对手建模
语种中文
七大方向——子方向分类机器博弈
国重实验室规划方向分类智能博弈与对手建模
是否有论文关联数据集需要存交
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/51947
专题毕业生_硕士学位论文
中国科学院自动化研究所
毕业生
推荐引用方式
GB/T 7714
白丰硕. 融合对手建模的多风格策略集成算法研究[D],2023.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
Master_Thesis.pdf(5376KB)学位论文 限制开放CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[白丰硕]的文章
百度学术
百度学术中相似的文章
[白丰硕]的文章
必应学术
必应学术中相似的文章
[白丰硕]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。