信息不完备条件下的复杂决策问题高效强化学习算法研究

doi:无

CASIA OpenIR > 毕业生 > 博士学位论文

	信息不完备条件下的复杂决策问题高效强化学习算法研究
	赵恩民
ISSN	无
	2023-05-26
页数	132
学位类型	博士
中文摘要	强化学习是机器学习的一个重要分支，主要研究如何让智能体从与环境的交互过程中学习出序列化决策过程的最优策略。信息不完备条件下的复杂决策问题由于同时存在观测空间部分不可见、对手风格信息不确定、多维策略空间难搜索等特点，给传统强化学习算法的高效运用带来了极大挑战，是当下人工智能研究的热点和难点。本文主要研究两种信息不完备的复杂决策问题：1）环境信息不完备的单智能体探索问题：该种环境的状态空间很大，通过简单的探索方式智能体并不能得到所有的信息，有些状态需要经过不断的探索——利用这一过程才可以得到，该种环境的典型代表是稀疏奖励环境，例如迷宫类探索游戏；2）对手信息不完备的多智能体博弈问题：该种环境中一般存在两个或多个智能体，环境的状态空间巨大，通过简单的交互并不能得到一个很好的策略。本文针对以上问题开展了四种不同的工作，旨在为信息不完备条件下的复杂决策问题提供高效的强化学习解决方案。首先，本文受到机器人避障系统中人工势能场的启发，针对环境信息不完备的单智能体探索问题，设计了一种基于规则的内部奖励机制。其次，针对环境信息不完备的单智能体探索问题，本文设计了基于学习的内部奖励机制，很好地将智能体的探索与利用结合，以提升智能体在环境信息不完备的单智能体探索问题中的表现。再次，本文针对对手信息不完备的多智能体博弈问题，以两人无限注德州扑克游戏为实例，创新性地设计了全新的使用深度强化学习的解决方案。最后，本文基于两人无限注德州扑克的人工智能的研究工作，以多人无限注德州扑克为实例，验证了其框架的迁移性及有效性。本文完成的研究工作和创新点总结如下：（1）本文针对环境信息不完备的单智能体探索问题，提出了一种基于规则的内部奖励机制：势能化经验回放算法。势能化经验回放将人工势能场引入到环境信息不完备的单智能体探索环境中，为智能体经历的各个状态定义势能函数，设立探索目标，帮助智能体在复杂探索环境中生成高效的样本。势能化经验回放算法从机器人避障问题出发，大幅度提升了智能体对未知环境的认知能力，相对于传统强化学习智能体实现了在探索问题中性能的大幅提升。（2）针对环境信息不完备的单智能体探索问题，本文提出了一种基于学习的内部奖励机制：信息网络蒸馏。信息网络蒸馏算法通过状态难度信息和状态伪价值信息的引入，提出了两种算法：难度信息网络蒸馏算法以及伪价值信息网络蒸馏算法。两种算法结合随机蒸馏网络方法，通过定义每一个状态的难度、伪价值，使用蒸馏网络技术自动地蒸馏出状态的探索-利用信号，形成一种全新的探索-利用结合的网络结构，大幅度提升了智能体对未知环境的探索和对已知环境的利用结合能力，进一步实现了在探索问题中性能的大幅提升。（3）针对对手信息不完备条件下的多智能体博弈问题，本文提出一种全新的端到端、轻量化的二人无限注德州扑克高性能强化学习方法AlphaHoldem。AlphaHoldem开创性地设计了高效的特征编码，伪孪生神经网络提取特征，全新的损失函数以及大幅度减小计算资源的自博弈方式，得到了领先世界的两人德州扑克人工智能。与传统的基于虚拟遗憾最小化方法不同，AlphaHoldem并没有对手牌进行任何抽象，并且可以处理更多的动作抽象。AlphaHoldem在几乎没有耗费任何存储空间的情况下，性能超越两个世界上顶尖的德州扑克人工智能：Slumbot和DeepStack。另外，与传统的基于虚拟遗憾最小化方法高性能人工智能相比，AlphaHoldem将训练资源缩短50倍的同时，测试时间缩短了近1000倍，并在人机测试中打败了亚洲专业牌手。AlphaHoldem为更加智能化的德州扑克人工智能后续研发工作提供了一定指引，也为实现非完美信息博弈问题的人工智能做出了重要贡献，并在一定程度上推动了通用决策模型的发展。（4）针对对手信息不完备条件下的多智能体博弈问题，本文通过AlphaHoldem的成功经验，创新性{地}设计了高性能多人德州扑克人工智能HoldemZoo。HoldemZoo通过有效的状态编码，全新的自博弈方式以及快速对手适应模块等，在没有任何领域知识的情况下，打败其它多人德州扑克人工智能，且框架可以很快拓展到任意多人扑克类博弈游戏中。HoldemZoo为更加智能化的多人非完美信息人工智能后续研发工作提供了一定参考，也为实现通用人工智能终极发展目标迈出了一步。
英文摘要	Reinforcement learning is a major branch of machine learning that focuses on how to enable agents to learn optimal strategies for sequential decision-making processes from their interactions with the environment. Complex decision-making problems with imperfect information represent a great challenge to the efficiency of traditional reinforcement learning algorithms due to the simultaneous invisibility of parts of the observation space, uncertainty of adversary style information and difficulty in searching the multidimensional strategy space, which are hotspots and difficulties in current artificial intelligence research. This paper focuses on two types of complex decision problems with imperfect information: 1) single-agent hard exploration problems with imperfect information about the environment: the state space of this type of environment is large, and the agent can not get all the information through a simple exploration strategy. Agents need to explore and exploit the environment and get rewards. The typical representative of this kind of environment is the sparse reward environment, i.e. the maze game; and 2) multi-agent game problem with imperfect information about the opponent: there are generally two or more agents in this kind of environment, the state space of the environment is huge, and a good strategy cannot be obtained through simple interaction. In this paper, we propose four different kinds of work to address the above problems, with the aim of providing efficient reinforcement learning frameworks for complex decision problems with imperfect information. Firstly, inspired by human navigation systems as well as robotic obstacle avoidance systems, this paper designs two rule-based intrinsic rewards in single-agent hard exploration problems with imperfect environmental information. Secondly, this paper designs two learning-based intrinsic rewards for single-agent hard exploration problems with imperfect environmental information. The two mechanisms well combine the exploration and exploitation of an agent to enhance its performance in single-agent hard exploration problems with imperfect environmental information. Thirdly, this paper designs a novel framework using deep reinforcement learning for multi-agent problems with imperfect information, using Heads-up no-limit Texas Hold'em as an example. Finally, this paper validates the migration and effectiveness of its framework based on the artificial intelligence in two-player no-limit Texas Hold'em, using multiplayer poker as an example. The research work and contributions in this paper are summarised as follows. (1) This paper addresses the problem of single-agent hard exploration problems with imperfect information about the environment and proposes a rule-based method for single-agent hard exploration problems: Potentialized Experience Replay algorithm. Potentialized Experience Replay introduces artificial potential fields into single-agent hard exploration problems with imperfect information about the environment, defines the potential exploration function for each state experienced by the agents, sets up exploration goals, and helps the agents generate efficient samples in complex exploration environments. Potentialized Experience Replay substantially improves the agents' cognitive ability of the unknown environment from the robot obstacle avoidance problem and achieves a significant performance improvement in the exploration problem relative to traditional reinforcement learning algorithms. (2) To address the single-agent hard exploration problems with imperfect information about the environment, this paper proposes a learning-based framework combining single-agent exploration and exploitation: Inforamtion Network Distillation framework. Through the introduction of state difficulty information and state pseudo-value information, we propose two algorithms: Difficulty Information Network Distillation and Pseudo Value Information Network Distillation. The two algorithms define the difficulty and the pseudo-value of each state and use Random Network Distillation to automatically distill the exploration-exploration signal of the state, forming a new network structure combined exploration and exploration. The two algorithms substantially improve the ability to combine exploration of the unknown environment and exploitation of the known environment from the perspective of combined exploration-exploitation of the agents, further achieving a significant improvement in performance in hard exploration problems. (3) To address the multi-agent problem with imperfect information, this paper proposes a new end-to-end, lightweight, two-player high-performance reinforcement learning artificial intelligence for Heads-Up No-Limit Poker: AlphaHoldem. Unlike traditional counterfactual regret minimization methods, AlphaHoldem is designed with effective game state representation, pseudo-siamese architecture for feature extraction, novel loss functions, and a new self-play approach that significantly reduces computational resources, resulting in a high-performance two-player Texas Hold'em artificial intelligence. AlphaHoldem outperforms Slumbot and Deepstack, two of the top three high-performance Texas Hold'em artificial intelligence in the world while consuming virtually no storage space. In addition, AlphaHoldem reduces training resources by 50 times and testing time by nearly 1,000 times compared with traditional counterfactual regret minimization based high-performance artificial intelligence. AlphaHoldem beats professional Asian poker players in human-computer tests. AlphaHoldem provides some guidance for the subsequent development of a more intelligent artificial intelligence for Texas Hold'em and has also made an important contribution to artificial intelligence for imperfect information games, and to a certain extent to the development of general decision models. (4) To address the multi-agent problem with imperfect information, this paper innovatively designs the world's second high-performance multiplayer Poker artificial intelligence HoldemZoo through the successful experience of AlphaHoldem. HoldemZoo provides a reference for the subsequent development of a more intelligent multiplayer imperfect information artificial intelligence and is a step towards the ultimate goal of general artificial intelligence.
关键词	信息不完备复杂决策问题强化学习单智能体探索不完美信息博弈
学科领域	信息科学与系统科学
学科门类	工学
DOI	无
URL	查看原文
收录类别	其他
语种	中文
WOS记录号	WOS:无
CSCD记录号	CSCD:无
出版者	赵恩民、兴军亮
是否为代表性论文	是
七大方向——子方向分类	机器学习
国重实验室规划方向分类	开放博弈基础理论
是否有论文关联数据集需要存交	否
引用统计
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/51928
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	赵恩民. 信息不完备条件下的复杂决策问题高效强化学习算法研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
202018014628093赵恩民.p（25370KB）	学位论文		限制开放	CC BY-NC-SA