面向几类微分博弈的自适应动态规划方法

CASIA OpenIR > 毕业生 > 博士学位论文

	面向几类微分博弈的自适应动态规划方法
	Zhang,Qichao1,2
	2017-05-25
学位类型	工学博士
中文摘要	作为一类由强化学习、最优控制和人工神经网络交叉融合产生的优化方法，自适应动态规划(Adaptive Dynamic Programming, ADP)通过模仿生物的学习机制与环境进行交互，利用交互数据不断学习和改进自身策略直到系统性能最优。由于ADP可以解决传统动态规划中的“维数灾难”问题，现已成为智能控制与计算智能领域最新的研究热点。值得注意的是，当前大多数的现代控制系统往往包含两个或多个控制单元甚至是多个子系统，这类控制问题可被看作为合作或竞争的非线性博弈问题。然而，目前的ADP方法在求解这类复杂非线性博弈问题时还存在着许多难点与不足，特别是针对存在不确定项、输入受限、模型未知等情况的微分博弈问题。因此，利用ADP 思想解决复杂非线性微分博弈问题，具有重要的理论与实际意义。另一方面，如何设计ADP方法以提高数据的利用率，节省通信资源，减轻计算负担也是一个值得深入探讨的研究热点。本文在综述当前研究现状的基础上，以最优控制理论、强化学习、博弈论等为主要工具，研究非线性系统ADP理论和方法，用于解决几类复杂非线性微分博弈(零和博弈、非零和博弈、完全合作博弈)问题，同时改进算法以提高数据利用率，节省通信资源，减轻计算负担。论文的主要章节包含以下工作和贡献： 1. 面向二人零和博弈问题，提出了事件驱动自适应动态规划(Event-Triggered Adaptive Dynamic Programming, EADP)算法，并给出了网络收敛性证明。EADP算法可有效地节省通信资源，减轻计算负担，同时适用于求解H∞控制问题。通过设计神经网络逼近器来逼近最优的值函数、最优的控制策略和最坏的干扰策略，即得到二人零和博弈的一组近似纳什均衡解。最后给出了基于多层前馈神经网络的实现方法以及仿真验证。 2. 针对不确定非线性系统，利用最优控制的思想求解鲁棒控制问题。首先将不确定系统的鲁棒控制问题转换为相应辅助系统的合作博弈优化控制问题，同时在设计性能指标函数的时候考虑系统不确定项的影响，然后设计事件驱动条件确保所求得的最优控制器可以保证原不确定非线性系统的稳定性，这意味着该最优控制器也是原系统的鲁棒控制器。进而利用事件驱动自适应动态规划(EADP) 算法逼近所转化合作博弈问题的最优控制策略。最后在两个常见仿真系统上验证了算法的有效性。 3. 面向部分输入受限的完全合作博弈问题，提出数据驱动的自适应动态规划(Data-Driven Adaptive Dynamic Programming, DADP) 算法，DADP算法采用在线采集数据和离策略迭代学习的方法，不再依赖系统动力学信息和模型辨识过程。同样设计了三个神经网络逼近器，利用最小二乘法同时更新神经网络的权重来分别逼近最优值函数、输入受限控制策略和输入不受限控制策略，并采用李雅普诺夫(Lyapunov)方法证明了闭环系统的一致最终有界性(Uniformly Ultimately Bounded, UUB)。 4. 面向模型未知的N人非零和博弈问题，利用模型辨识的方法，通过设计恰当的神经网络辨识器辨识系统动力学模型，基于辨识的模型采用单评判网络结构逼近哈密顿-雅克比方程的解。在设计模型辨识器和评判网络的时候，结合经验回放技术，同时利用部分历史数据和当前数据更新神经网络权重，加快了神经网络的收敛速度，基于此提出了经验回放的单评判网络自适应动态规划(Single-Network ADP with Experience Replay,SAER)算法并证明了算法的收敛性，分别在线性和非线性非零和博弈系统上的仿真实验验证了算法的有效性。
英文摘要	As a class of optimization methods, which are based on reinforcement learning, optimal control theory and artificial neural network, Adaptive Dynamic Programming (ADP) imitates biological learning mechanism, where the agent learns and improves its control policy based on the interactive data between itself and the environment to make the system performance index optimal. It can solve the problem of ”cruse of dimensionality” in the traditional dynamics programming, and has become a hot topic in the field of intelligent control and computational intelligence recently. Note that most of the modern control systems often contain two or more than two control units even multiple subsystems, which can be seen as competitive or cooperative nonlinear differential games. However, there are some unsolved problems for nonlinear differential games based on ADP, especially for nonlinear differential games with model uncertainties, constrained inputs and the unknown model. Therefore, to solve these nonlinear differential games based on ADP method is of important theoretical and practical significance. On the other hand, how to design the ADP algorithm to improve the data utilization, save the communicational resources and reduce the computational burden is also a hot topic, which should be investigated intensively. On the basis of review and summary of the corresponding research, this thesis employs the optimal control theory, reinforcement learning and game theory as the major tools, studies the ADP theory and method, aiming to solve the optimal control and differential games (zero-sum games, non-zero-sum games and completely cooperative games) of nonlinear systems with efficient data utilization, low communication resources and reduced computational burden. The main contributions of this thesis include the following four parts. 1. For the zero-sum game, we propose an event-triggered adaptive dynamic programming (EADP) with the convergence analysis. The EADP algorithm can save communicational resources and reduce the computational burden during the learning process, which can also be used to solve the H∞ control problem. Then the neural network (NN) approximators are constructed to approach the optimal value function, the optimal control policy and the worst disturbance policy. That is to say, a tuple of Nash equilibrium is obtained for the zero-sum game. Finally, the implementation method using multi-layer feedforward NNs and the simulation examples are provided. 2. For the nonlinear system with model uncertainties, the robust control problem is solved using the optimal control method. First, the robust control problem is transformed into an optimal control problem of a corresponding auxiliary cooperative game. To reflect the model uncertainties, a novel performance index is given. Then, the triggering condition is designed to guarantee the stability of the uncertain system under the optimal controller, which means that the optimal controller is also a robust controller for the original uncertain system. Furthermore, the EADP is used to approach the optimal control policy for the transformed cooperative game. Finally, the effectiveness of the proposed algorithm is proved based on two common examples. 3. For the fully cooperative game with partially constrained inputs, we propose a data-driven adaptive dynamic programming (DADP) algorithm. Based on the online measurement and off-policy learning, the system dynamics and the identification procedure are neither required for the DADP algorithm. Similary, three NN approximators are constructed to approach the optimal value function, the constrained policy and the unconstrained policy, where the least-squares method is used to update the weights of NNs. Furthermore, the uniformly ultimately bounded stability of the closed-loop system is proved based on Lyapunov approach. 4. For the non-zero-sum game with unknown dynamics, a NN-based identifier is designed to identify the system dynamics. Based on the identification model, a single-network ADP is proposed to approach the solution of the Hamilton-Jacobi (HJ) equations. Note that the experience replay technique is introduced to speed up the convergence rate for both the identifier and the critic network, where the recorded data and current data are used to update the NNs simultaneously. Then, the single-network ADP with experience replay (SAER) algorithmis proposed with the convergence analysis. At last, the SAER algorithm is tested on a linear and nonlinear non-zero-sum game respectively. The empirical results show its effective performance.
关键词	自适应动态规划神经网络微分博弈
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/14677
专题	毕业生_博士学位论文
作者单位	1.The State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences 2.University of Chinese Academy of Sciences
第一作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	Zhang,Qichao. 面向几类微分博弈的自适应动态规划方法[D]. 北京. 中国科学院研究生院,2017.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
面向几类微分博弈的自适应动态规划方法.p（4868KB）	学位论文		限制开放	CC BY-NC-SA