基于自适应动态规划算法的离散动态系统最优控制

CASIA OpenIR > 毕业生 > 博士学位论文

	基于自适应动态规划算法的离散动态系统最优控制
	梁明明
	2020-08
页数	138
学位类型	博士
中文摘要	在控制科学的发展中，动态系统(包括非线性系统和随机过程)的最优控制是一个重要的研究主题。自适应动态规划(Adaptive Dynamic Programming, ADP)方法基于强化学习思想，融合了神经网络等近似理论。可以解决传统动态规划方法的“维数灾”问题，并逐渐成为智能控制领域的研究热点之一。然而，传统ADP方法的诸多严格限制条件和其所带来的巨大计算负载使得其在具体应用中面临极大的挑战。为此，本文基于ADP方法进一步研究离散动态系统的最优控制问题，提出新的证明方法、稳定性判据和迭代规则来增强算法的通用性和扩大算法的应用范围。本文的主要工作和贡献体现在以下三个方面。 1. 基于新的迭代规则提出策略迭代ADP方法并用于求解离散非线性系统的最优控制问题。首先提出一种改进的迭代规则，使得ADP算法只在已设定好的一个局部系统状态集上更新控制律，进而有效地减轻计算设备的计算负担。其次证明了策略迭代ADP算法在新的迭代规则下所产生的价值函数序列最终将收敛到与某个局部策略空间相对应的局部最优值。而且可以用数学语言定量地描述该局部策略空间。再次，证明了在宽松的限制条件下，由策略迭代ADP算法在新的迭代规则下所产生的价值函数序列能够收敛到全局最优值。仿真结果表明，这里提出的方法在获取离散非线性系统最优控制律的同时可以有效地缓解中央处理器的计算压力。 2. 在控制对象由非线性系统推广到离散随机过程的情况下，研究策略迭代ADP方法实际运行时的计算负载情况及其收敛特性。首先引入新的迭代规则，来解决由传统策略迭代ADP方法所带来的高计算负载问题。然后使用数学归纳法分析算法产生的价值函数序列的单调性并给出相应的收敛性证明。在选用合适的迭代子状态空间序列的情况下。研究得出由改进后的策略迭代ADP算法能够使得价值函数序列最终趋向于全局最优值。在具体的算法实现过程中，分别构建评判网络和动作网络以近似相应的迭代价值函数及控制律。最后通过仿真实例验证所提方法在解决最优控制问题上的可行性以及在计算负载方面的优越性。 3. 基于值迭代ADP方法获取离散随机过程的最优稳定控制策略。传统ADP方法要求迭代控制律和迭代价值函数更新无穷次才能获得理论上的最优性能指标函数，并且还要求初始迭代价值函数必须满足某些严格的限制条件时才能使得值迭代的价值函数序列收敛。这使得ADP方法的实施非常困难。为此，本文首次提出了一种稳定性判据，用来判定值迭代算法当前得到的控制律能否稳定随机过程。同时给出了一种全新的证明方法，得出了算法的收敛特性，也即只要值迭代算法的初始迭代价值函数为一个半正定函数，那么算法产生的价值函数序列能够收敛到随机过程的最优性能指标函数。最后给出仿真实例以验证所提出算法的有效性。
英文摘要	The optimal control of dynamic systems, including nonlinear systems and stochastic processes, is an important topic in the development of control science. Based on the theories of dynamic programming, neural networks and reinforcement learning, the adaptive dynamic programming (ADP) method has been shown to be a good solution to the ``curse of dimensionality" problem. Hence, the ADP approach has become one of the hot topics in the field of intelligent control. However, the strict restrictions and the huge computational burden associated with the traditional ADP method make it unsuitable for practical implementation. In this thesis, the optimal control of dynamic systems using ADP is further investigated. New convergence analysis, stability criteria and an improved iteration method are presented to enhance the generality and expand the application domains of the ADP algorithm. The main contributions of this thesis are summarized in the following three aspects. 1. Based on an improved policy iteration method, the ADP algorithm is developed to design the optimal control of the discrete nonlinear systems. In the proposed policy iteration ADP algorithm, the iterative control law is updated in a local system state space at each iteration. Hence we can significantly reduce the computational burden for CPUs in comparison with the conventional policy iteration algorithm. The iterative value functions obtained by the proposed algorithm can converge to the optimum within a local policy space. In addition, this local policy space will be described in detail for the first time. Under a few weak constraints, it is also shown that the iterative value function will converge to the optimal performance index function of the global policy space. Finally, a simulation example is presented to validate the effectiveness of the developed method. 2. With the policy iteration ADP algorithm extended to stochastic processes, the computational burden and the convergence characteristics of the proposed method are investigated. By introducing the improved iteration method, the expensive computation resulting from the traditional approach can be successfully avoided. Using mathematical induction, the monotonicity and convergence properties of the generated value functions are analyzed. Once the local state-space sequence is set properly, it can be proved that the proposed method can obtain the global optimum. For facilitating the implementation process, the critic network and action network are constructed to approximate the iterative value function and control law, respectively. The simulation results show that the policy iteration ADP algorithm can solve the stochastic optimal control problems successfully, and the superiority of the algorithm in terms of computation load is demonstrated as well. 3. A novel value iteration ADP algorithm is presented to obtain the optimal stable policy for discrete stochastic processes. The traditional ADP method requires the iterative control law and the iterative value function to be updated infinitely to obtain the optimum, and also requires the initial iterative value function to satisfy some strict constraints. These requirements make the practical implementation of the ADP method extremely difficult. In the proposed value iteration ADP algorithm, for the first time, a new stability criteria is presented to verify whether the obtained policy is stable or not for stochastic processes. By analyzing the convergence properties of the proposed algorithm, it is shown that the iterative value functions can converge to the optimum. In addition, the proposed algorithm permits the initial value function to be an arbitrary positive semi-definite function. Finally, two simulation examples are presented to validate the effectiveness of the developed method.
关键词	智能控制神经网络自适应动态规划最优控制离散随机过程离散非线性系统
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/40674
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	梁明明. 基于自适应动态规划算法的离散动态系统最优控制[D]. 中国科学院自动化研究所. 中国科学院大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Thesis.pdf（8751KB）	学位论文		限制开放	CC BY-NC-SA