连续状态系统的近似最优在线强化学习

CASIA OpenIR > 毕业生 > 博士学位论文

	连续状态系统的近似最优在线强化学习
其他题名	Near-Optimal Online Reinforcement Learning Algorithms for Continuous-State Systems
	朱圆恒
	2015-05-26
学位类型	工学博士
中文摘要	在控制领域, 最优控制一直是理论方法和工程实践的一个重要研究方向. 它要求找到一个使性能指标最大(或最小)的控制策略. 强化学习(Reinforcement Learning, RL)能够有效解决最优控制的问题, 因而受到研究者的广泛关注. 尤其是在线RL方法在系统模型未知的情况下能够通过在线学习得到最优控制策略, 摆脱了对模型的依赖. 然而, 当前RL在理论研究和方法实现上还一直存在着许多未解决的问题. 很多理论证明的缺乏和算法实现上的缺陷制约着RL的应用. 为此, 本课题以在线RL作为主要研究对象, 解决连续状态系统最优控制问题, 同时补充离线RL的理论体系. 本文的主要章节包含以下工作和贡献： 1. 研究近似策略迭代(Approximate Policy Iteration, API)解决连续状态系统非衰减最优控制问题, 并给出通用的收敛性理论分析结论. 证明在近似策略评估阶段, 当逼近器满足一定条件时近似性能函数的误差是有界的. 经过策略提升阶段后的策略与原始目标策略是完全相同的. 进而推出在逼近误差干扰下API仍然能够收敛到最优解. 同时将API与三角形模糊逼近器结合, 给出模糊API算法的实现形式, 并解决一类路径规划问题. 2. 基于策略迭代提出一种在线算法, 解决连续状态、连续动作系统控制问题. 选择多项式作为基函数, 使用线性结构近似性能函数和控制策略. 分别使用递归最小二乘和梯度下降方法训练性能函数和控制策略的参数. 由于使用在线学习的方式, 算法利用在线采集的数据, 摆脱了对系统模型的依赖. 在线性和非线性系统上的仿真实验验证了算法的在线学习能力. 3. 针对模型未知连续状态系统首次提出一种近似最优在线RL算法. 算法特点是在线学习过程中只需要运行有限时间即可学到系统的近似最优控制策略. 该算法将连续状态空间划分成不同的子集, 存储在线数据. 定义新型的迭代算子计算性能函数和在线执行的策略. 完整的理论分析证明算法满足概率近似正确原理, 即在线学习过程中非最优控制策略时刻的总和是有限的. 在一些常见系统上进行仿真, 实验结果与理论结论一致. 4. 对上一个算法进行改进, 设计一种基于数据的近似最优在线RL算法. 使用存储的在线数据重新定义迭代算子, 计算性能函数和在线策略. 算法引入kd树技术方便存储和查找数据. 理论分析同样证明新算法满足概率近似正确原理, 即在线学习时只需要有限运行时间即可学到近似最优控制策略. 直接基于数据的设计使算法具有更高的数据利用率和学习速率. 因而与之前的算法相比, 在相同实验上可以在更短的时间里学到近似最优控制策略, 对应更好的学习能力.
英文摘要	In the control area, optimal control is not only intensive in the academic research, but also valuable in the practical applications. It aims to find a control policy that maximizes (or minimizes) certain performance index. Reinforcement learning (RL) has attracted plenty of interest from researchers because of its capability to the optimal control problems. Especially when the system models are unknown, online RL can learn the optimal policy through online learning. However, there are still numerous problems confronting RL researchers. The wide applications are restricted because some theories have not been established and some algorithms are still flawed. In this thesis, we consider the continuous-state systems and use RL to solve their optimal control problems. Meanwhile offline RL is also studied in some chapters. The main contributions can be summarized as follows: 1. Approximate policy iteration (API) is studied to solve the optimal control of the continuous-state systems with undiscounted performance index. The universal convergence property is analyzed in the chapter. It is proved that if certain conditions are satisfied for the approximators, the errors of the approximate value functions are bounded. The updated policies after policy improvement remain identical with the original resulting policies. Then it is derived that with the approximation error, API still converges to the optimal solution. Besides, we combine API with the triangular fuzzy approximator and propose a fuzzy API algorithm, which is further used to solve a path planning problem. 2. Based on policy iteration, an online algorithm is proposed to solve the control problem of the continuous-state and continuous-action system. Polynomials are selected as the basis functions and linear parametrization is used to approximate the value function and control policy. The approximators of the value function and control policy are trained using the recursive least-squares and gradient descent methods respectively. Because of the online learning property, the algorithm avoids the dependence on the models but utilizes the online data. The simulations of linear and nonlinear systems verify the online learning capability. 3. For the unknown-model continuous-state system, a near-optimal online RL algorithm is proposed for the first time. It costs only finite running time before learning a near-optimal policy. The continuous state space is partitioned into different subsets, which are used t...
关键词	强化学习最优控制近似策略迭代概率近似最优连续状态系统收敛性在线学习 Kd树 Reinforcement Learning Optimal Control Approximate Policy Iteration Probably Approximately Correct Continuous-state System Convergence Online Learning Kd-tree
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6688
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	朱圆恒. 连续状态系统的近似最优在线强化学习[D]. 中国科学院自动化研究所. 中国科学院大学,2015.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20121801462802（2679KB）			暂不开放	CC BY-NC-SA