连续状态-动作空间下强化学习方法的研究

CASIA OpenIR > 毕业生 > 博士学位论文

	连续状态-动作空间下强化学习方法的研究
其他题名	Research on Reinforcement Learning Methods of Continuous State-Action Spaces
	程玉虎
	2005-04-01
学位类型	工学博士
中文摘要	本文的主要内容和研究成果如下：首先，研究了离散状态和离散动作空间的强化学习问题，提出了一种基于资格迹机制的加权递归最小二乘多步 Q 学习算法，能够实现在线增量式学习，有效提高了算法的计算效率，并运用离散鞅理论对算法的收敛性进行了分析。其次，针对具有连续状态空间下的控制问题，设计出一种自适应的强化学习算法。在 Actor-Critic 框架下，用一个归一化 RBF 网络同时逼近 Critic 的值函数和 Actor 的策略函数。由于 Actor 和 Critic 对网络输入层和隐层资源的共用，使得算法比较简单，同时实现了对状态空间的在线、自适应构建。第三，提出了一类连续状态与连续动作空间下的加权 Q 学习算法。利用 RBF网络实现标准的 Q 学习，完成对离散动作效用值的逼近，然后采用加权规则对离散动作的效用值进行加权，得到作用于系统的连续动作，从而实现了将 Q 学习的应用扩展到具有连续动作空间的控制问题。第四，利用模糊推理的可理解性与 RBF 网络的学习能力，首先构建了一类基于模糊 RBF 网络的模糊强化学习体系结构，然后基于此体系结构，分别设计出模糊 Actor-Critic 学习和模糊 Q 学习。这两种学习算法具有泛化性能好、网络结构紧凑、自适应和自学习的特点。第五，设计出一种基于动态 Elman 网络预测模型的非线性直接多步预测控制器，将时间差分算法与 BP 算法相结合，对网络权值的实时调整进行渐进计算，并采用单值预测控制算法进行控制量的在线滚动优化计算。该方法具有结构简单、运算量小、速度快的特点，并且对系统参数的变化具有一定的自适应性。最后对取得的研究成果进行了总结，并展望了需要进一步研究的工作。
英文摘要	contributions of the dissertation are as follows: Firstly, a weighted recurrent least square multi-step Q learning algorithm is proposed in both discrete state and action spaces. The computation efficiency of the Q learning algorithm is improved by virtue of online and incremental learning. Discrete martingale theory is applied to analyze the convergence property of the proposed Q learning algorithm. Secondly, an adaptive Actor-Critic reinforcement learning algorithm is designed in continuous state spaces. A normalized RBF neural network is used to approximate the value function of Critic and the policy function of Actor simultaneously under an Actor-Critic architecture. The algorithm is very simple due to the share of the input and the hidden layers of the NRBF network by the Actor and the Critic, and it can realize online and adaptive constructing of state space. Thirdly, a proposal of weighted Q Learning algorithm suitable for control systems with continuous state and action spaces is put forward. At first, the standard Q implemented by RBF network is used to approximate the utility values of discrete actions, and then a weighted rule is applied to weight the utility values of each discrete action, continuous action that actually acts upon the system can be obtained in this way. The application area of Q learning is then extended to problems with continuous state and action spaces. Fourthly, a fuzzy reinforcement learning architecture based on a four-layer fuzzy RBF neural network is proposed by using the knowledge presenting property of fuzzy inference system and the self-learning property of RBF network fully. Based on the fuzzy reinforcement learning architecture, a fuzzy Actor-Critic learning and a fuzzy Q learning are designed. The two fuzzy reinforcement learning methods both have advantages of perfect generalization ability, compact network structure, self-adaptive, and self-learning. Fifthly, a proposal of nonlinear multi-step predictive controller based on a modified Elman recurrent neural network is designed. A new hybrid learning algorithm combining the temporal differences method with BP algorithm to train the Elman prediction model is put forward according to the intrinsic defects of BP algorithm that can not update network weights incrementally. In order to simplify computation, the single-value predictive control algorithm is used to optimize the control input of the next step. The predictive controller has characteristics of simple structure, small calculating amount, fast speed, and has self-adaptive ability for the change of parameters of the plant. Finally, a summary of the dissertation is given and some future works areaddressed.
关键词	强化学习连续空间函数逼近 Rbf 网络模糊推理系统 Reinforcement Learning Continuous Space Function Approximation Rbf Network Fuzzy Inference System
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/5835
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	程玉虎. 连续状态-动作空间下强化学习方法的研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2005.