contributions of the dissertation are as follows: Firstly, a weighted recurrent least square multi-step Q learning algorithm is proposed in both discrete state and action spaces. The computation efficiency of the Q learning algorithm is improved by virtue of online and incremental learning. Discrete martingale theory is applied to analyze the convergence property of the proposed Q learning algorithm. Secondly, an adaptive Actor-Critic reinforcement learning algorithm is designed in continuous state spaces. A normalized RBF neural network is used to approximate the value function of Critic and the policy function of Actor simultaneously under an Actor-Critic architecture. The algorithm is very simple due to the share of the input and the hidden layers of the NRBF network by the Actor and the Critic, and it can realize online and adaptive constructing of state space. Thirdly, a proposal of weighted Q Learning algorithm suitable for control systems with continuous state and action spaces is put forward. At first, the standard Q implemented by RBF network is used to approximate the utility values of discrete actions, and then a weighted rule is applied to weight the utility values of each discrete action, continuous action that actually acts upon the system can be obtained in this way. The application area of Q learning is then extended to problems with continuous state and action spaces. Fourthly, a fuzzy reinforcement learning architecture based on a four-layer fuzzy RBF neural network is proposed by using the knowledge presenting property of fuzzy inference system and the self-learning property of RBF network fully. Based on the fuzzy reinforcement learning architecture, a fuzzy Actor-Critic learning and a fuzzy Q learning are designed. The two fuzzy reinforcement learning methods both have advantages of perfect generalization ability, compact network structure, self-adaptive, and self-learning. Fifthly, a proposal of nonlinear multi-step predictive controller based on a modified Elman recurrent neural network is designed. A new hybrid learning algorithm combining the temporal differences method with BP algorithm to train the Elman prediction model is put forward according to the intrinsic defects of BP algorithm that can not update network weights incrementally. In order to simplify computation, the single-value predictive control algorithm is used to optimize the control input of the next step. The predictive controller has characteristics of simple structure, small calculating amount, fast speed, and has self-adaptive ability for the change of parameters of the plant. Finally, a summary of the dissertation is given and some future works areaddressed.
修改评论