基于高斯过程的强化学习及汽车智能巡航控制

CASIA OpenIR > 毕业生 > 博士学位论文

	基于高斯过程的强化学习及汽车智能巡航控制
	夏中谱
	2016-05
学位类型	工学博士
中文摘要	针对汽车自适应巡航控制系统这一类模型未知、运行环境不确定的系统，基于模型或基于专家经验的传统控制方法很难求解最优的控制策略。基于数据学习的方法，特别是强化学习，通过与系统交互不断学习和改进控制策略，被认为是解决问题的有效方法，成为当前研究的热点。但实际应用中，系统的状态和动作通常都是连续，对现有的强化学习控制方法和理论提出了挑战。论文使用高斯过程回归解决强化学习方法在连续状态动作系统的应用问题，首先针对非线性仿射系统，提出了无模型的强化学习控制方法；进而将方法扩展到在线的强化学习上，实现对状态空间的高效探索和对控制策略的快速评估；最终将提出的方法应用到汽车智能巡航控制上，使用硬件在环的汽车仿真平台测试和分析控制器性能，为汽车辅助驾驶系统提供可靠的方法和理论支持。论文将包含以下工作和贡献： 1. 针对连续状态动作的非线性仿射系统，提出了无模型的最优控制方法。基于被控对象的状态转移数据，对给定的策略进行评估得到动作值函数，进而根据动作值函数求解贪婪策略，如此不断迭代直到得到最优控制策略。理论证明了策略评估时动作值函数的收敛性，控制策略经过提升后的稳定性及学习结果的最优性。最后使用高斯过程回归构建评价网络和执行网络，根据系统状态转移数据迭代训练网络直到参数收敛。并将方法应用于两个不同的非线性系统控制中，实验结果与理论证明相一。 2. 从概率统计的角度出发，提出了基于贝叶斯推理的在线强化学习方法，很好地解决了强化学习中状态空间探索和策略评估两个难题。使用高斯过程建模动作值函数，以奖惩值为观测量，基于贝叶斯推理的方式实现了对控制策略的快速评估。进而在高斯过程中加入系统先验知识，结合ϵ-greedy 动作选择方法实现对系统状态空间的有效探索。最终实现了一种基于贝叶斯推理的在线强化学习控制方法，实验验证了方法的有效性。 3. 基于dSPACE 实时仿真系统的模型、软件和硬件，搭建了驾驶员+硬件在环的汽车仿真测试平台，为汽车辅助驾驶系统开发前期的数据采集和测试提供了支持。使用高斯过程回归学习驾驶员跟车习惯特性，结合线性二次型控制算法构建符合驾驶员习惯的自适应巡航控制器。使用飞思卡尔32 位微处理器实现控制算法，在汽车仿真测试平台上构建虚拟交通环境，测试和分析控制器的有效性。 4. 基于驾驶员跟车模型、视觉模型和安全距离模型，设计了集舒适性和安全性于一体的智能巡航控制策略评价指标。从本车速度和加速度空间采集状态转移数据，并增广到跟车过程的状态动作空间，得到独立分布的状态转移数据。进而根据状态转移数据和性能指标，使用之前提出的无模型最优控制方法学习最优控制策略。并将学习得到的控制策略与二次型控制器、PID 控制器进行比较，在不同的汽车行驶场景中仿真和测试，验证了无模型最优控制方法在汽车智能巡航控制问题上的有效性。
英文摘要	It is always difficult to implement optimal control for a system whose model is unknown and operating environment is uncertain, such as adaptive cruise control of vehicles, by using traditional model-based or expert-based control approach. Data-based method, especially the reinforcement learning control, which is able to learn and improve the control policy by interacting with the underlying system, is viewed as a feasible approach to the problem. This makes it the latest research tendency in the control area, while how to be applied to the optimal control of the system with continuous states and actions is still a challenge. In this thesis, we will firstly develop a model-free optimal control approach for an affine nonlinear system of continuous states and actions with convergence and optimality analysis. And then the approach is extended to an online form by calculating the action value function from the perspective of Bayesian inference, which makes it possible to evaluate the control policy fast and explore the system efficiently. At last, the proposed approach is employed to develop the intelligent cruise controller to provide the functions of safety, comfortableness and satisfying drivers’ habit. And a driver&hardware in the loop vehicle simulator is built to test and analysis the performance of the controller. The main contributions are as follows. We propose a novel approach which can derive an optimal control policy for an affine nonlinear system of continuous states and actions without accessing any knowledge of the mathematical models under the reinforcement learning framework. It evaluates the performance of control policy through the collected data of the underlying system and then selects the greedy action for each state. These two phases alternate until the optimal control policy is achieved. Theorem shows that the greedy control action for each state is uniquely existent, the control policy after each iteration is admissible, and the optimal control policy can be achieved ultimately. Two Gaussian processes are employed to implement the approach and experimental results validate its effectiveness. We propose an online reinforcement learning approach to address the problems of policy evaluation and system exploration from the perspective of Bayesian inference. It models the action value function as the latent variable, while the reward as the observed variable. Thus, the action value function can sequentially update according to the observed variable by Bayesian inference. Besides, such a style of modeling allows to assign an optimistic value to the action value function, and so an efficient exploration strategy is presented. At last, the Bayesian-SARSA algorithm is tested on some benchmark problems and empirical results show it effective performance. A driver&hardware in the loop vehicle simulation platform is developed within the automotive simulation model, software and hardware provided by dSPACE. The driver in the loop simulator is used to collect drivers’ data during the driving time, while the hardware in the loop simulator is used to test a vehicle electronic control unit like an adaptive cruise controller during its prototype phase. An intelligent cruise controller which is able to satisfy different drivers’ character is developed through regressing the driver’s character by Gaussian processes. At last, the controller is implemented in a 32-bit Freescale MCU and tested on the simulator within a virtual traffic scene. A performance index for an intelligent cruise controller is set up by combining a car-following model, a driver vision model and a driving safety model in order to evaluate the controller comprehensively. The model-free optimal control method proposed previously is employed to develop an intelligent cruise controller. It learns from the data which is collected from state domain of the host vehicle including the acceleration and velocity and then expanded to the car-following state including the host’s state, the clearance and relative velocity between the host and the preceding vehicle. An optimal control policy is achieveduntil its performance index does not improve during the learning phase. Its effectiveness is verified by comparing with a LQR controller and a PID controller on the hardware in the loop simulator.
关键词	强化学习控制高斯过程连续状态系统无模型控制智能巡航控制
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/11435
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	夏中谱. 基于高斯过程的强化学习及汽车智能巡航控制[D]. 北京. 中国科学院研究生院,2016.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
博士论文终稿_夏中谱.pdf（18177KB）	学位论文		限制开放	CC BY-NC-SA