融合专家知识与强化学习的自动驾驶策略研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	融合专家知识与强化学习的自动驾驶策略研究
	王宇霄
	2024-05-16
页数	60
学位类型	硕士
中文摘要	自动驾驶技术有望有效缓解交通拥堵、保障行驶安全、改善人们的出行体验。然而目前构建能够大规模应用的、完全智能自主且不依赖人类监管的自动驾驶系统仍然存在困难，其中决策和规划问题为关键性难题。目前主要的解决方案为基于规则和有限状态机的方法、基于深度模仿学习的方法以及基于深度强化学习的方法，但是由于缺少自主探索能力或者专家先验知识的引导，它们各自都存在一定的缺点。本文主要针对这一问题，在不同的场景下利用了不同的方法对专家驾驶知识进行了提取和表征，进而辅助强化学习以实现性能提升。首先，在高速公路行驶场景下，利用专家知识编写了安全约束规则，提升了强化学习决策结果的安全性；其次，在一般道路行驶场景下，设计了一种轨迹预测引导的强化学习训练算法，利用驾驶演示数据引导智能体快速、有效地进行探索和学习；然后，基于异质图理论和自注意力机制改进了深度网络设计，通过模仿人类提升了规划模型对于复杂交通环境的处理能力。经过实验验证，本文提出的各项方法能够有效提升自动驾驶模型的性能。本文的主要研究内容如下：（1）在高速公路行驶场景下，设计了一种基于强化学习与安全约束的自动驾驶决策方法。针对速度控制和车道变换的决策问题，首先提出了一种基于价值的安全约束方法，利用专家知识编写了安全约束规则，并修改了强化学习过程中的动作选择方法；其次改进了训练算法，向训练数据中添加了包含虚拟奖励的信息。最终在不影响算法动作空间设计和模型训练收敛的情况下，实现了利用专家知识对强化学习决策输出进行安全约束。实验结果表明，相较于基于规则的安全约束方法和不添加安全约束的方法，此方法训练的自动驾驶决策模型拥有更高的平均奖励、更好的安全性、较快的行驶速度以及更优的舒适性。（2）在一般道路行驶场景下，设计了一种通过模仿专家轨迹引导强化学习的自动驾驶规划模型训练算法。深度模仿学习可以使规划模型快速、有效地学习人类专家的驾驶知识，但对训练数据的质量要求较高，并且存在长尾效应；深度强化学习可以使模型自主地探索未知状态并进行提升，但缺少先验知识的引导，训练效率较低，奖励设计困难。针对这些问题，首先提出了一种预测性强化学习训练算法，基于专家驾驶演示数据，利用轨迹预测损失和强化学习策略损失共同训练策略网络；其次为了实现此算法，将策略网络的输出模块分为了规划以及预测两部分，并给出了此模型的完整训练算法。对比实验和可视化的结果表明，此算法同时具备模仿学习和强化学习二者的优势，拥有较高的训练效率、安全性和较好的泛化能力；而多步轨迹预测也比行为克隆算法更加适合于提取专家知识。（3）基于异质图理论设计了深层专家知识提取方法。为了进一步提升由专家知识引导的强化学习在大规模训练中的稳定性和性能表现，首先采用向量化感知数据作为输入，设计了不同的局部子图网络，对不同类别的交通元素进行局部特征聚合。其次设计了分层异质图注意力规划模型，基于各节点和边的异质特征进行全局自注意力运算，得到用于轨迹规划的全局特征向量。在此基础上，基于轨迹预测和异策略的强化学习改进了训练算法，使其更适合于带有专家监督的、大规模的训练过程。最终，对比和消融实验结果说明了此方法具有优秀的专家模仿能力和安全性能，并且基于图理论的网络模型和改进后的预测性训练算法都对于性能提升具有积极的作用。此外，还通过消融实验验证了此方法中的自注意力计算方式也是最为合理的一种。综上所述，本文面向自动驾驶行为决策和轨迹规划问题，针对如何利用专家先验知识来提升强化学习算法的性能并解决各算法的短板问题开展了研究，提出了一种对于强化学习决策结果进行安全约束的方法、两种对强化学习训练过程进行引导的训练算法以及一种基于图理论对强化学习特征网络进行设计的方法。这些方法能有效提升自动驾驶车辆的运行安全性和模型训练效率，并且从不同角度尝试了对人类智能和机器智能进行更好的融合，具有重要的理论意义和应用价值。
英文摘要	Autonomous driving is expected to effectively alleviate traffic congestion, ensure driving safety, and improve people's travel experience. However, there are still difficulties in building a fully intelligent and autonomous driving system that can be used in large-scale applications without human supervision. Specifically, the decision-making and planning problems are the key challenges. Currently, the main solutions include methods using finite state machine with rules, deep imitation learning methods, and deep reinforcement learning methods. Due to the lack of either exploration ability or the guidance of expert's prior knowledge, they each have certain disadvantages. In this thesis, we focus on this problem and use different methods to extract and utilize expert driving knowledge in different scenarios, which can assist reinforcement learning process to achieve better performance. First, in the highway driving scenario, we write constraint rules based on the expert knowledge to improve the safety of reinforcement learning's decision results. Second, in the general road scenario, a trajectory prediction-guided reinforcement learning algorithm is designed, which uses driving demonstration data to guide the agent to efficiently explore and learn. Third, the deep network design is improved based on heterogeneous graph theory and self-attention mechanism, which improves the planning model's ability to handle complex traffic environments by mimicking human. The experimental verification shows that the methods proposed in this thesis can effectively improve the performance of autonomous driving models. The main research content of this thesis is as follows: (1) This thesis designs an autonomous driving decision method in highway scenario based on reinforcement learning with safety constraints. Focusing on the decision-making of speed control and lane change, firstly, a value-based safety constraint method is proposed, which utilizes the expert knowledge to write rules and modifies the action selection method in reinforcement learning. Secondly, the training algorithm is improved by adding experience with virtual rewards to the training data. Finally, without affecting the design of action space and the training convergence, we realize the safety constraints on the decision output of reinforcement learning through expert knowledge. The experimental results show that compared with the methods with rule-based safety constraints and without safety constraints, the decision model trained by this method has higher average rewards, better safety, faster driving speed and better comfort. (2) By imitating expert trajectories, this thesis designs a method that guides reinforcement learning for autonomous driving planning model in general road scenarios. Deep imitation learning enables the agent to learn expert's knowledge efficiently, but it requires training data with high quality and has the long-tail effect. Deep reinforcement learning enables the model to explore unknown states and improve itself, but it has low training efficiency as the lack of prior knowledge's guidance, and has difficulty in reward design. To address these problems, firstly, a predictive reinforcement learning algorithm is proposed. Based on expert driving demonstration data, it uses trajectory prediction loss and reinforcement learning policy loss to jointly train the policy network. Secondly, in order to deploy this algorithm, the output module of policy network is divided into planning and prediction parts, and a complete training algorithm for this model is given. Comparison experiments and visualization results show that this algorithm has the advantages of both imitation learning and reinforcement learning, with high training efficiency, good safety and generalization ability. It is also indicated that trajectory prediction is more suitable for extracting expert knowledge than behavior cloning. (3) An extraction method for deep expert knowledge is designed based on the heterogeneous graph theory. In order to further improve the stability and performance of expert-guided reinforcement learning in large scale training, firstly, vectorized perception data is used as inputs and different local sub-graph networks are designed for local feature aggregation of traffic elements with different types. Secondly, a hierarchical heterogeneous graph attention planning model is designed to perform global self-attention and obtain the global feature for trajectory planning, based on the heterogeneous features of all nodes and edges. Thirdly, the training algorithm is improved based on trajectory prediction and off-policy reinforcement learning, making it more suitable for the large-scale training with expert supervision. At last, the experiment results illustrate that this method has excellent imitation capability and safety performance, and that both the graph network model and the improved predictive training method have a positive effect on the improvement. In addition, the ablation experiments also verify that the self-attention calculation method we used is the most reasonable one. To summarize, this thesis focuses on autonomous driving behavior decision and trajectory planning problems, and carries out a research on how to use expert's prior knowledge to improve the performance of reinforcement learning and overcome each algorithm's shortcomings. This thesis puts forward one method of adding safety constraints to the decision results of reinforcement learning, two kinds of training algorithms that provide guidance for the training process of it, and one graph-theory based design method for the feature network of it. The methods proposed in this thesis can effectively improve the training efficiency and driving safety of intelligent vehicles, and try to better integrate human intelligence and machine intelligence from different perspectives, which has important theoretical significance and application value.
关键词	自动驾驶，深度模仿学习，深度强化学习，图神经网络
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/56623
专题	毕业生_硕士学位论文毕业生
推荐引用方式 GB/T 7714	王宇霄. 融合专家知识与强化学习的自动驾驶策略研究[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
融合专家知识与强化学习的自动驾驶策略研究（2599KB）	学位论文		限制开放	CC BY-NC-SA