基于预训练模型的决策序列化建模研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于预训练模型的决策序列化建模研究
	林润基
	2024-06
页数	84
学位类型	硕士
中文摘要	随着人工智能体系逐渐从面向特定任务的领域人工智能向通用人工智能演进，预训练大规模 Transformer 模型在自然语言处理和计算机视觉中已成为推动领域范式统一的核心关键技术，然而相关技术在决策领域的研究尚处于探索阶段。现有强化学习算法在离线元学习环境中面临策略优化和任务辨识的困难，在部分可观测环境下面临长历史观测序列而导致的信念维护困难和探索空间大的挑战。针对上述趋势和挑战，本文提出了一系列基于预训练模型的决策序列化建模方案，以提高强化学习算法的样本效率和泛化性能。本文聚焦于通过序列化建模算法与预训练-微调技术解决各类决策任务场景下的关键挑战。具体来说，决策序列建模算法将决策场景问题建模为适用于 Transformer 网络结构处理的序列形式;而预训练-微调技术借助大规模数据资源训练得到一个初始模型，然后在特定应用任务上对该模型进行针对性调整，利用迁移学习的思想来快速构建和优化针对不同任务的模型。研究的内容涵盖了设计与任务适应的序列模板，以及通过多样化数据驱动的决策预训练过程的两方面，以期在元学习、部分可观测马尔可夫决策过程等场景中取得性能提升。本文主要的研究贡献与创新之处列举如下: 1. 针对元学习和离线强化学习学习场景，本文提出了新的算法框架，该框架综合了自监督预训练方法，并融入了策略提示和任务提示的序列建模技术，有效应对了离线元学习中策略提升与任务泛化的难题。在预训练技术方面，该框架通过自监督预训练手段挖掘离线数据集中蕴含的世界模型信息、策略行为模式以及任务特性，以提升策略质量和增强对新任务的泛化能力。在序列建模方面，该框架探究了基于提示的微调策略，将策略提示与输入特征进行联合序列建模，引导模型生成条件化策略，从而优化策略。更进一步，将跨任务的任务提示信息加入序列中以改进对未知任务的泛化性能。本研究在两个不同的离线强化学习场景上开展了广泛的实验验证，实验结果均有力证明了算法的有效性。 2. 对于部分可观测马尔可夫决策问题环境，本文提出将时序决策建模与预训练技术相结合的算法框架，以解决由于长历史观测引起的信念维护困难及探索效率低的问题。在预训练技术方面，该算法在多样化的专家策略上进行预训练以初始化策略网络，显著提高了模型训练初期的收敛速率、训练数据的样本利用率以及训练过程的整体稳定性。在序列建模方面，该算法基于嵌入因果掩码机制的 Transformer 解码器架构为骨干网络，能够接收历史观测序列并输出相应动作序列或价值函数序列，借助注意力机制成功提升了对长序列决策问题的建模能力，从而提升了总体性能和泛化能力。实验评估阶段特别设计了一个扑翼流体仿真器作为典型的部分可观测马尔可夫决策过程环境。实验结果证明了该算法在处理复杂的局部观测决策任务时展现出的简洁高效特性。
英文摘要	As artificial intelligence systems progressively evolve from domain-specific arti- ficial intelligence towards general artificial intelligence, pre-trained large-scale Trans- former models have become the core technology driving the unification of paradigms in natural language processing and computer vision. However, the exploration of these technologies in the decision-making domain is still in its early stages. Existing rein- forcement learning(RL) algorithms encounter challenges in offline meta-learning envi- ronments, particularly with policy optimization and task identification. Furthermore, they face challenges related to belief maintenance and exploration of space expansion due to long historical observation sequences in partially observable environments. In light of these trends and challenges, this paper proposes decision sequence modeling algorithms based on pre-trained models to enhance the sample efficiency and general- ization performance of reinforcement learning algorithms. This paper focuses on solving key challenges in various decision-making task sce- narios through sequence modeling-based RL algorithms and pre-training-fine-tuning techniques. Specifically, decision sequence modeling algorithms model decision-making problems as sequences suitable for processing by Transformer network structures; whereas pre-training-finetuning techniques leverage large-scale data resources to train a founda- tion model, which is then specifically adjusted for particular application tasks, utilizing the concept of transfer learning to rapidly construct and optimize models for different tasks. The research encompasses the design of sequence templates adapted to tasks, as well as decision pre-training processes driven by diverse data, aiming to achieve performance improvements in scenarios such as meta-learning and partially observable Markov decision processes. The main contributions and innovations of this paper are listed as follows: 1. For meta-learning and offline reinforcement learning scenarios, a new algorithm framework is proposed, integrating self-supervised pre-training methods with sequence modeling techniques for policy prompts and task prompts, effectively addressing the challenges of policy optimization and task generalization in offline meta-learning. In terms of pre-training technology, this framework exploits world model information, pol- icy behavior patterns, and task characteristics contained in offline datasets through self- supervised pre-training means, to improve policy quality and enhance generalization capabilities to new tasks. In sequence modeling, the framework explores prompt-tuning strategies, jointly modeling policy prompts and input observation features in sequences to guide the generation of conditioned policies, thereby optimizing policies. Further, the task prompt is added to the sequences to improve generalization performance to un- III Research on Sequence Modeling for Decision-making Based on Pre-trained Models known tasks. Extensive experiments conducted in two different offline reinforcement learning scenarios robustly demonstrate the effectiveness of the algorithm. 2. For partially observable Markov decision process(POMDP) environments, an algorithmic framework that combines temporal decision modeling with pre-training technology is proposed, addressing the challenges posed by long historical observations on belief maintenance and low exploration efficiency. In pre-training, the algorithm pre-trains on a variety of expert strategies to initialize the policy network, significantly enhancing the convergence rate, sample efficiency of training, and overall stability of the training process. In sequence modeling, the algorithm employs a Transformer decoder architecture with causal attention masking mechanism as the backbone network, capa- ble of receiving historical observation sequences and outputting corresponding action sequences or value function sequences. Leveraging the attention mechanism, it suc- cessfully enhances modeling capabilities for long-sequence decision problems, thereby improving overall performance and generalization capabilities. Experimental evalua- tions were conducted using a specifically designed flapping fluid simulator as a typical POMDP environment. Experimental results demonstrate the algorithm’s efficiency and effectiveness in handling complex partial observation decision tasks.
关键词	预训练模型决策序列化序列模型
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/57328
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	林润基. 基于预训练模型的决策序列化建模研究[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
毕业论文_基于预训练模型的决策序列化建模（7811KB）	学位论文		限制开放	CC BY-NC-SA