基于基础模型的分层强化学习

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于基础模型的分层强化学习
	吴俣桥
	2024-05-14
页数	75
学位类型	硕士
中文摘要	近期以ChatGPT为代表的基础模型为通用人工智能的实现带来了曙光。作为拥有海量先验知识的预训练模型，基础模型展现了在不同任务间良好的泛化能力与出色的少样本学习能力。尽管基础模型在自然语言处理与计算机视觉领域表现突出，但其在决策领域的应用仍受到挑战：传统的决策问题建模方式与基础模型所采用的序列建模方式存在差异；基础模型如何与环境进行交互并基于交互经验进行学习尚未明确。另一方面，经过几十年的发展，分层强化学习已成为强化学习领域的重要分支，为解决长时序决策问题提供了一种有效的工具。然而，设计合适的分层结构往往依赖于先验知识，而自动学习得到的分层结构所产生的子目标或技能通常与环境或任务紧密耦合。直观来看，基础模型中所蕴含的海量先验知识有助于分层强化学习找到合适的分层方式，而分层强化学习则为基础模型提供与环境交互的手段。此外，分层强化学习的非平稳性能够使基础模型的序列建模发挥更大的价值，在分层强化学习的框架下，利用这种模式进行大规模预训练有望带来性能更强大的分层决策模型。基于上述观点，本文提出了两种新颖的将分层强化学习与基础模型相结合的方法：利用基础模型结构进行技能发现的分层强化学习：该算法利用Transformer中的自注意力机制对动作进行聚合，自动发现技能并采用序列化建模方式学习技能内部策略与技能间策略。该方法为训练大规模的分层式决策模型奠定了基础，为扩展基础模型的类型提供了可能。利用大语言模型生成子目标的分层强化学习：该方法以语言作为通用的子目标表征，利用基础模型中囊括的海量知识结合具体任务与环境观测生成相应的子目标，然后使用传统的概率策略模型学习完成子目标的子策略。该方法提供了一种利用基础模型处理连续动作空间决策问题与多维度复杂决策问题的方法，拓展了基础模型智能体的应用范围。
英文摘要	Recently, foundational models like ChatGPT have shed light on achieving general artificial intelligence. As pre-trained models with vast prior knowledge, these foundational models demonstrate strong generalization capabilities and exceptional few-shot learning abilities across different tasks. Despite their remarkable performance in natural language processing and computer vision, their application in decision-making domains remains challenging due to disparities between traditional decision-making modeling approaches and the sequence modeling employed by foundational models, as well as uncertainties regarding how foundational models interact with environments and learn from interaction experiences. On the other hand, hierarchical reinforcement learning (HRL) has evolved over decades to become a significant branch of reinforcement learning, offering effective tools for addressing long-horizon decision-making problems. However, designing appropriate hierarchical structures often relies on prior knowledge, and hierarchies learned automatically tend to be tightly coupled with the environment or task. Intuitively, the vast prior knowledge embedded within foundational models can facilitate finding suitable hierarchical structures in HRL, while HRL provides a means for foundational models to interact with environments. Furthermore, the non-stationarity of HRL can enhance the value of sequence modeling in foundational models, and employing this paradigm for large-scale pre-training within an HRL framework holds the promise of yielding hierarchical decision-making models with stronger performance. Based on these insights, this paper proposes two novel approaches that integrate HRL with foundational models: Hierarchical Reinforcement Learning with Skill Discovery Using Foundation Model Structures: This algorithm utilizes the self-attention mechanism in Transformer to aggregate actions, automatically discovers skills, and learns intra-skill and inter-skill policies through sequence modeling. This method paves the way for training large-scale hierarchical decision-making models and expands the range of foundational model types. Hierarchical Reinforcement Learning with Subgoal Generation Using Large Language Models: This algorithm employs language as a universal sub-goal representation, leveraging the vast knowledge encompassed by foundational models in conjunction with specific tasks and environmental observations to generate corresponding sub-goals, and then learns a traditional probability policy model to complete sub-policies. This method provides an approach to handle decision-making problems involving continuous action spaces and huge-dimensional complexity using foundation models, thereby expanding the application scope of foundation model agents.
关键词	强化学习分层强化学习基础模型
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/57509
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	吴俣桥. 基于基础模型的分层强化学习[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Thesis.pdf（16716KB）	学位论文		限制开放	CC BY-NC-SA