CASIA OpenIR  > 毕业生  > 硕士学位论文
对抗生成式模仿学习方法研究
关伟凡
2023-05
Pages66
Subtype硕士
Abstract

近年来,伴随着AlphaGo、MuZero、ChatGPT 等标志性工作的出现,强化学习受到学术界和工业界的广泛关注。然而强化学习存在采样效率低,奖励函数难以设计等问题。对此,模仿学习作为强化学习的重要研究分支方向之一,给出了全新的解决方案。模仿学习旨在从人类专家提供的演示样本中进行数据挖掘,让智能体学习人类专家的决策规律,从而达到和人类专家相同的决策能力。模仿学习借鉴对抗生成网络的思想衍生出对抗生成式模仿学习,通过智能体与奖励函数的对抗训练,使智能体达到和专家相同的决策效果。

然而对抗生成式模仿学习依然存在很多待解决的问题,例如:专家样本混杂、质量不一致;在新的测试环境中泛化能力较差、性能表现降低等。针对上述问题,本文首先对当前模仿学习的研究现状进行归纳总结,然后提出了两种改进的对抗生成式模仿学习算法,主要贡献如下:

1. 针对专家样本质量不一致问题,本文提出:通过噪声对比估计改良次优专家样本特征分布。为专家样本标定权重系数,并通过奖励函数预测样本排序,计算排序损失。利用排序损失对权重系数进行自适应优化,进一步改良数据集的样本分布,使得算法在学习过程中更加关注最优专家样本。提升了对抗生成式模仿学习的性能表现。

2. 针对视觉观测场景下智能体泛化能力较差问题,本文提出:通过预训练视觉模型对原始视觉观测进行特征提取,将抽取得到中间层的特征图作为智能体的状态观测特征,输入后续基于视觉观测的对抗生成式模仿学习算法之中。提升了智能体在新的测试环境中的泛化能力。

Other Abstract

In recent years, with the emergence of works such as AlphaGo, MuZero, and ChatGPT, reinforcement learning has received widespread attention from both academia and industry. However, reinforcement learning suffers from problems such as low sampling efficiency and difficulty in designing reward function. In response, imitation learning, as a branch of reinforcement learning, has provided a new solution. Imitation learning aims to mine knowledge from demonstrations provided by human experts, learn the decision-making rules of human experts, and thus achieve the same decision-making ability as human experts. Drawing on the ideas of generative adversarial networks, generative adversarial imitation learning has been derived, which trains the current agent to achieve the same decision-making performance as the expert agent through adversarial training between the agent and the reward function.

However, generative adversarial imitation learning still faces many unresolved issues, such as inconsistent expert demonstrations, and poor generalization ability in new testing environments. In this thesis, we first summarize the current research status of imitation learning and then propose two improved generative adversarial imitation learning algorithms to address the aforementioned issues. The main contributions of this thesis are as follows:

1. In the scenario of a mixed suboptimal expert demonstration dataset: We use noise contrastive estimation to improve the suboptimal expert demonstration feature distribution. We calibrate weight coefficients for expert demonstration and calculate ranking loss through the reward function to predict demonstration ranking. By using the ranking loss to adaptively optimize the weight coefficients, we further improve the demonstration distribution of the dataset, allowing the algorithm to focus more on the optimal expert demonstrations during the learning process. In the mixed expert demonstration dataset task setting, this method enhances the performance of generative adversarial imitation learning.

2. In the scenario of visual observation dataset: We use a pre-trained visual model to extract features from the original visual observations, extract the feature maps from the intermediate layers as new visual observations, and input them into the subsequent visual generative adversarial imitation learning algorithm. This method enhances the agent's generalization ability and performance in new testing environments when visual observations are used as input.

Keyword强化学习 模仿学习 对抗生成训练 次优专家样本 基于观测的模仿学习
Language中文
Sub direction classification强化与进化学习
planning direction of the national heavy laboratory智能计算与学习
Paper associated data
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/52279
Collection毕业生_硕士学位论文
Recommended Citation
GB/T 7714
关伟凡. 对抗生成式模仿学习方法研究[D],2023.
Files in This Item:
File Name/Size DocType Version Access License
对抗生成式模仿学习方法研究.pdf(7227KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[关伟凡]'s Articles
Baidu academic
Similar articles in Baidu academic
[关伟凡]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[关伟凡]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.