CASIA OpenIR  > 毕业生  > 硕士学位论文
执行者-评论家算法框架下的强化学习稳定性研究
龚晨
2023-05
Pages152
Subtype硕士
Abstract

强化学习在实际生产中具有广泛的潜在应用场景。然而,由于其训练不稳定、收敛速度缓慢、采样效率低、易受攻击等特点,该技术存在一系列问题。一方面,难以训练出具备理想性能的智能体;另一方面,若智能体的抗攻击能力不足,在受到恶意用户攻击时,将给系统带来无法估量的灾难。这些缺陷严重阻碍了强化学习在复杂实际应用中发挥更为重要的作用。

本文聚焦强化学习稳定性的问题,从强化学习的训练过程和应用过程出发,进行全面分析。在强化学习训练过程中,从算法优化的角度提高训练智能体的稳定性。在应用过程中,研究智能体潜在的漏洞,提升应用稳定性。具体来说,本文选取的角度如下。第一,在线强化学习应用阶段,如何保护智能体不受到第三方的攻击(第一贡献)。第二,离线强化学习中如何保护数据,来提升智能体应用的稳定性(第二贡献)。第三,在线强化学习训练阶段提高训练的稳定性(第三,第四贡献)。本文的四个主要贡献如下。

第一为基于好奇心机制和受害者意识的对抗性策略。研究人员揭示了DRL模型容易受到对抗性攻击的影响。攻击者训练“对抗性策略”来篡改经过充分训练的受害智能体的观察结果来进行攻击。提高深度强化学习的对抗鲁棒性对于提高各种DRL系统的质量和可靠性非常重要。本文提出了一种新颖的方法——“好奇心驱动”和“受害者感知”对抗性策略训练,可以更有效地利用受害智能体的缺陷。为了利用受害者感知信息,本文构建了一个网络,可以近似黑盒受害者的状态值函数以收集受害者的信息。提出了一种好奇心驱动的方法,鼓励对抗性策略利用智能体网络的隐藏层信息来有效地利用受害者的漏洞。大量实验表明,提出的方法在多个环境下优于或达到了当前最先进水平。

第二为研究离线强化学习下的后门攻击问题。本节的研究关注的是一种极为严重的安全威胁:后门攻击。因此,本节提出了一种名为Baffle的离线强化学习后门攻击,并评估了不同离线强化学习算法对此种攻击的反应。本节揭示了一个令人担忧的事实:现有的离线强化学习算法都无法幸免于此种后门攻击。具体而言,Baffle在四个任务中都对数据集的10%进行了修改(包括3个机器人控制任务和1个自动驾驶任务)。尽管无触发器的情况下表现良好,当触发器出现时,智能体的性能平均下降了63.6%、57.8%、60.8%和44.7%。本节发现即使对于已经存在的后门的智能体进行了微调,后门依然存在。更令人担忧的是,插入的后门也很难被常用的防御方法检测到。因此,本节呼吁对开源离线强化学习数据集进行更加有效的保护。

第三为提出了稳定强化学习训练过程的一种方法,即为“广义平稳策略优化”。深度强化学习(DRL)在视频游戏中的应用越来越广泛,但通常会遭受训练不稳定和低采样效率等问题的困扰。为了在训练过程趋于收敛时稳定贝尔曼残差分布(BRD)遵循平稳随机过程,本文提出了一种名为“广义平稳策略优化算法”(WSPO)的框架,该框架利用相邻时间步之间的BRD的Wasserstein距离来稳定训练阶段并提高采样效率。本文使用分位数回归来最小化Wasserstein距离,这样做的好处是不需要知道BRD的具体形式。最后,本文将WSPO与优势演员-评论家(A2C)算法和深度确定性策略梯度(DDPG)算法相结合。对Atari 2600视频游戏和连续控制任务进行了WSPO的评估,结果表明WSPO比本文测试的最先进算法表现更好或取得了类似的效果。

第四为提出了一种强化学习新框架,即$f$散度强化学习框架。本文提出了一种新的DRL框架,称为“f散度强化学习(FRL)”。在FRL中,策略评估和策略改进阶段同时进行,通过“最小化学习策略和采样策略之间的f散度”来实现,这与传统DRL算法旨在最大化期望累积回报的方法不同。本文在理论上证明,最小化$f$散度可以使学习策略收敛到最优策略。此外,通过Fenchel共轭,将FRL框架中的智能体训练过程转换为一个带有特定$f$函数的鞍点优化问题,形成了策略评估和策略改进的新方法。通过数学证明和实证评估,本文证明了FRL框架具有两个优点:(1)同时执行策略评估和策略改进过程,(2)自然地缓解了价值函数高估的问题。为了评估FRL框架的有效性,本文在Atari 2600视频游戏上进行了实验。实验结果表明,使用FRL框架训练的智能体在性能方面优于或达到了基线DRL算法的水平。

本文希望通过对强化学习智能体稳定性的研究,警示人们在实际应用中应谨慎使用强化学习算法,推动设计出抗攻击能力强的鲁棒,和值得信赖的强化学习算法,以推动强化学习算法在实际生活中的发展。

Other Abstract

Reinforcement learning has wide potential applications in practical production. However, due to its unstable training process, slow convergence rate, low sampling efficiency, and susceptibility to attacks, this technology faces a series of challenges. On one hand, it is difficult to train intelligent agents with ideal performance, and on the other hand, if the agent's ability to resist attacks is insufficient, it may bring incalculable disasters to the system when attacked by malicious users. These shortcomings seriously hinder the more important role of reinforcement learning in complex practical applications.

This article focuses on the stability issues of reinforcement learning and conducts a comprehensive analysis from the perspectives of the training and application processes of reinforcement learning. The selected angles of this article are as follows: firstly, how to protect the intelligent agent from third-party attacks during online reinforcement learning application phase. Secondly, how to improve the stability of training in the online reinforcement learning training phase. Thirdly, how to protect data in offline reinforcement learning to improve the stability of intelligent agent application. The main contributions of this article are as follows.


The first research is the curiosity driven and victim aware adversarial policy. However, recently researchers have revealed that deep reinforcement learning models are vulnerable to adversarial attacks: malicious attackers can train \textit{adversarial policies} to tamper with the observations of a well-trained victim agent, the latter of which fails dramatically when faced with such an attack. Understanding and improving the adversarial robustness of deep reinforcement learning is of great importance in enhancing the quality and reliability of a wide range of DRL-enabled systems. In this paper, we develop \textit{curiosity-driven} and \textit{victim-aware} adversarial policy training, a novel method that can more effectively exploit the defects of victim agents. 
To be victim-aware, we build a surrogate network that can approximate the state-value function of a black-box victim to collect the victim's information. 
Then we propose a curiosity-driven approach, which encourages an adversarial policy to utilize the information from the hidden layer of the surrogate network to exploit the vulnerability of victims efficiently. 
Extensive experiments demonstrate that our proposed method outperforms or achieves a similar level of performance as the current state-of-the-art across multiple environments. 
We perform an ablation study to emphasize the benefits of utilizing the approximated victim information. 

The second research is hiding backdoors in offline reinforcement learning datasets. 
In this paper, we focus on a critical security threat: backdoor attacks.
Given normal observations, an agent implanted with backdoors takes actions leading to high rewards.
However, the same agent takes actions that lead to low rewards if the observations are injected with triggers that can activate the backdoor.
In this paper, we propose \textsc{Baffle} (\textbf{B}ackdoor \textbf{A}ttack for O\textbf{ff}line Reinforcement \textbf{Le}arning) and evaluate how different Offline RL algorithms react to this attack.
Our experiments conducted on four tasks and four offline RL algorithms expose a disquieting fact: none of the existing offline RL algorithms is immune to such a backdoor attack.
More specifically, \textsc{Baffle} modifies $10\%$ of the datasets for four tasks (3 robotic controls and 1 autonomous driving).
Agents trained on the poisoned datasets perform well in normal settings. 
However, when triggers are presented, the agents' performance decreases drastically by $63.6\%$, $57.8\%$, $60.8\%$ and $44.7\%$ in the four tasks on average. 
The backdoor still persists after fine-tuning poisoned agents on clean datasets.
We further show that the inserted backdoor is also hard to be detected by a popular defensive method. 
This paper calls attention to developing more effective protection for the open-source offline RL dataset. 

The third research proposes a method to stable the training process of reinforcement learning, termed on ``Wide-sense Stationary Policy Optimization".  Deep Reinforcement Learning (DRL) has an increasing application in video games. However, it usually suffers from unstable training, low sampling efficiency, etc. Under the assumption that Bellman residual follows a stationary random process when the training process is convergent, we propose the \textbf{W}ide-sense \textbf{S}tationary \textbf{P}olicy \textbf{O}ptimization (WSPO) framework, which leverages the Wasserstein distance from the Bellman Residual Distribution (BRD) between two adjacent time steps, to stabilize the training stage and improve the sampling efficiency. We minimize the Wasserstein distance with Quantile Regression, where the specific form of BRD is not needed. Finally, we combine WSPO with Advantage Actor-Critic (A2C) algorithm and Deep Deterministic Policy Gradient (DDPG) algorithm. We evaluate WSPO on Atari 2600 video games and continuous control tasks, illustrating that WSPO compares or outperforms the state-of-the-art algorithms we tested.

The last study offers a novel reinforcement learning framework, which is $f-$divergence reinforcement learning framework. The framework of deep reinforcement learning (DRL) provides a powerful and widely applicable mathematical formalization for sequential decision-making. This paper contributes a novel DRL framework, termed $f$-Divergence Reinforcement Learning (FRL). In FRL, the policy evaluation and policy improvement phases are simultaneously performed by \textit{minimizing the $f$-divergence between the learning policy and sampling policy}, which is distinct from conventional DRL algorithms that aim to maximize the expected cumulative rewards. We theoretically prove that minimizing such $f$-divergence can make the learning policy converge to the optimal policy. Besides, we convert the process of training agents in FRL framework to a saddle-point optimization problem with a specific $f$ function through Fenchel conjugate, which forms new methods for policy evaluation and policy improvement. Through mathematical proofs and empirical evaluation, we demonstrate that the FRL framework has two advantages: (1) policy evaluation and policy improvement processes are performed simultaneously, and (2) the issues of overestimating value function are naturally alleviated. To evaluate the effectiveness of the FRL framework, we conduct experiments on Atari 2600 video games. Experimental results show that agents trained in the FRL framework outperform or achieve a similar performance level as the baseline DRL algorithms. 

This paper aims to caution developers to carefully use RL technologies in reality by exploring the stability of RL agents, promoting the development of robustness and trustworthy RL method. Besides, we hope the RL method can be widely utilized in real applications soon.

Keyword深度强化学习,稳定性,共轭,对抗性攻击,后门攻击
Language中文
Sub direction classification机器学习
planning direction of the national heavy laboratory先进智能应用与转化
Paper associated data
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/52143
Collection毕业生_硕士学位论文
Recommended Citation
GB/T 7714
龚晨. 执行者-评论家算法框架下的强化学习稳定性研究[D],2023.
Files in This Item:
File Name/Size DocType Version Access License
中国科学院硕士毕业论文-龚晨.pdf(8324KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[龚晨]'s Articles
Baidu academic
Similar articles in Baidu academic
[龚晨]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[龚晨]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.