Knowledge Commons of Institute of Automation,CAS
POPO: Pessimistic Offline Policy Optimization | |
He Q(何强)1,2; Hou XW(侯新文)1; Liu Y(刘禹)1 | |
2022-04 | |
会议名称 | ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
会议日期 | 23-27 May 2022 |
会议地点 | Singapore, Singapore |
出版者 | IEEE |
摘要 | Offline reinforcement learning (RL) aims to optimize policy from large pre-recorded datasets without interaction with the environment. This setting offers the promise of utilizing diverse and static datasets to obtain policies without costly, risky, active exploration. However, commonly used off-policy deep RL methods perform poorly when facing arbitrary off-policy datasets. In this work, we show that there exists an estimation gap of value-based deep RL algorithms in the offline setting. To eliminate the estimation gap, we propose a novel offline RL algorithm that we term Pessimistic Offline Policy Optimization (POPO), which learns a pessimistic value function. To demonstrate the effectiveness of POPO, we perform experiments on various quality datasets. And we find that POPO performs surprisingly well and scales to tasks with high-dimensional state and action space, comparing or outperforming tested state-of-the-art offline RL algorithms on benchmark tasks. |
关键词 | reinforcement learning offline optimization out-of-distribution |
DOI | 10.1109/ICASSP43922.2022.9747886 |
URL | 查看原文 |
收录类别 | EI |
语种 | 英语 |
引用统计 | |
文献类型 | 会议论文 |
条目标识符 | http://ir.ia.ac.cn/handle/173211/48891 |
专题 | 多模态人工智能系统全国重点实验室_脑机融合与认知评估 |
通讯作者 | Hou XW(侯新文) |
作者单位 | 1.Institute of Automation, Chinese Academy of Sciences 2.University of Chinese Academy of Sciences |
第一作者单位 | 中国科学院自动化研究所 |
通讯作者单位 | 中国科学院自动化研究所 |
推荐引用方式 GB/T 7714 | He Q,Hou XW,Liu Y. POPO: Pessimistic Offline Policy Optimization[C]:IEEE,2022. |
条目包含的文件 | 下载所有文件 | |||||
文件名称/大小 | 文献类型 | 版本类型 | 开放类型 | 使用许可 | ||
POPO_Pessimistic_Off(1200KB) | 会议论文 | 开放获取 | CC BY-NC-SA | 浏览 下载 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论