POPO: Pessimistic Offline Policy Optimization

doi:10.1109/ICASSP43922.2022.9747886

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 脑机融合与认知评估

	POPO: Pessimistic Offline Policy Optimization
	He Q(何强)1,2 ; Hou XW(侯新文)1 ; Liu Y(刘禹)1
	2022-04
会议名称	ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
会议日期	23-27 May 2022
会议地点	Singapore, Singapore
出版者	IEEE
摘要	Offline reinforcement learning (RL) aims to optimize policy from large pre-recorded datasets without interaction with the environment. This setting offers the promise of utilizing diverse and static datasets to obtain policies without costly, risky, active exploration. However, commonly used off-policy deep RL methods perform poorly when facing arbitrary off-policy datasets. In this work, we show that there exists an estimation gap of value-based deep RL algorithms in the offline setting. To eliminate the estimation gap, we propose a novel offline RL algorithm that we term Pessimistic Offline Policy Optimization (POPO), which learns a pessimistic value function. To demonstrate the effectiveness of POPO, we perform experiments on various quality datasets. And we find that POPO performs surprisingly well and scales to tasks with high-dimensional state and action space, comparing or outperforming tested state-of-the-art offline RL algorithms on benchmark tasks.
关键词	reinforcement learning offline optimization out-of-distribution
DOI	10.1109/ICASSP43922.2022.9747886
URL	查看原文
收录类别	EI
语种	英语
引用统计	被引频次：1[WOS] [WOS记录] [WOS相关记录]
文献类型	会议论文
条目标识符	http://ir.ia.ac.cn/handle/173211/48891
专题	多模态人工智能系统全国重点实验室_脑机融合与认知评估
通讯作者	Hou XW(侯新文)
作者单位	1.Institute of Automation, Chinese Academy of Sciences 2.University of Chinese Academy of Sciences
第一作者单位	中国科学院自动化研究所
通讯作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	He Q,Hou XW,Liu Y. POPO: Pessimistic Offline Policy Optimization[C]:IEEE,2022.

条目包含的文件		下载所有文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
POPO_Pessimistic_Off（1200KB）	会议论文		开放获取	CC BY-NC-SA	浏览下载