...
首页> 外文期刊>JMLR: Workshop and Conference Proceedings >Learning to Explore via Meta-Policy Gradient
【24h】

Learning to Explore via Meta-Policy Gradient

机译:通过元策略梯度学习探索

获取原文
           

摘要

The performance of off-policy learning, including deep Q-learning and deep deterministic policy gradient (DDPG), critically depends on the choice of the exploration policy. Existing exploration methods are mostly based on adding noise to the on-going actor policy and can only explore local regions close to what the actor policy dictates. In this work, we develop a simple meta-policy gradient algorithm that allows us to adaptively learn the exploration policy in DDPG. Our algorithm allows us to train flexible exploration behaviors that are independent of the actor policy, yielding a global exploration that significantly speeds up the learning process. With an extensive study, we show that our method significantly improves the sample-efficiency of DDPG on a variety of reinforcement learning continuous control tasks.
机译:非政策学习的性能(包括深度Q学习和深度确定性策略梯度(DDPG))在很大程度上取决于探索策略的选择。现有的探索方法主要是基于对正在进行的参与者策略增加干扰,并且只能探索与参与者策略所指示的内容接近的局部区域。在这项工作中,我们开发了一种简单的元策略梯度算法,使我们能够自适应地学习DDPG中的勘探策略。我们的算法使我们能够训练独立于参与者策略的灵活探索行为,从而产生可显着加快学习过程的全局探索。通过广泛的研究,我们证明了我们的方法可以显着提高DDPG在各种强化学习连续控制任务中的采样效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号