【24h】

Bias-Variance Error Bounds for Temporal Difference Updates

机译:时差更新的偏差-偏差误差界

获取原文
获取原文并翻译 | 示例

摘要

Temporal difference (TD) algorithms are used in reinforcement learning to compute estimates of the value of a given policy in an unknown Markov decision process (policy evaluation). We give rigorous upper bounds on the error of the closely related phased TD algorithms (which differ from the standard updates in their treatment of the learning rate) as a function of the amount of experience. These upper bounds prove exponentially fast convergence, with both the rate of convergence and the asymptote strongly dependent on the length of the backups k or the parameterλ. Our bounds give formal verification to the well-known intuition that TD methods are subject to a bias-variance tradeoff, and they lead to schedules for k andλ that are predicted to be better than any fixed values for these parameters. We give preliminary experimental confirmation of our theory for a version of the random walk problem.
机译:时间差异(TD)算法用于强化学习中,以在未知的马尔可夫决策过程(策略评估)中计算给定策略的价值估计。我们根据经验量对严格相关的分阶段TD算法(在学习率的处理上不同于标准更新)的误差给出严格的上限。这些上限证明了指数级快速收敛,收敛速度和渐近线都强烈取决于备份k的长度或参数λ。我们的界限正式验证了TD方法要经受偏差方差折衷的直觉,并且它们导致k和λ的调度比这些参数的任何固定值都要好。我们对随机游走问题的版本给出了理论的初步实验确认。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号