首页> 外文期刊>Network >Prospective And Retrospective Temporal Difference Learning
【24h】

Prospective And Retrospective Temporal Difference Learning

机译:前瞻性和回顾性时间差异学习

获取原文
获取原文并翻译 | 示例
           

摘要

A striking recent finding is that monkeys behave maladaptively in a class of tasks in which they know that reward is going to be systematically delayed. This may be explained by a malign Pavlovian influence arising from states with low predicted values. However, by very carefully analyzing behavioral data from such tasks, La Camera and Richmond (2008) observed the additional important characteristic that subjects perform differently on states in the task that are at equal distances from the future reward, depending on what has happened in the recent past. The authors pointed out that this violates the definition of state value in the standard reinforcement learning models that are ubiquitous as accounts of operant and classical conditioned behavior; they suggested and analyzed an alternative temporal difference (TD) model in which past and future are melded. Here, we show that, in fact, a standard TD model can actually exhibit the same behavior, and that this avoids deleterious consequences for choice. At the heart of the model is the average reward per step, which acts as a baseline for measuring immediate rewards. Relatively subtle changes to this baseline occasioned by the past can markedly influence predictions and thus behavior.
机译:最近的一个惊人发现是,猴子在某类任务中表现出不良适应性,他们知道奖励将被系统地延迟。这可能是由于预测值较低的状态引起的恶性的巴甫洛夫影响。但是,通过非常仔细地分析来自此类任务的行为数据,La Camera and Richmond(2008)观察到了另一个重要特征,即受试者在任务中与未来奖励距离相等的状态下的表现不同,具体取决于任务中发生的情况。最近的过去。作者指出,这违反了标准强化学习模型中状态值的定义,该模型普遍存在于操作和经典条件行为的解释;他们建议并分析了将过去和未来融为一体的替代时差(TD)模型。在这里,我们表明,事实上,标准TD模型实际上可以表现出相同的行为,并且可以避免选择时的有害后果。该模型的核心是每步的平均奖励,它是衡量即时奖励的基准。过去引起的对该基准的相对微妙的变化会明显影响预测并因此影响行为。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号