首页> 外文期刊>Journal of machine learning research >Multi-task Reinforcement Learning in Partially Observable Stochastic Environments
【24h】

Multi-task Reinforcement Learning in Partially Observable Stochastic Environments

机译:部分可观察的随机环境中的多任务强化学习

获取原文
           

摘要

We consider the problem of multi-task reinforcement learning (MTRL)in multiple partially observable stochastic environments. Weintroduce the regionalized policy representation (RPR) tocharacterize the agent's behavior in each environment. The RPR is aparametric model of the conditional distribution over currentactions given the history of past actions and observations; theagent's choice of actions is directly based on this conditionaldistribution, without an intervening model to characterize theenvironment itself. We propose off-policy batch algorithms to learnthe parameters of the RPRs, using episodic data collected whenfollowing a behavior policy, and show their linkage to policyiteration. We employ the Dirichlet process as a nonparametric priorover the RPRs across multiple environments. The intrinsic clusteringproperty of the Dirichlet process imposes sharing of episodes amongsimilar environments, which effectively reduces the number ofepisodes required for learning a good policy in each environment,when data sharing is appropriate. The number of distinct RPRs andthe associated clusters (the sharing patterns) are automaticallydiscovered by exploiting the episodic data as well as thenonparametric nature of the Dirichlet process. We demonstrate theeffectiveness of the proposed RPR as well as the RPR-based MTRLframework on various problems, including grid-world navigation andmulti-aspect target classification. The experimental results showthat the RPR is a competitive reinforcement learning algorithm inpartially observable domains, and the MTRL consistently achievesbetter performance than single task reinforcement learning. color="gray">
机译:我们考虑了在多个部分可观察的随机环境中的多任务强化学习(MTRL)问题。我们引入了区域化策略表示(RPR),以表征代理在每种环境中的行为。 RPR是给定过去行动和观察的历史的当前行动条件分布的参数化模型;主体对行动的选择直接基于这种条件分布,而没有干预模型来表征环境本身。我们提出非政策批处理算法,以使用行为策略遵循时收集的突发数据来学习RPR的参数,并显示它们与策略迭代的联系。我们将Dirichlet流程作为跨多个环境的RPR的非参数优先级。 Dirichlet过程的内在聚类属性强加了相似环境之间的情节共享,当数据共享适当时,这有效地减少了在每种环境中学习良好策略所需的情节数量。通过利用情景数据以及Dirichlet过程的非参数性质,可以自动发现不同的RPR数量和相关的簇(共享模式)。我们证明了建议的RPR以及基于RPR的MTRL框架在各种问题上的有效性,这些问题包括网格世界导航和多方面目标分类。实验结果表明,RPR是一种竞争性的强化学习算法,在部分可观察的领域内有效,并且MTRL的性能始终优于单任务强化学习。 color =“ gray”>

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号