首页> 外文会议>American Control Conference >Tracking of real-valued Markovian random processes with asymmetric cost and observation
【24h】

Tracking of real-valued Markovian random processes with asymmetric cost and observation

机译:具有不对称成本和观测值的实值马尔可夫随机过程的跟踪

获取原文

摘要

We study a state-tracking problem in which the background random process is Markovian with unknown real-valued states and known transition probability densities. At each time step the decision-maker chooses a state as an action and accumulates some reward based on the selected state and the actual state. If the selected state is higher than the actual state, the actual state is fully observed in expense of overutilization cost. Otherwise, the decision-maker has to pay underutilization cost and could only observe the actual state partially (that it is higher than the selected state). Thus, the decision-maker faces asymmetries in both cost and observation. The goal is to select the actions in order to maximize the total expected discounted reward over infinite horizon. We model this problem as a Partially Observable Markov Decision Process and formulate it in two different ways: (i) belief-based, and (ii) sequence-based. In the sequence-based formulation, only two parameters matter to define the sequence of actions, the last fully observed state and the time passed from the last observation. We prove key structural properties of the optimal policy including a lower bound on the optimal sequence. Further, for a specific form of processes we present an upper bound on the optimal sequence. Both lower and upper bound sequences have percentile threshold structure and are monotonically increasing with respect to the last fully observed state.
机译:我们研究了一个状态跟踪问题,其中背景随机过程是马尔可夫模型,具有未知的实值状态和已知的转移概率密度。决策者在每个时间步骤都将状态选择为动作,并根据所选状态和实际状态累积一些奖励。如果所选状态高于实际状态,则会以过度使用成本为代价充分观察实际状态。否则,决策者必须支付未充分利用的成本,并且只能部分观察实际状态(该状态高于所选状态)。因此,决策者在成本和观察方面都面临着不对称。目的是选择行动,以在无限远景范围内最大化总的预期折扣奖励。我们将此问题建模为部分可观察的马尔可夫决策过程,并用两种不同的方式来表述:(i)基于信念的和(ii)基于序列的。在基于序列的表述中,只有两个参数对定义动作序列至关重要,即最后一次完全观察到的状态和从最后一次观察开始经过的时间。我们证明了最优策略的关键结构特性,包括最优序列的下限。此外,对于特定形式的过程,我们提出了最佳序列的上限。下限和上限序列均具有百分数阈值结构,并且相对于最后一个完全观察到的状态单调增加。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号