首页> 外文会议>Conference on Uncertainty in Artificial Intelligence >Quantile-Regret Minimisation in Infinitely Many-Armed Bandits
【24h】

Quantile-Regret Minimisation in Infinitely Many-Armed Bandits

机译:无限多武装匪徒的分体遗憾最小化

获取原文

摘要

The stochastic multi-armed bandit is a well-studied abstraction of decision making in the face of uncertainty. We consider the setting in which the number of bandit arms is much larger than the possible number of pulls, and can even be infinite. With the aim of minimising regret with respect to an optimal arm, existing methods for this setting either assume some structure over the set of arms (Kleinberg et al., 2008, Ray Chowdhury and Gopalan, 2017), or some property of the reward distribution (Wang et al., 2008). Invariably, the validity of such assumptions - and therefore the performance of the corresponding methods-depends on instance-specific parameters, which might not be known beforehand. We propose a conceptually simple, parameter-free, and practically effective alternative. Specifically we introduce a notion of regret with respect to the top quantile of a probability distribution over the expected reward of randomly drawn arms. Our main contribution is an algorithm that achieves sublinear "quantile-regret", both (1) when it is specified a quantile, and (2) when the quantile can be any (unknown) positive value. The algorithm needs no side information about the arms or about the structure of their reward distributions: it relies on random sampling to reach arms in the top quantile. Experiments show that our algorithm outperforms several previous methods (in terms of conventional regret) when the latter are not tuned well, and often even when they are.
机译:随机多武装强盗是在不确定性面前决策的精彩抽象。我们认为强盗臂的数量远远大于可能的拉动数量的设置,甚至可以是无限的。旨在最大限度地减少关于最佳臂的遗憾,该设置的现有方法要么承担一组武器(Kleinberg等,2008,Ray Chowdhury,Gopalan,2017)或奖励分配的一些财产(Wang等,2008)。总是,这种假设的有效性 - 因此,相应的方法的性能取决于特定于实例的参数,这些参数可能不会预先知道。我们提出了一个概念上简单,无参数和实际有效的替代方案。具体而言,我们对随机绘制武器的预期奖励的概率分布的顶部定量介绍了遗憾的概念。我们的主要贡献是一种算法,可以实现索姆林“分位式遗憾”,两者(1)指定量子,(2)当分量可以是任何(未知)正值时。该算法不需要有关武器的侧面信息或关于其奖励分布的结构:它依赖于随机采样来达到顶部定位的臂。实验表明,当后者未良好调整时,我们的算法优于几种以前的方法(在传统遗憾方面),通常即使它们是。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号