The stochastic multi-armed bandit is a well-studied abstraction of decision making in the face of uncertainty. We consider the setting in which the number of bandit arms is much larger than the possible number of pulls, and can even be infinite. With the aim of minimising regret with respect to an optimal arm, existing methods for this setting either assume some structure over the set of arms (Kleinberg et al., 2008, Ray Chowdhury and Gopalan, 2017), or some property of the reward distribution (Wang et al., 2008). Invariably, the validity of such assumptions - and therefore the performance of the corresponding methods-depends on instance-specific parameters, which might not be known beforehand. We propose a conceptually simple, parameter-free, and practically effective alternative. Specifically we introduce a notion of regret with respect to the top quantile of a probability distribution over the expected reward of randomly drawn arms. Our main contribution is an algorithm that achieves sublinear "quantile-regret", both (1) when it is specified a quantile, and (2) when the quantile can be any (unknown) positive value. The algorithm needs no side information about the arms or about the structure of their reward distributions: it relies on random sampling to reach arms in the top quantile. Experiments show that our algorithm outperforms several previous methods (in terms of conventional regret) when the latter are not tuned well, and often even when they are.
展开▼