Multiarmed Bandit Algorithms Based on Empirical Moments

Junya HONDA; Akimichi TAKEMURA

首页> 外文期刊>電子情報通信学会技術研究報告 >Multiarmed Bandit Algorithms Based on Empirical Moments

【24h】

Multiarmed Bandit Algorithms Based on Empirical Moments

机译：基于经验矩的多臂强盗算法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

多腕バンディット問題は強化学習における知識の探索と活用のジレンマを定式化したもので，複数台のスロットマシンを選んでプレイするギャンブラーのモデルとして表される．本論文では各マシンからのrewardが区間［0，1］上の確率分布にしたがう場合を考える。このモデルにおいては理論限界を漸近的に達成する戦略が知られているが、これは経験分布そのものを用いた凸最適化を毎回実行する必要がある。そこで本研究では任意の次数dまでの経験モーメントのみを用いて計算可能な戦略を新たに提案し、その性能が次数dを増やすことで理論限界に漸近することを示す。また、提案戦略においてはモーメント制約付きKLダイバージュンス最小化を計算する必要があるが、これがTchebycheff systemの理論を用いることにより代数方程式系の求解に帰着できることを示す。%In the multiarmed bandit problem a gambler chooses an arm of a slot machine to pull considering a tradeoff between exploration and exploitation. We study the stochastic bandit problem where each arm has a reward distribution supported in a known bounded interval, e.g. [0,1]. For this model, there exists a policy which achieves the theoretical bound asymptotically. However the optimal policy requires a computation of a convex optimization which involves the empirical distribution of each arm. In this paper, we propose a policy which exploits the first d empirical moments for arbitrary d fixed in advance. The asymptotic upper bound of the regret of the policy approaches the theoretical bound as d increases. The proposed policy requires a minimization of KL divergence with moment constraints. We show by the theory of Tchebycheff system that the optimal value is obtained by solving polynomial equations.

机译：多臂强盗问题是强化学习中知识搜索和利用困境的形式化形式，并表示为选择并玩多个老虎机的赌徒模型。在本文中，我们考虑了来自每台机器的奖励遵循区间[0,1]上的概率分布的情况。在此模型中，已知一种策略，可以渐近地达到理论极限，但这需要每次使用经验分布本身进行凸优化。因此，在这项研究中，我们提出了一种新的策略，可以仅使用经验矩计算任意阶数d，并且通过增加阶数d来证明其性能接近理论极限。我们还表明，提出的策略需要计算带有力矩约束的KL发散最小化，这可以通过使用Tchebycheff系统的理论来解决。％在多臂匪徒问题中，赌徒选择一台投币机的手臂以考虑勘探与剥削之间的折衷关系。我们研究了随机匪徒问题，其中每只手臂都有在已知有界区间内得到支持的奖励分布，例如[0,1对于该模型，存在一种可以渐近实现理论界的策略。但是，最优策略需要计算凸优化，该凸优化涉及每个臂的经验分布。在本文中，我们提出了一种利用第一个d的策略。随着d的增加，该策略的后悔的渐近上限接近理论界线。拟议的策略要求在具有矩约束的情况下最小化KL散度。通过Tchebycheff系统的理论表明最优值通过求解多项式方程获得。

著录项

来源
《電子情報通信学会技術研究報告》 |2011年第193期|p.15-22|共8页
作者
Junya HONDA; Akimichi TAKEMURA;
展开▼
作者单位

Graduate School of Frontier Sciences, The University of Tokyo Kashiwanoha 5-1-5, Kashiwa-shi, Chiba, 277-8561, Japan;

Graduate School of Information Science and Technology, The University of Tokyo Hongo 7-3-1, Bunkyo-ku, Tokyo, 113-8656, Japan;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
multiarmed bandit problem; tchebycheff system; moment space; divergence minimization;

机译：多臂强盗问题;切比雪夫系统瞬间空间差异最小化;

相似文献

外文文献
中文文献
专利

1. Multiarmed Bandit Algorithms Based on Empirical Moments [J] . Junya HONDA, Akimichi TAKEMURA 電子情報通信学会技術研究報告. 情報論的学習理論と機械学習 . 2011,第194期

机译：基于经验矩的多臂强盗算法
2. Multiarmed Bandit Algorithms Based on Empirical Moments [J] . Junya HONDA, Akimichi TAKEMURA 電子情報通信学会技術研究報告 . 2011,第194期

机译：基于经验矩的多臂强盗算法
3. Multiarmed Bandit Algorithms Based on Empirical Moments [J] . Junya HONDA, Akimichi TAKEMURA 電子情報通信学会技術研究報告. パターン認識·メディア理解. Pattern Recognition and Media Understanding . 2011,第193期

机译：基于经验矩的多臂强盗算法
4. Social Imitation in Cooperative Multiarmed Bandits: Partition-Based Algorithms with Strictly Local Information [C] . Peter Landgren, Vaibhav Srivastava, Naomi Ehrich Leonard IEEE Conference on Decision and Control . 2018

机译：合作多臂匪的社会模仿：具有严格本地信息的基于分区的算法
5. Pde Approaches to Two Online Learning Problems, and an Empirical Study of Some Neural Network-Based Active Learning Algorithms [D] . Wang, Zhilei. 2021

机译：PDE接近两个在线学习问题，以及对一些基于神经网络的主题学习算法的实证研究
6. Channel Selection Based on Trust and Multiarmed Bandit in Multiuser Multichannel Cognitive Radio Networks [O] . Fanzi Zeng, Xinwang Shen -1

机译：多用户多通道认知无线电网络中基于信任和多臂强盗的信道选择
7. Distributed Cooperative Decision-Making in Multiarmed Bandits: Frequentist and Bayesian Algorithms [O] . Landgren, Peter, Srivastava, Vaibhav, Leonard, Naomi Ehrich 2016

机译：多臂匪盗分布式协同决策： Frequentist和贝叶斯算法

Multiarmed Bandit Algorithms Based on Empirical Moments

摘要

著录项

相似文献

相关主题

期刊订阅