KL-UCB-Based Policy for Budgeted Multi-Armed Bandits with Stochastic Action Costs

Ryo WATANABE; Junpei KOMIYAMA; Atsuyoshi NAKAMURA; Mineichi KUDO

doi:10.1587/transfun.E100.A.2470

IEICE TRANSACTIONS on Fundamentals

KL-UCB-Based Policy for Budgeted Multi-Armed Bandits with Stochastic Action Costs

Ryo WATANABE, Junpei KOMIYAMA, Atsuyoshi NAKAMURA, Mineichi KUDO

Full Text Views

0

Cite this

Summary :

We study the budgeted multi-armed bandit problem with stochastic action costs. In this problem, a player not only receives a reward but also pays a cost for an action of his/her choice. The goal of the player is to maximize the cumulative reward he/she receives before the total cost exceeds the budget. In the classical multi-armed bandit problem, a policy called KL-UCB is known to perform well. We propose KL-UCB-SC, an extension of this policy for the budgeted bandit problem. We prove that KL-UCB-SC is asymptotically optimal for the case of Bernoulli costs and rewards. To the best of our knowledge, this is the first result that shows asymptotic optimality in the study of the budgeted bandit problem. In fact, our regret upper bound is at least four times better than that of BTS, the best known upper bound for the budgeted bandit problem. Moreover, an empirical simulation we conducted shows that the performance of a tuned variant of KL-UCB-SC is comparable to that of state-of-the-art policies such as PD-BwK and BTS.

Publication: IEICE TRANSACTIONS on Fundamentals Vol.E100-A No.11 pp.2470-2486

Publication Date: 2017/11/01

Publicized

Online ISSN: 1745-1337

DOI: 10.1587/transfun.E100.A.2470

Type of Manuscript: PAPER

Category: Mathematical Systems Science

Authors

Ryo WATANABE
  Hokkaido University
Junpei KOMIYAMA
  the University of Tokyo
Atsuyoshi NAKAMURA
  Hokkaido University
Mineichi KUDO
  Hokkaido University

Keyword

budgeted multi-armed bandits, asymptotically optimal policy, regret analysis

Cite this

Copy

Ryo WATANABE, Junpei KOMIYAMA, Atsuyoshi NAKAMURA, Mineichi KUDO, "KL-UCB-Based Policy for Budgeted Multi-Armed Bandits with Stochastic Action Costs" in IEICE TRANSACTIONS on Fundamentals, vol. E100-A, no. 11, pp. 2470-2486, November 2017, doi: 10.1587/transfun.E100.A.2470.
Abstract: We study the budgeted multi-armed bandit problem with stochastic action costs. In this problem, a player not only receives a reward but also pays a cost for an action of his/her choice. The goal of the player is to maximize the cumulative reward he/she receives before the total cost exceeds the budget. In the classical multi-armed bandit problem, a policy called KL-UCB is known to perform well. We propose KL-UCB-SC, an extension of this policy for the budgeted bandit problem. We prove that KL-UCB-SC is asymptotically optimal for the case of Bernoulli costs and rewards. To the best of our knowledge, this is the first result that shows asymptotic optimality in the study of the budgeted bandit problem. In fact, our regret upper bound is at least four times better than that of BTS, the best known upper bound for the budgeted bandit problem. Moreover, an empirical simulation we conducted shows that the performance of a tuned variant of KL-UCB-SC is comparable to that of state-of-the-art policies such as PD-BwK and BTS.
URL: https://global.ieice.org/en_transactions/fundamentals/10.1587/transfun.E100.A.2470/_p

Copy

@ARTICLE{e100-a_11_2470,
author={Ryo WATANABE, Junpei KOMIYAMA, Atsuyoshi NAKAMURA, Mineichi KUDO, },
journal={IEICE TRANSACTIONS on Fundamentals},
title={KL-UCB-Based Policy for Budgeted Multi-Armed Bandits with Stochastic Action Costs},
year={2017},
volume={E100-A},
number={11},
pages={2470-2486},
abstract={We study the budgeted multi-armed bandit problem with stochastic action costs. In this problem, a player not only receives a reward but also pays a cost for an action of his/her choice. The goal of the player is to maximize the cumulative reward he/she receives before the total cost exceeds the budget. In the classical multi-armed bandit problem, a policy called KL-UCB is known to perform well. We propose KL-UCB-SC, an extension of this policy for the budgeted bandit problem. We prove that KL-UCB-SC is asymptotically optimal for the case of Bernoulli costs and rewards. To the best of our knowledge, this is the first result that shows asymptotic optimality in the study of the budgeted bandit problem. In fact, our regret upper bound is at least four times better than that of BTS, the best known upper bound for the budgeted bandit problem. Moreover, an empirical simulation we conducted shows that the performance of a tuned variant of KL-UCB-SC is comparable to that of state-of-the-art policies such as PD-BwK and BTS.},
keywords={},
doi={10.1587/transfun.E100.A.2470},
ISSN={1745-1337},
month={November},}

Copy

TY - JOUR
TI - KL-UCB-Based Policy for Budgeted Multi-Armed Bandits with Stochastic Action Costs
T2 - IEICE TRANSACTIONS on Fundamentals
SP - 2470
EP - 2486
AU - Ryo WATANABE
AU - Junpei KOMIYAMA
AU - Atsuyoshi NAKAMURA
AU - Mineichi KUDO
PY - 2017
DO - 10.1587/transfun.E100.A.2470
JO - IEICE TRANSACTIONS on Fundamentals
SN - 1745-1337
VL - E100-A
IS - 11
JA - IEICE TRANSACTIONS on Fundamentals
Y1 - November 2017
AB - We study the budgeted multi-armed bandit problem with stochastic action costs. In this problem, a player not only receives a reward but also pays a cost for an action of his/her choice. The goal of the player is to maximize the cumulative reward he/she receives before the total cost exceeds the budget. In the classical multi-armed bandit problem, a policy called KL-UCB is known to perform well. We propose KL-UCB-SC, an extension of this policy for the budgeted bandit problem. We prove that KL-UCB-SC is asymptotically optimal for the case of Bernoulli costs and rewards. To the best of our knowledge, this is the first result that shows asymptotic optimality in the study of the budgeted bandit problem. In fact, our regret upper bound is at least four times better than that of BTS, the best known upper bound for the budgeted bandit problem. Moreover, an empirical simulation we conducted shows that the performance of a tuned variant of KL-UCB-SC is comparable to that of state-of-the-art policies such as PD-BwK and BTS.
ER -

IEICE TRANSACTIONS on Fundamentals