Kernel estimation and model combination in a bandit problem with covariates

Wei Qian; Yuhong Yang

Kernel estimation and model combination in a bandit problem with covariates

Wei Qian, Yuhong Yang

Statistics (Twin Cities)

Research output: Contribution to journal › Article › peer-review

12 Scopus citations

Abstract

Multi-armed bandit problem is an important optimization game that requires an explorationexploitation tradeoff to achieve optimal total reward. Motivated from industrial applications such as online advertising and clinical research, we consider a setting where the rewards of bandit machines are associated with covariates, and the accurate estimation of the corresponding mean reward functions plays an important role in the performance of allocation rules. Under a flexible problem setup, we establish asymptotic strong consistency and perform a finite-time regret analysis for a sequential randomized allocation strategy based on kernel estimation. In addition, since many nonparametric and parametric methods in supervised learning may be applied to estimating the mean reward functions but guidance on how to choose among them is generally unavailable, we propose a model combining allocation strategy for adaptive performance. Simulations and a real data evaluation are conducted to illustrate the performance of the proposed allocation strategy.

Original language	English (US)
Journal	Journal of Machine Learning Research
Volume	17
State	Published - Oct 1 2016

Bibliographical note

Publisher Copyright:
© 2016 Wei Qian and Yuhong Yang.

Keywords

Contextual bandit problem
Exploration-exploitation tradeoff
Nonparametric regression
Regret bound
Upper confidence bound

OpenUrl availability

Full text

Cite this

@article{68dcb759616a46389e4c458e90559703,

title = "Kernel estimation and model combination in a bandit problem with covariates",

abstract = "Multi-armed bandit problem is an important optimization game that requires an explorationexploitation tradeoff to achieve optimal total reward. Motivated from industrial applications such as online advertising and clinical research, we consider a setting where the rewards of bandit machines are associated with covariates, and the accurate estimation of the corresponding mean reward functions plays an important role in the performance of allocation rules. Under a flexible problem setup, we establish asymptotic strong consistency and perform a finite-time regret analysis for a sequential randomized allocation strategy based on kernel estimation. In addition, since many nonparametric and parametric methods in supervised learning may be applied to estimating the mean reward functions but guidance on how to choose among them is generally unavailable, we propose a model combining allocation strategy for adaptive performance. Simulations and a real data evaluation are conducted to illustrate the performance of the proposed allocation strategy.",

keywords = "Contextual bandit problem, Exploration-exploitation tradeoff, Nonparametric regression, Regret bound, Upper confidence bound",

author = "Wei Qian and Yuhong Yang",

note = "Publisher Copyright: {\textcopyright} 2016 Wei Qian and Yuhong Yang.",

year = "2016",

month = oct,

day = "1",

language = "English (US)",

volume = "17",

journal = "Journal of Machine Learning Research",

issn = "1532-4435",

publisher = "Microtome Publishing",

}

TY - JOUR

T1 - Kernel estimation and model combination in a bandit problem with covariates

AU - Qian, Wei

AU - Yang, Yuhong

PY - 2016/10/1

Y1 - 2016/10/1

N2 - Multi-armed bandit problem is an important optimization game that requires an explorationexploitation tradeoff to achieve optimal total reward. Motivated from industrial applications such as online advertising and clinical research, we consider a setting where the rewards of bandit machines are associated with covariates, and the accurate estimation of the corresponding mean reward functions plays an important role in the performance of allocation rules. Under a flexible problem setup, we establish asymptotic strong consistency and perform a finite-time regret analysis for a sequential randomized allocation strategy based on kernel estimation. In addition, since many nonparametric and parametric methods in supervised learning may be applied to estimating the mean reward functions but guidance on how to choose among them is generally unavailable, we propose a model combining allocation strategy for adaptive performance. Simulations and a real data evaluation are conducted to illustrate the performance of the proposed allocation strategy.

AB - Multi-armed bandit problem is an important optimization game that requires an explorationexploitation tradeoff to achieve optimal total reward. Motivated from industrial applications such as online advertising and clinical research, we consider a setting where the rewards of bandit machines are associated with covariates, and the accurate estimation of the corresponding mean reward functions plays an important role in the performance of allocation rules. Under a flexible problem setup, we establish asymptotic strong consistency and perform a finite-time regret analysis for a sequential randomized allocation strategy based on kernel estimation. In addition, since many nonparametric and parametric methods in supervised learning may be applied to estimating the mean reward functions but guidance on how to choose among them is generally unavailable, we propose a model combining allocation strategy for adaptive performance. Simulations and a real data evaluation are conducted to illustrate the performance of the proposed allocation strategy.

KW - Contextual bandit problem

KW - Exploration-exploitation tradeoff

KW - Nonparametric regression

KW - Regret bound

KW - Upper confidence bound

UR - http://www.scopus.com/inward/record.url?scp=84995414960&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84995414960&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:84995414960

SN - 1532-4435

VL - 17

JO - Journal of Machine Learning Research

JF - Journal of Machine Learning Research

ER -

Kernel estimation and model combination in a bandit problem with covariates

Abstract

Bibliographical note

Keywords

OpenUrl availability

Other files and links

Fingerprint

Cite this