A Smoothed Analysis of the Greedy Algorithm for the Linear Contextual Bandit Problem
Sampath Kannan, Jamie Morgenstern, Aaron Roth, Bo Waggoner, Zhiwei Steven Wu
[arXiv]. See also this blog post.
Bandit learning is characterized by the tension between long-term
exploration and short-term exploitation. However, as has recently
been noted, in settings in which the choices of the learning
algorithm correspond to important decisions about individual people
(such as criminal recidivism prediction, lending, and sequential
drug trials), exploration corresponds to explicitly sacrificing the
well-being of one individual for the potential future benefit of
others. This raises a fairness concern. In such settings, one might
like to run a "greedy" algorithm, which always makes the
(myopically) optimal decision for the individuals at hand --- but
doing this can result in a catastrophic failure to learn. In this
paper, we consider the linear contextual bandit problem and revisit
the performance of the greedy algorithm. We give a smoothed
analysis, showing that even when contexts may be chosen by an
adversary, small perturbations of the adversary's choices suffice
for the algorithm to achieve "no regret", perhaps (depending on
the specifics of the setting) with a constant amount of initial
training data. This suggests that generically (i.e. in slightly
perturbed environments), exploration and exploitation need not be in
conflict in the linear setting.