Introduction to Multi-Armed Bandits
(preliminary and incomplete draft)
Aleksandrs Slivkins
Microsoft Research NYC
https://www.microsoft.com/en-us/research/people/slivkins/
First draft: January 2017
This version: October 2017
i
Preface
Multi-armed bandits is a rich area, multi-disciplinary area studied since (Thompson, 1933), with a big surge of activity
in the past 10-15 years. An enormous body of work has accumulated over the years. While various subsets of this
work have been covered in depth in several books and surveys (Berry and Fristedt, 1985; Cesa-Bianchi and Lugosi,
2006; Bergemann and V¨alim¨aki, 2006; Gittins et al., 2011; Bubeck and Cesa-Bianchi, 2012), this book provides a
more textbook-like treatment of the subject.
The organizing principles for this book can be summarized as follows. The work on multi-armed bandits can
be partitioned into a dozen or so lines of work. Each chapter tackles one line of work, providing a self-contained
introduction and pointers for further reading. We favor fundamental ideas and elementary, teachable proofs over
strongest possible results with very complicated proofs. We emphasize accessibility of the material: while exposure
to machine learning and probability/statistics would certainly help, no advanced background is required beyond a
standard undergraduate course on algorithms, e.g., one based on (Kleinberg and Tardos, 2005).
The book is based on lecture notes from a graduate course at University of Maryland, College Park, taught by the
author in Fall 2016. In fact, each chapter corresponds to a week of the course. The choice of topics and results to
present is based on the author’s understanding of what is important and teachable, and is necessarily subjective, to
some extent. In particular, many important results has been deemed too technical/advanced to be presented in detail.
To keep the book manageable, and also more accessible, we chose not to dwell on the deep connections to online
convex optimization. A modern treatment of this fascinating subject can be found, e.g., in the recent textbook (Hazan,
2015). Likewise, we chose not venture into a much more general problem space of reinforcement learning, a subject
of many graduate courses and textbooks such as (Sutton and Barto, 1998; Szepesv´ari, 2010). A course based on this
book would be complementary to graduate-level courses on online convex optimization and reinforcement learning.
Status of the manuscript. The present draft needs some polishing, and, at places, a more detailed discussion of
related work. (However, our goal is to provide pointers for further reading rather than a comprehensive discussion.) In
addition to the nine chapters already in the manuscript, the author needs to add an introductory chapter on scope and
motivations, and plans to add chapters covering connections to mechanism design and game theory (Chapters 11, 12,
resp.), an “interlude” on practical aspects on bandit algorithms (Interlude B), and a section on contextual bandits as a
system (Section 9.6). All the above is based on lectures from the course.
Several important topics have been omitted from the course, and therefore from the current draft, due to schedule
constraints. Time permitting, the author hopes to include some of these topics in the near future: most notably, a
chapter on the MDP formulation of bandits and the Gittins’ algorithm. In the meantime, the author would be grateful
for feedback and is open to suggestions.
Acknowledgements. The author is indebted to the students who scribed the initial versions of the lecture notes.
Presentation of some of the fundamental results is heavily influenced by online lecture notes in (Kleinberg, 2007). The
author is grateful to Alekh Agarwal, Bobby Kleinberg, Yishay Mansour, and Rob Schapire for discussions and advice.
iii
Contents
1 Scope and motivations (to be added)
2 Bandits with IID Rewards
2.1 Model and examples
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Simple algorithms: uniform exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Advanced algorithms: adaptive exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Further remarks .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
2.5 Exercises
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Lower Bounds
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1 Background on KL-divergence .
3.2 A simple example: flipping one coin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Flipping several coins: “bandits with prediction”
3.4 Proof of Lemma 3.10 for K ≥ 24 arms
. . . . . . . . . . . . . . . . . . . .
Instance-dependent lower bounds (without proofs) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6 Bibliographic notes .
3.7 Exercises
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . .
. . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Interlude A: Bandits with Initial Information
4 Thompson Sampling
4.1 Bayesian bandits: preliminaries and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
4.2 Thompson Sampling: definition and characterizations . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Computational aspects
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Example: 0-1 rewards and Beta priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Example: Gaussian rewards and Gaussian priors . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6 Bayesian regret
4.7 Thompson Sampling with no prior (and no proofs)
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8 Bibliographic notes .
. . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Interlude B: Practical Aspects (to be added)
5 Lipschitz Bandits
5.1 Continuum-armed bandits
5.2 Lipschitz MAB .
.
5.3 Adaptive discretization: the Zooming Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Remarks .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Full Feedback and Adversarial Costs
.
.
.
6.1 Adversaries and regret
6.2
6.3 Hedge Algorithm .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Initial results: binary prediction with experts advice . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
iv
1
3
3
4
6
11
11
13
13
16
17
18
20
21
21
23
25
25
26
27
27
28
29
31
32
33
35
35
38
40
44
45
46
47
49
6.4 Bibliographic remarks
.
6.5 Exercises
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
7 Adversarial Bandits
7.1 Reduction from bandit feedback to full feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
7.2 Adversarial bandits with expert advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3 Preliminary analysis: unbiased estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4 Algorithm Exp4 and crude analysis . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5
.
7.6 Further remarks and extensions .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
7.7 Bibliographic notes .
7.8 Exercises
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Improved analysis .
. . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8 Bandits/Experts with Linear Costs
8.1 Recap from last lecture .
.
8.2 The Online Routing Problem .
8.3 Combinatorial (semi-)bandits .
8.4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
“Online learning with experts” with linear costs: Follow the Perturbed Leader . . . . . . . . . . . . .
.
.
.
.
.
9 Contextual Bandits
9.1 Problem statement and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2 Small number of contexts .
9.3 Lipshitz Contextual bandits (a simple example)
.
9.4 Linear Contextual Bandits (LinUCB algorithm without proofs)
. . . . . . . . . . . . . . . . . . . . .
9.5 Contextual Bandits with a Known Policy Class
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.6 Contextual bandits as a system (to be written, based on (Agarwal et al., 2016)) . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
10 Bandits with Knapsacks
10.1 Motivating Example: Dynamic Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2 General Framework: Bandits with Knapsacks (BwK) . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
10.3 Regret bounds .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.4 Fractional relaxation .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.5 “Clean event” and confidence regions
10.6 Three algorithms for BwK (no proofs) . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.7 Bibliographic notes .
. . . . .
. . . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11 Bandits and Incentives (lecture to be written up)
12 Bandits and Games (to be written)
13 MDP bandits and Gittins Algorithm (to be written)
A Tools
Bibliography
v
52
53
55
55
56
57
57
59
60
61
61
63
63
64
65
67
71
71
72
72
74
74
77
79
79
79
81
82
83
83
85
87
89
91
93
95
Chapter 1
Scope and motivations (to be added)
[as: To be written. For now, see slides at
http://www.cs.umd.edu/ slivkins/CMSC858G-fall16/Lecture1-intro.pdf]
1