Introduction to Multi-Armed Bandits.pdf-资料库

qq_34031839-11004703-16359647551581370145.pdf-第1页.png

第1页 / 共103页

qq_34031839-11004703-16359647551581370145.pdf-第2页.png

第2页 / 共103页

qq_34031839-11004703-16359647551581370145.pdf-第3页.png

第3页 / 共103页

qq_34031839-11004703-16359647551581370145.pdf-第4页.png

第4页 / 共103页

qq_34031839-11004703-16359647551581370145.pdf-第5页.png

第5页 / 共103页

qq_34031839-11004703-16359647551581370145.pdf-第6页.png

第6页 / 共103页

qq_34031839-11004703-16359647551581370145.pdf-第7页.png

第7页 / 共103页

qq_34031839-11004703-16359647551581370145.pdf-第8页.png

第8页 / 共103页

Introduction to Multi-Armed Bandits (preliminary and incomplete draft) Aleksandrs Slivkins Microsoft Research NYC https://www.microsoft.com/en-us/research/people/slivkins/ First draft: January 2017 This version: October 2017 i

Preface Multi-armed bandits is a rich area, multi-disciplinary area studied since (Thompson, 1933), with a big surge of activity in the past 10-15 years. An enormous body of work has accumulated over the years. While various subsets of this work have been covered in depth in several books and surveys (Berry and Fristedt, 1985; Cesa-Bianchi and Lugosi, 2006; Bergemann and V¨alim¨aki, 2006; Gittins et al., 2011; Bubeck and Cesa-Bianchi, 2012), this book provides a more textbook-like treatment of the subject. The organizing principles for this book can be summarized as follows. The work on multi-armed bandits can be partitioned into a dozen or so lines of work. Each chapter tackles one line of work, providing a self-contained introduction and pointers for further reading. We favor fundamental ideas and elementary, teachable proofs over strongest possible results with very complicated proofs. We emphasize accessibility of the material: while exposure to machine learning and probability/statistics would certainly help, no advanced background is required beyond a standard undergraduate course on algorithms, e.g., one based on (Kleinberg and Tardos, 2005). The book is based on lecture notes from a graduate course at University of Maryland, College Park, taught by the author in Fall 2016. In fact, each chapter corresponds to a week of the course. The choice of topics and results to present is based on the author’s understanding of what is important and teachable, and is necessarily subjective, to some extent. In particular, many important results has been deemed too technical/advanced to be presented in detail. To keep the book manageable, and also more accessible, we chose not to dwell on the deep connections to online convex optimization. A modern treatment of this fascinating subject can be found, e.g., in the recent textbook (Hazan, 2015). Likewise, we chose not venture into a much more general problem space of reinforcement learning, a subject of many graduate courses and textbooks such as (Sutton and Barto, 1998; Szepesv´ari, 2010). A course based on this book would be complementary to graduate-level courses on online convex optimization and reinforcement learning. Status of the manuscript. The present draft needs some polishing, and, at places, a more detailed discussion of related work. (However, our goal is to provide pointers for further reading rather than a comprehensive discussion.) In addition to the nine chapters already in the manuscript, the author needs to add an introductory chapter on scope and motivations, and plans to add chapters covering connections to mechanism design and game theory (Chapters 11, 12, resp.), an “interlude” on practical aspects on bandit algorithms (Interlude B), and a section on contextual bandits as a system (Section 9.6). All the above is based on lectures from the course. Several important topics have been omitted from the course, and therefore from the current draft, due to schedule constraints. Time permitting, the author hopes to include some of these topics in the near future: most notably, a chapter on the MDP formulation of bandits and the Gittins’ algorithm. In the meantime, the author would be grateful for feedback and is open to suggestions. Acknowledgements. The author is indebted to the students who scribed the initial versions of the lecture notes. Presentation of some of the fundamental results is heavily inﬂuenced by online lecture notes in (Kleinberg, 2007). The author is grateful to Alekh Agarwal, Bobby Kleinberg, Yishay Mansour, and Rob Schapire for discussions and advice. iii

Contents 1 Scope and motivations (to be added) 2 Bandits with IID Rewards 2.1 Model and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Simple algorithms: uniform exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Advanced algorithms: adaptive exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Further remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Background on KL-divergence . 3.2 A simple example: ﬂipping one coin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Flipping several coins: “bandits with prediction” 3.4 Proof of Lemma 3.10 for K ≥ 24 arms . . . . . . . . . . . . . . . . . . . . Instance-dependent lower bounds (without proofs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Bibliographic notes . 3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interlude A: Bandits with Initial Information 4 Thompson Sampling 4.1 Bayesian bandits: preliminaries and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Thompson Sampling: deﬁnition and characterizations . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Computational aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Example: 0-1 rewards and Beta priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Example: Gaussian rewards and Gaussian priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Bayesian regret 4.7 Thompson Sampling with no prior (and no proofs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interlude B: Practical Aspects (to be added) 5 Lipschitz Bandits 5.1 Continuum-armed bandits 5.2 Lipschitz MAB . . 5.3 Adaptive discretization: the Zooming Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Full Feedback and Adversarial Costs . . . 6.1 Adversaries and regret 6.2 6.3 Hedge Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Initial results: binary prediction with experts advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv 1 3 3 4 6 11 11 13 13 16 17 18 20 21 21 23 25 25 26 27 27 28 29 31 32 33 35 35 38 40 44 45 46 47 49

6.4 Bibliographic remarks . 6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Adversarial Bandits 7.1 Reduction from bandit feedback to full feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Adversarial bandits with expert advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Preliminary analysis: unbiased estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Algorithm Exp4 and crude analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 . 7.6 Further remarks and extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Bibliographic notes . 7.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Improved analysis . . . . . . . . . . . . . . . . . . . . . . . . . 8 Bandits/Experts with Linear Costs 8.1 Recap from last lecture . . 8.2 The Online Routing Problem . 8.3 Combinatorial (semi-)bandits . 8.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . “Online learning with experts” with linear costs: Follow the Perturbed Leader . . . . . . . . . . . . . . . . . . 9 Contextual Bandits 9.1 Problem statement and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Small number of contexts . 9.3 Lipshitz Contextual bandits (a simple example) . 9.4 Linear Contextual Bandits (LinUCB algorithm without proofs) . . . . . . . . . . . . . . . . . . . . . 9.5 Contextual Bandits with a Known Policy Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Contextual bandits as a system (to be written, based on (Agarwal et al., 2016)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Bandits with Knapsacks 10.1 Motivating Example: Dynamic Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 General Framework: Bandits with Knapsacks (BwK) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Regret bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Fractional relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5 “Clean event” and conﬁdence regions 10.6 Three algorithms for BwK (no proofs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.7 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Bandits and Incentives (lecture to be written up) 12 Bandits and Games (to be written) 13 MDP bandits and Gittins Algorithm (to be written) A Tools Bibliography v 52 53 55 55 56 57 57 59 60 61 61 63 63 64 65 67 71 71 72 72 74 74 77 79 79 79 81 82 83 83 85 87 89 91 93 95

Chapter 1 Scope and motivations (to be added) [as: To be written. For now, see slides at http://www.cs.umd.edu/ slivkins/CMSC858G-fall16/Lecture1-intro.pdf] 1

资料库

Introduction to Multi-Armed Bandits.pdf

相关推荐

人工智能

热门标签

最新资料