Online Learning and Online Convex Optimization By Shai Shalev-Sh....pdf-资料库

qq_34031839-11004697-4744300845239804724.pdf-第1页.png

第1页 / 共90页

qq_34031839-11004697-4744300845239804724.pdf-第2页.png

第2页 / 共90页

qq_34031839-11004697-4744300845239804724.pdf-第3页.png

第3页 / 共90页

qq_34031839-11004697-4744300845239804724.pdf-第4页.png

第4页 / 共90页

qq_34031839-11004697-4744300845239804724.pdf-第5页.png

第5页 / 共90页

qq_34031839-11004697-4744300845239804724.pdf-第6页.png

第6页 / 共90页

qq_34031839-11004697-4744300845239804724.pdf-第7页.png

第7页 / 共90页

qq_34031839-11004697-4744300845239804724.pdf-第8页.png

第8页 / 共90页

Foundations and Trends R in Machine Learning Vol. 4, No. 2 (2011) 107–194 c 2012 S. Shalev-Shwartz DOI: 10.1561/2200000018 Online Learning and Online Convex Optimization By Shai Shalev-Shwartz Contents 1 Introduction 1.1 Examples 1.2 A Gentle Start 1.3 Organization and Scope 1.4 Notation and Basic Deﬁnitions 2 Online Convex Optimization 2.1 Convexiﬁcation 2.2 Follow-the-leader 2.3 Follow-the-Regularized-Leader 2.4 Online Gradient Descent: Linearization of Convex Functions 2.5 Strongly Convex Regularizers 2.6 Online Mirror Descent 2.7 The Language of Duality 2.8 Bounds with Local Norms 2.9 Bibliographic Remarks 3 Online Classiﬁcation 3.1 Finite Hypothesis Class and Experts Advice 3.2 Learnability and the Standard Optimal Algorithm 108 111 112 116 117 119 120 124 127 130 134 141 146 152 155 157 158 160

3.3 Perceptron and Winnow 3.4 Bibliographic Remarks 4 Limited Feedback (Bandits) 4.1 Online Mirror Descent with Estimated Gradients 4.2 The Multi-armed Bandit Problem 4.3 Gradient Descent Without a Gradient 4.4 Bibliographic Remarks 5 Online-to-Batch Conversions 5.1 Bibliographic Remarks Acknowledgments References 168 175 177 178 179 182 185 186 190 191 192

Foundations and Trends R in Machine Learning Vol. 4, No. 2 (2011) 107–194 c 2012 S. Shalev-Shwartz DOI: 10.1561/2200000018 Online Learning and Online Convex Optimization Shai Shalev-Shwartz Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel, shais@cs.huji.ac.il Abstract Online learning is a well established learning paradigm which has both theoretical and practical appeals. The goal of online learning is to make a sequence of accurate predictions given knowledge of the cor- rect answer to previous prediction tasks and possibly additional avail- able information. Online learning has been studied in several research ﬁelds including game theory, information theory, and machine learning. It also became of great interest to practitioners due the recent emer- gence of large scale applications such as online advertisement placement and online web ranking. In this survey we provide a modern overview of online learning. Our goal is to give the reader a sense of some of the interesting ideas and in particular to underscore the centrality of convexity in deriving eﬃcient online learning algorithms. We do not mean to be comprehensive but rather to give a high-level, rigorous yet easy to follow, survey.

1 Introduction Online learning is the process of answering a sequence of questions given (maybe partial) knowledge of the correct answers to previous questions and possibly additional available information. The study of online learning algorithms is an important domain in machine learn- ing, and one that has interesting theoretical properties and practical applications. Online learning is performed in a sequence of consecutive rounds, where at round t the learner is given a question, xt, taken from an instance domain X , and is required to provide an answer to this ques- tion, which we denote by pt. After predicting an answer, the correct answer, yt, taken from a target domain Y, is revealed and the learner suﬀers a loss, l(pt, yt), which measures the discrepancy between his answer and the correct one. While in many cases pt is in Y, it is some- times convenient to allow the learner to pick a prediction from a larger set, which we denote by D. 108

109 Online Learning for t = 1,2, . . . receive question xt ∈ X predict pt ∈ D receive true answer yt ∈ Y suﬀer loss l(pt, yt) The speciﬁc case of yes/no answers and predictions, namely D = Y = {0,1}, is called online classiﬁcation. In this case it is natural to use the 0–1 loss function: l(pt, yt) = |pt − yt|. That is, l(pt, yt) indicates if pt = yt (the prediction is correct) or pt = yt (the prediction is wrong). For example, consider the problem of predicting whether it is going to rain tomorrow. On day t, the question xt can be encoded as a vector of meteorological measurements. Based on these measurements, the learner should predict if it’s going to rain tomorrow. In the following day, the learner knows the correct answer. We can also allow the learner to output a prediction in [0,1], which can be interpreted as the probability of raining tomorrow. This is an example of an application in which D = Y. We can still use the loss function l(pt, yt) = |pt − yt|, which can now be interpreted as the prob- ability to err if predicting that it’s going to rain with probability pt. The learner’s ultimate goal is to minimize the cumulative loss suf- fered along its run, which translates to making few prediction mistakes in the classiﬁcation case. The learner tries to deduce information from previous rounds so as to improve its predictions on present and future questions. Clearly, learning is hopeless if there is no correlation between past and present rounds. Classic statistical theory of sequential predic- tion therefore enforces strong assumptions on the statistical properties of the input sequence (e.g., that it is sampled i.i.d. according to some unknown distribution). In this review we survey methods which make no statistical assump- tions regarding the origin of the sequence of examples. The sequence is allowed to be deterministic, stochastic, or even adversarially adaptive to the learner’s own behavior (as in the case of spam email ﬁltering). Naturally, an adversary can make the cumulative loss to our online

110 Introduction learning algorithm arbitrarily large. For example, the adversary can ask the same question on each online round, wait for the learner’s answer, and provide the opposite answer as the correct answer. To make non- trivial statements we must further restrict the problem. We consider two natural restrictions. The ﬁrst restriction is especially suited to the case of online classi- ﬁcation. We assume that all the answers are generated by some target mapping, h : X → Y. Furthermore, h is taken from a ﬁxed set, called a hypothesis class and denoted by H, which is known to the learner. With this restriction on the sequence, which we call the realizable case, the learner should make as few mistakes as possible, assuming that both h and the sequence of questions can be chosen by an adversary. For an online learning algorithm, A, we denote by MA(H) the max- imal number of mistakes A might make on a sequence of examples which is labeled by some h ∈ H. We emphasize again that both h and the sequence of questions can be chosen by an adversary. A bound on MA(H) is called a mistake-bound and we will study how to design algorithms for which MA(H) is minimal. Alternatively, the second restriction of the online learning model we consider is a relaxation of the realizable assumption. We no longer assume that all answers are generated by some h ∈ H, but we require the learner to be competitive with the best ﬁxed predictor from H. This is captured by the regret of the algorithm, which measures how “sorry” the learner is, in retrospect, not to have followed the predictions of some hypothesis h ∈ H. Formally, the regret of the algorithm relative to h when running on a sequence of T examples is deﬁned as T t=1 l(pt, yt) − T t=1 RegretT (h) = l(h(xt), yt), (1.1) and the regret of the algorithm relative to a hypothesis class H is RegretT (H) = max h∈H RegretT (h). (1.2) We restate the learner’s goal as having the lowest possible regret relative to H. We will sometime be satisﬁed with “low regret” algo- rithms, by which we mean that RegretT (H) grows sub-linearly with

the number of rounds, T , which implies that the diﬀerence between the average loss of the learner and the average loss of the best hypothesis in H tends to zero as T goes to inﬁnity. 1.1 Examples 111 1.1 Examples We already mentioned the problem of online classiﬁcation. To make the discussion more concrete, we list several additional online prediction problems and possible hypothesis classes. Online Regression In regression problems, X = Rd which cor- responds to a set of measurements (often called features), and Y = D = R. For example, consider the problem of estimating the fetal weight based on ultrasound measurements of abdominal circumference and femur length. Here, each x ∈ X = R2 is a two-dimensional vector corresponds to the measurements of the abdominal circumference and the femur length. Given these measurements the goal is to predict the fetal weight. Common loss functions for regression problems are the squared loss, (p, y) = (p − y)2, and the absolute loss, (p, y) = |p − y|. Maybe the simplest hypothesis class for regression is the class of linear i=1 w[i]x[i] : ∀i, w[i] ∈ R}, where w[i] is the ith element of w. The resulting problem is called online linear regression. predictors, H = {x → d Prediction with Expert Advice On each online round the learner has to choose from the advice of d given experts. Therefore, xt ∈ X ⊂ Rd, where xt[i] is the advice of the ith expert, and D = {1, . . . , d}. Then, the learner receives the true answer, which is a vector yt ∈ Y = [0,1]d, where yt[i] is the cost of following the advice of the ith expert. The loss of the learner is the cost of the chosen expert, (pt,yt) = yt[pt]. A common hypothesis class for this problem is the set of constant predictors, H = {h1, . . . , hd}, where hi(x) = i for all x. This implies that the regret of the algorithm is measured relative to the performance of the strategies which always predict according to the same expert. Online Ranking On round t, the learner receives a query xt ∈ X and is required to order k elements (e.g., documents) according to

112 Introduction their relevance to the query. That is, D is the set of all permuta- tions of {1, . . . , k}. Then, the learner receives the true answer yt ∈ Y = {1, . . . , k}, which corresponds to the document which best matches the query. In web applications, this is the document that the user clicked on. The loss, (pt, yt), is the position of yt in the ranked list pt. 1.2 A Gentle Start We start with studying online classiﬁcation problem, in which Y = D = {0,1}, and (p, y) = |p − y| is the 0–1 loss. That is, on each round, the learner receives xt ∈ X and is required to predict pt ∈ {0,1}. Then, it receives yt ∈ {0,1} and pays the loss |pt − yt|. We make the following simplifying assumption: • Finite Hypothesis Class: We assume that |H| < ∞. Recall that the goal of the learner is to have a low regret relative to the hypotheses set, H, where each function in H is a mapping from X to {0,1}, and the regret is deﬁned as T RegretT (H) = max h∈H |pt − yt| − T t=1 t=1 |h(xt) − yt| . We ﬁrst show that this is an impossible mission — no algorithm can obtain a sublinear regret bound even if |H| = 2. Indeed, consider H = {h0, h1}, where h0 is the function that always returns 0 and h1 is the function that always returns 1. An adversary can make the number of mistakes of any online algorithm to be equal to T , by simply waiting for the learner’s prediction and then providing the opposite answer as the true answer. In contrast, for any sequence of true answers, y1, . . . , yT , let b be the majority of labels in y1, . . . , yT , then the number of mistakes of hb is at most T /2. Therefore, the regret of any online algorithm might be at least T − T /2 = T /2, which is not a sublinear with T . This impossibility result is attributed to Cover [13]. To sidestep Cover’s impossibility result, we must further restrict the power of the adversarial environment. In the following we present two ways to do this.

资料库

Online Learning and Online Convex Optimization By Shai Shalev-Sh....pdf

相关推荐

人工智能

热门标签

最新资料