logo资料库

随机逼近算法.pdf

第1页 / 共485页
第2页 / 共485页
第3页 / 共485页
第4页 / 共485页
第5页 / 共485页
第6页 / 共485页
第7页 / 共485页
第8页 / 共485页
资料共485页,剩余部分请下载后查看
Stochastic Mechanics Applications of Random Media Mathematics Signal Processing and Image Synthesis Stochastic Modelling Mathematical Economics and Finance and Applied Probability Stochastic Optimization Stochastic Control 35 Stochastic Models in Life Sciences Edited by B. Rozovskii M. Yor Advisory Board D. Dawson D. Geman G. Grimmett I. Karatzas F. Kelly Y. Le Jan B. Øksendal G. Papanicolaou E. Pardoux
Harold J. Kushner G. George Yin Stochastic Approximation and Recursive Algorithms and Applications Second Edition With 31 Figures
G. George Yin Department of Mathematics Wayne State University Detroit, MI 48202, USA gyin@math.wayne.edu Harold J. Kushner Division of Applied Mathematics Brown University Providence, RI 02912, USA Harold_Kushner@Brown.edu Managing Editors B. Rozovskii Center for Applied Mathematical Sciences Denney Research Building 308 University of Southern California 1042 West Thirty-sixth Place Los Angeles, CA 90089, USA rozovski@math.usc.edu M. Yor Laboratoire de Probabilite´s et Mode`les Ale´atoires Universite´ de Paris VI 175, rue du Chevaleret 75013 Paris, France Cover illustration: Cover pattern by courtesy of Rick Durrett, Cornell University, Ithaca, New York. Mathematics Subject Classification (2000): 62L20, 93E10, 93E25, 93E35, 65C05, 93-02, 90C15 Library of Congress Cataloging-in-Publication Data Kushner, Harold J. (Harold Joseph), 1933– Stochastic approximation and recursive algorithms and applications / Harold J. Kushner, G. George Yin. p. cm. — (Applications of mathematics ; 35) Rev. ed. of: Stochastic approximation algorithms and applications, c1997. ISBN 0-387-00894-2 (acid-free paper) 1. Stochastic approximation. 2. Recursive stochastic algorithms. 3. Recursive algorithms. I. Kushner, Harold J. (Harold Joseph), 1933–. Stochastic approximation algorithms and applications. QA274.2.K88 2003 519.2—dc21 II. Yin, George, 1954– III. Title. IV. Series. 2003045459 ISBN 0-387-00894-2 Printed on acid-free paper. © 2003, 1997 Springer-Verlag New York, Inc. All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer-Verlag New York, Inc., 175 Fifth Avenue, New York, NY 10010, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed in the United States of America. 9 8 7 6 5 4 3 2 1 SPIN 10922088 Typesetting: Pages created by the authors in 2.09 using Springer’s svsing.sty macro. www.springer-ny.com Springer-Verlag New York Berlin Heidelberg A member of BertelsmannSpringer Science+Business Media GmbH
To Our Parents, Harriet and Hyman Kushner and Wanzhen Zhu and Yixin Yin
Preface and Introduction The basic stochastic approximation algorithms introduced by Robbins and Monro and by Kiefer and Wolfowitz in the early 1950s have been the subject of an enormous literature, both theoretical and applied. This is due to the large number of applications and the interesting theoretical issues in the analysis of “dynamically defined” stochastic processes. The basic paradigm is a stochastic difference equation such as θn+1 = θn +ϵnYn, where θn takes its values in some Euclidean space, Yn is a random variable, and the “step size” ϵn > 0 is small and might go to zero as n → ∞. In its simplest form, θ is a parameter of a system, and the random vector Yn is a function of “noise-corrupted” observations taken on the system when the parameter is set to θn. One recursively adjusts the parameter so that some goal is met asymptotically. This book is concerned with the qualitative and asymptotic properties of such recursive algorithms in the diverse forms in which they arise in applications. There are analogous continuous time algorithms, but the conditions and proofs are generally very close to those for the discrete time case. The original work was motivated by the problem of finding a root of a continuous function ¯g(θ), where the function is not known but the ex- perimenter is able to take “noisy” measurements at any desired value of θ. Recursive methods for root finding are common in classical numerical analysis, and it is reasonable to expect that appropriate stochastic analogs would also perform well. In one classical example, θ is the level of dosage of a drug, and the function ¯g(θ), assumed to be increasing with θ, is the probability of success at dosage level θ. The level at which ¯g(θ) takes a given value v is sought.
viii Preface and Introduction The probability of success is known only by experiment at whatever values of θ are selected by the experimenter, with the experimental outcome being either success or failure. Thus, the problem cannot be solved analytically. One possible approach is to take a sufficient number of observations at some fixed value of θ, so that a good estimate of the function value is available, and then to move on. Since most such observations will be taken at parameter values that are not close to the optimum, much effort might be wasted in comparison with the stochastic approximation algorithm θn+1 = θn + ϵn[v − observation at θn], where the parameter value moves (on the average) in the correct direction after each observation. In another example, we wish to minimize a real-valued continuously differentiable function f(·) of θ. Here, θn is the nth estimate of the minimum, and Yn is a noisy estimate of the negative of the derivative of f(·) at θn, perhaps obtained by a Monte Carlo procedure. The algorithms are frequently constrained in that the iterates θn are projected back to some set H if they ever leave it. The mathematical paradigms have posed substantial challenges in the asymptotic analysis of recursively defined stochastic processes. A major insight of Robbins and Monro was that, if the step sizes in the parameter updates are allowed to go to zero in an appropriate way as n → ∞, then there is an implicit averaging that eliminates the effects of the noise in the long run. An excellent survey of developments up to about the mid 1960s can be found in the book by Wasan [250]. More recent material can be found in [16, 48, 57, 67, 135, 225]. The book [192] deals with many of the issues involved in stochastic optimization in general. In recent years, algorithms of the stochastic approximation type have found applications in new and diverse areas, and new techniques have been developed for proofs of convergence and rate of convergence. The actual and potential applications in signal processing and communications have exploded. Indeed, whether or not they are called stochastic approximations, such algorithms occur frequently in practical systems for the purposes of noise or interference cancellation, the optimization of “post processing” or “equalization” filters in time varying communication channels, adaptive antenna systems, adaptive power control in wireless communications, and many related applications. In these applications, the step size is often a small constant ϵn = ϵ, or it might be random. The underlying processes are often nonstationary and the optimal value of θ can change with time. Then one keeps ϵn strictly away from zero in order to allow “tracking.” Such tracking applications lead to new problems in the asymptotic analysis (e.g., when ϵn are adjusted adaptively); one wishes to estimate the tracking errors and their dependence on the structure of the algorithm. New challenges have arisen in applications to adaptive control. There has been a resurgence of interest in general “learning” algorithms, moti- vated by the training problem in artificial neural networks [7, 51, 97], the on-line learning of optimal strategies in very high-dimensional Markov de- cision processes [113, 174, 221, 252] with unknown transition probabilities,
Preface and Introduction ix in learning automata [155], recursive games [11], convergence in sequential decision problems in economics [175], and related areas. The actual recur- sive forms of the algorithms in many such applications are of the stochastic approximation type. Owing to the types of simulation methods used, the “noise” might be “pseudorandom” [184], rather than random. Methods such as infinitesimal perturbation analysis [101] for the estima- tion of the pathwise derivatives of complex discrete event systems enlarge the possibilities for the recursive on-line optimization of many systems that arise in communications or manufacturing. The appropriate algorithms are often of the stochastic approximation type and the criterion to be mini- mized is often the average cost per unit time over the infinite time interval. Iterate and observation averaging methods [6, 149, 216, 195, 267, 268, 273], which yield nearly optimal algorithms under broad conditions, have been developed. The iterate averaging effectively adds an additional time scale to the algorithm. Decentralized or asynchronous algorithms introduce new difficulties for analysis. Consider, for example, a problem where com- putation is split among several processors, operating and transmitting data to one another asynchronously. Such algorithms are only beginning to come into prominence, due to both the developments of decentralized processing and applications where each of several locations might control or adjust “local variables,” but where the criterion of concern is global. Despite their successes, the classical methods are not adequate for many of the algorithms that arise in such applications. Some of the reasons con- cern the greater flexibility desired for the step sizes, more complicated dependence properties of the noise and iterate processes, the types of constraints that might occur, ergodic cost functions, possibly additional time scales, nonstationarity and issues of tracking for time-varying sys- tems, data-flow problems in the decentralized algorithm, iterate-averaging algorithms, desired stronger rate of convergence results, and so forth. Much modern analysis of the algorithms uses the so-called ODE (ordi- nary differential equation) method introduced by Ljung [164] and exten- sively developed by Kushner and coworkers [123, 135, 142] to cover quite general noise processes and constraints by the use of weak ergodic or aver- aging conditions. The main idea is to show that, asymptotically, the noise effects average out so that the asymptotic behavior is determined effec- tively by that of a “mean” ODE. The usefulness of the technique stems from the fact that the ODE is obtained by a “local analysis,” where the dynamical term of the ODE at parameter value θ is obtained by averaging the Yn as though the parameter were fixed at θ. Constraints, complicated state dependent noise processes, discontinuities, and many other difficulties can be handled. Depending on the application, the ODE might be replaced by a constrained (projected) ODE or a differential inclusion. Owing to its versatility and naturalness, the ODE method has become a fundamental technique in the current toolbox, and its full power will be apparent from the results in this book.
分享到:
收藏