Stochastic Mechanics Applications of
Random Media Mathematics
Signal Processing and Image Synthesis Stochastic Modelling
Mathematical Economics and Finance
and Applied Probability
Stochastic Optimization
Stochastic Control 35
Stochastic Models in Life Sciences
Edited by B. Rozovskii
M. Yor
Advisory Board D. Dawson
D. Geman
G. Grimmett
I. Karatzas
F. Kelly
Y. Le Jan
B. Øksendal
G. Papanicolaou
E. Pardoux
Harold J. Kushner G. George Yin
Stochastic Approximation
and Recursive Algorithms
and Applications
Second Edition
With 31 Figures
G. George Yin
Department of Mathematics
Wayne State University
Detroit, MI 48202, USA
gyin@math.wayne.edu
Harold J. Kushner
Division of Applied Mathematics
Brown University
Providence, RI 02912, USA
Harold_Kushner@Brown.edu
Managing Editors
B. Rozovskii
Center for Applied Mathematical Sciences
Denney Research Building 308
University of Southern California
1042 West Thirty-sixth Place
Los Angeles, CA 90089, USA
rozovski@math.usc.edu
M. Yor
Laboratoire de Probabilite´s et Mode`les Ale´atoires
Universite´ de Paris VI
175, rue du Chevaleret
75013 Paris, France
Cover illustration: Cover pattern by courtesy of Rick Durrett, Cornell University, Ithaca, New York.
Mathematics Subject Classification (2000): 62L20, 93E10, 93E25, 93E35, 65C05, 93-02, 90C15
Library of Congress Cataloging-in-Publication Data
Kushner, Harold J. (Harold Joseph), 1933–
Stochastic approximation and recursive algorithms and applications / Harold J. Kushner,
G. George Yin.
p. cm. — (Applications of mathematics ; 35)
Rev. ed. of: Stochastic approximation algorithms and applications, c1997.
ISBN 0-387-00894-2 (acid-free paper)
1. Stochastic approximation.
2. Recursive stochastic algorithms.
3. Recursive algorithms.
I. Kushner, Harold J. (Harold Joseph), 1933–. Stochastic approximation algorithms and
applications.
QA274.2.K88 2003
519.2—dc21
II. Yin, George, 1954– III. Title.
IV. Series.
2003045459
ISBN 0-387-00894-2
Printed on acid-free paper.
© 2003, 1997 Springer-Verlag New York, Inc.
All rights reserved. This work may not be translated or copied in whole or in part without the
written permission of the publisher (Springer-Verlag New York, Inc., 175 Fifth Avenue, New York,
NY 10010, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use
in connection with any form of information storage and retrieval, electronic adaptation, computer
software, or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if
they are not identified as such, is not to be taken as an expression of opinion as to whether or not
they are subject to proprietary rights.
Printed in the United States of America.
9 8 7 6 5 4 3 2 1
SPIN 10922088
Typesetting: Pages created by the authors in
2.09 using Springer’s svsing.sty macro.
www.springer-ny.com
Springer-Verlag New York Berlin Heidelberg
A member of BertelsmannSpringer Science+Business Media GmbH
To Our Parents,
Harriet and Hyman Kushner
and
Wanzhen Zhu and Yixin Yin
Preface and Introduction
The basic stochastic approximation algorithms introduced by Robbins and
Monro and by Kiefer and Wolfowitz in the early 1950s have been the subject
of an enormous literature, both theoretical and applied. This is due to the
large number of applications and the interesting theoretical issues in the
analysis of “dynamically defined” stochastic processes. The basic paradigm
is a stochastic difference equation such as θn+1 = θn +ϵnYn, where θn takes
its values in some Euclidean space, Yn is a random variable, and the “step
size” ϵn > 0 is small and might go to zero as n → ∞. In its simplest form,
θ is a parameter of a system, and the random vector Yn is a function of
“noise-corrupted” observations taken on the system when the parameter is
set to θn. One recursively adjusts the parameter so that some goal is met
asymptotically. This book is concerned with the qualitative and asymptotic
properties of such recursive algorithms in the diverse forms in which they
arise in applications. There are analogous continuous time algorithms, but
the conditions and proofs are generally very close to those for the discrete
time case.
The original work was motivated by the problem of finding a root of
a continuous function ¯g(θ), where the function is not known but the ex-
perimenter is able to take “noisy” measurements at any desired value of
θ. Recursive methods for root finding are common in classical numerical
analysis, and it is reasonable to expect that appropriate stochastic analogs
would also perform well.
In one classical example, θ is the level of dosage of a drug, and the
function ¯g(θ), assumed to be increasing with θ, is the probability of success
at dosage level θ. The level at which ¯g(θ) takes a given value v is sought.
viii
Preface and Introduction
The probability of success is known only by experiment at whatever values
of θ are selected by the experimenter, with the experimental outcome being
either success or failure. Thus, the problem cannot be solved analytically.
One possible approach is to take a sufficient number of observations at
some fixed value of θ, so that a good estimate of the function value is
available, and then to move on. Since most such observations will be taken
at parameter values that are not close to the optimum, much effort might be
wasted in comparison with the stochastic approximation algorithm θn+1 =
θn + ϵn[v − observation at θn], where the parameter value moves (on the
average) in the correct direction after each observation. In another example,
we wish to minimize a real-valued continuously differentiable function f(·)
of θ. Here, θn is the nth estimate of the minimum, and Yn is a noisy
estimate of the negative of the derivative of f(·) at θn, perhaps obtained
by a Monte Carlo procedure. The algorithms are frequently constrained in
that the iterates θn are projected back to some set H if they ever leave
it. The mathematical paradigms have posed substantial challenges in the
asymptotic analysis of recursively defined stochastic processes.
A major insight of Robbins and Monro was that, if the step sizes in
the parameter updates are allowed to go to zero in an appropriate way as
n → ∞, then there is an implicit averaging that eliminates the effects of the
noise in the long run. An excellent survey of developments up to about the
mid 1960s can be found in the book by Wasan [250]. More recent material
can be found in [16, 48, 57, 67, 135, 225]. The book [192] deals with many
of the issues involved in stochastic optimization in general.
In recent years, algorithms of the stochastic approximation type have
found applications in new and diverse areas, and new techniques have been
developed for proofs of convergence and rate of convergence. The actual
and potential applications in signal processing and communications have
exploded. Indeed, whether or not they are called stochastic approximations,
such algorithms occur frequently in practical systems for the purposes of
noise or interference cancellation, the optimization of “post processing”
or “equalization” filters in time varying communication channels, adaptive
antenna systems, adaptive power control in wireless communications, and
many related applications. In these applications, the step size is often a
small constant ϵn = ϵ, or it might be random. The underlying processes
are often nonstationary and the optimal value of θ can change with time.
Then one keeps ϵn strictly away from zero in order to allow “tracking.” Such
tracking applications lead to new problems in the asymptotic analysis (e.g.,
when ϵn are adjusted adaptively); one wishes to estimate the tracking errors
and their dependence on the structure of the algorithm.
New challenges have arisen in applications to adaptive control. There
has been a resurgence of interest in general “learning” algorithms, moti-
vated by the training problem in artificial neural networks [7, 51, 97], the
on-line learning of optimal strategies in very high-dimensional Markov de-
cision processes [113, 174, 221, 252] with unknown transition probabilities,
Preface and Introduction
ix
in learning automata [155], recursive games [11], convergence in sequential
decision problems in economics [175], and related areas. The actual recur-
sive forms of the algorithms in many such applications are of the stochastic
approximation type. Owing to the types of simulation methods used, the
“noise” might be “pseudorandom” [184], rather than random.
Methods such as infinitesimal perturbation analysis [101] for the estima-
tion of the pathwise derivatives of complex discrete event systems enlarge
the possibilities for the recursive on-line optimization of many systems that
arise in communications or manufacturing. The appropriate algorithms are
often of the stochastic approximation type and the criterion to be mini-
mized is often the average cost per unit time over the infinite time interval.
Iterate and observation averaging methods [6, 149, 216, 195, 267, 268,
273], which yield nearly optimal algorithms under broad conditions, have
been developed. The iterate averaging effectively adds an additional time
scale to the algorithm. Decentralized or asynchronous algorithms introduce
new difficulties for analysis. Consider, for example, a problem where com-
putation is split among several processors, operating and transmitting data
to one another asynchronously. Such algorithms are only beginning to come
into prominence, due to both the developments of decentralized processing
and applications where each of several locations might control or adjust
“local variables,” but where the criterion of concern is global.
Despite their successes, the classical methods are not adequate for many
of the algorithms that arise in such applications. Some of the reasons con-
cern the greater flexibility desired for the step sizes, more complicated
dependence properties of the noise and iterate processes, the types of
constraints that might occur, ergodic cost functions, possibly additional
time scales, nonstationarity and issues of tracking for time-varying sys-
tems, data-flow problems in the decentralized algorithm, iterate-averaging
algorithms, desired stronger rate of convergence results, and so forth.
Much modern analysis of the algorithms uses the so-called ODE (ordi-
nary differential equation) method introduced by Ljung [164] and exten-
sively developed by Kushner and coworkers [123, 135, 142] to cover quite
general noise processes and constraints by the use of weak ergodic or aver-
aging conditions. The main idea is to show that, asymptotically, the noise
effects average out so that the asymptotic behavior is determined effec-
tively by that of a “mean” ODE. The usefulness of the technique stems
from the fact that the ODE is obtained by a “local analysis,” where the
dynamical term of the ODE at parameter value θ is obtained by averaging
the Yn as though the parameter were fixed at θ. Constraints, complicated
state dependent noise processes, discontinuities, and many other difficulties
can be handled. Depending on the application, the ODE might be replaced
by a constrained (projected) ODE or a differential inclusion. Owing to its
versatility and naturalness, the ODE method has become a fundamental
technique in the current toolbox, and its full power will be apparent from
the results in this book.