Reinforcement Learning and Optimal Control
by
Dimitri P. Bertsekas
Massachusetts Institute of Technology
DRAFT TEXTBOOK
This is a draft of a textbook that is scheduled to be finalized in 2019,
and to be published by Athena Scientific. It represents “work in progress,”
and it will be periodically updated.
It more than likely contains errors
(hopefully not serious ones). Furthermore, its references to the literature
are incomplete. Your comments and suggestions to the author at dim-
itrib@mit.edu are welcome. The date of last revision is given below.
December 14, 2018
WWW site for book information and orders
http://www.athenasc.com
Athena Scientific, Belmont, Massachusetts
Athena Scientific
Post Office Box 805
Nashua, NH 03060
U.S.A.
Email: info@athenasc.com
WWW: http://www.athenasc.com
Publisher’s Cataloging-in-Publication Data
Bertsekas, Dimitri P.
Reinforcement Learning and Optimal Control
Includes Bibliography and Index
1. Mathematical Optimization. 2. Dynamic Programming. I. Title.
QA402.5 .B465 2019
00-91281
519.703
ISBN-10: 1-886529-39-6, ISBN-13: 978-1-886529-39-7
ABOUT THE AUTHOR
Dimitri Bertsekas studied Mechanical and Electrical Engineering at the
National Technical University of Athens, Greece, and obtained his Ph.D.
in system science from the Massachusetts Institute of Technology. He has
held faculty positions with the Engineering-Economic Systems Department,
Stanford University, and the Electrical Engineering Department of the Uni-
versity of Illinois, Urbana. Since 1979 he has been teaching at the Electrical
Engineering and Computer Science Department of the Massachusetts In-
stitute of Technology (M.I.T.), where he is currently the McAfee Professor
of Engineering.
His teaching and research spans several fields, including determinis-
tic optimization, dynamic programming and stochastic control, large-scale
and distributed computation, and data communication networks. He has
authored or coauthored numerous research papers and seventeen books,
several of which are currently used as textbooks in MIT classes, including
“Dynamic Programming and Optimal Control,” “Data Networks,” “Intro-
duction to Probability,” and “Nonlinear Programming.”
Professor Bertsekas was awarded the INFORMS 1997 Prize for Re-
search Excellence in the Interface Between Operations Research and Com-
puter Science for his book “Neuro-Dynamic Programming” (co-authored
with John Tsitsiklis), the 2001 AACC John R. Ragazzini Education Award,
the 2009 INFORMS Expository Writing Award, the 2014 AACC Richard
Bellman Heritage Award, the 2014 Khachiyan Prize for Life-Time Accom-
plishments in Optimization, the 2015 George B. Dantzig Prize, and the 2018
John von Neumann Theory Prize. In 2001, he was elected to the United
States National Academy of Engineering for “pioneering contributions to
fundamental research, practice and education of optimization/control the-
ory, and especially its application to data communication networks.”
iii
ATHENA SCIENTIFIC
OPTIMIZATION AND COMPUTATION SERIES
1. Abstract Dynamic Programming, 2nd Edition, by Dimitri P.
Bertsekas, 2018, ISBN 978-1-886529-46-5, 360 pages
2. Dynamic Programming and Optimal Control, Two-Volume Set,
by Dimitri P. Bertsekas, 2017, ISBN 1-886529-08-6, 1270 pages
3. Nonlinear Programming, 3rd Edition, by Dimitri P. Bertsekas,
2016, ISBN 1-886529-05-1, 880 pages
4. Convex Optimization Algorithms, by Dimitri P. Bertsekas, 2015,
ISBN 978-1-886529-28-1, 576 pages
5. Convex Optimization Theory, by Dimitri P. Bertsekas, 2009,
ISBN 978-1-886529-31-1, 256 pages
6. Introduction to Probability, 2nd Edition, by Dimitri P. Bertsekas
and John N. Tsitsiklis, 2008, ISBN 978-1-886529-23-6, 544 pages
7. Convex Analysis and Optimization, by Dimitri P. Bertsekas, An-
gelia Nedi´c, and Asuman E. Ozdaglar, 2003, ISBN 1-886529-45-0,
560 pages
8. Network Optimization: Continuous and Discrete Models, by Dim-
itri P. Bertsekas, 1998, ISBN 1-886529-02-7, 608 pages
9. Network Flows and Monotropic Optimization, by R. Tyrrell Rock-
afellar, 1998, ISBN 1-886529-06-X, 634 pages
10. Introduction to Linear Optimization, by Dimitris Bertsimas and
John N. Tsitsiklis, 1997, ISBN 1-886529-19-1, 608 pages
11. Parallel and Distributed Computation: Numerical Methods, by
Dimitri P. Bertsekas and John N. Tsitsiklis, 1997, ISBN 1-886529-
01-9, 718 pages
12. Neuro-Dynamic Programming, by Dimitri P. Bertsekas and John
N. Tsitsiklis, 1996, ISBN 1-886529-10-8, 512 pages
13. Constrained Optimization and Lagrange Multiplier Methods, by
Dimitri P. Bertsekas, 1996, ISBN 1-886529-04-3, 410 pages
14. Stochastic Optimal Control: The Discrete-Time Case, by Dimitri
P. Bertsekas and Steven E. Shreve, 1996, ISBN 1-886529-03-5,
330 pages
iv
Contents
1. Exact Dynamic Programming
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1.1. Deterministic Dynamic Programming .
.
1.2. Stochastic Dynamic Programming .
1.3. Examples, Variations, and Simplifications
.
.
1.1.1. Deterministic Problems
.
.
1.1.2. The Dynamic Programming Algorithm .
.
1.1.3. Approximation in Value Space
.
.
1.1.4. Model-Free Approximate Solution - Q-Learning .
.
.
.
.
.
.
.
.
1.3.1. Deterministic Shortest Path Problems
1.3.2. Discrete Deterministic Optimization .
.
1.3.3. Problems with a Terminal State
1.3.4. Forecasts .
.
.
1.3.5. Problems with Uncontrollable State Components
1.3.6. Partial State Information and Belief States .
1.3.7. Linear Quadratic Optimal Control .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1.4. Reinforcement Learning and Optimal Control - Some
Terminology .
.
1.5. Notes and Sources
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Approximation in Value Space
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2.1. Variants of Approximation in Value Space .
.
.
2.1.1. Off-Line and On-Line Methods
.
2.1.2. Simplifying the Lookahead Minimization .
2.1.3. Model-Free Approximation in Value and .
.
.
.
.
.
.
2.1.4. When is Approximation in Value Space Effective? .
.
2.2.1. Multistep Lookahead and Rolling Horizon
.
2.2.2. Multistep Lookahead and Deterministic Problems .
.
2.3. Problem Approximation .
2.2. Multistep Lookahead
Policy Space
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. p. 2
. p. 2
. p. 7
. p. 12
. p. 13
. p. 14
. p. 17
. p. 18
. p. 19
. p. 23
. p. 26
. p. 27
. p. 32
. p. 35
.
. p. 38
. p. 40
.
.
.
.
. p. 3
. p. 4
. p. 5
.
. p. 6
. p. 9
. p. 10
. p. 11
. p. 13
. p. 14
v
vi
Contents
.
.
.
.
.
.
Equivalent Control
.
2.4. Rollout and Model Predictive Control
2.3.1. Enforced Decomposition .
.
2.3.2. Probabilistic Approximation - Certainty .
.
.
.
.
.
.
.
2.4.1. Rollout for Deterministic Problems
.
2.4.2. Stochastic Rollout and Monte Carlo Tree Search
.
2.4.3. Model Predictive Control .
.
.
2.5. Notes and Sources
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3. Parametric Approximation
.
.
.
.
.
.
.
.
.
.
.
3.2. Neural Networks
3.1. Approximation Architectures .
.
3.1.1. Linear and Nonlinear Feature-Based Architectures .
.
3.1.2. Training of Linear and Nonlinear Architectures
3.1.3. Incremental Gradient and Newton Methods .
.
.
.
.
.
.
.
.
.
.
.
3.3. Sequential Dynamic Programming Approximation .
.
.
.
.
.
3.4. Q-factor Parametric Approximation .
3.5. Notes and Sources
.
.
.
.
.
3.2.1. Training of Neural Networks
.
3.2.2. Multilayer and Deep Neural Networks
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4. Infinite Horizon Renforcement Learning
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4.6. Simulation-Based Policy Iteration with Parametric
4.1. An Overview of Infinite Horizon Problems
.
4.2. Stochastic Shortest Path Problems
4.3. Discounted Problems
.
4.4. Exact and Approximate Value Iteration .
.
4.5. Policy Iteration
.
.
.
.
.
.
.
.
.
.
4.5.1. Exact Policy Iteration .
.
.
4.5.2. Policy Iteration for Q-factors
.
4.5.3. Limited Lookahead Policies and Rollout
4.5.4. Approximate Policy Iteration - Error Bounds .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Approximation .
.
.
4.6.1. Self-Learning and Actor-Critic Systems
.
.
4.6.2. A Model-Based Variant
4.6.3. A Model-Free Variant
.
.
4.6.4. Issues Relating to Approximate Policy Iteration .
.
.
.
.
.
.
4.7. Exact and Approximate Linear Programming
4.8. Q-Learning
.
.
4.9. Additional Methods - Temporal Differences
.
.
4.10. Approximation in Policy Space
.
4.11. Notes and Sources
.
.
4.12. Appendix: Mathematical Analysis
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. p. 15
.
. p. 21
. p. 27
. p. 27
. p. 34
. p. 41
. p. 46
. p. 2
. p. 2
. p. 7
. p. 9
. p. 21
. p. 24
. p. 26
. p. 29
. p. 31
. p. 33
.
.
. p. 2
. p. 5
. p. 14
. p. 19
. p. 22
. p. 22
. p. 27
. p. 28
. p. 30
.
. p. 34
. p. 34
. p. 35
. p. 37
. p. 39
. p. 42
. p. 44
. p. 47
. p. 58
. p. 60
. p. 63
Contents
.
4.12.1. Proofs for Stochastic Shortest Path Problems
.
.
4.12.2. Proofs for Discounted Problems
4.12.3. Convergence of Exact Policy Iteration .
.
4.12.4. Error Bounds for Approximate Policy Iteration .
.
.
.
.
.
.
.
.
5. Aggregation
.
.
.
.
.
.
.
.
5.1. Aggregation Frameworks .
.
5.2. Classical and Biased Forms of the Aggregate Problem .
.
5.3. Bellman’s Equation for the Aggregate Problem .
5.4. Algorithms for the Aggregate Problem .
.
.
5.5. Some Examples .
.
.
.
5.6. Spatiotemporal Aggregation for Deterministic Problems
5.7. Notes and Sources
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
References
Index .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vii
. p. 63
. p. 69
. p. 69
. p. 70
.
.
.
.
.
.
.
.
.
. p.
. p.
. p.
. p.
. p.
. p.
. p.
. p.
. p.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.