reinforcement learning and optimal control.pdf-资料库

3d91bd3c-27c6-4b11-9796-e4377691b981.pdf-第1页.png

第1页 / 共229页

3d91bd3c-27c6-4b11-9796-e4377691b981.pdf-第2页.png

第2页 / 共229页

3d91bd3c-27c6-4b11-9796-e4377691b981.pdf-第3页.png

第3页 / 共229页

3d91bd3c-27c6-4b11-9796-e4377691b981.pdf-第4页.png

第4页 / 共229页

3d91bd3c-27c6-4b11-9796-e4377691b981.pdf-第5页.png

第5页 / 共229页

3d91bd3c-27c6-4b11-9796-e4377691b981.pdf-第6页.png

第6页 / 共229页

3d91bd3c-27c6-4b11-9796-e4377691b981.pdf-第7页.png

第7页 / 共229页

3d91bd3c-27c6-4b11-9796-e4377691b981.pdf-第8页.png

第8页 / 共229页

Reinforcement Learning and Optimal Control by Dimitri P. Bertsekas Massachusetts Institute of Technology DRAFT TEXTBOOK This is a draft of a textbook that is scheduled to be ﬁnalized in 2019, and to be published by Athena Scientiﬁc. It represents “work in progress,” and it will be periodically updated. It more than likely contains errors (hopefully not serious ones). Furthermore, its references to the literature are incomplete. Your comments and suggestions to the author at dim- itrib@mit.edu are welcome. The date of last revision is given below. December 14, 2018 WWW site for book information and orders http://www.athenasc.com Athena Scientiﬁc, Belmont, Massachusetts

Athena Scientiﬁc Post Oﬃce Box 805 Nashua, NH 03060 U.S.A. Email: info@athenasc.com WWW: http://www.athenasc.com Publisher’s Cataloging-in-Publication Data Bertsekas, Dimitri P. Reinforcement Learning and Optimal Control Includes Bibliography and Index 1. Mathematical Optimization. 2. Dynamic Programming. I. Title. QA402.5 .B465 2019 00-91281 519.703 ISBN-10: 1-886529-39-6, ISBN-13: 978-1-886529-39-7

ABOUT THE AUTHOR Dimitri Bertsekas studied Mechanical and Electrical Engineering at the National Technical University of Athens, Greece, and obtained his Ph.D. in system science from the Massachusetts Institute of Technology. He has held faculty positions with the Engineering-Economic Systems Department, Stanford University, and the Electrical Engineering Department of the Uni- versity of Illinois, Urbana. Since 1979 he has been teaching at the Electrical Engineering and Computer Science Department of the Massachusetts In- stitute of Technology (M.I.T.), where he is currently the McAfee Professor of Engineering. His teaching and research spans several ﬁelds, including determinis- tic optimization, dynamic programming and stochastic control, large-scale and distributed computation, and data communication networks. He has authored or coauthored numerous research papers and seventeen books, several of which are currently used as textbooks in MIT classes, including “Dynamic Programming and Optimal Control,” “Data Networks,” “Intro- duction to Probability,” and “Nonlinear Programming.” Professor Bertsekas was awarded the INFORMS 1997 Prize for Re- search Excellence in the Interface Between Operations Research and Com- puter Science for his book “Neuro-Dynamic Programming” (co-authored with John Tsitsiklis), the 2001 AACC John R. Ragazzini Education Award, the 2009 INFORMS Expository Writing Award, the 2014 AACC Richard Bellman Heritage Award, the 2014 Khachiyan Prize for Life-Time Accom- plishments in Optimization, the 2015 George B. Dantzig Prize, and the 2018 John von Neumann Theory Prize. In 2001, he was elected to the United States National Academy of Engineering for “pioneering contributions to fundamental research, practice and education of optimization/control the- ory, and especially its application to data communication networks.” iii

ATHENA SCIENTIFIC OPTIMIZATION AND COMPUTATION SERIES 1. Abstract Dynamic Programming, 2nd Edition, by Dimitri P. Bertsekas, 2018, ISBN 978-1-886529-46-5, 360 pages 2. Dynamic Programming and Optimal Control, Two-Volume Set, by Dimitri P. Bertsekas, 2017, ISBN 1-886529-08-6, 1270 pages 3. Nonlinear Programming, 3rd Edition, by Dimitri P. Bertsekas, 2016, ISBN 1-886529-05-1, 880 pages 4. Convex Optimization Algorithms, by Dimitri P. Bertsekas, 2015, ISBN 978-1-886529-28-1, 576 pages 5. Convex Optimization Theory, by Dimitri P. Bertsekas, 2009, ISBN 978-1-886529-31-1, 256 pages 6. Introduction to Probability, 2nd Edition, by Dimitri P. Bertsekas and John N. Tsitsiklis, 2008, ISBN 978-1-886529-23-6, 544 pages 7. Convex Analysis and Optimization, by Dimitri P. Bertsekas, An- gelia Nedi´c, and Asuman E. Ozdaglar, 2003, ISBN 1-886529-45-0, 560 pages 8. Network Optimization: Continuous and Discrete Models, by Dim- itri P. Bertsekas, 1998, ISBN 1-886529-02-7, 608 pages 9. Network Flows and Monotropic Optimization, by R. Tyrrell Rock- afellar, 1998, ISBN 1-886529-06-X, 634 pages 10. Introduction to Linear Optimization, by Dimitris Bertsimas and John N. Tsitsiklis, 1997, ISBN 1-886529-19-1, 608 pages 11. Parallel and Distributed Computation: Numerical Methods, by Dimitri P. Bertsekas and John N. Tsitsiklis, 1997, ISBN 1-886529- 01-9, 718 pages 12. Neuro-Dynamic Programming, by Dimitri P. Bertsekas and John N. Tsitsiklis, 1996, ISBN 1-886529-10-8, 512 pages 13. Constrained Optimization and Lagrange Multiplier Methods, by Dimitri P. Bertsekas, 1996, ISBN 1-886529-04-3, 410 pages 14. Stochastic Optimal Control: The Discrete-Time Case, by Dimitri P. Bertsekas and Steven E. Shreve, 1996, ISBN 1-886529-03-5, 330 pages iv

Contents 1. Exact Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . 1.1. Deterministic Dynamic Programming . . 1.2. Stochastic Dynamic Programming . 1.3. Examples, Variations, and Simpliﬁcations . . 1.1.1. Deterministic Problems . . 1.1.2. The Dynamic Programming Algorithm . . 1.1.3. Approximation in Value Space . . 1.1.4. Model-Free Approximate Solution - Q-Learning . . . . . . . . . 1.3.1. Deterministic Shortest Path Problems 1.3.2. Discrete Deterministic Optimization . . 1.3.3. Problems with a Terminal State 1.3.4. Forecasts . . . 1.3.5. Problems with Uncontrollable State Components 1.3.6. Partial State Information and Belief States . 1.3.7. Linear Quadratic Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4. Reinforcement Learning and Optimal Control - Some Terminology . . 1.5. Notes and Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. Approximation in Value Space . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1. Variants of Approximation in Value Space . . . 2.1.1. Oﬀ-Line and On-Line Methods . 2.1.2. Simplifying the Lookahead Minimization . 2.1.3. Model-Free Approximation in Value and . . . . . . . 2.1.4. When is Approximation in Value Space Eﬀective? . . 2.2.1. Multistep Lookahead and Rolling Horizon . 2.2.2. Multistep Lookahead and Deterministic Problems . . 2.3. Problem Approximation . 2.2. Multistep Lookahead Policy Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 2 . p. 2 . p. 7 . p. 12 . p. 13 . p. 14 . p. 17 . p. 18 . p. 19 . p. 23 . p. 26 . p. 27 . p. 32 . p. 35 . . p. 38 . p. 40 . . . . . p. 3 . p. 4 . p. 5 . . p. 6 . p. 9 . p. 10 . p. 11 . p. 13 . p. 14 v

vi Contents . . . . . . Equivalent Control . 2.4. Rollout and Model Predictive Control 2.3.1. Enforced Decomposition . . 2.3.2. Probabilistic Approximation - Certainty . . . . . . . . 2.4.1. Rollout for Deterministic Problems . 2.4.2. Stochastic Rollout and Monte Carlo Tree Search . 2.4.3. Model Predictive Control . . . 2.5. Notes and Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. Parametric Approximation . . . . . . . . . . . 3.2. Neural Networks 3.1. Approximation Architectures . . 3.1.1. Linear and Nonlinear Feature-Based Architectures . . 3.1.2. Training of Linear and Nonlinear Architectures 3.1.3. Incremental Gradient and Newton Methods . . . . . . . . . . . . 3.3. Sequential Dynamic Programming Approximation . . . . . . 3.4. Q-factor Parametric Approximation . 3.5. Notes and Sources . . . . . 3.2.1. Training of Neural Networks . 3.2.2. Multilayer and Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4. Inﬁnite Horizon Renforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6. Simulation-Based Policy Iteration with Parametric 4.1. An Overview of Inﬁnite Horizon Problems . 4.2. Stochastic Shortest Path Problems 4.3. Discounted Problems . 4.4. Exact and Approximate Value Iteration . . 4.5. Policy Iteration . . . . . . . . . . 4.5.1. Exact Policy Iteration . . . 4.5.2. Policy Iteration for Q-factors . 4.5.3. Limited Lookahead Policies and Rollout 4.5.4. Approximate Policy Iteration - Error Bounds . . . . . . . . . . . . . . . . . Approximation . . . 4.6.1. Self-Learning and Actor-Critic Systems . . 4.6.2. A Model-Based Variant 4.6.3. A Model-Free Variant . . 4.6.4. Issues Relating to Approximate Policy Iteration . . . . . . . 4.7. Exact and Approximate Linear Programming 4.8. Q-Learning . . 4.9. Additional Methods - Temporal Diﬀerences . . 4.10. Approximation in Policy Space . 4.11. Notes and Sources . . 4.12. Appendix: Mathematical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 15 . . p. 21 . p. 27 . p. 27 . p. 34 . p. 41 . p. 46 . p. 2 . p. 2 . p. 7 . p. 9 . p. 21 . p. 24 . p. 26 . p. 29 . p. 31 . p. 33 . . . p. 2 . p. 5 . p. 14 . p. 19 . p. 22 . p. 22 . p. 27 . p. 28 . p. 30 . . p. 34 . p. 34 . p. 35 . p. 37 . p. 39 . p. 42 . p. 44 . p. 47 . p. 58 . p. 60 . p. 63

Contents . 4.12.1. Proofs for Stochastic Shortest Path Problems . . 4.12.2. Proofs for Discounted Problems 4.12.3. Convergence of Exact Policy Iteration . . 4.12.4. Error Bounds for Approximate Policy Iteration . . . . . . . . . 5. Aggregation . . . . . . . . 5.1. Aggregation Frameworks . . 5.2. Classical and Biased Forms of the Aggregate Problem . . 5.3. Bellman’s Equation for the Aggregate Problem . 5.4. Algorithms for the Aggregate Problem . . . 5.5. Some Examples . . . . 5.6. Spatiotemporal Aggregation for Deterministic Problems 5.7. Notes and Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii . p. 63 . p. 69 . p. 69 . p. 70 . . . . . . . . . . p. . p. . p. . p. . p. . p. . p. . p. . p. . . . . . . . . . . . . . . . . . . . . . . . . . .

资料库

reinforcement learning and optimal control.pdf

相关推荐

人工智能

热门标签

最新资料