强化学习和最优控制的《十个关键点》【81页PPT汇总】.pdf

发布时间：2022-05-31 发布人：admin 分类：说明书资料大小：4.83M 资料格式：pdf 举报版权申诉

syp_net-12206753-4744300845195788809.pdf-第1页.png

第1页 / 共72页

syp_net-12206753-4744300845195788809.pdf-第2页.png

第2页 / 共72页

syp_net-12206753-4744300845195788809.pdf-第3页.png

第3页 / 共72页

syp_net-12206753-4744300845195788809.pdf-第4页.png

第4页 / 共72页

syp_net-12206753-4744300845195788809.pdf-第5页.png

第5页 / 共72页

syp_net-12206753-4744300845195788809.pdf-第6页.png

第6页 / 共72页

syp_net-12206753-4744300845195788809.pdf-第7页.png

第7页 / 共72页

syp_net-12206753-4744300845195788809.pdf-第8页.png

第8页 / 共72页

Principle of Optimality

Approximation in Value Space

Approximation in Policy Space

Model-Free Methods and Simulation

Policy Improvement, Rollout, and Self-Learning

Approximate Policy Improvement, Adaptive Simulation, and Q-Learning

Features, Approximation Architectures, and Deep Neural Nets

Incremental and Stochastic Gradient Optimization

Direct Policy Optimization: A More General Approach

Gradient and Random Search Methods for Direct Policy Optimization

Ten Key Ideas for Reinforcement Learning and Optimal Control Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology and School of Computing, Informatics, and Decision Systems Engineering Arizona State University August 2019 Bertsekas (M.I.T.) Reinforcement Learning 1 / 82

Reinforcement Learning (RL): A Happy Union of AI and Decision/Control/Dynamic Programming (DP) Ideas Historical highlights Exact DP, optimal control (Bellman, Shannon, 1950s ...) First impressive successes: Backgammon programs (Tesauro, 1992, 1996) Algorithmic progress, analysis, applications, ﬁrst books (mid 90s ...) Machine Learning, BIG Data, Robotics, Deep Neural Networks (mid 2000s ...) Bertsekas (M.I.T.) Reinforcement Learning 2 / 82 Decision/Control/DPPrinciple of OptimalityMarkov Decision ProblemsPOMDP Policy IterationValue IterationAI/RLLearning through ExperienceSimulation,Model-Free Methods Feature-Based RepresentationsA*/Games/HeuristicsComplementary IdeasLate 80s-Early 90s

AlphaGo (2016) and AlphaZero (2017) Methodology: Simulation-based approximation to a form of the policy iteration method of DP Uses self-learning, i.e., self-generated data for policy evaluation, and Monte Carlo tree search for policy improvement The success of AlphaZero is due to: A skillful implementation/integration of known ideas Awesome computational power Bertsekas (M.I.T.) Reinforcement Learning 3 / 82 AlphaZero(Google-DeepMind)Playsdiﬀerent!Learnedfromscratch...with4hoursoftraining!PlaysmuchbetterthanallchessprogramsSamealgorithmlearnedmultiplegames(Go,Shogi)

Approximate DP/RL Methodology is now Ambitious and Universal Exact DP applies (in principle) to a very broad range of optimization problems Deterministic <—-> Stochastic Combinatorial optimization <—-> Optimal control w/ inﬁnite state/control spaces One decision maker <—-> Two player games ... BUT is plagued by the curse of dimensionality and need for a math model Approximate DP/RL overcomes the difﬁculties of exact DP by: Approximation (use neural nets and other architectures to reduce dimension) Simulation (use a computer model in place of a math model) State of the art: Broadly applicable methodology: Can address broad range of challenging problems. Deterministic-stochastic-dynamic, discrete-continuous, games, etc There are no methods that are guaranteed to work for all or even most problems There are enough methods to try with a reasonable chance of success for most types of optimization problems Role of the theory: Guide the art, delineate the sound ideas Bertsekas (M.I.T.) Reinforcement Learning 4 / 82

About this Talk The purpose of this talk To selectively explain some of the key ideas of RL and its connections with DP. To provide a road map for further study (mostly from the perspective of DP). To provide a guide for reading my new book (abbreviated RL-OC): Bertsekas, Reinforcement Learning and Optimal Control, Athena Scientiﬁc, 2019. For slides and videolectures from Arizona State University course earlier this year, see my website References Quite a few Exact DP books (1950s-present starting with Bellman). My books: 2017. My two-volume textbook "Dynamic Programming and Optimal Control" was updated in My mathematically oriented research monograph “Stochastic Optimal Control" (with S. My latest mathematically oriented research monograph “Abstract DP" came out in E. Shreve) came out in 1978. 2018. Quite a few Approximate DP/RL/Neural Nets books (1996-Present) Bertsekas and Tsitsiklis, Neuro-Dynamic Programming, 1996 Sutton and Barto, 1998, Reinforcement Learning (new edition 2018) Many surveys on all aspects of the subject Bertsekas (M.I.T.) Reinforcement Learning 5 / 82

Terminology in RL/AI and DP/Control (RL-OC, Section 1.4) RL uses Max/Value, DP uses Min/Cost Reward of a stage = (Opposite of) Cost of a stage. State value = (Opposite of) State cost. Value (or state-value) function = (Opposite of) Cost function. Controlled system terminology Agent = Decision maker or controller. Action = Control. Environment = Dynamic system. Methods terminology Learning = Solving a DP-related problem using simulation. Self-learning (or self-play in the context of games) = Solving a DP problem using simulation-based policy iteration. Planning vs Learning distinction = Solving a DP problem with model-based vs model-free simulation. Bertsekas (M.I.T.) Reinforcement Learning 6 / 82

AN OUTLINE OF THE SUBJECT - TEN KEY IDEAS 1 Principle of Optimality 2 Approximation in Value Space 3 Approximation in Policy Space 4 Model-Free Methods and Simulation 5 Policy Improvement, Rollout, and Self-Learning 6 Approximate Policy Improvement, Adaptive Simulation, and Q-Learning 7 Features, Approximation Architectures, and Deep Neural Nets 8 Incremental and Stochastic Gradient Optimization 9 Direct Policy Optimization: A More General Approach 10 Gradient and Random Search Methods for Direct Policy Optimization Bertsekas (M.I.T.) Reinforcement Learning 7 / 82

1. PRINCIPLE OF OPTIMALITY (RL-OC, Chapter 1) Bertsekas (M.I.T.) Reinforcement Learning 9 / 82

分享到：

赞收藏

资料库

强化学习和最优控制的《十个关键点》【81页PPT汇总】.pdf

相关推荐

人工智能

热门标签

最新资料