logo资料库

强化学习和最优控制的《十个关键点》【81页PPT汇总】.pdf

第1页 / 共72页
第2页 / 共72页
第3页 / 共72页
第4页 / 共72页
第5页 / 共72页
第6页 / 共72页
第7页 / 共72页
第8页 / 共72页
资料共72页,剩余部分请下载后查看
Principle of Optimality
Approximation in Value Space
Approximation in Policy Space
Model-Free Methods and Simulation
Policy Improvement, Rollout, and Self-Learning
Approximate Policy Improvement, Adaptive Simulation, and Q-Learning
Features, Approximation Architectures, and Deep Neural Nets
Incremental and Stochastic Gradient Optimization
Direct Policy Optimization: A More General Approach
Gradient and Random Search Methods for Direct Policy Optimization
Ten Key Ideas for Reinforcement Learning and Optimal Control Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology and School of Computing, Informatics, and Decision Systems Engineering Arizona State University August 2019 Bertsekas (M.I.T.) Reinforcement Learning 1 / 82
Reinforcement Learning (RL): A Happy Union of AI and Decision/Control/Dynamic Programming (DP) Ideas Historical highlights Exact DP, optimal control (Bellman, Shannon, 1950s ...) First impressive successes: Backgammon programs (Tesauro, 1992, 1996) Algorithmic progress, analysis, applications, first books (mid 90s ...) Machine Learning, BIG Data, Robotics, Deep Neural Networks (mid 2000s ...) Bertsekas (M.I.T.) Reinforcement Learning 2 / 82 Decision/Control/DPPrinciple of OptimalityMarkov Decision ProblemsPOMDP Policy IterationValue IterationAI/RLLearning through ExperienceSimulation,Model-Free Methods Feature-Based RepresentationsA*/Games/HeuristicsComplementary IdeasLate 80s-Early 90s
AlphaGo (2016) and AlphaZero (2017) Methodology: Simulation-based approximation to a form of the policy iteration method of DP Uses self-learning, i.e., self-generated data for policy evaluation, and Monte Carlo tree search for policy improvement The success of AlphaZero is due to: A skillful implementation/integration of known ideas Awesome computational power Bertsekas (M.I.T.) Reinforcement Learning 3 / 82 AlphaZero(Google-DeepMind)Playsdifferent!Learnedfromscratch...with4hoursoftraining!PlaysmuchbetterthanallchessprogramsSamealgorithmlearnedmultiplegames(Go,Shogi)
Approximate DP/RL Methodology is now Ambitious and Universal Exact DP applies (in principle) to a very broad range of optimization problems Deterministic <—-> Stochastic Combinatorial optimization <—-> Optimal control w/ infinite state/control spaces One decision maker <—-> Two player games ... BUT is plagued by the curse of dimensionality and need for a math model Approximate DP/RL overcomes the difficulties of exact DP by: Approximation (use neural nets and other architectures to reduce dimension) Simulation (use a computer model in place of a math model) State of the art: Broadly applicable methodology: Can address broad range of challenging problems. Deterministic-stochastic-dynamic, discrete-continuous, games, etc There are no methods that are guaranteed to work for all or even most problems There are enough methods to try with a reasonable chance of success for most types of optimization problems Role of the theory: Guide the art, delineate the sound ideas Bertsekas (M.I.T.) Reinforcement Learning 4 / 82
About this Talk The purpose of this talk To selectively explain some of the key ideas of RL and its connections with DP. To provide a road map for further study (mostly from the perspective of DP). To provide a guide for reading my new book (abbreviated RL-OC): Bertsekas, Reinforcement Learning and Optimal Control, Athena Scientific, 2019. For slides and videolectures from Arizona State University course earlier this year, see my website References Quite a few Exact DP books (1950s-present starting with Bellman). My books: 2017. My two-volume textbook "Dynamic Programming and Optimal Control" was updated in My mathematically oriented research monograph “Stochastic Optimal Control" (with S. My latest mathematically oriented research monograph “Abstract DP" came out in E. Shreve) came out in 1978. 2018. Quite a few Approximate DP/RL/Neural Nets books (1996-Present) Bertsekas and Tsitsiklis, Neuro-Dynamic Programming, 1996 Sutton and Barto, 1998, Reinforcement Learning (new edition 2018) Many surveys on all aspects of the subject Bertsekas (M.I.T.) Reinforcement Learning 5 / 82
Terminology in RL/AI and DP/Control (RL-OC, Section 1.4) RL uses Max/Value, DP uses Min/Cost Reward of a stage = (Opposite of) Cost of a stage. State value = (Opposite of) State cost. Value (or state-value) function = (Opposite of) Cost function. Controlled system terminology Agent = Decision maker or controller. Action = Control. Environment = Dynamic system. Methods terminology Learning = Solving a DP-related problem using simulation. Self-learning (or self-play in the context of games) = Solving a DP problem using simulation-based policy iteration. Planning vs Learning distinction = Solving a DP problem with model-based vs model-free simulation. Bertsekas (M.I.T.) Reinforcement Learning 6 / 82
AN OUTLINE OF THE SUBJECT - TEN KEY IDEAS 1 Principle of Optimality 2 Approximation in Value Space 3 Approximation in Policy Space 4 Model-Free Methods and Simulation 5 Policy Improvement, Rollout, and Self-Learning 6 Approximate Policy Improvement, Adaptive Simulation, and Q-Learning 7 Features, Approximation Architectures, and Deep Neural Nets 8 Incremental and Stochastic Gradient Optimization 9 Direct Policy Optimization: A More General Approach 10 Gradient and Random Search Methods for Direct Policy Optimization Bertsekas (M.I.T.) Reinforcement Learning 7 / 82
1. PRINCIPLE OF OPTIMALITY (RL-OC, Chapter 1) Bertsekas (M.I.T.) Reinforcement Learning 9 / 82
分享到:
收藏