logo资料库

An Introduction to Deep Reinforcement Learning.pdf

第1页 / 共140页
第2页 / 共140页
第3页 / 共140页
第4页 / 共140页
第5页 / 共140页
第6页 / 共140页
第7页 / 共140页
第8页 / 共140页
资料共140页,剩余部分请下载后查看
1 Introduction
1.1 Motivation
1.2 Outline
2 Machine learning and deep learning
2.1 Supervised learning and the concepts of bias and overfitting
2.2 Unsupervised learning
2.3 The deep learning approach
3 Introduction to reinforcement learning
3.1 Formal framework
3.2 Different components to learn a policy
3.3 Different settings to learn a policy from data
4 Value-based methods for deep RL
4.1 Q-learning
4.2 Fitted Q-learning
4.3 Deep Q-networks
4.4 Double DQN
4.5 Dueling network architecture
4.6 Distributional DQN
4.7 Multi-step learning
4.8 Combination of all DQN improvements and variants of DQN
5 Policy gradient methods for deep RL
5.1 Stochastic Policy Gradient
5.2 Deterministic Policy Gradient
5.3 Actor-Critic Methods
5.4 Natural Policy Gradients
5.5 Trust Region Optimization
5.6 Combining policy gradient and Q-learning
6 Model-based methods for deep RL
6.1 Pure model-based methods
6.2 Integrating model-free and model-based methods
7 The concept of generalization
7.1 Feature selection
7.2 Choice of the learning algorithm and function approximator selection
7.3 Modifying the objective function
7.4 Hierarchical learning
7.5 How to obtain the best bias-overfitting tradeoff
8 Particular challenges in the online setting
8.1 Exploration/Exploitation dilemma
8.2 Managing experience replay
9 Benchmarking Deep RL
9.1 Benchmark Environments
9.2 Best practices to benchmark deep RL
9.3 Open-source software for Deep RL
10 Deep reinforcement learning beyond MDPs
10.1 Partial observability and the distribution of (related) MDPs
10.2 Transfer learning
10.3 Learning without explicit reward function
10.4 Multi-agent systems
11 Perspectives on deep reinforcement learning
11.1 Successes of deep reinforcement learning
11.2 Challenges of applying reinforcement learning to real-world problems
11.3 Relations between deep RL and neuroscience
12 Conclusion
12.1 Future development of deep RL
12.2 Applications and societal impact of deep RL
Appendices
References
An Introduction to Deep Reinforcement Learning Vincent François-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare and Joelle Pineau (2018), “An Introduction to Deep Reinforcement Learning”, Foundations and Trends in Machine Learning: Vol. 11, No. 3-4. DOI: 10.1561/2200000071. Vincent François-Lavet McGill University vincent.francois-lavet@mcgill.ca Riashat Islam McGill University riashat.islam@mail.mcgill.ca Joelle Pineau Facebook, McGill University jpineau@cs.mcgill.ca Peter Henderson McGill University peter.henderson@mail.mcgill.ca Marc G. Bellemare Google Brain bellemare@google.com 8 1 0 2 c e D 3 ] G L . s c [ 2 v 0 6 5 2 1 . 1 1 8 1 : v i X r a Boston — Delft
Contents 1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Machine learning and deep learning 2.1 Supervised learning and the concepts of bias and overfitting 2.2 Unsupervised learning . . . . . . . . . . . . . . . . . . . . 2.3 The deep learning approach . . . . . . . . . . . . . . . . . 3 Introduction to reinforcement learning 3.1 Formal framework . . . . . . . . . . . . . . . . . . . . . . 3.2 Different components to learn a policy . . . . . . . . . . . 3.3 Different settings to learn a policy from data . . . . . . . . 4 Value-based methods for deep RL 4.1 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Fitted Q-learning . . . . . . . . . . . . . . . . . . . . . . 4.3 Deep Q-networks . . . . . . . . . . . . . . . . . . . . . . 4.4 Double DQN . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Dueling network architecture . . . . . . . . . . . . . . . . 4.6 Distributional DQN . . . . . . . . . . . . . . . . . . . . . 4.7 Multi-step learning . . . . . . . . . . . . . . . . . . . . . . 2 2 3 6 7 9 10 15 16 20 21 24 24 25 27 28 29 31 32
4.8 Combination of all DQN improvements and variants of DQN 34 5 Policy gradient methods for deep RL 5.1 Stochastic Policy Gradient . . . . . . . . . . . . . . . . . 5.2 Deterministic Policy Gradient . . . . . . . . . . . . . . . . 5.3 Actor-Critic Methods . . . . . . . . . . . . . . . . . . . . 5.4 Natural Policy Gradients . . . . . . . . . . . . . . . . . . 5.5 Trust Region Optimization . . . . . . . . . . . . . . . . . 5.6 Combining policy gradient and Q-learning . . . . . . . . . 6 Model-based methods for deep RL 6.1 Pure model-based methods . . . . . . . . . . . . . . . . . 6.2 . . . . . Integrating model-free and model-based methods 7 The concept of generalization 7.1 Feature selection . . . . . . . . . . . . . . . . . . . . . . . 7.2 Choice of the learning algorithm and function approximator selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Modifying the objective function . . . . . . . . . . . . . . 7.4 Hierarchical learning . . . . . . . . . . . . . . . . . . . . . 7.5 How to obtain the best bias-overfitting tradeoff . . . . . . 8 Particular challenges in the online setting 8.1 Exploration/Exploitation dilemma . . . . . . . . . . . . . . 8.2 Managing experience replay . . . . . . . . . . . . . . . . . 9 Benchmarking Deep RL 9.1 Benchmark Environments . . . . . . . . . . . . . . . . . . 9.2 Best practices to benchmark deep RL . . . . . . . . . . . 9.3 Open-source software for Deep RL . . . . . . . . . . . . . 10 Deep reinforcement learning beyond MDPs 10.1 Partial observability and the distribution of (related) MDPs 10.2 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . 10.3 Learning without explicit reward function . . . . . . . . . . 10.4 Multi-agent systems . . . . . . . . . . . . . . . . . . . . . 36 37 39 40 42 43 44 46 46 49 53 58 59 61 62 63 66 66 71 73 73 78 80 81 81 86 89 91
11 Perspectives on deep reinforcement learning 11.1 Successes of deep reinforcement learning . . . . . . . . . . 11.2 Challenges of applying reinforcement learning to real-world problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Relations between deep RL and neuroscience . . . . . . . . 94 94 95 96 12 Conclusion 99 12.1 Future development of deep RL . . . . . . . . . . . . . . . 99 12.2 Applications and societal impact of deep RL . . . . . . . . 100 Appendices References 103 106
An Introduction to Deep Reinforcement Learning Vincent François-Lavet1, Peter Henderson2, Riashat Islam3, Marc G. Bellemare4 and Joelle Pineau5 1McGill University; vincent.francois-lavet@mcgill.ca 2McGill University; peter.henderson@mail.mcgill.ca 3McGill University; riashat.islam@mail.mcgill.ca 4Google Brain; bellemare@google.com 5Facebook, McGill University; jpineau@cs.mcgill.ca ABSTRACT Deep reinforcement learning is the combination of reinforce- ment learning (RL) and deep learning. This field of research has been able to solve a wide range of complex decision- making tasks that were previously out of reach for a machine. Thus, deep RL opens up many new applications in domains such as healthcare, robotics, smart grids, finance, and many more. This manuscript provides an introduction to deep reinforcement learning models, algorithms and techniques. Particular focus is on the aspects related to generalization and how deep RL can be used for practical applications. We assume the reader is familiar with basic machine learning concepts.
1 Introduction 1.1 Motivation A core topic in machine learning is that of sequential decision-making. This is the task of deciding, from experience, the sequence of actions to perform in an uncertain environment in order to achieve some goals. Sequential decision-making tasks cover a wide range of possible applications with the potential to impact many domains, such as robotics, healthcare, smart grids, finance, self-driving cars, and many more. Inspired by behavioral psychology (see e.g., Sutton, 1984), rein- forcement learning (RL) proposes a formal framework to this problem. The main idea is that an artificial agent may learn by interacting with its environment, similarly to a biological agent. Using the experience gathered, the artificial agent should be able to optimize some objectives given in the form of cumulative rewards. This approach applies in prin- ciple to any type of sequential decision-making problem relying on past experience. The environment may be stochastic, the agent may only observe partial information about the current state, the observations may be high-dimensional (e.g., frames and time series), the agent may freely gather experience in the environment or, on the contrary, the data 2
1.2. Outline 3 may be may be constrained (e.g., not access to an accurate simulator or limited data). Over the past few years, RL has become increasingly popular due to its success in addressing challenging sequential decision-making problems. Several of these achievements are due to the combination of RL with deep learning techniques (LeCun et al., 2015; Schmidhuber, 2015; Goodfellow et al., 2016). This combination, called deep RL, is most useful in problems with high dimensional state-space. Previous RL approaches had a difficult design issue in the choice of features (Munos and Moore, 2002; Bellemare et al., 2013). However, deep RL has been successful in complicated tasks with lower prior knowledge thanks to its ability to learn different levels of abstractions from data. For instance, a deep RL agent can successfully learn from visual perceptual inputs made up of thousands of pixels (Mnih et al., 2015). This opens up the possibility to mimic some human problem solving capabilities, even in high-dimensional space — which, only a few years ago, was difficult to conceive. Several notable works using deep RL in games have stood out for attaining super-human level in playing Atari games from the pixels (Mnih et al., 2015), mastering Go (Silver et al., 2016a) or beating the world’s top professionals at the game of Poker (Brown and Sandholm, 2017; Moravčik et al., 2017). Deep RL also has potential for real-world applications such as robotics (Levine et al., 2016; Gandhi et al., 2017; Pinto et al., 2017), self-driving cars (You et al., 2017), finance (Deng et al., 2017) and smart grids (François-Lavet, 2017), to name a few. Nonetheless, several challenges arise in applying deep RL algorithms. Among others, exploring the environment efficiently or being able to generalize a good behavior in a slightly different context are not straightforward. Thus, a large array of algorithms have been proposed for the deep RL framework, depending on a variety of settings of the sequential decision-making tasks. 1.2 Outline The goal of this introduction to deep RL is to guide the reader towards effective use and understanding of core methods, as well as provide
4 Introduction references for further reading. After reading this introduction, the reader should be able to understand the key different deep RL approaches and algorithms and should be able to apply them. The reader should also have enough background to investigate the scientific literature further and pursue research on deep RL. In Chapter 2, we introduce the field of machine learning and the deep learning approach. The goal is to provide the general technical context and explain briefly where deep learning is situated in the broader field of machine learning. We assume the reader is familiar with basic notions of supervised and unsupervised learning; however, we briefly review the essentials. In Chapter 3, we provide the general RL framework along with the case of a Markov Decision Process (MDP). In that context, we examine the different methodologies that can be used to train a deep RL agent. On the one hand, learning a value function (Chapter 4) and/or a direct representation of the policy (Chapter 5) belong to the so-called model-free approaches. On the other hand, planning algorithms that can make use of a learned model of the environment belong to the so-called model-based approaches (Chapter 6). We dedicate Chapter 7 to the notion of generalization in RL. Within either a model-based or a model-free approach, we discuss the importance of different elements: (i) feature selection, (ii) function approximator selection, (iii) modifying the objective function and (iv) hierarchical learning. In Chapter 8, we present the main challenges of using RL in the online setting. In particular, we discuss the exploration- exploitation dilemma and the use of a replay memory. In Chapter 9, we provide an overview of different existing bench- marks for evaluation of RL algorithms. Furthermore, we present a set of best practices to ensure consistency and reproducibility of the results obtained on the different benchmarks. In Chapter 10, we discuss more general settings than MDPs: (i) the Partially Observable Markov Decision Process (POMDP), (ii) the distribution of MDPs (instead of a given MDP) along with the notion of transfer learning, (iii) learning without explicit reward function and (iv) multi-agent systems. We provide descriptions of how deep RL can be used in these settings.
分享到:
收藏