An Introduction to Deep Reinforcement Learning.pdf-资料库

zongzhiyuan-10877761-4744302543014541949.pdf-第1页.png

第1页 / 共140页

zongzhiyuan-10877761-4744302543014541949.pdf-第2页.png

第2页 / 共140页

zongzhiyuan-10877761-4744302543014541949.pdf-第3页.png

第3页 / 共140页

zongzhiyuan-10877761-4744302543014541949.pdf-第4页.png

第4页 / 共140页

zongzhiyuan-10877761-4744302543014541949.pdf-第5页.png

第5页 / 共140页

zongzhiyuan-10877761-4744302543014541949.pdf-第6页.png

第6页 / 共140页

zongzhiyuan-10877761-4744302543014541949.pdf-第7页.png

第7页 / 共140页

zongzhiyuan-10877761-4744302543014541949.pdf-第8页.png

第8页 / 共140页

An Introduction to Deep Reinforcement Learning Vincent François-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare and Joelle Pineau (2018), “An Introduction to Deep Reinforcement Learning”, Foundations and Trends in Machine Learning: Vol. 11, No. 3-4. DOI: 10.1561/2200000071. Vincent François-Lavet McGill University vincent.francois-lavet@mcgill.ca Riashat Islam McGill University riashat.islam@mail.mcgill.ca Joelle Pineau Facebook, McGill University jpineau@cs.mcgill.ca Peter Henderson McGill University peter.henderson@mail.mcgill.ca Marc G. Bellemare Google Brain bellemare@google.com 8 1 0 2 c e D 3 ] G L . s c [ 2 v 0 6 5 2 1 . 1 1 8 1 : v i X r a Boston — Delft

Contents 1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Machine learning and deep learning 2.1 Supervised learning and the concepts of bias and overﬁtting 2.2 Unsupervised learning . . . . . . . . . . . . . . . . . . . . 2.3 The deep learning approach . . . . . . . . . . . . . . . . . 3 Introduction to reinforcement learning 3.1 Formal framework . . . . . . . . . . . . . . . . . . . . . . 3.2 Diﬀerent components to learn a policy . . . . . . . . . . . 3.3 Diﬀerent settings to learn a policy from data . . . . . . . . 4 Value-based methods for deep RL 4.1 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Fitted Q-learning . . . . . . . . . . . . . . . . . . . . . . 4.3 Deep Q-networks . . . . . . . . . . . . . . . . . . . . . . 4.4 Double DQN . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Dueling network architecture . . . . . . . . . . . . . . . . 4.6 Distributional DQN . . . . . . . . . . . . . . . . . . . . . 4.7 Multi-step learning . . . . . . . . . . . . . . . . . . . . . . 2 2 3 6 7 9 10 15 16 20 21 24 24 25 27 28 29 31 32

4.8 Combination of all DQN improvements and variants of DQN 34 5 Policy gradient methods for deep RL 5.1 Stochastic Policy Gradient . . . . . . . . . . . . . . . . . 5.2 Deterministic Policy Gradient . . . . . . . . . . . . . . . . 5.3 Actor-Critic Methods . . . . . . . . . . . . . . . . . . . . 5.4 Natural Policy Gradients . . . . . . . . . . . . . . . . . . 5.5 Trust Region Optimization . . . . . . . . . . . . . . . . . 5.6 Combining policy gradient and Q-learning . . . . . . . . . 6 Model-based methods for deep RL 6.1 Pure model-based methods . . . . . . . . . . . . . . . . . 6.2 . . . . . Integrating model-free and model-based methods 7 The concept of generalization 7.1 Feature selection . . . . . . . . . . . . . . . . . . . . . . . 7.2 Choice of the learning algorithm and function approximator selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Modifying the objective function . . . . . . . . . . . . . . 7.4 Hierarchical learning . . . . . . . . . . . . . . . . . . . . . 7.5 How to obtain the best bias-overﬁtting tradeoﬀ . . . . . . 8 Particular challenges in the online setting 8.1 Exploration/Exploitation dilemma . . . . . . . . . . . . . . 8.2 Managing experience replay . . . . . . . . . . . . . . . . . 9 Benchmarking Deep RL 9.1 Benchmark Environments . . . . . . . . . . . . . . . . . . 9.2 Best practices to benchmark deep RL . . . . . . . . . . . 9.3 Open-source software for Deep RL . . . . . . . . . . . . . 10 Deep reinforcement learning beyond MDPs 10.1 Partial observability and the distribution of (related) MDPs 10.2 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . 10.3 Learning without explicit reward function . . . . . . . . . . 10.4 Multi-agent systems . . . . . . . . . . . . . . . . . . . . . 36 37 39 40 42 43 44 46 46 49 53 58 59 61 62 63 66 66 71 73 73 78 80 81 81 86 89 91

11 Perspectives on deep reinforcement learning 11.1 Successes of deep reinforcement learning . . . . . . . . . . 11.2 Challenges of applying reinforcement learning to real-world problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Relations between deep RL and neuroscience . . . . . . . . 94 94 95 96 12 Conclusion 99 12.1 Future development of deep RL . . . . . . . . . . . . . . . 99 12.2 Applications and societal impact of deep RL . . . . . . . . 100 Appendices References 103 106

An Introduction to Deep Reinforcement Learning Vincent François-Lavet1, Peter Henderson2, Riashat Islam3, Marc G. Bellemare4 and Joelle Pineau5 1McGill University; vincent.francois-lavet@mcgill.ca 2McGill University; peter.henderson@mail.mcgill.ca 3McGill University; riashat.islam@mail.mcgill.ca 4Google Brain; bellemare@google.com 5Facebook, McGill University; jpineau@cs.mcgill.ca ABSTRACT Deep reinforcement learning is the combination of reinforce- ment learning (RL) and deep learning. This ﬁeld of research has been able to solve a wide range of complex decision- making tasks that were previously out of reach for a machine. Thus, deep RL opens up many new applications in domains such as healthcare, robotics, smart grids, ﬁnance, and many more. This manuscript provides an introduction to deep reinforcement learning models, algorithms and techniques. Particular focus is on the aspects related to generalization and how deep RL can be used for practical applications. We assume the reader is familiar with basic machine learning concepts.

1 Introduction 1.1 Motivation A core topic in machine learning is that of sequential decision-making. This is the task of deciding, from experience, the sequence of actions to perform in an uncertain environment in order to achieve some goals. Sequential decision-making tasks cover a wide range of possible applications with the potential to impact many domains, such as robotics, healthcare, smart grids, ﬁnance, self-driving cars, and many more. Inspired by behavioral psychology (see e.g., Sutton, 1984), rein- forcement learning (RL) proposes a formal framework to this problem. The main idea is that an artiﬁcial agent may learn by interacting with its environment, similarly to a biological agent. Using the experience gathered, the artiﬁcial agent should be able to optimize some objectives given in the form of cumulative rewards. This approach applies in prin- ciple to any type of sequential decision-making problem relying on past experience. The environment may be stochastic, the agent may only observe partial information about the current state, the observations may be high-dimensional (e.g., frames and time series), the agent may freely gather experience in the environment or, on the contrary, the data 2

1.2. Outline 3 may be may be constrained (e.g., not access to an accurate simulator or limited data). Over the past few years, RL has become increasingly popular due to its success in addressing challenging sequential decision-making problems. Several of these achievements are due to the combination of RL with deep learning techniques (LeCun et al., 2015; Schmidhuber, 2015; Goodfellow et al., 2016). This combination, called deep RL, is most useful in problems with high dimensional state-space. Previous RL approaches had a diﬃcult design issue in the choice of features (Munos and Moore, 2002; Bellemare et al., 2013). However, deep RL has been successful in complicated tasks with lower prior knowledge thanks to its ability to learn diﬀerent levels of abstractions from data. For instance, a deep RL agent can successfully learn from visual perceptual inputs made up of thousands of pixels (Mnih et al., 2015). This opens up the possibility to mimic some human problem solving capabilities, even in high-dimensional space — which, only a few years ago, was diﬃcult to conceive. Several notable works using deep RL in games have stood out for attaining super-human level in playing Atari games from the pixels (Mnih et al., 2015), mastering Go (Silver et al., 2016a) or beating the world’s top professionals at the game of Poker (Brown and Sandholm, 2017; Moravčik et al., 2017). Deep RL also has potential for real-world applications such as robotics (Levine et al., 2016; Gandhi et al., 2017; Pinto et al., 2017), self-driving cars (You et al., 2017), ﬁnance (Deng et al., 2017) and smart grids (François-Lavet, 2017), to name a few. Nonetheless, several challenges arise in applying deep RL algorithms. Among others, exploring the environment eﬃciently or being able to generalize a good behavior in a slightly diﬀerent context are not straightforward. Thus, a large array of algorithms have been proposed for the deep RL framework, depending on a variety of settings of the sequential decision-making tasks. 1.2 Outline The goal of this introduction to deep RL is to guide the reader towards eﬀective use and understanding of core methods, as well as provide

4 Introduction references for further reading. After reading this introduction, the reader should be able to understand the key diﬀerent deep RL approaches and algorithms and should be able to apply them. The reader should also have enough background to investigate the scientiﬁc literature further and pursue research on deep RL. In Chapter 2, we introduce the ﬁeld of machine learning and the deep learning approach. The goal is to provide the general technical context and explain brieﬂy where deep learning is situated in the broader ﬁeld of machine learning. We assume the reader is familiar with basic notions of supervised and unsupervised learning; however, we brieﬂy review the essentials. In Chapter 3, we provide the general RL framework along with the case of a Markov Decision Process (MDP). In that context, we examine the diﬀerent methodologies that can be used to train a deep RL agent. On the one hand, learning a value function (Chapter 4) and/or a direct representation of the policy (Chapter 5) belong to the so-called model-free approaches. On the other hand, planning algorithms that can make use of a learned model of the environment belong to the so-called model-based approaches (Chapter 6). We dedicate Chapter 7 to the notion of generalization in RL. Within either a model-based or a model-free approach, we discuss the importance of diﬀerent elements: (i) feature selection, (ii) function approximator selection, (iii) modifying the objective function and (iv) hierarchical learning. In Chapter 8, we present the main challenges of using RL in the online setting. In particular, we discuss the exploration- exploitation dilemma and the use of a replay memory. In Chapter 9, we provide an overview of diﬀerent existing bench- marks for evaluation of RL algorithms. Furthermore, we present a set of best practices to ensure consistency and reproducibility of the results obtained on the diﬀerent benchmarks. In Chapter 10, we discuss more general settings than MDPs: (i) the Partially Observable Markov Decision Process (POMDP), (ii) the distribution of MDPs (instead of a given MDP) along with the notion of transfer learning, (iii) learning without explicit reward function and (iv) multi-agent systems. We provide descriptions of how deep RL can be used in these settings.

资料库

An Introduction to Deep Reinforcement Learning.pdf

相关推荐

人工智能

热门标签

最新资料