logo资料库

Deep Reinforcement Learning Hands-On(TRUE PDF).pdf

第1页 / 共547页
第2页 / 共547页
第3页 / 共547页
第4页 / 共547页
第5页 / 共547页
第6页 / 共547页
第7页 / 共547页
第8页 / 共547页
资料共547页,剩余部分请下载后查看
Cover
Copyright
Packt upsell
Contributors
Table of Contents
Preface
Chapter 1 - What is Reinforcement Learning?
Learning – supervised, unsupervised, and reinforcement
RL formalisms and relations
Reward
The agent
The environment
Actions
Observations
Markov decision processes
Markov process
Markov reward process
Markov decision process
Summary
Chapter 2 - OpenAI Gym
The anatomy of the agent
Hardware and software requirements
OpenAI Gym API
Action space
Observation space
The environment
Creation of the environment
The CartPole session
The random CartPole agent
The extra Gym functionality – wrappers and monitors
Wrappers
Monitor
Summary
Chapter 3 - Deep Learning with PyTorch
Tensors
Creation of tensors
Scalar tensors
Tensor operations
GPU tensors
Gradients
Tensors and gradients
NN building blocks
Custom layers
Final glue – loss functions and optimizers
Loss functions
Optimizers
Monitoring with TensorBoard
TensorBoard 101
Plotting stuff
Example – GAN on Atari images
Summary
Chapter 4 - The Cross-Entropy Method
Taxonomy of RL methods
Practical cross-entropy
Cross-entropy on CartPole
Cross-entropy on FrozenLake
Theoretical background of the cross-entropy method
Summary
Chapter 5 - Tabular Learning and the Bellman Equation
Value, state, and optimality
The Bellman equation of optimality
Value of action
The value iteration method
Value iteration in practice
Q-learning for FrozenLake
Summary
Chapter 6 - Deep Q-Networks
Real-life value iteration
Tabular Q-learning
Deep Q-learning
Interaction with the environment
SGD optimisation
Correlation between steps
The Markov property
The final form of DQN training
DQN on Pong
Wrappers
DQN model
Training
Running and performance
Your model in action
Summary
Chapter 7 - DQN Extensions
The PyTorch Agent Net library
Agent
Agent's experience
Experience buffer
Gym env wrappers
Basic DQN
N-step DQN
Implementation
Double DQN
Implementation
Results
Noisy networks
Implementation
Results
Prioritized replay buffer
Implementation
Results
Dueling DQN
Implementation
Results
Categorical DQN
Implementation
Results
Combining everything
Implementation
Results
Summary
References
Chapter 8 - Stocks Trading Using RL
Trading
Data
Problem statements and key decisions
The trading environment
Models
Training code
Results
The feed-forward model
The convolution model
Things to try
Summary
Chapter 9 - Policy Gradients – An Alternative
Values and policy
Why policy?
Policy representation
Policy gradients
The REINFORCE method
The CartPole example
Results
Policy-based versus value-based methods
REINFORCE issues
Full episodes are required
High gradients variance
Exploration
Correlation between samples
PG on CartPole
Results
PG on Pong
Results
Summary
Chapter 10 - The Actor-Critic Method
Variance reduction
CartPole variance
Actor-critic
A2C on Pong
A2C on Pong results
Tuning hyperparameters
Learning rate
Entropy beta
Count of environments
Batch size
Summary
Chapter 11 - Asynchronous Advantage Afctor-Critic
Correlation and sample efficiency
Adding an extra A to A2C
Multiprocessing in Python
A3C – data parallelism
Results
A3C – gradients parallelism
Results
Summary
Chapter 12 - Chatbots Training with RL
Chatbots overview
Deep NLP basics
Recurrent Neural Networks
Embeddings
Encoder-Decoder
Training of seq2seq
Log-likelihood training
Bilingual evaluation understudy (BLEU) score
RL in seq2seq
Self-critical sequence training
The chatbot example
The example structure
Modules: cornell.py and data.py
BLEU score and utils.py
Model
Training: cross-entropy
Running the training
Checking the data
Playing with the trained model
Training: SCST
Running the SCST training
Results
Telegram bot
Summary
Chapter 13 - Web Navigation
Web navigation
Browser automation and RL
Mini World of Bits benchmark
OpenAI Universe
Installation
Actions and observations
Environment creation
MiniWoB stability
Simple clicking approach
Grid actions
Example overview
Model
Training code
Starting containers
Training process
Checking the learned policy
Issues with simple clicking
Human demonstrations
Recording the demonstrations
Recording format
Training using demonstrations
Results
TicTacToe problem
Adding text description
Results
Things to try
Summary
Chapter 14 - Continuous Action Space
Why a continuous space?
Action space
Environments
The Actor-Critic (A2C) method
Implementation
Results
Using models and recording videos
Deterministic policy gradients
Exploration
Implementation
Results
Recording videos
Distributional policy gradients
Architecture
Implementation
Results
Things to try
Summary
Chapter 15 - Trust Regions – TRPO, PPO and ACKTR
Introduction
Roboschool
A2C baseline
Results
Videos recording
Proximal Policy Optimisation
Implementation
Results
Trust Region Policy Optimisation
Implementation
Results
A2C using ACKTR
Implementation
Results
Summary
Chapter 16 - Black-Box Optimization in RL
Black-box methods
Evolution strategies
ES on CartPole
Results
ES on HalfCheetah
Results
Genetic algorithms
GA on CartPole
Results
GA tweaks
Deep GA
Novelty search
GA on Cheetah
Results
Summary
References
Chapter 17 - Beyond Model- Free – Imagination
Model-based versus model-free
Model imperfections
Imagination-augmented agent
The environment model
The rollout policy
The rollout encoder
Paper results
I2A on Atari Breakout
The baseline A2C agent
EM training
The imagination agent
The I2A model
The Rollout encoder
Training of I2A
Experiment results
The baseline agent
Training EM weights
Training with the I2A model
Summary
References
Chapter 18 - AlphaGo Zero
Board games
The AlphaGo Zero method
Overview
Monte-Carlo Tree Search
Self-play
Training and evaluation
Connect4 bot
Game model
Implementing MCTS
Model
Training
Testing and comparison
Connect4 results
Summary
References
Book summary
Other Books You May Enjoy
Index
Deep Reinforcement Learning Hands-On Apply modern RL methods, with deep Q-networks, value iteration, policy gradients, TRPO, AlphaGo Zero and more Maxim Lapan BIRMINGHAM - MUMBAI
Deep Reinforcement Learning Hands-On Copyright © 2018 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Acquisition Editors: Frank Pohlmann, Suresh Jain Project Editor: Kishor Rit Technical Editor: Nidhisha Shetty Proofreader: Tom Jacob Indexer: Tejal Daruwale Soni Graphics: Sandip Tadge Production Coordinator: Shantanu Zagade First published: June 2018 Production reference: 1150618 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78883-424-7 www.packtpub.com
mapt.io Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website. Why subscribe? • Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals • Learn better with Skill Plans built especially for you • Get a free eBook or video every month • Mapt is fully searchable • Copy and paste, print, and bookmark content PacktPub.com Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub. com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at service@packtpub.com for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Contributors About the author Maxim Lapan is a deep learning enthusiast and independent researcher. His background and 15 years' work expertise as a software developer and a systems architect lays from low-level Linux kernel driver development to performance optimization and design of distributed applications working on thousands of servers. With vast work experiences in big data, Machine Learning, and large parallel distributed HPC and nonHPC systems, he has a talent to explain a gist of complicated things in simple words and vivid examples. His current areas of interest lie in practical applications of Deep Learning, such as Deep Natural Language Processing and Deep Reinforcement Learning. Maxim lives in Moscow, Russian Federation, with his family, and he works for an Israeli start-up as a Senior NLP developer. I'd like to thank my family: wife Olga and kids Ksenia, Julia, and Fedor, for patience and support. It was a challenging time, writing this book and it wouldn't be possible without you, thanks! Julia and Fedor did a great job gathering samples for MiniWoB (Chapter 13, Web Navigation) and testing ConnectFour agent's playing skills (Chapter 18, AlphaGo Zero). I also want to thank the technical reviewers, Oleg Vasilev and Mikhail Yurushkin, for their valuable comments and suggestions about the book contents.
About the reviewers Basem O. F. Alijla received his Ph.D. degree in intelligent systems from USM, Malaysia, in 2015. He is currently an assistant professor with Software Development Department, IUG in Palestine. He has authored number of technical papers published in journals and international conferences. His current research interest include, Optimization, Machine Learning, and Data mining. Oleg Vasilev is a professional with a background in Computer Science and Data Engineering. His university program is Applied Mathematics and Informatics in NRU HSE, Moscow, with a major in Distributed Systems. He is a staff member on a Git-course, Practical_RL and Practical_DL, taught on-campus in HSE and YSDA. Oleg's previous work experience includes working in Dialog Systems Group, Yandex, as Data Scientist. He currently holds a position of Vice President of Infrastructure Management in GoTo Lab, an educational corporation, and he works for Digital Contact as a software engineer. I'd like to thank Alexander Panin (@justheuristic), my mentor, for opening the world of Machine Learning to me. I am deeply grateful to other Russian researchers who helped me in mastering Computer Science: Pavel Shvechikov, Alexander Grishin, Valery Kharitonov, Alexander Fritzler, Pavel Ostyakov, Michail Konobeev, Dmitrii Vetrov, and Alena Ilyna. I also want to thank my friends and family for their kind support. Mikhail Yurushkin holds a PhD in Applied Mathematics. His areas of research are high performance computing and optimizing compilers development. He was involved in the development of a state-of-the-art optimizing parallelizing compiler system. Mikhail is a senior lecturer at SFEDU university, Rostov on Don, Russia. He teaches advanced DL courses, namely Computer Vision and NLP. Mikhail has worked for over 7 years in cross-platform native C++ development, machine learning, and deep learning. Now he works as an individual consultant in ML/DL fields.
Packt is Searching for Authors Like You If you're interested in becoming an author for Packt, please visit authors.packtpub. com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Table of Contents Preface Chapter 1: What is Reinforcement Learning? Learning – supervised, unsupervised, and reinforcement RL formalisms and relations xi 1 2 5 5 7 8 8 8 11 12 16 19 23 25 25 28 30 30 31 33 34 36 39 Reward The agent The environment Actions Observations Markov decision processes Markov process Markov reward process Markov decision process Summary Chapter 2: OpenAI Gym The anatomy of the agent Hardware and software requirements OpenAI Gym API Action space Observation space The environment Creation of the environment The CartPole session The random CartPole agent [ i ]
分享到:
收藏