Markov Decision Processes in Artificial Intelligence.pdf

发布时间：2022-05-31 发布人：admin 分类：说明书资料大小：16.80M 资料格式：pdf 举报版权申诉

kerryxia-11075209-4744300845209308367.pdf-第1页.png

第1页 / 共466页

kerryxia-11075209-4744300845209308367.pdf-第2页.png

第2页 / 共466页

kerryxia-11075209-4744300845209308367.pdf-第3页.png

第3页 / 共466页

kerryxia-11075209-4744300845209308367.pdf-第4页.png

第4页 / 共466页

kerryxia-11075209-4744300845209308367.pdf-第5页.png

第5页 / 共466页

kerryxia-11075209-4744300845209308367.pdf-第6页.png

第6页 / 共466页

kerryxia-11075209-4744300845209308367.pdf-第7页.png

第7页 / 共466页

kerryxia-11075209-4744300845209308367.pdf-第8页.png

第8页 / 共466页

Title Page

Contents

Preface

List of Authors

Part 1 MDPs: Models and Methods

Chapter 1 Markov Decision Processes

1.1. Introduction

1.2. Markov decision problems

1.2.1. Markov decision processes

1.2.2. Action policies

1.2.3. Performance criterion

1.3. Value functions

1.3.1. The finite criterion

1.3.2. The γ-discounted criterion

1.3.3. The total reward criterion

1.3.4. The average reward criterion

1.4. Markov policies

1.4.1. Equivalence of history-dependent and Markov policies

1.4.2. Markov policies and valued Markov chains

1.5. Characterization of optimal policies

1.5.1. The finite criterion

1.5.1.1. Optimality equations

1.5.1.2. Evaluation of a deterministic Markov policy

1.5.2. The discounted criterion

1.5.2.1. Evaluation of a stationary Markov policy

1.5.2.2. Optimality equations

1.5.3. The total reward criterion

1.5.4. The average reward criterion

1.5.4.1. Evaluation of a stationary Markov policy

1.5.4.2. Optimality equations

1.6. Optimization algorithms for MDPs

1.6.1. The finite criterion

1.6.2. The discounted criterion

1.6.2.1. Linear programming

1.6.2.2. The value iteration algorithm

1.6.2.3. The policy iteration algorithm

1.6.3. The total reward criterion

1.6.3.1. Positive MDPs

1.6.3.2. Negative MDPs

1.6.4. The average criterion

1.6.4.1. Relative value iteration algorithm

1.6.4.2. Modified policy iteration algorithm

1.7. Conclusion and outlook

1.8. Bibliography

Chapter 2 Reinforcement Learning

2.1. Introduction

2.1.1. Historical overview

2.2. Reinforcement learning: a global view

2.2.1. Reinforcement learning as approximate dynamic programming

2.2.2. Temporal, non-supervised and trial-and-error based learning

2.2.3. Exploration versus exploitation

2.2.4. General preliminaries on estimation methods

2.3. Monte Carlo methods

2.4. From Monte Carlo to temporal difference methods

2.5. Temporal difference methods

2.5.1. The TD0 algorithm

2.5.2. The SARSA algorithm

2.5.3. The Q-learning algorithm

2.5.4. The TD묀, SARSA묀 and Q묀 algorithms

2.5.5. Eligibility traces and TD묀

2.5.6. From TD묀 to SARSA묀

2.5.7. Q묀

2.5.8. The R-learning algorithm

2.6. Model-based methods: learning a model

2.6.1. Dyna architectures

2.6.2. The E3 algorithm

2.6.3. The Rmax algorithm

2.7. Conclusion

2.8. Bibliography

Chapter 3 Approximate Dynamic Programming

3.1. Introduction

3.2. Approximate value iteration AVI

3.2.1. Sample-based implementation and supervised learning

3.2.2. Analysis of the AVI algorithm

3.2.3. Numerical illustration

3.3. Approximate policy iteration API

3.3.1. Analysis in L∞-norm of the API algorithm

3.3.2. Approximate policy evaluation

3.3.3. Linear approximation and least-squares methods

3.3.3.1. TD묀

3.3.3.2. Least-squares methods

3.3.3.3. Linear approximation of the state-action value function

3.4. Direct minimization of the Bellman residual

3.5. Towards an analysis of dynamic programming in Lp-norm

3.5.1. Intuition of an Lp analysis in dynamic programming

3.5.2. PAC bounds for RL algorithms

3.6. Conclusions

3.7. Bibliography

Chapter 4 Factored Markov Decision Processes

4.1. Introduction

4.2. Modeling a problem with an FMDP

4.2.1. Representing the state space

4.2.2. The Coffee Robot example

4.2.3. Decomposition and function-specific independence

4.2.3.1. Transition function decomposition

4.2.3.2. Dynamic Bayesian networks in FMDPs

4.2.3.3. Factored model of the transition function in an FMDP

4.2.3.4. Factored model of the reward function

4.2.4. Context-specific independence

4.3. Planning with FMDPs

4.3.1. Structured policy iteration and structured value iteration

4.3.1.1. Decision trees

4.3.1.2. Representation of the transition function

4.3.1.3. Representation of the reward function

4.3.1.4. Representation of a policy

4.3.1.5. Representation of the value function

4.3.1.6. Algorithms

4.3.2. SPUDD: stochastic planning using decision diagrams

4.3.2.1. Representing the functions of an FMDP with ADDs

4.3.2.2. Algorithm

4.3.3. Approximate linear programming in FMDPs

4.3.3.1. Representations

4.3.3.2. Representation of the transition function

4.3.3.3. Representation of the reward function

4.3.3.4. Policy representation

4.3.3.5. Representation of the value function

4.3.3.6. Algorithms

4.4. Perspectives and conclusion

4.5. Bibliography

Chapter 5 Policy-Gradient Algorithms

5.1. Reminder about the notion of gradient

5.1.1. Gradient of a function

5.1.2. Gradient descent

5.2. Optimizing a parameterized policy with a gradient algorithm

5.2.1. Application to MDPs: overview

5.2.1.1. First attempt to define a parameterized policy

5.2.1.2. Example of definition of a parameterized policy

5.2.2. Finite difference method

5.2.3. Gradient estimation of f in an MDP, case of the finite time horizon

5.2.3.1. Time horizon 1

5.2.3.2. Time horizon T

5.2.4. Extension to the infinite time horizon: discounted criterion, average criterion

5.2.4.1. Case of a regenerative process

5.2.4.2. Using a moving window

5.2.4.3. Using a discount factor

5.2.5. Partially observable case

5.2.6. Continuous action spaces

5.3. Actor-critic methods

5.3.1. A gradient estimator using the Q-values

5.3.2. Compatibility with an approximate value function

5.3.2.1. Approximating Qθ

5.3.2.2. Compatibility of the approximators

5.3.3. An actor-critic algorithm

5.4. Complements

5.5. Conclusion

5.6. Bibliography

Chapter 6 Online Resolution Techniques

6.1. Introduction

6.1.1. Exploiting time online

6.1.2. Online search by simulation

6.2. Online algorithms for solving an MDP

6.2.1. Offline algorithms, online algorithms

6.2.2. Problem formalization

6.2.2.1. Forward search over a reasoning horizon

6.2.2.2. Tree or graph?

6.2.2.3. Complexity and efficiency of the forward search

6.2.3. Online heuristic search algorithms for MDPs

6.2.3.1. General principles

6.2.3.2. The RTDP algorithm

6.2.3.3. The LAO* algorithm

6.2.4. Kearns, Mansour and Ng’s simulation-based algorithm

6.2.4.1. Complexity and convergence of Kearns et al.’s algorithm

6.2.4.2. Efficiency and practical considerations

6.2.5. Tesauro and Galperin’s rollout algorithm

6.2.5.1. Complexity and convergence of the rollout algorithm

6.2.5.2. Efficiency of the rollout algorithm

6.3. Controlling the search

6.3.1. Error bounds and pathology of the forward search

6.3.1.1. Pathology of the simulation-based forward search

6.3.1.2. Searching for a good compromise between depth and width

6.3.2. Iterative allocation of simulations

6.3.2.1. Multi-armed bandits and exploration in MDPs

6.3.3. Focused reinforcement learning

6.3.3.1. Global sampling error estimation

6.3.3.2. Organizing the search in successive trajectories

6.3.3.3. Convergence and practical considerations

6.3.4. Controlled rollout

6.3.4.1. Choice of the horizon

6.3.4.2. Iterative allocation of simulations

6.4. Conclusion

6.5. Bibliography

Part 2 Beyond MDPs

Chapter 7 Partially Observable Markov Decision Processes

7.1. Formal definitions for POMDPs

7.1.1. Definition of a POMDP

7.1.2. Performance criteria

7.1.3. Information state

7.1.3.1. Definition

7.1.3.2. Complete information state

7.1.3.3. Sufficient statistics

7.1.3.4. Belief states

7.1.4. Policy

7.1.5. Value function

7.2. Non-Markovian problems: incomplete information

7.2.1. Adapted policies

7.2.2. Discounted reward

7.2.2.1. Adapted stochastic policies

7.2.2.2. Adapted value function

7.2.2.3. Convergence of adapted algorithms

7.2.3. Adapted algorithms and adapted average reward criterion

7.3. Computation of an exact policy on information states

7.3.1. The general case

7.3.1.1. Finite horizon

7.3.1.2. Infinite horizon

7.3.2. Belief states and piecewise linear value function

7.3.2.1. Choice of the θ vectors

7.3.2.2. Infinite horizon

7.4. Exact value iteration algorithms

7.4.1. Steps for the dynamic programming operator

7.4.2. A parsimonious representation of V

7.4.2.1. Region

7.4.2.2. Parsimonious representation

7.4.2.3. Pruning of dominated vectors

7.4.2.4. Pruning

7.4.2.5. Choice of a vector for a belief state

7.4.3. The WITNESS algorithm

7.4.3.1. Neighborhood of a vector

7.4.3.2. The algorithm

7.4.4. Iterative pruning

7.4.4.1. Complete enumeration

7.4.4.2. Incremental enumeration

7.5. Policy iteration algorithms

7.6. Conclusion and perspectives

7.7. Bibliography

Chapter 8 Stochastic Games

8.1. Introduction

8.2. Background on game theory

8.2.1. Some basic definitions

8.2.1.1. Criteria distinguishing different forms of games

8.2.2. Static games of complete information

8.2.2.1. Games in strategic form, pure strategy, mixed strategy

8.2.2.2. Zero-sum game and minimax

8.2.2.3. Equilibrium in dominating strategies

8.2.2.4. Nash equilibrium

8.2.3. Dynamic games of complete information

8.2.3.1. Games in extensive form with perfect information

8.2.3.2. Repeated games

8.3. Stochastic games

8.3.1. Definition and equilibrium of a stochastic game

8.3.2. Solving stochastic games

8.3.2.1. Game model available

8.3.2.2. Game model unavailable, A?i observed, equilibrium learners

8.3.2.3. Game model unavailable, A?i observed, opponent modeling

8.3.2.4. Game model unavailable, A?i not observed

8.3.3. Complexity and scalability of multi-agent learning algorithms

8.3.4. Beyond the search for an equilibrium

8.3.4.1. Efficient play

8.3.4.2. Regret minimization

8.3.4.3. Metastrategies

8.3.5. Discussion

8.4. Conclusion and outlook

8.5. Bibliography

Chapter 9 DEC-MDP/POMDP

9.1. Introduction

9.2. Preliminaries

9.3. Multiagent Markov decision processes

9.4. Decentralized control and local observability

9.4.1. Decentralized Markov decision processes

9.4.2. Multiagent team decision problems

9.4.3. Complexity

9.5. Sub-classes of DEC-POMDPs

9.5.1. Transition and observation independence

9.5.2. Goal oriented DEC-POMDPs

9.5.3. DEC-MDPs with constraints

9.5.3.1. Event-driven DEC-MDPs

9.5.3.2. Opportunity-cost DEC-MDPs

9.5.3.3. Problem statement

9.5.3.4. The OC-DEC-MDP model

9.5.4. Communication and DEC-POMDPs

9.5.4.1. DEC-POMDPs with communication

9.5.4.2. Complexity results

9.5.4.3. Discussion

9.6. Algorithms for solving DEC-POMDPs

9.6.1. Optimal algorithms

9.6.1.1. Dynamic programming for DEC-POMDPs

9.6.1.2. Heuristic search for DEC-POMDPs

9.6.1.3. Optimal algorithms for sub-classes of DEC-POMDPs

9.6.2. Approximate algorithms

9.6.2.1. Heuristics and approximate dynamic programming

9.6.2.2. Bounded memory

9.6.2.3. Co-evolutive algorithms

9.6.2.4. Gradient descent for policy search

9.6.2.5. Bayesian games

9.6.2.6. Heuristics for communicating agents

9.6.2.7. Approximate solutions for OC-DEC-MDPs

9.7. Applicative scenario: multirobot exploration

9.8. Conclusion and outlook

9.9. Bibliography

Chapter 10 Non-Standard Criteria

10.1. Introduction

10.2. Multicriteria approaches

10.2.1. Multicriteria decision-making

10.2.2. Multicriteria MDPs

10.2.2.1. Operators for multicriteria decision-making

10.3. Robustness in MDPs

10.4. Possibilistic MDPs

10.4.1. Possibilistic counterpart of expected utility

10.4.2. Possibilistic dynamic programming

10.4.2.1. Finite horizon

10.4.2.2. Value iteration

10.4.2.3. Policy iteration

10.4.3. Extensions of possibilistic MDPs

10.4.3.1. Possibilistic reinforcement learning

10.4.3.2. Possibilistic partially observable MDPs

10.4.3.3. Possibilistic influence diagrams PID

10.5. Algebraic MDPs

10.5.1. Background

10.5.1.1. Semirings

10.5.1.2. Plausibility measures

10.5.1.3. Generalized expected utility

10.5.2. Definition of an algebraic MDP

10.5.3. Value function of a policy

10.5.4. Conditions

10.5.5. Examples of AMDPs

10.5.5.1. Probabilistic multicriteria AMDP

10.5.5.2. Possibilistic multicriteria AMDPs

10.5.5.3. AMDPs whose rewards are non-decreasing functions

10.6. Conclusion

10.7. Bibliography

Part 3 Applications

Chapter 11 Online Learning for Micro-Object Manipulation

11.1. Introduction

11.2. Manipulation device

11.2.1. Micro-positioning by pushing

11.2.2. Manipulation device

11.2.3. Control loop

11.2.4. Representation of the manipulation task as an MDP

11.2.4.1. Definition of the state space

11.2.4.2. Definition of the action space

11.2.4.3. Definition of the reward function

11.2.4.4. Definition of an episode

11.3. Choice of the reinforcement learning algorithm

11.3.1. Characteristics of the MDP

11.3.2. A suitable algorithm: STM-Q

11.4. Experimental results

11.4.1. Experimental setup

11.4.2. Results

11.5. Conclusion

11.6. Bibliography

Chapter 12 Conservation of Biodiversity

12.1. Introduction

12.2. When to protect, survey or surrender cryptic endangered species

12.2.1. Surveying and managing the Sumatran tiger

12.2.2. The model

12.2.3. Results

12.2.4. Extension to more than one population

12.3. Can sea otters and abalone co-exist?

12.3.1. Abalone and sea otters: two endangered species

12.3.2. The models

12.3.2.1. Population dynamics of abalone

12.3.2.2. Sea otter population model

12.3.2.3. States

12.3.2.4. Decisions

12.3.2.5. Interaction between sea otters and abalone

12.3.2.6. Multicriteria objective and reward function

12.3.3. Methods

12.3.4. Results

12.3.4.1. Scenario 1: sea otter reintroduction and anti-poaching enforcement

12.3.4.2. Scenario 2: control of sea otters

12.3.4.3. Scenario 3: combined action of sea otter control and anti-poaching

12.3.5. Conclusion

12.4. Other applications in conservation biology and discussions

12.5. Bibliography

Chapter 13 Autonomous Helicopter Searching for a Landing Area in an Uncertain Environment

13.1. Introduction

13.2. Exploration scenario

13.2.1. Planning problem

13.2.2. States and actions

13.2.3. Uncertainties

13.2.4. Optimization criterion

13.2.5. Formalization of the decision problem

13.3. Embedded control and decision architecture

13.3.1. Global view

13.3.2. Multi-thread planning triggered by the supervisor

13.3.2.1. Policy optimization

13.3.2.2. Dialogue with the supervisor

13.4. Incremental stochastic dynamic programming

13.4.1. Obtaining the initial safe policy quickly

13.4.2. Generating the sub-space of reachable states

13.4.3. Local policy optimization

13.4.4. Launching local replanning processes

13.5. Flight tests and return on experience

13.6. Conclusion

13.7. Bibliography

Chapter 14 Resource Consumption Control for an Autonomous Robot

14.1. The rover’s mission

14.2. Progressive processing formalism

14.3. MDP/PRU model

14.3.1. States

14.3.2. Actions

14.3.3. Transition function

14.3.4. Reward function

14.4. Policy calculation

14.4.1. Value function

14.4.2. Propagation algorithm

14.5. How to model a real mission

14.6. Extensions

14.7. Conclusion

14.8. Bibliography

Chapter 15 Operations Planning

15.1. Operations planning

15.1.1. Intuition

15.1.1.1. Problem features

15.1.1.2. Plans

15.1.2. Formal definitions

15.1.2.1. Planning problem, operations

15.1.2.2. Execution

15.1.2.3. Decision epochs, states, plans

15.1.2.4. Objective

15.2. MDP value function approaches

15.2.1. Formalizations, CoMDP

15.2.1.1. States, actions, transitions

15.2.1.2. Rewards, costs

15.2.2. Algorithms

15.2.2.1. LRTDP

15.2.2.2. Memory management

15.2.2.3. Reduction of the number of updates

15.2.2.4. Hybrid algorithms

15.2.2.5. Algorithms with upper bounds

15.2.3. Heuristics

15.2.3.1. Basic heuristics

15.2.3.2. Heuristics obtained by relaxation of the CoMDP

15.2.3.3. Planning graph heuristics

15.3. Reinforcement learning: FPG

15.3.1. Employing approximate methods

15.3.2. Parameterized policy

15.3.2.1. Inputs

15.3.2.2. Outputs

15.3.2.3. Function approximator

15.3.3. Gradient methods

15.3.3.1. Terminating an execution

15.3.3.2. Choice of OLpomdp

15.3.3.3. Optimized criterion

15.3.4. Improving FPG

15.4. Experiments

15.5. Conclusion and outlook

15.6. Bibliography

Index

Markov Decision Processes in Artificial Intelligence

Markov Decision Processes in Artificial Intelligence MDPs, Beyond MDPs and Applications Edited by Olivier Sigaud Olivier Buffet

First published 2008 in France by Hermes Science/Lavoisier in two volumes entitled: Processus décisionnels de Markov en intelligence artificielle © LAVOISIER 2008 First published 2010 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc. Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address: ISTE Ltd 27-37 St George’s Road London SW19 4EU UK www.iste.co.uk John Wiley & Sons, Inc. 111 River Street Hoboken, NJ 07030 USA www.wiley.com © ISTE Ltd 2010 The rights of Olivier Sigaud and Olivier Buffet to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988. Library of Congress Cataloging-in-Publication Data Markov decision processes in artificial intelligence : MDPs, beyond MDPs and applications / edited by Olivier Sigaud, Olivier Buffet. p. cm. Includes bibliographical references and index. ISBN 978-1-84821-167-4 1. Artificial intelligence--Mathematics. 2. Artificial intelligence--Statistical methods. 3. Markov processes. 4. Statistical decision. I. Sigaud, Olivier. II. Buffet, Olivier. Q335.M374 2010 006.301'509233--dc22 2009048651 British Library Cataloguing-in-Publication Data A CIP record for this book is available from the British Library ISBN 978-1-84821-167-4 Printed and bound in Great Britain by CPI Antony Rowe, Chippenham and Eastbourne

Table of Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii List of Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix PART 1. MDPS: MODELS AND METHODS . . . . . . . . . . . . . . . . . . . Chapter 1. Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . Frédérick GARCIA and Emmanuel RACHELSON 1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2. Markov decision problems . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1. Markov decision processes . . . . . . . . . . . . . . . . . . . . . . 1.2.2. Action policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3. Performance criterion . . . . . . . . . . . . . . . . . . . . . . . . . 1.3. Value functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1. The ﬁnite criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2. The γ-discounted criterion . . . . . . . . . . . . . . . . . . . . . . 1.3.3. The total reward criterion . . . . . . . . . . . . . . . . . . . . . . . 1.3.4. The average reward criterion . . . . . . . . . . . . . . . . . . . . . 1.4. Markov policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1. Equivalence of history-dependent and Markov policies . . . . . . 1.4.2. Markov policies and valued Markov chains . . . . . . . . . . . . . 1.5. Characterization of optimal policies . . . . . . . . . . . . . . . . . . . . 1.5.1. The ﬁnite criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1.1. Optimality equations . . . . . . . . . . . . . . . . . . . . . . . 1.5.1.2. Evaluation of a deterministic Markov policy . . . . . . . . . 1.5.2. The discounted criterion . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2.1. Evaluation of a stationary Markov policy . . . . . . . . . . . 1.5.2.2. Optimality equations . . . . . . . . . . . . . . . . . . . . . . . 1.5.3. The total reward criterion . . . . . . . . . . . . . . . . . . . . . . . 1 3 3 4 4 7 8 9 10 10 11 11 12 12 13 14 14 14 15 16 16 17 22 v

vi Markov Decision Processes in AI 1.5.4. The average reward criterion . . . . . . . . . . . . . . . . . . . . . 1.5.4.1. Evaluation of a stationary Markov policy . . . . . . . . . . . 1.5.4.2. Optimality equations . . . . . . . . . . . . . . . . . . . . . . . 1.6. Optimization algorithms for MDPs . . . . . . . . . . . . . . . . . . . . . 1.6.1. The ﬁnite criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.2. The discounted criterion . . . . . . . . . . . . . . . . . . . . . . . . 1.6.2.1. Linear programming . . . . . . . . . . . . . . . . . . . . . . . 1.6.2.2. The value iteration algorithm . . . . . . . . . . . . . . . . . . 1.6.2.3. The policy iteration algorithm . . . . . . . . . . . . . . . . . 1.6.3. The total reward criterion . . . . . . . . . . . . . . . . . . . . . . . 1.6.3.1. Positive MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.3.2. Negative MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.4. The average criterion . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.4.1. Relative value iteration algorithm . . . . . . . . . . . . . . . 1.6.4.2. Modiﬁed policy iteration algorithm . . . . . . . . . . . . . . 1.7. Conclusion and outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8. Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 2. Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . Olivier SIGAUD and Frédérick GARCIA 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1. Historical overview . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2. Reinforcement learning: a global view . . . . . . . . . . . . . . . . . . . 2.2.1. Reinforcement learning as approximate dynamic programming . . 2.2.2. Temporal, non-supervised and trial-and-error based learning . . . 2.2.3. Exploration versus exploitation . . . . . . . . . . . . . . . . . . . . 2.2.4. General preliminaries on estimation methods . . . . . . . . . . . . 2.3. Monte Carlo methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4. From Monte Carlo to temporal difference methods . . . . . . . . . . . . 2.5. Temporal difference methods . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1. The TD(0) algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2. The SARSA algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3. The Q-learning algorithm . . . . . . . . . . . . . . . . . . . . . . . 2.5.4. The TD(λ), SARSA(λ) and Q(λ) algorithms . . . . . . . . . . . . 2.5.5. Eligibility traces and TD(λ) . . . . . . . . . . . . . . . . . . . . . 2.5.6. From TD(λ) to SARSA(λ) . . . . . . . . . . . . . . . . . . . . . . 2.5.7. Q(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.8. The R-learning algorithm . . . . . . . . . . . . . . . . . . . . . . . 2.6. Model-based methods: learning a model . . . . . . . . . . . . . . . . . . 2.6.1. Dyna architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2. The E3 algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.3. The Rmax algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 24 24 27 28 28 28 28 29 30 34 34 34 35 35 36 37 37 39 39 39 40 41 42 42 44 45 45 46 47 48 49 51 52 55 56 58 59 60 61 62

Contents vii 2.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8. Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 3. Approximate Dynamic Programming . . . . . . . . . . . . . . . Rémi MUNOS 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2. Approximate value iteration (AVI) . . . . . . . . . . . . . . . . . . . . . 3.2.1. Sample-based implementation and supervised learning . . . . . . 3.2.2. Analysis of the AVI algorithm . . . . . . . . . . . . . . . . . . . . 3.2.3. Numerical illustration . . . . . . . . . . . . . . . . . . . . . . . . . 3.3. Approximate policy iteration (API) . . . . . . . . . . . . . . . . . . . . . ∞-norm of the API algorithm . . . . . . . . . . . . . 3.3.1. Analysis in L 3.3.2. Approximate policy evaluation . . . . . . . . . . . . . . . . . . . . 3.3.3. Linear approximation and least-squares methods . . . . . . . . . . 3.3.3.1. TD(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3.2. Least-squares methods . . . . . . . . . . . . . . . . . . . . . . 3.3.3.3. Linear approximation of the state-action value function . . . 3.4. Direct minimization of the Bellman residual . . . . . . . . . . . . . . . 3.5. Towards an analysis of dynamic programming in Lp-norm . . . . . . . 3.5.1. Intuition of an Lp analysis in dynamic programming . . . . . . . . 3.5.2. PAC bounds for RL algorithms . . . . . . . . . . . . . . . . . . . . 3.6. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7. Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 4. Factored Markov Decision Processes . . . . . . . . . . . . . . . . Thomas DEGRIS and Olivier SIGAUD 63 63 67 68 70 71 73 74 77 77 79 80 81 82 85 87 88 89 91 93 93 99 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.2. Modeling a problem with an FMDP . . . . . . . . . . . . . . . . . . . . 100 4.2.1. Representing the state space . . . . . . . . . . . . . . . . . . . . . . 100 4.2.2. The Coffee Robot example . . . . . . . . . . . . . . . . . . . . . . 100 4.2.3. Decomposition and function-speciﬁc independence . . . . . . . . 101 4.2.3.1. Transition function decomposition . . . . . . . . . . . . . . . 102 4.2.3.2. Dynamic Bayesian networks in FMDPs . . . . . . . . . . . . 103 4.2.3.3. Factored model of the transition function in an FMDP . . . . 104 4.2.3.4. Factored model of the reward function . . . . . . . . . . . . . 105 4.2.4. Context-speciﬁc independence . . . . . . . . . . . . . . . . . . . . 107 4.3. Planning with FMDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.3.1. Structured policy iteration and structured value iteration . . . . . . 108 4.3.1.1. Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.3.1.2. Representation of the transition function . . . . . . . . . . . 108 4.3.1.3. Representation of the reward function . . . . . . . . . . . . . 110 4.3.1.4. Representation of a policy . . . . . . . . . . . . . . . . . . . . 110 4.3.1.5. Representation of the value function . . . . . . . . . . . . . . 111 4.3.1.6. Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

viii Markov Decision Processes in AI 4.3.2. SPUDD: stochastic planning using decision diagrams . . . . . . . 112 4.3.2.1. Representing the functions of an FMDP with ADDs . . . . . 113 4.3.2.2. Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.3.3. Approximate linear programming in FMDPs . . . . . . . . . . . . 115 4.3.3.1. Representations . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.3.3.2. Representation of the transition function . . . . . . . . . . . 116 4.3.3.3. Representation of the reward function . . . . . . . . . . . . . 117 4.3.3.4. Policy representation . . . . . . . . . . . . . . . . . . . . . . 118 4.3.3.5. Representation of the value function . . . . . . . . . . . . . . 119 4.3.3.6. Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.4. Perspectives and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.5. Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Chapter 5. Policy-Gradient Algorithms Olivier BUFFET . . . . . . . . . . . . . . . . . . . . . 127 5.1. Reminder about the notion of gradient . . . . . . . . . . . . . . . . . . . 128 5.1.1. Gradient of a function . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.1.2. Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.2. Optimizing a parameterized policy with a gradient algorithm . . . . . . 130 5.2.1. Application to MDPs: overview . . . . . . . . . . . . . . . . . . . 130 5.2.1.1. First attempt to deﬁne a parameterized policy . . . . . . . . . 131 5.2.1.2. Example of deﬁnition of a parameterized policy . . . . . . . 131 5.2.2. Finite difference method . . . . . . . . . . . . . . . . . . . . . . . . 132 5.2.3. Gradient estimation of f in an MDP, case of the ﬁnite time horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.2.3.1. Time horizon 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.2.3.2. Time horizon T . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.2.4. Extension to the inﬁnite time horizon: discounted criterion, average criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.2.4.1. Case of a regenerative process . . . . . . . . . . . . . . . . . 138 5.2.4.2. Using a moving window . . . . . . . . . . . . . . . . . . . . . 138 . . . . . . . . . . . . . . . . . . . . . 139 5.2.4.3. Using a discount factor 5.2.5. Partially observable case . . . . . . . . . . . . . . . . . . . . . . . . 139 5.2.6. Continuous action spaces . . . . . . . . . . . . . . . . . . . . . . . 141 5.3. Actor-critic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.3.1. A gradient estimator using the Q-values . . . . . . . . . . . . . . . 143 5.3.2. Compatibility with an approximate value function . . . . . . . . . 144 5.3.2.1. Approximating Qθ . . . . . . . . . . . . . . . . . . . . . . . . 144 5.3.2.2. Compatibility of the approximators . . . . . . . . . . . . . . 145 5.3.3. An actor-critic algorithm . . . . . . . . . . . . . . . . . . . . . . . 146 5.4. Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.6. Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

Chapter 6. Online Resolution Techniques . . . . . . . . . . . . . . . . . . . . 153 Laurent PÉRET and Frédérick GARCIA Contents ix 6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.1.1. Exploiting time online . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.1.2. Online search by simulation . . . . . . . . . . . . . . . . . . . . . . 154 6.2. Online algorithms for solving an MDP . . . . . . . . . . . . . . . . . . . 155 6.2.1. Ofﬂine algorithms, online algorithms . . . . . . . . . . . . . . . . 155 6.2.2. Problem formalization . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.2.2.1. Forward search over a reasoning horizon . . . . . . . . . . . 156 6.2.2.2. Tree or graph? . . . . . . . . . . . . . . . . . . . . . . . . . . 157 6.2.2.3. Complexity and efﬁciency of the forward search . . . . . . . 157 6.2.3. Online heuristic search algorithms for MDPs . . . . . . . . . . . . 158 6.2.3.1. General principles . . . . . . . . . . . . . . . . . . . . . . . . 158 6.2.3.2. The RTDP algorithm . . . . . . . . . . . . . . . . . . . . . . . 159 ∗ algorithm . . . . . . . . . . . . . . . . . . . . . . 160 6.2.3.3. The LAO 6.2.4. Kearns, Mansour and Ng’s simulation-based algorithm . . . . . . 163 6.2.4.1. Complexity and convergence of Kearns et al.’s algorithm . . 165 6.2.4.2. Efﬁciency and practical considerations . . . . . . . . . . . . 165 6.2.5. Tesauro and Galperin’s rollout algorithm . . . . . . . . . . . . . . 165 6.2.5.1. Complexity and convergence of the rollout algorithm . . . . 166 6.2.5.2. Efﬁciency of the rollout algorithm . . . . . . . . . . . . . . . 166 6.3. Controlling the search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 6.3.1. Error bounds and pathology of the forward search . . . . . . . . . 168 6.3.1.1. Pathology of the simulation-based forward search . . . . . . 168 6.3.1.2. Searching for a good compromise between depth 6.3.4. Controlled rollout 6.3.2. Iterative allocation of simulations and width . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 . . . . . . . . . . . . . . . . . . 170 6.3.2.1. Multi-armed bandits and exploration in MDPs . . . . . . . . 171 6.3.3. Focused reinforcement learning . . . . . . . . . . . . . . . . . . . . 173 6.3.3.1. Global sampling error estimation . . . . . . . . . . . . . . . . 174 6.3.3.2. Organizing the search in successive trajectories . . . . . . . 175 6.3.3.3. Convergence and practical considerations . . . . . . . . . . . 176 . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 6.3.4.1. Choice of the horizon . . . . . . . . . . . . . . . . . . . . . . 178 6.3.4.2. Iterative allocation of simulations . . . . . . . . . . . . . . . 179 6.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 6.5. Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 PART 2. BEYOND MDPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Chapter 7. Partially Observable Markov Decision Processes Alain DUTECH and Bruno SCHERRER 7.1. Formal deﬁnitions for POMDPs . . . . . . . . . . . . . . . . . . . . . . . 188 7.1.1. Deﬁnition of a POMDP . . . . . . . . . . . . . . . . . . . . . . . . . 188 . . . . . . . . 187

分享到：

赞收藏

资料库

Markov Decision Processes in Artificial Intelligence.pdf

相关推荐

人工智能

热门标签

最新资料