Title Page
Copyright
Contents
Preface
List of Authors
Part 1
MDPs: Models and Methods
Chapter 1
Markov Decision Processes
1.1. Introduction
1.2. Markov decision problems
1.2.1. Markov decision processes
1.2.2. Action policies
1.2.3. Performance criterion
1.3. Value functions
1.3.1. The finite criterion
1.3.2. The γ-discounted criterion
1.3.3. The total reward criterion
1.3.4. The average reward criterion
1.4. Markov policies
1.4.1. Equivalence of history-dependent and Markov policies
1.4.2. Markov policies and valued Markov chains
1.5. Characterization of optimal policies
1.5.1. The finite criterion
1.5.1.1. Optimality equations
1.5.1.2. Evaluation of a deterministic Markov policy
1.5.2. The discounted criterion
1.5.2.1. Evaluation of a stationary Markov policy
1.5.2.2. Optimality equations
1.5.3. The total reward criterion
1.5.4. The average reward criterion
1.5.4.1. Evaluation of a stationary Markov policy
1.5.4.2. Optimality equations
1.6. Optimization algorithms for MDPs
1.6.1. The finite criterion
1.6.2. The discounted criterion
1.6.2.1. Linear programming
1.6.2.2. The value iteration algorithm
1.6.2.3. The policy iteration algorithm
1.6.3. The total reward criterion
1.6.3.1. Positive MDPs
1.6.3.2. Negative MDPs
1.6.4. The average criterion
1.6.4.1. Relative value iteration algorithm
1.6.4.2. Modified policy iteration algorithm
1.7. Conclusion and outlook
1.8. Bibliography
Chapter 2
Reinforcement Learning
2.1. Introduction
2.1.1. Historical overview
2.2. Reinforcement learning: a global view
2.2.1. Reinforcement learning as approximate dynamic programming
2.2.2. Temporal, non-supervised and trial-and-error based learning
2.2.3. Exploration versus exploitation
2.2.4. General preliminaries on estimation methods
2.3. Monte Carlo methods
2.4. From Monte Carlo to temporal difference methods
2.5. Temporal difference methods
2.5.1. The TD0 algorithm
2.5.2. The SARSA algorithm
2.5.3. The Q-learning algorithm
2.5.4. The TD묀, SARSA묀 and Q묀 algorithms
2.5.5. Eligibility traces and TD묀
2.5.6. From TD묀 to SARSA묀
2.5.7. Q묀
2.5.8. The R-learning algorithm
2.6. Model-based methods: learning a model
2.6.1. Dyna architectures
2.6.2. The E3 algorithm
2.6.3. The Rmax algorithm
2.7. Conclusion
2.8. Bibliography
Chapter 3
Approximate Dynamic Programming
3.1. Introduction
3.2. Approximate value iteration AVI
3.2.1. Sample-based implementation and supervised learning
3.2.2. Analysis of the AVI algorithm
3.2.3. Numerical illustration
3.3. Approximate policy iteration API
3.3.1. Analysis in L∞-norm of the API algorithm
3.3.2. Approximate policy evaluation
3.3.3. Linear approximation and least-squares methods
3.3.3.1. TD묀
3.3.3.2. Least-squares methods
3.3.3.3. Linear approximation of the state-action value function
3.4. Direct minimization of the Bellman residual
3.5. Towards an analysis of dynamic programming in Lp-norm
3.5.1. Intuition of an Lp analysis in dynamic programming
3.5.2. PAC bounds for RL algorithms
3.6. Conclusions
3.7. Bibliography
Chapter 4
Factored Markov Decision Processes
4.1. Introduction
4.2. Modeling a problem with an FMDP
4.2.1. Representing the state space
4.2.2. The Coffee Robot example
4.2.3. Decomposition and function-specific independence
4.2.3.1. Transition function decomposition
4.2.3.2. Dynamic Bayesian networks in FMDPs
4.2.3.3. Factored model of the transition function in an FMDP
4.2.3.4. Factored model of the reward function
4.2.4. Context-specific independence
4.3. Planning with FMDPs
4.3.1. Structured policy iteration and structured value iteration
4.3.1.1. Decision trees
4.3.1.2. Representation of the transition function
4.3.1.3. Representation of the reward function
4.3.1.4. Representation of a policy
4.3.1.5. Representation of the value function
4.3.1.6. Algorithms
4.3.2. SPUDD: stochastic planning using decision diagrams
4.3.2.1. Representing the functions of an FMDP with ADDs
4.3.2.2. Algorithm
4.3.3. Approximate linear programming in FMDPs
4.3.3.1. Representations
4.3.3.2. Representation of the transition function
4.3.3.3. Representation of the reward function
4.3.3.4. Policy representation
4.3.3.5. Representation of the value function
4.3.3.6. Algorithms
4.4. Perspectives and conclusion
4.5. Bibliography
Chapter 5
Policy-Gradient Algorithms
5.1. Reminder about the notion of gradient
5.1.1. Gradient of a function
5.1.2. Gradient descent
5.2. Optimizing a parameterized policy with a gradient algorithm
5.2.1. Application to MDPs: overview
5.2.1.1. First attempt to define a parameterized policy
5.2.1.2. Example of definition of a parameterized policy
5.2.2. Finite difference method
5.2.3. Gradient estimation of f in an MDP, case of the finite time horizon
5.2.3.1. Time horizon 1
5.2.3.2. Time horizon T
5.2.4. Extension to the infinite time horizon: discounted criterion, average criterion
5.2.4.1. Case of a regenerative process
5.2.4.2. Using a moving window
5.2.4.3. Using a discount factor
5.2.5. Partially observable case
5.2.6. Continuous action spaces
5.3. Actor-critic methods
5.3.1. A gradient estimator using the Q-values
5.3.2. Compatibility with an approximate value function
5.3.2.1. Approximating Qθ
5.3.2.2. Compatibility of the approximators
5.3.3. An actor-critic algorithm
5.4. Complements
5.5. Conclusion
5.6. Bibliography
Chapter 6
Online Resolution Techniques
6.1. Introduction
6.1.1. Exploiting time online
6.1.2. Online search by simulation
6.2. Online algorithms for solving an MDP
6.2.1. Offline algorithms, online algorithms
6.2.2. Problem formalization
6.2.2.1. Forward search over a reasoning horizon
6.2.2.2. Tree or graph?
6.2.2.3. Complexity and efficiency of the forward search
6.2.3. Online heuristic search algorithms for MDPs
6.2.3.1. General principles
6.2.3.2. The RTDP algorithm
6.2.3.3. The LAO* algorithm
6.2.4. Kearns, Mansour and Ng’s simulation-based algorithm
6.2.4.1. Complexity and convergence of Kearns et al.’s algorithm
6.2.4.2. Efficiency and practical considerations
6.2.5. Tesauro and Galperin’s rollout algorithm
6.2.5.1. Complexity and convergence of the rollout algorithm
6.2.5.2. Efficiency of the rollout algorithm
6.3. Controlling the search
6.3.1. Error bounds and pathology of the forward search
6.3.1.1. Pathology of the simulation-based forward search
6.3.1.2. Searching for a good compromise between depth and width
6.3.2. Iterative allocation of simulations
6.3.2.1. Multi-armed bandits and exploration in MDPs
6.3.3. Focused reinforcement learning
6.3.3.1. Global sampling error estimation
6.3.3.2. Organizing the search in successive trajectories
6.3.3.3. Convergence and practical considerations
6.3.4. Controlled rollout
6.3.4.1. Choice of the horizon
6.3.4.2. Iterative allocation of simulations
6.4. Conclusion
6.5. Bibliography
Part 2
Beyond MDPs
Chapter 7
Partially Observable Markov Decision
Processes
7.1. Formal definitions for POMDPs
7.1.1. Definition of a POMDP
7.1.2. Performance criteria
7.1.3. Information state
7.1.3.1. Definition
7.1.3.2. Complete information state
7.1.3.3. Sufficient statistics
7.1.3.4. Belief states
7.1.4. Policy
7.1.5. Value function
7.2. Non-Markovian problems: incomplete information
7.2.1. Adapted policies
7.2.2. Discounted reward
7.2.2.1. Adapted stochastic policies
7.2.2.2. Adapted value function
7.2.2.3. Convergence of adapted algorithms
7.2.3. Adapted algorithms and adapted average reward criterion
7.3. Computation of an exact policy on information states
7.3.1. The general case
7.3.1.1. Finite horizon
7.3.1.2. Infinite horizon
7.3.2. Belief states and piecewise linear value function
7.3.2.1. Choice of the θ vectors
7.3.2.2. Infinite horizon
7.4. Exact value iteration algorithms
7.4.1. Steps for the dynamic programming operator
7.4.2. A parsimonious representation of V
7.4.2.1. Region
7.4.2.2. Parsimonious representation
7.4.2.3. Pruning of dominated vectors
7.4.2.4. Pruning
7.4.2.5. Choice of a vector for a belief state
7.4.3. The WITNESS algorithm
7.4.3.1. Neighborhood of a vector
7.4.3.2. The algorithm
7.4.4. Iterative pruning
7.4.4.1. Complete enumeration
7.4.4.2. Incremental enumeration
7.5. Policy iteration algorithms
7.6. Conclusion and perspectives
7.7. Bibliography
Chapter 8
Stochastic Games
8.1. Introduction
8.2. Background on game theory
8.2.1. Some basic definitions
8.2.1.1. Criteria distinguishing different forms of games
8.2.2. Static games of complete information
8.2.2.1. Games in strategic form, pure strategy, mixed strategy
8.2.2.2. Zero-sum game and minimax
8.2.2.3. Equilibrium in dominating strategies
8.2.2.4. Nash equilibrium
8.2.3. Dynamic games of complete information
8.2.3.1. Games in extensive form with perfect information
8.2.3.2. Repeated games
8.3. Stochastic games
8.3.1. Definition and equilibrium of a stochastic game
8.3.2. Solving stochastic games
8.3.2.1. Game model available
8.3.2.2. Game model unavailable, A?i observed, equilibrium learners
8.3.2.3. Game model unavailable, A?i observed, opponent modeling
8.3.2.4. Game model unavailable, A?i not observed
8.3.3. Complexity and scalability of multi-agent learning algorithms
8.3.4. Beyond the search for an equilibrium
8.3.4.1. Efficient play
8.3.4.2. Regret minimization
8.3.4.3. Metastrategies
8.3.5. Discussion
8.4. Conclusion and outlook
8.5. Bibliography
Chapter 9
DEC-MDP/POMDP
9.1. Introduction
9.2. Preliminaries
9.3. Multiagent Markov decision processes
9.4. Decentralized control and local observability
9.4.1. Decentralized Markov decision processes
9.4.2. Multiagent team decision problems
9.4.3. Complexity
9.5. Sub-classes of DEC-POMDPs
9.5.1. Transition and observation independence
9.5.2. Goal oriented DEC-POMDPs
9.5.3. DEC-MDPs with constraints
9.5.3.1. Event-driven DEC-MDPs
9.5.3.2. Opportunity-cost DEC-MDPs
9.5.3.3. Problem statement
9.5.3.4. The OC-DEC-MDP model
9.5.4. Communication and DEC-POMDPs
9.5.4.1. DEC-POMDPs with communication
9.5.4.2. Complexity results
9.5.4.3. Discussion
9.6. Algorithms for solving DEC-POMDPs
9.6.1. Optimal algorithms
9.6.1.1. Dynamic programming for DEC-POMDPs
9.6.1.2. Heuristic search for DEC-POMDPs
9.6.1.3. Optimal algorithms for sub-classes of DEC-POMDPs
9.6.2. Approximate algorithms
9.6.2.1. Heuristics and approximate dynamic programming
9.6.2.2. Bounded memory
9.6.2.3. Co-evolutive algorithms
9.6.2.4. Gradient descent for policy search
9.6.2.5. Bayesian games
9.6.2.6. Heuristics for communicating agents
9.6.2.7. Approximate solutions for OC-DEC-MDPs
9.7. Applicative scenario: multirobot exploration
9.8. Conclusion and outlook
9.9. Bibliography
Chapter 10
Non-Standard Criteria
10.1. Introduction
10.2. Multicriteria approaches
10.2.1. Multicriteria decision-making
10.2.2. Multicriteria MDPs
10.2.2.1. Operators for multicriteria decision-making
10.3. Robustness in MDPs
10.4. Possibilistic MDPs
10.4.1. Possibilistic counterpart of expected utility
10.4.2. Possibilistic dynamic programming
10.4.2.1. Finite horizon
10.4.2.2. Value iteration
10.4.2.3. Policy iteration
10.4.3. Extensions of possibilistic MDPs
10.4.3.1. Possibilistic reinforcement learning
10.4.3.2. Possibilistic partially observable MDPs
10.4.3.3. Possibilistic influence diagrams PID
10.5. Algebraic MDPs
10.5.1. Background
10.5.1.1. Semirings
10.5.1.2. Plausibility measures
10.5.1.3. Generalized expected utility
10.5.2. Definition of an algebraic MDP
10.5.3. Value function of a policy
10.5.4. Conditions
10.5.5. Examples of AMDPs
10.5.5.1. Probabilistic multicriteria AMDP
10.5.5.2. Possibilistic multicriteria AMDPs
10.5.5.3. AMDPs whose rewards are non-decreasing functions
10.6. Conclusion
10.7. Bibliography
Part 3
Applications
Chapter 11
Online Learning for Micro-Object
Manipulation
11.1. Introduction
11.2. Manipulation device
11.2.1. Micro-positioning by pushing
11.2.2. Manipulation device
11.2.3. Control loop
11.2.4. Representation of the manipulation task as an MDP
11.2.4.1. Definition of the state space
11.2.4.2. Definition of the action space
11.2.4.3. Definition of the reward function
11.2.4.4. Definition of an episode
11.3. Choice of the reinforcement learning algorithm
11.3.1. Characteristics of the MDP
11.3.2. A suitable algorithm: STM-Q
11.4. Experimental results
11.4.1. Experimental setup
11.4.2. Results
11.5. Conclusion
11.6. Bibliography
Chapter 12
Conservation of Biodiversity
12.1. Introduction
12.2. When to protect, survey or surrender cryptic endangered species
12.2.1. Surveying and managing the Sumatran tiger
12.2.2. The model
12.2.3. Results
12.2.4. Extension to more than one population
12.3. Can sea otters and abalone co-exist?
12.3.1. Abalone and sea otters: two endangered species
12.3.2. The models
12.3.2.1. Population dynamics of abalone
12.3.2.2. Sea otter population model
12.3.2.3. States
12.3.2.4. Decisions
12.3.2.5. Interaction between sea otters and abalone
12.3.2.6. Multicriteria objective and reward function
12.3.3. Methods
12.3.4. Results
12.3.4.1. Scenario 1: sea otter reintroduction and anti-poaching enforcement
12.3.4.2. Scenario 2: control of sea otters
12.3.4.3. Scenario 3: combined action of sea otter control and anti-poaching
12.3.5. Conclusion
12.4. Other applications in conservation biology and discussions
12.5. Bibliography
Chapter 13
Autonomous Helicopter Searching for a
Landing Area in an Uncertain Environment
13.1. Introduction
13.2. Exploration scenario
13.2.1. Planning problem
13.2.2. States and actions
13.2.3. Uncertainties
13.2.4. Optimization criterion
13.2.5. Formalization of the decision problem
13.3. Embedded control and decision architecture
13.3.1. Global view
13.3.2. Multi-thread planning triggered by the supervisor
13.3.2.1. Policy optimization
13.3.2.2. Dialogue with the supervisor
13.4. Incremental stochastic dynamic programming
13.4.1. Obtaining the initial safe policy quickly
13.4.2. Generating the sub-space of reachable states
13.4.3. Local policy optimization
13.4.4. Launching local replanning processes
13.5. Flight tests and return on experience
13.6. Conclusion
13.7. Bibliography
Chapter 14
Resource Consumption Control
for an Autonomous Robot
14.1. The rover’s mission
14.2. Progressive processing formalism
14.3. MDP/PRU model
14.3.1. States
14.3.2. Actions
14.3.3. Transition function
14.3.4. Reward function
14.4. Policy calculation
14.4.1. Value function
14.4.2. Propagation algorithm
14.5. How to model a real mission
14.6. Extensions
14.7. Conclusion
14.8. Bibliography
Chapter 15
Operations Planning
15.1. Operations planning
15.1.1. Intuition
15.1.1.1. Problem features
15.1.1.2. Plans
15.1.2. Formal definitions
15.1.2.1. Planning problem, operations
15.1.2.2. Execution
15.1.2.3. Decision epochs, states, plans
15.1.2.4. Objective
15.2. MDP value function approaches
15.2.1. Formalizations, CoMDP
15.2.1.1. States, actions, transitions
15.2.1.2. Rewards, costs
15.2.2. Algorithms
15.2.2.1. LRTDP
15.2.2.2. Memory management
15.2.2.3. Reduction of the number of updates
15.2.2.4. Hybrid algorithms
15.2.2.5. Algorithms with upper bounds
15.2.3. Heuristics
15.2.3.1. Basic heuristics
15.2.3.2. Heuristics obtained by relaxation of the CoMDP
15.2.3.3. Planning graph heuristics
15.3. Reinforcement learning: FPG
15.3.1. Employing approximate methods
15.3.2. Parameterized policy
15.3.2.1. Inputs
15.3.2.2. Outputs
15.3.2.3. Function approximator
15.3.3. Gradient methods
15.3.3.1. Terminating an execution
15.3.3.2. Choice of OLpomdp
15.3.3.3. Optimized criterion
15.3.4. Improving FPG
15.4. Experiments
15.5. Conclusion and outlook
15.6. Bibliography
Index