logo资料库

POMDP,部分可观察马尔可夫决策过程.pdf

第1页 / 共55页
第2页 / 共55页
第3页 / 共55页
第4页 / 共55页
第5页 / 共55页
第6页 / 共55页
第7页 / 共55页
第8页 / 共55页
资料共55页,剩余部分请下载后查看
POMDP Tutorial
Preliminaries: Problem Definition • Agent model, POMDP, Bayesian RL Transition Dynamics WORLD Observation Action Belief b Policy π ACTOR ACTOR Markov Decision Process -X: set of states [xs,xr] • state component • reward component --A: set of actions -T=P(x’|x,a): transition and reward probabilities -O: Observation function -b: Belief and info. state -π: Policy
Sequential decision making with uncertainty in a changing world • Dynamics- The world and agent change over time. The agent may have a model of this change process • Observation- Both world state and goal achievement are accessed through indirect measurements (sensations of state and reward) • Beliefs - Understanding of world state with uncertainty • Goals- encoded by rewards! => Find a policy that can maximize total reward over some duration • Value Function- Measure of goodness of being in a belief • Policy- a function that describes how to select actions in state each belief state.
Markov Decision Processes (MDPs) In RL, the environment is a modeled as an MDP, defined by S – set of states of the environment A(s) – set of actions possible in state s within S P(s,s',a) – probability of transition from s to s' given a R(s,s',a) – expected reward on transition s to s' given a g – discount rate for delayed reward discrete time, t = 0, 1, 2, . . . . . . . . . rt +3 st +3 t +2a t +3a st a t rt +1 st +1 rt +2 st +2 t +1a
MDP Model Agent State Reward Action Environment Process: • Observe state st in S • Choose action at in At • Receive immediate reward rt • State changes to st+1 a0 s0 r0 a1 s1 r1 s2 a2 s3 r2 MDP Model
The Objective is to Maximize Long-term Total Discounted Reward Find a policy π : s∈ S → a∈ A(s) (could be stochastic) that maximizes the value (expected future reward) of each s : π V (s) = E {r + γ r + γ r + s =s, π } t +1 t +2 t +3 t . . . 2 and each s,a pair: Q (s,a) = E {r + γ r + γ r + s =s, a =a, π } t +1 t +2 t +3 t t rewards . . . π 2 These are called value functions - cf. evaluation functions
Policy & Value Function • Which action, a, should the agent take? – In MDPs: • Policy is a mapping from state to action, π: S → A • Value Function Vπ,t(S) given a policy π – The expected sum of reward gained from starting in state s executing non-stationary policy π for t steps. • Relation – Value function is evaluation for policy based on the long-run value that agent expects to gain from executing the policy.
分享到:
收藏