POMDP，部分可观察马尔可夫决策过程.pdf

发布时间：2022-05-31 发布人：admin 分类：说明书资料大小：2.18M 资料格式：pdf 举报版权申诉

ligary2002-13057170-16359647332904255463.pdf-第1页.png

第1页 / 共55页

ligary2002-13057170-16359647332904255463.pdf-第2页.png

第2页 / 共55页

ligary2002-13057170-16359647332904255463.pdf-第3页.png

第3页 / 共55页

ligary2002-13057170-16359647332904255463.pdf-第4页.png

第4页 / 共55页

ligary2002-13057170-16359647332904255463.pdf-第5页.png

第5页 / 共55页

ligary2002-13057170-16359647332904255463.pdf-第6页.png

第6页 / 共55页

ligary2002-13057170-16359647332904255463.pdf-第7页.png

第7页 / 共55页

ligary2002-13057170-16359647332904255463.pdf-第8页.png

第8页 / 共55页

文本预览

POMDP Tutorial

Preliminaries: Problem Definition • Agent model, POMDP, Bayesian RL Transition Dynamics WORLD Observation Action Belief b Policy π ACTOR ACTOR Markov Decision Process -X: set of states [xs,xr] • state component • reward component --A: set of actions -T=P(x’|x,a): transition and reward probabilities -O: Observation function -b: Belief and info. state -π: Policy

Sequential decision making with uncertainty in a changing world • Dynamics- The world and agent change over time. The agent may have a model of this change process • Observation- Both world state and goal achievement are accessed through indirect measurements (sensations of state and reward) • Beliefs - Understanding of world state with uncertainty • Goals- encoded by rewards! => Find a policy that can maximize total reward over some duration • Value Function- Measure of goodness of being in a belief • Policy- a function that describes how to select actions in state each belief state.

Markov Decision Processes (MDPs) In RL, the environment is a modeled as an MDP, deﬁned by S – set of states of the environment A(s) – set of actions possible in state s within S P(s,s',a) – probability of transition from s to s' given a R(s,s',a) – expected reward on transition s to s' given a g – discount rate for delayed reward discrete time, t = 0, 1, 2, . . . . . . . . . rt +3 st +3 t +2a t +3a st a t rt +1 st +1 rt +2 st +2 t +1a

MDP Model Agent State Reward Action Environment Process: • Observe state st in S • Choose action at in At • Receive immediate reward rt • State changes to st+1 a0 s0 r0 a1 s1 r1 s2 a2 s3 r2 MDP Model

The Objective is to Maximize Long-term Total Discounted Reward Find a policy π : s∈ S → a∈ A(s) (could be stochastic) that maximizes the value (expected future reward) of each s : π V (s) = E {r + γ r + γ r + s =s, π } t +1 t +2 t +3 t . . . 2 and each s,a pair: Q (s,a) = E {r + γ r + γ r + s =s, a =a, π } t +1 t +2 t +3 t t rewards . . . π 2 These are called value functions - cf. evaluation functions

Policy & Value Function • Which action, a, should the agent take? – In MDPs: • Policy is a mapping from state to action, π: S → A • Value Function Vπ,t(S) given a policy π – The expected sum of reward gained from starting in state s executing non-stationary policy π for t steps. • Relation – Value function is evaluation for policy based on the long-run value that agent expects to gain from executing the policy.

分享到：

赞收藏

资料库

POMDP，部分可观察马尔可夫决策过程.pdf

相关推荐

人工智能

热门标签

最新资料