logo资料库

引导策略搜索介绍 - GPS.pdf

第1页 / 共15页
第2页 / 共15页
第3页 / 共15页
第4页 / 共15页
第5页 / 共15页
第6页 / 共15页
第7页 / 共15页
第8页 / 共15页
资料共15页,剩余部分请下载后查看
Guided Policy Search 葛 维 2016年11月28日
Outline Ø Reinforcement Learning Ø Policy search Ø Differen;al Dynamic Programming Ø Importance Sampling Ø Guided Policy Search
Reinforcement Learning RL representa;on(tuple): (S, A,Psa,γ, R) γiri s0 = s & ( ' ∞ V π(s) = Eπ # ∑ % $ i=0 π*(s) = argmax Value Func;on: Goal: Bellman Equa;on(determinis;c) V π(s) = π maxV π(s) p( !s | s,π(s)) r( !s | s,π(s)) +γV π( !s ) $% ∑ !s ∈S = Eπ r( !s | s,a) +γV π( !s ) s0 = s "# "# $%
Policy search A simple example: πθ(s,a1) = πθ(s,a2) =1− ! # # # # # # " 1 1+e−θ Ts 1 1+e−θ Ts 0 ! # 0 # 0 # # 1 # 0 # " S = 1 x !x ϕ !ϕ $ & & & & & & % θ= $ & & & & & & %
Policy search θ E R(s,a0) +!+ R(s,aT ) πθ,s0 !" Goal: max #$ So how to deal with mul;-ac;ons? πθ(s,ai) = eθi Ts eθi Ts i∑
Differen;al Dynamic Programming DDP = backward pass + forward pass 1.backward pass on the nominal trajectory to generate a new control sequence. 2.forward pass to compute and evaluate a new nominal trajectory.
Differen;al Dynamic Programming State equa;on: xi+1 = f (xi,ui) min[] V(x,i) = min u [ l(x,u) +V( f (x,u),i +1) (x,u) Bellman equa;on: l(x,u) +V( f (x,u),i +1) ] The argument of the opetator: Let Q be the varia;on of this quan;ty around the i-th pair Q(δx,δu) = l(x +δx,u +δu) +V( f (x +δx,u +δu),i +1) −l(x,u) −V( f (x,u),i +1)
Differen;al Dynamic Programming 1 δx δu % ' ' ' & % " ' $ ' $ ' $ # ' & !Vx Q(δx,δu) 1 δx δu 1 2 " $ $ $ # " $ $ $ $ # Expand to second order: 0 QT x Qu T Q(δx,δu) ≈ Qx Qxx Qxu Qu Qux Quu where Qu = lu + fu T T % ' ' ' & !Vx Qx = lx + fx T !Vxx fx + !Vx fxx Qxx = lxx + fx T !Vuu fu + !Vx fuu Quu = luu + fu T Qux = lux + fu !Vxx fx + !Vx fux T !V =V(i +1)
分享到:
收藏