引导策略搜索介绍 - GPS.pdf

发布时间：2022-05-31 发布人：admin 分类：说明书资料大小：4.63M 资料格式：pdf 举报版权申诉

u013745804-10131775-4744302542856977797.pdf-第1页.png

第1页 / 共15页

u013745804-10131775-4744302542856977797.pdf-第2页.png

第2页 / 共15页

u013745804-10131775-4744302542856977797.pdf-第3页.png

第3页 / 共15页

u013745804-10131775-4744302542856977797.pdf-第4页.png

第4页 / 共15页

u013745804-10131775-4744302542856977797.pdf-第5页.png

第5页 / 共15页

u013745804-10131775-4744302542856977797.pdf-第6页.png

第6页 / 共15页

u013745804-10131775-4744302542856977797.pdf-第7页.png

第7页 / 共15页

u013745804-10131775-4744302542856977797.pdf-第8页.png

第8页 / 共15页

文本预览

Guided Policy Search 葛维 2016年11月28日

Outline Ø Reinforcement Learning Ø Policy search Ø Diﬀeren;al Dynamic Programming Ø Importance Sampling Ø Guided Policy Search

Reinforcement Learning RL representa;on(tuple)： (S, A,Psa,γ, R) γiri s0 = s & ( ' ∞ V π(s) = Eπ # ∑ % $ i=0 π*(s) = argmax Value Func;on: Goal: Bellman Equa;on(determinis;c) V π(s) = π maxV π(s) p( !s | s,π(s)) r( !s | s,π(s)) +γV π( !s ) $% ∑ !s ∈S = Eπ r( !s | s,a) +γV π( !s ) s0 = s "# "# $%

Policy search A simple example: πθ(s,a1) = πθ(s,a2) =1− ! # # # # # # " 1 1+e−θ Ts 1 1+e−θ Ts 0 ! # 0 # 0 # # 1 # 0 # " S = 1 x !x ϕ !ϕ $ & & & & & & % θ= $ & & & & & & %

Policy search θ E R(s,a0) +!+ R(s,aT ) πθ,s0 !" Goal: max #$ So how to deal with mul;-ac;ons? πθ(s,ai) = eθi Ts eθi Ts i∑

Diﬀeren;al Dynamic Programming DDP = backward pass + forward pass 1.backward pass on the nominal trajectory to generate a new control sequence. 2.forward pass to compute and evaluate a new nominal trajectory.

Diﬀeren;al Dynamic Programming State equa;on： xi+1 = f (xi,ui) min[] V(x,i) = min u [ l(x,u) +V( f (x,u),i +1) (x,u) Bellman equa;on: l(x,u) +V( f (x,u),i +1) ] The argument of the opetator: Let Q be the varia;on of this quan;ty around the i-th pair Q(δx,δu) = l(x +δx,u +δu) +V( f (x +δx,u +δu),i +1) −l(x,u) −V( f (x,u),i +1)

Diﬀeren;al Dynamic Programming 1 δx δu % ' ' ' & % " ' $ ' $ ' $ # ' & !Vx Q(δx,δu) 1 δx δu 1 2 " $ $ $ # " $ $ $ $ # Expand to second order: 0 QT x Qu T Q(δx,δu) ≈ Qx Qxx Qxu Qu Qux Quu where Qu = lu + fu T T % ' ' ' & !Vx Qx = lx + fx T !Vxx fx + !Vx fxx Qxx = lxx + fx T !Vuu fu + !Vx fuu Quu = luu + fu T Qux = lux + fu !Vxx fx + !Vx fux T !V =V(i +1)

分享到：

赞收藏

资料库

引导策略搜索介绍 - GPS.pdf

相关推荐

人工智能

热门标签

最新资料