Guided Policy Search
葛 维
2016年11月28日
Outline
Ø Reinforcement Learning
Ø Policy search
Ø Differen;al Dynamic Programming
Ø Importance Sampling
Ø Guided Policy Search
Reinforcement Learning
RL representa;on(tuple):
(S, A,Psa,γ, R)
γiri s0 = s
&
(
'
∞
V π(s) = Eπ
#
∑
%
$
i=0
π*(s) = argmax
Value Func;on:
Goal:
Bellman Equa;on(determinis;c)
V π(s) =
π
maxV π(s)
p( !s | s,π(s)) r( !s | s,π(s)) +γV π( !s )
$%
∑
!s ∈S
= Eπ r( !s | s,a) +γV π( !s ) s0 = s
"#
"#
$%
Policy search
A simple example:
πθ(s,a1) =
πθ(s,a2) =1−
!
#
#
#
#
#
#
"
1
1+e−θ
Ts
1
1+e−θ
Ts
0
!
#
0
#
0
#
#
1
#
0
#
"
S =
1
x
!x
ϕ
!ϕ
$
&
&
&
&
&
&
%
θ=
$
&
&
&
&
&
&
%
Policy search
θ
E R(s,a0) +!+ R(s,aT ) πθ,s0
!"
Goal:
max
#$
So how to deal with mul;-ac;ons?
πθ(s,ai) =
eθi
Ts
eθi
Ts
i∑
Differen;al Dynamic Programming
DDP = backward pass + forward pass
1.backward pass on the nominal trajectory to
generate a new control sequence.
2.forward pass to compute and evaluate a new
nominal trajectory.
Differen;al Dynamic Programming
State equa;on:
xi+1 = f (xi,ui)
min[]
V(x,i) = min
u
[
l(x,u) +V( f (x,u),i +1)
(x,u)
Bellman equa;on:
l(x,u) +V( f (x,u),i +1)
]
The argument of the opetator:
Let Q be the varia;on of this quan;ty around the
i-th pair
Q(δx,δu) = l(x +δx,u +δu) +V( f (x +δx,u +δu),i +1)
−l(x,u) −V( f (x,u),i +1)
Differen;al Dynamic Programming
1
δx
δu
%
'
'
'
&
%
"
'
$
'
$
'
$
#
'
&
!Vx
Q(δx,δu)
1
δx
δu
1
2
"
$
$
$
#
"
$
$
$
$
#
Expand to second order:
0 QT
x Qu
T
Q(δx,δu) ≈
Qx Qxx Qxu
Qu Qux Quu
where
Qu = lu + fu
T
T
%
'
'
'
&
!Vx
Qx = lx + fx
T
!Vxx fx + !Vx fxx
Qxx = lxx + fx
T
!Vuu fu + !Vx fuu
Quu = luu + fu
T
Qux = lux + fu
!Vxx fx + !Vx fux
T
!V =V(i +1)