Guided	Policy	Search	
    葛 维	
	 2016年11月28日	
	
Outline	
Ø Reinforcement	Learning	
Ø Policy	search	
Ø Differen;al	Dynamic	Programming	
Ø Importance	Sampling	
Ø Guided	Policy	Search	
Reinforcement	Learning	
RL	representa;on(tuple): 
(S, A,Psa,γ, R)
 	
γiri s0 = s
&
(
'
∞
V π(s) = Eπ
#
∑
%
$
i=0
π*(s) = argmax
Value	Func;on:	
	
	
Goal:	
	
Bellman	Equa;on(determinis;c)	
	
V π(s) =
π
maxV π(s)
p( !s | s,π(s)) r( !s | s,π(s)) +γV π( !s )
$%
∑
!s ∈S
= Eπ r( !s | s,a) +γV π( !s ) s0 = s
"#
"#
$%
Policy	search	
A	simple	example:	
	
πθ(s,a1) =
	
	
πθ(s,a2) =1−
	
	
!
	
#
#
	
#
	
#
	
#
#
	
"
	
1
1+e−θ
Ts
1
1+e−θ
Ts
0
!
#
0
#
0
#
#
1
#
0
#
"
S =
1
x
!x
ϕ
!ϕ
$
&
&
&
&
&
&
%
θ=
$
&
&
&
&
&
&
%
Policy	search	
θ
E R(s,a0) +!+ R(s,aT ) πθ,s0
!"
Goal:	
	
max
#$
	    	
So	how	to	deal	with	mul;-ac;ons?	
	
	     πθ(s,ai) =
eθi
Ts
eθi
Ts
i∑
Differen;al	Dynamic	Programming	
DDP	= backward	pass	+	forward	pass	
	
	  1.backward	pass	on	the	nominal	trajectory	to	
generate	a	new	control	sequence.	
	
2.forward	pass	to	compute	and	evaluate	a	new	
nominal	trajectory. 
Differen;al	Dynamic	Programming	
State	equa;on: 
xi+1 = f (xi,ui)
min[]
V(x,i) = min
u
[
l(x,u) +V( f (x,u),i +1)
(x,u)
Bellman	equa;on:	
l(x,u) +V( f (x,u),i +1)
	
]
	
The	argument	of	the														opetator:	
	
	
	
	
Let	Q	be	the	varia;on	of	this	quan;ty	around	the	
	
i-th												pair	
	
	
	
Q(δx,δu) = l(x +δx,u +δu) +V( f (x +δx,u +δu),i +1)
	
	
	
	
	
	
	
	
 
 
−l(x,u) −V( f (x,u),i +1)
Differen;al	Dynamic	Programming	
1
δx
δu
%
'
'
'
&
%
"
'
$
'
$
'
$
#
'
&
!Vx
Q(δx,δu)
1
δx
δu
1
2
"
$
$
$
#
"
$
$
$
$
#
Expand																		to	second	order:	
0 QT
x Qu
	
T
	
Q(δx,δu) ≈
Qx Qxx Qxu
	
	
Qu Qux Quu
	
where	
Qu = lu + fu
	
T
	
	
	
 
T
%
'
'
'
&
!Vx
Qx = lx + fx
T
!Vxx fx + !Vx fxx
Qxx = lxx + fx
T
!Vuu fu + !Vx fuu
Quu = luu + fu
T
Qux = lux + fu
!Vxx fx + !Vx fux
T
!V =V(i +1)