logo资料库

Adaptive Dynamic Programming 自适应动态规划.pdf

第1页 / 共9页
第2页 / 共9页
第3页 / 共9页
第4页 / 共9页
第5页 / 共9页
第6页 / 共9页
第7页 / 共9页
第8页 / 共9页
资料共9页,剩余部分请下载后查看
Michael Margaliot MMiMMMMiMiMiMiMiMMiMiMMiMiMiMiMiMMiMiMMiMiMiMMMMMiMiMiMMiMiMiMiMMiiMiMMiMiiMiMMiiMMiMiMMMMMiMiiMiMiiMMiiMM cchchchchchchchchcchchchchchchchchchchchchchchchchchchchchchchhchchchchccchhhhhhhccchhhcchchaeaeaeaeaeaeaeaeaeaeaeaeaeaeaeaeaeaeaeaaaeaaaeaaeaaeaeaeaeeaeaeaeaaeael l l l l ll ll lll lllllllllllllllllllllll MaMaMaMaMaMaMaMaMaMaMaMaMaMaMaMaMaMaMaMaMaMaMMaMaMaMaMMMMaMaMMaMMaMMaMaMMaMaMMMMaMaMMMaMMMargrgrgrgrgrgrgrggrgrgrgrgrgrgrgrgrgrgrgrgrgrgrgrgrgggrgrgrgrgrgrggrgrgrgrrggggggalalalalalalalalalalalalallalalalalalalallalalallalaalallalalaaaaala ioioioioioioioioioioioioiiooioioioioioioioioioioiioioiioioioiooioiooiiotttttttttttttttttttttttttttttt Tel Aviv University, Israel TeTelTelTelTelTelTelTelTeTelTelTelTelTelTelTeTelTelTTTelTelTelT lTelTelTelTellTeTeTTellelTeTelTelTeTelTelTelTTTeelTeelTTelTTTelelelTeeee A AvAAvAA Av Av Av Av Av Av AvAvAv AvAvA Av AvAvAAvAvAvAv Av Av AvAAAvAvAvvAAvAv AvA AAAvAAA iviviv iv iv iviv iviv iv iv ivvv iviviv iv iviviviv iviviiviv vivivivivvviivvii UniUniUniUniUniUniUniUniUniUnUnUniUUniUniUniUniUniUniUniUniUnnUniUniUniUniUUniUniUniUniUUUnUniUniiUUniUnUUUUnUUnUUnUnnUUniiUUnUnUnn verververververververververververververveveververververvevververereververererververeverveverererrveervvvvvvevvve sitsitsitsitsitsitsitsitsitsitsitsitsitsitsitsitsitsitsitsitsititsitisitsitsisssisittsitsittsittitiiity, y,y, y, y, y, y, y, y, yy, yy, y, yy, yyyy, y, y,y,y, y,y,y,y, y, y, y, y, yy,y,yy,yyy,y,yyy IsrIsrIsrIsrIsrIsrIsrIsrIssrIsrIsrssrIsrIsIsIsrIIsrIsrIIIsrIIIsrIsrIsrsrIsIsrIsrIsrsrIsIsrrs aelaelaeaelaelaelaelaelaelaelaelaelaelaelaelaeelaelelaelaeaeaelaelaellllaelaellaelaaaelaelaelaeleellaellelaaeleeaela Adaptive Dynamic Programming: An Introduction Fei-Yue Wang, Chinese Academy of Sciences, CHINA and University of Arizona, USA Huaguang Zhang, Northeastern University, CHINA and Derong Liu, Chinese Academy of Sciences, CHINA Digital Object Identifier 10.1109/MCI.2009.932261 ©STOCKBYTE Abstract: In this article, we introduce some recent research trends within the field of adaptive/approximate dynamic programming (ADP), including the variations on the structure of ADP schemes, the development of ADP algorithms and applications of ADP schemes. For ADP algorithms, the point of focus is that iterative algorithms of ADP can be sorted into two classes: one class is the iterative algorithm with initial stable policy; the other is the one without the requirement of initial stable policy. It is generally believed that the latter one has less computation at the cost of missing the guarantee of system stability during iteration process. In addition, many recent papers have provided conver- gence analysis associated with the algorithms developed. Fur- thermore, we point out some topics for future studies. 1556-603X/09/$25.00©2009IEEE MAY 2009 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 39
Introduction A s is well known, there are many methods for designing stable control for nonlinear systems. However, stability is only a bare minimum requirement in a system design. Ensuring optimality guarantees the stability of the nonlinear system. Dynamic programming is a very useful tool in solving optimization and optimal control problems by employing the principle of optimality. In [16], the principle of optimality is expressed as: “An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.” There are several spectrums about the dynamic programming. One can consider discrete-time systems or continuous-time systems, lin- ear systems or nonlinear systems, time-invariant systems or time-varying systems, deterministic systems or stochastic sys- tems, etc. We first take a look at nonlinear discrete-time (time- varying) dynamical (deterministic) systems. Time-varying non- linear systems cover most of the application areas and discrete-time is the basic consideration for digital computation. Suppose that one is given a discrete-time nonlinear (time- varying) dynamical system x1k 1 12 5 F 31x1k2, u1k2, k4, k 5 0, 1, c (1) where x [ Rn represents the state vector of the system and u [ Rm denotes the control action and F is the system func- tion. Suppose that one associates with this system the perfor- mance index (or cost) J 3x1i2, i4 5 a` g k2iU 3x1k2, u1k2, k4 (2) k5i where U is called the utility function and g is the discount factor with 0 , g # 1. Note that the function J is dependent on the initial time i and the initial state x( i ), and it is referred to as the cost-to-go of state x( i ). The objective of dynamic pro- gramming problem is to choose a control sequence u(k), k 5 i, i 1 1, c, so that the function J (i.e., the cost) in (2) is minimized. According to Bellman, the optimal cost from time k is equal to J *1x1k22 5 min EU 1x1k2, u1k22 1 g J *1x1k 1 122F. (3) u1k2 The optimal control u *1k2 at time k is the u1k2 which achieves EU 1x1k2, u1k22 1 gJ *1x1k 1 122F. (4) u*1k2 5 arg min u1k2 this minimum, i.e., Equation (3) is the principle of optimality for discrete-time systems. Its importance lies in the fact that it allows one to optimize over only one control vector at a time by working backward in time. In nonlinear continuous-time case, the system can be described by 40 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | MAY 2009 The cost in this case is defined as For continuous-time systems, Bellman’s principle of opti- mality can be applied, too. The optimal cost J *(x0) 5 min J (x0, u(t )) will satisfy the Hamilton-Jacobi-Bellman Equation 'J *1x1t22 't t x #1t2 5 F 3x1t2, u1t2, t4, t $ t0. J 1x1t22 53 ` U1x1 t2, u1 t22dt. bU 1x1t2, u1t2, t2 1a'J *1x1t22 'x1t2 3 F 1x1t2, u1t2, t2r 5 U1x1t2, u*1t2, t2 1a'J *1x1t22 b T 'x1t2 3 F 1x1t2, u*1t2, t2. 5 min u[U b T (5) (6) (7) 2 Equations (3) and (7) are called the optimality equations of dynamic programming which are the basis for implemen- tation of dynamic programming. In the above, if the func- tion F in (1) or (5) and the cost function J in (2) or (6) are known, the solution of u(k ) becomes a simple optimization problem. If the system is modeled by linear dynamics and the cost function to be minimized is quadratic in the state and control, then the optimal control is a linear feedback of the states, where the gains are obtained by solving a standard Riccati equation [47]. On the other hand, if the system is modeled by nonlinear dynamics or the cost function is non- quadratic, the optimal state feedback control will depend upon solutions to the Hamilton-Jacobi-Bellman (HJB) equation [48] which is generally a nonlinear partial differ- ential equation or difference equation. However, it is often computationally untenable to run true dynamic program- ming due to the backward numerical process required for its solutions, i.e., as a result of the well-known “curse of dimen- sionality” [16], [28]. In [69], three curses are displayed in resource management and control problems to show the cost function J, which is the theoretical solution of the Hamilton-Jacobi- Bellman equation, is very difficult to obtain, except for systems satisfying some very good condi- tions. Over the years, progress has been made to circumvent the “curse of dimensionality” by building a system, called “critic”, to approximate the cost function in dynamic pro- gramming (cf. [10], [60], [61], [63], [70], [78], [92], [94], [95]). The idea is to approximate dynamic programming solutions by using a function approximation structure such as neural networks to approximate the cost function. The Basic Structures of ADP In recent years, adaptive/approximate dynamic programming (ADP) has gained much attention from many researchers in order to obtain approximate solutions of the HJB equation,
cf. [2], [3], [5], [8], [11]–[13], [21], [22], [25], [30], [31], [34], [35], [40], [46], [49], [52], [54], [55], [63], [70], [76], [80], [83], [95], [96], [99], [100]. In 1977, Werbos [91] introduced an approach for ADP that was later called adaptive critic designs (ACDs). ACDs were proposed in [91], [94], [97] as a way for solving dynamic programming problems forward-in-time. In the literature, there are several synonyms used for “Adaptive Critic Designs” [10], [24], [39], [43], [54], [70], [71], [87], including “Approximate Dynamic Programming” [69], [82], [95], “Asymp- totic Dynamic Programming” [75], “Adaptive Dynamic Pro- gramming” [63], [64], “Heuristic Dynamic Programming” [46], [93], “Neuro-Dynamic Programming” [17], “Neural Dynamic Programming” [82], [101], and “Reinforcement Learning” [84]. Bertsekas and Tsitsiklis gave an overview of the neuro- dynamic programming in their book [17]. They provided the background, gave a detailed introduction to dynamic program- ming, discussed the neural network architectures and methods for training them, and developed general convergence theorems for stochastic approximation methods as the foundation for analysis of various neuro-dynamic programming algorithms. They provided the core neuro-dynamic programming methodology, including many mathematical results and methodological insights. They sug- gested many useful methodologies for applications to neuro- dynamic programming, like Monte Carlo simulation, on-line and off-line temporal difference methods, Q-learning algorithm, opti- mistic policy iteration methods, Bellman error methods, approxi- mate linear programming, approximate dynamic programming with cost-to-go function, etc. A particularly impressive success that greatly motivated subsequent research, was the development of a backgammon playing program by Tesauro [85]. Here a neural network was trained to approximate the optimal cost-to-go func- tion of the game of backgammon by using simulation, that is, by letting the program play against itself. Unlike chess programs, this program did not use lookahead of many steps, so its success can be attributed primarily to the use of a properly trained approxi- mation of the optimal cost-to-go function. To implement the ADP algorithm, Werbos [95] proposed a means to get around this numerical complexity by using “approximate dynamic programming” formulations. His meth- ods approximate the original problem with a discrete formulation. Solution to the ADP formulation is obtained through neu- ral network based adaptive critic approach. The main idea of ADP is shown in Fig. 1. He proposed two basic versions which are heuristic dynamic programming (HDP) and dual heuristic programming (DHP). Reward/Penalty Critic Performance Index Function Action Control Dynamic System Agent State FIGURE 1 Learn from the environment. samples from the instantaneous utility function U, while models of the environment and the instantaneous reward are needed to find the cost function corresponding to the optimal policy. In HDP, the output of the critic network is J^, which is the estimate of J in equation (2). This is done by minimizing the following error measure over time 3 J^1k2 2 U1k2 2 gJ^1k 1 1242, (8) 4 and WC represents the 7Eh7 5a where J^(k) 5 J^3x(k), u(k), k, WC 1k2 5 J^1k2 5 U1k2 1 gJ^1k 1 12 parameters of the critic network. When Eh 5 0 for all k, (8) implies that a k 1 2 (9) Eh k and J^(k) 5 a` i5k g i2kU( i ) which is the same as (2). Dual Heuristic Programming (DHP) Dual heuristic programming is a method for estimating the gradient of the cost function, rather than J itself. To do this, a function is needed to describe the gradient of the instanta- neous cost function with respect to the state of the system. In J(x(k))› Critic Network Heuristic Dynamic Programming (HDP) HDP is the most basic and widely applied structure of ADP [13], [38], [72], [79], [90], [93], [104], [106]. The structure of HDP is shown in Fig. 2. HDP is a method for esti- mating the cost function. Estimating the cost function for a given policy only requires u(k) x(k) Action Network Model Network x(k + 1) Critic Network › J(x(k + 1)) γ U(k) FIGURE 2 The HDP structure. MAY 2009 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 41
the DHP structure, the action network remains the same as the one for HDP, but for the second network, which is called the critic network, with the costate as its output and the state vari- ables as its inputs. over time a k 1 2 The critic network’s training is more complicated than that in HDP since we need to take into account all relevant path- ways of backpropagation. 1k2 5 This is done by minimizing the following error measure 'J^1k 1 12 c 'J^1k2 'U1k2 7ED7 5 a d 2 'x1k2 'x1k2 2 g 'x1k2 2 where 'J^1k2 /'x1k2 5 'J^3x1k2, u1k2, k, WC 4/'x1k2 and WC 'J1k 1 12 'x1k2 represents theparameters of the critic network. When Eh 5 0 for all k, (10) implies that 'U1k2 'x1k2 1 g 'J^1k2 'x1k2 5 , (10) . (11) ED k Theoretical Developments In [82], Si et al summarizes the cross-disciplinary theoretical developments of ADP and overviews DP and ADP; and dis- cusses their relations to artificial intelligence, approximation theory, control theory, operations research, and statistics. In [69], Powell shows how ADP, when coupled with math- ematical programming, can solve (approximately) deterministic or stochastic optimization problems that are far larger than any- thing that could be solved using existing techniques and shows the improvement directions of ADP. The Development of Structures In [95], Werbos further gave two other versions called “action- dependent critics,” namely, ADHDP (also known as Q-learning [89]) and ADDHP. In the two ADP structures, the control is also the input of the critic networks. In 1997, Prokhorov and Wunsch [70] presented more algorithms according to ACDs. ∂J(x(k))› ∂x(k) Critic Network › ∂J(x(k + 1)) ∂x(k + 1) γ u(k) x(k) Action Network Model Network x(k + 1) Critic Network FIGURE 3 The DHP structure. 42 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | MAY 2009 They discussed the design families of HDP, DHP, and global- ized dual heuristic programming (GDHP). They suggested some new improvements to the original GDHP design. They promised to be useful for many engineering applications in the areas of optimization and optimal control. Based on one of these modifications, they present a unified approach to all ACDs. This leads to a generalized training procedure for ACDs. In [26], a realization of ADHDP was suggested: a least squares support vector machine (SVM) regressor has been used for generating the control actions, while an SVM-based tree-type neural network (NN) is used as the critic. The GDHP or ADGDHP structure minimizes the error with respect to both the cost and its derivatives. While it is more complex to do this simultaneously, the resulting behavior is expected to be superi- or. So in [102], GDHP serves as a reconfigurable controller to deal with both abrupt and incipient changes in the plant dynamics due to faults. A novel fault tolerant control (FTC) supervisor is combined with GDHP for the purpose of improving the performance of GDHP for fault tolerant control. When the plant is affected by a known abrupt fault, the new initial conditions of GDHP are loaded from dynamic model bank (DMB). On the other hand, if the fault is incipient, the reconfigurable controller maintains performance by continu- ously modifying itself without supervisor intervention. It is noted that the training of three networks used to implement the GDHP is in an online fashion by utilizing two distinct net- works to implement the critic. The first critic network is trained at every iterations while the second one is updated with a copy of the first one at a given period of iterations. All the ADP structures can realize the same function that is to obtain the optimal control policy while the computation precision and running time are different from each other. Gen- erally speaking, the computation burden of HDP is low but the computation precision is also low; while GDHP has better pre- cision but the computation process will take longer time and the detailed comparison can be seen in [70]. In [30], [33] and [83], the schematic of direct heuristic dynamic programming is developed. Using the approach of [83], the model network in Fig. 1 is not needed anymore. Reference [101] makes significant contri- butions to model-free adaptive critic designs. Several practical examples are included in [101] for demonstration which include single inverted pendulum and triple inverted pendulum. A rein- forcement learning-based controller design for nonlinear discrete-time systems with input constraints is presented by [36], where the nonlinear tracking control is implemented with filtered tracking error using direct HDP designs. Similar works also see [37]. Reference [54] is also about model-free adaptive critic designs. Two approaches for the training of critic net- work are provided in [54]: A forward-in-time ∂U(k) ∂x(k)
approach and a backward-in-time approach. Fig. 4 shows the diagram of forward-in- time approach. In this approach, we view J^(k) in (8) as the output of the critic net- wo r k t o b e t r a i n e d a n d c h o o s e U(k) 1 gJ^(k 1 1) as the training target. Note that J^(k) and J^(k 1 1) are obtained using state variables at different time instances. Fig. 5 shows the diagram of backward-in-time approach. In this approach, we view J^(k 1 1) in (8) as the output of the critic network to be trained and choose ( J^(k) 2 U(k))/g as the training target. The training ap proach of [101] can be considered as a back- ward-in-time ap proach. In Fig. 4 and Fig. 5, x(k 1 1) is the output of the model network. An improvement and modification to the two network archi- tecture, which is called the “single network adaptive critic (SNAC)” was presented in [65], [66]. This approach eliminates the action network. As a consequence, the SNAC architecture offers three potential advantages: a simpler architecture, lesser computational load (about half of the dual network algorithms), and no approximate error due to the fact that the action network is eliminated. The SNAC approach is applicable to a wide class of nonlinear systems where the optimal control (stationary) equa- tion can be explicitly expressed in terms of the state and the costate variables. Most of the problems in aerospace, automobile, robotics, and other engineering disciplines can be characterized by the nonlinear control-affine equations that yield such a rela- tion. SNAC-based controllers yield excellent tracking perfor- mances in applications to microelectronic mechanical systems, chemical reactor, and high-speed reentry problems. Padhi et al. [65] have proved that for linear systems (where the mapping between the costate at stage k 1 1 and the state at stage k is lin- ear), the solution obtained by the algorithm based on the SNAC structure converges to the solution of discrete Riccati equation. Algorithms and Convergence Analysis The exact solution of the HJB equation is generally impossible to obtain for nonlinear systems. To overcome the difficulty in solving the HJB equation, recursive methods are employed to obtain the solution of HJB equation indirectly. In 1983, Barto et al [12] developed a neural computation-based adaptive critic learning method. They divides the state space into boxes and stores learned information for each box. The algorithm works well but the number of boxes can be very large for a complicated system. In 1991, Lin and Kim [51] integrate the cerebellar model articulation controller technique [1] with the box-based scheme. Large state space is mapped into a smaller physical memory space. With the distributed information storage, there is no need to reserve memory for useless boxes; this makes the structure appli- cable to problems of larger size. Kleinman [42] pointed out that the solution of the Riccati equation can be obtained by succes- sively solving a sequence of Lyapunov equations, which is linear with respect to the cost function of the system, and thus, it is eas- ier to solve than a Riccati equation, which is nonlinear with respect to the cost function. Saridis and Lee [77] extended this All the ADP structures can realize the same function that is to obtain the optimal control policy while the computation precision and running time are different from each other. › J(k) Eh(k) U(k) γ › J(k + 1) Critic Network Critic Network x(t) x(t + 1) FIGURE 4 Forward-in-time approach. idea to the case of nonlinear continuous-time systems where a recursive method is used to obtain the optimal control of contin- uous system by successively solving the generalized Hamilton– Jacobi–Bellman (GHJB) equation, and then, updating the control action if an admissible initial control is given. Although the GHJB equation is linear and easier to solve than HJB equation, no general solution for GHJB is demonstrated. Therefore, successful application of the successive approximation method was limited until the novel work of Beard et al. [15] where they used a Galerkin spectral approximation method at each iteration to find approximate solutions to the GHJB equa- tions. And then Beard and Saridis [14] employed a series of poly- nomial functions as basic functions to solve the approximate GHJB equation in continuous time but this method requires the computation of a large number of integrals and it is not obvious how to handle explicit constraints on the controls. In [79], the γ –1 U(k) › J(k+1) Eh(k) › J(k) Critic Network Critic Network x(k + 1) x(k) FIGURE 5 Backward-in-time approach. MAY 2009 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 43
Equations (3) and (7) are called the optimality equations of dynamic programming which are the basis for implementation of dynamic programming. method, two action networks and one critic net- work are used that are adaptively tuned in forward time using adaptive critic methods without the system model information. The convergence anal- ysis is also given to guarantee the cost function to reach the saddle point of the game. HJB equations are motivated and proven on time scales. The authors connected the calculus of time scales and stochastic con- trol via ADP algorithm and further pointed out three significant directions for the investigation of ADP on time scales. Park [68] employed interpolating wavelets as the basic functions. On the other hand, Lewis and Abu-Khalaf presented how to formulate the associated Hamilton–Jacobi–Isaac (HJI) equation using special nonquadratic supply rates to obtain the nonlinear state feedback control in [9]. Next, the fixed-final-time-constrained optimal control of nonlinear systems is studied in [22], [23] based on the neural network solution of the GHJB equation. In order to enhance learning speed and final performance, Wiering and Has- selt combined multiple different reinforcement learning algo- rithms to design and implement four different ensemble methods in [98]. In [41], a new algorithm for the closed loop parallel opti- mal control of weakly coupled nonlinear systems is developed using the successive Galerkin approximation. In [45], the author inspired researchers to develop the experience-based approach which selected a controller that is appropriate to the current situ- ation from a repository of existing controller solutions. Although many papers have discussed the GHJB method for continuous-time systems, there is very minimal work avail- able on the GHJB method for discrete-time nonlinear systems. Discrete-time version of the approximate GHJB-equation- based control is important since all the controllers are typically implemented by using embedded digital hardware. In [21] a successive approximation method using GHJB equation is pro- posed to solve the near-optimal control problem for affine nonlinear discrete-time systems, which requires small perturba- tion assumption and an initially stable policy. The theory of GHJB in discrete-time has also been applied to the linear dis- crete-time case which indicates that the optimal control is nothing but the solution of the standard Riccati equation. On the other hand, in [19], Bradtke et al. implemented a Q-learning policy iteration method for the discrete-time linear quadratic optimal control problem which required an initial stable policy. Furthermore, Landelius [44] applied HDP, DHP, ADHDP and ADDHP techniques to the discrete-time linear quadratic optimal control problem without the initial stable conditions and discussed their convergence. Based on the work of [44], the improvement of ADP to the discrete-time linear quadratic zero-sum game that appearing in the H` optimal control problem is concerned in [2], [4]. The optimal strategies for discrete-time quadratic zero-sum games related to the H` optimal control problem are solved in forward time. The idea is to solve for an action dependent cost function Q(x, u, w) of the zero-sum game instead of solving for the state dependent cost function J(x) which satisfies a corresponding game algebraic Riccati equation (GARE). Using the Kronecker 44 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | MAY 2009 with the cost function (12) (13) Moreover, in [3], a greedy HDP iteration scheme is pro- posed for solving the optimal control problem for nonlinear discrete-time systems with known mathematical model, which does not require an initially stable policy. The discrete-time system can be written as x1k 1 12 5 f 1x1k22 1 g1x1k22u1k2, 1xT1i2Qx1i2 1 u T1i2Ru1i22, J1x1k22 5 a1` 1x2 5 0, define 1x1k22 5 2 R21gT1x1k22c 'Vi µ ui 1x1k22 5 xT1k2Qx1k2 1 u i 1x1k 1 122 where Q [ Rn3n and R [ Rn3n are positive definite matri- ces. Similar to [78], an iterative process, which is referred as Heuristic Dynamic Programming (HDP, cf. [95]), is pro- posed to obtain the optimal control law. Starting from V0 1x1k 1 122 d T 'x1k 1 12 1x1k22 T1k21x1k22Rui , Vi11 i5k 1 2 1 Vi (14) where x(k 1 1) 5 f (x(k)) 1 g(x(k)) ui(x(k)). Al-Tamimi and Lewis [3] also provided a proof to show the cost function converges to the optimal one satisfying discrete-time Hamil- ton-Jacobi-Bellman (DT HJB). Zhang, Wei and Luo [104] applied the greedy iterative HDP algorithm to solve the opti- mal tracking problem. In [104], a new performance index is introduced to obtain a better tracking results. Using a system transformation, the optimal tracking problem is changed into an optimal regulator problem and then the greedy iterative HDP algorithm is introduced to obtain the optimal control for the transformed system. In [105], Zhang and Luo proposed a iteration scheme without requirement of initial stable policy, and proved that the cost function sequence will converge to the optimal cost function when the number of iteration steps goes to infinity. And the critic and action networks can be tuned adaptively without the system plant information via a model network. On the other hand, the optimal control prob- lem for linear continuous-time systems without initial stable policy is studied in [88]. Murray et al. [63] proposed an iterative ADP scheme for continuous-time nonlinear systems with respect to quadratic cost function and succeeded to improve the autolanding control of aircraft. The iteration is required to begin with an initial stable policy, and after each iteration the cost function is updated. So the iterative policy is also
2 5 x0, called “cost iteration”. The system is described by the fol- lowing continuous-time differential equation # x 5 F1x2 1 B1x2u, x1t0 U1x1 t2, u1 t22dt, (15) with the cost function J1x2 53 ` where U1x, u2 5 q1x2 1 u Tr1x2u is a nonnegative function (16) t0 and r (x) . 0. Similar to [78], an iterative process is proposed to obtain the control law. In this case, the optimal control can be simplified to . (17) u *1x2 5 2 1x2 5 2 1 2 1 2 r 211x2B T1x2c dJ *1x2 d T 1x2 r 211x2B T1x2c dJi d T dx Starting from any stable Lyapunov function J0 (or alternatively, starting from an arbitrary stable controller u0 ) and replacing J * by Ji, (17) becomes , dx (18) ui where Ji 5 et0 1`U(xi21, ui21)dt is the cost of the trajectory xi21(t) of plant (15) under the input u(t) 5 ui21(t). Furthermore, Murray et al. gave the convergence analysis of the iterative ADP scheme and the stability proof of the system. Before that most of the ADP analysis is based on the Riccati equation for linear sys- tems. In [8], based on the work of Lyshevski [58], [59], an iterative ADP method is used to obtain an approximate solution of the cost function of the HJB equation using neural networks (NNs). A monotonic odd function is introduced to change the saturating actuators into a nonsaturating one. This in turn results in a nearly optimal constrained input state feedback controller suitable for sat- urated actuators. Different from the iterative ADP scheme in [63], the iterative scheme in [8] adopt policy iteration which means that after each iteration the policy (or control) function is updated. The convergence and stability analysis can also be found in [8]. Vrabie et al. [88] proposed a new policy iteration technique to solve online the continuous-time LQR problem for a partially model-free system (internal dynamics unknown). They presented an online adaptive critic algorithm in which the actor performs continuous-time control, whereas the critic’s correction of the actor’s behavior is discrete in time until best performance is obtained. The critic evaluates the actor’s performance over a peri- od of time and formulates it in a parameterized form. Policy update is a function of the critic’s evaluation of the actor. Conver- gence of the proposed algorithm is established by proving equiva- lence with an established algorithm [42]. Numerical results using the short period dynamics of an F-16 aircraft are presented. In [32], a novel linear parameter-varying (LPV) approach for design- ing the ADP neural network controllers is presented. The control performance and the closed-loop stability of the LPV regime are formulated as a set of design equations that are linear with respect to matrix functions of NN parameters. Applications As for industrial applications of ADP algorithms, focuses have been on missile systems [18], autopilot [31], [50], generators [67], power systems [62], [72], communication systems [55], biochemical processes [57] and so on. In [103], an improved reinforcement learning methods are proposed to perform navi- gation in dynamic environments. The difficulties of the tradi- tional reinforcement learning are presented in autonomous navigating and three effective solutions are proposed to over- come these difficulties which are forgetting Q-learning, feature based Q-learning, and hierarchical Q-learning, respectively. Forgetting Q-learning is proposed to improve performance in a dynamic environment by maintaining possible navigation paths that would be considered unacceptable by traditional Q-learning. Hierarchical Q-learning is proposed as a method of subdividing the problem domain into a set of more manage- able ones. Feature based Q-learning is proposed as a method of enhancing hierarchical Q-learning. In [27], an incoherent con- trol scheme for accomplishing the state control of a class of quantum systems which have wavefunction-controllable sub- spaces is proposed. This incoherent control scheme provides an alternative quantum engineering strategy for locally controlla- ble quantum systems. In the scheme, the initial states can be unknown identical states, and the controlled system is not nec- essarily initially controllable. Applications of adaptive critics in the continuous-time domain were mainly done by using discretization and well- established discrete-time results (e.g., [86]). Various schemes of continuous-time dynamic reinforcement learning were dis- cussed in Campos and Lewis [20] and Rovithakis [74], where the derivative of Lyapunov function is approximated. Lu, Si and Xie [56] applied a direct heuristic dynamic pro- gramming (direct HDP) to a large power system stability con- trol problem. A direct HDP controller learns to cope with model deficiencies for nonlinearities and uncertainties on the basis of real system responses instead of a system model. Ray et al. [73] reported a comparison of adaptive critic-based and classical wide-area controllers for power systems. Liu et al. [53] demonstrated a good engine torque and exhaust air-fuel ratio (AFR) control with adaptive critic techniques for an engine application. The design was based on neural networks to auto- matically learn the inherent dynamics and it advanced the development of a virtual powertrain to improve its perfor- mance during the actual vehicle operations. Enns and Si [29] presented an article on model-free approach to helicopter control. Jagannathan [81] has extended stability proofs for systems with observers in the feedback loop and applied to spark engine EGR operation on the basis of reinforcement learning dual control [37]. Al-Tamimi, Abu- Khalaf and Lewis [2] used HDP and DHP structures to solve problems formulated with game theoretic notions. Their for- mulation leads to a forward-in-time reinforcement learning algorithm that converges to the Nash equilibrium of the cor- responding zero-sum game and they have provided perfor- mance comparisons with an F-16 autopilot problem. MAY 2009 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 45
Al-Tamimi et al. [6], [7] extended these results to a model- free environment for linear systems for the control of a power system generator. In these papers, they presented online mod- el-free adaptive critic schemes to solve optimal control prob- lems in both discrete-time and continuous-time domains for linear systems with unknown dynamics. In the discrete-time case, the solution process leads to solving the underlying game algebraic Riccati equation (GARE) of the corresponding optimal control problem or zero-sum game. In the continu- ous-time domain, the ADP scheme solves the underlying ARE of the optimal control problem. It is shown that contin- uous-time ADP scheme is nothing but a quasi-Newton method to solve the ARE. Either in continuous-time domain or in discrete-time domain, the adaptive critic algorithms are easy to implement the fact that initial policies are not required to be stabilizing. For the model-based paper, the authors have proved the convergence of the presented algorithm. Concluding Remarks In this article, we presented the variations on the structure of ADP schemes and stated the development on the iterative ADP algorithms, and at last we summarized industrial applications of ADP schemes. In the future, the study of ADP algorithms for nonlinear continuous-time systems without the requirement of initially stable policy is important. And also how to extend the ADP algorithms to time-variant and time-delay uncertain non- linear systems with stability guarantee is another interesting topic. In addition, practical applications of ADP with significant economic impact are of great demand. Acknowledgment The authors would like to thank Dr. Ning Jin and Dr. Qinglai Wei for their help in preparing this manuscript. References [1] J. S. Albus, “A new approach to manipulator control: The cerebellar model articulation controller (CMAC),” Trans. ASME, J. Dyn. Syst., Meas., Control, vol. 97, pp. 220–227, Sept. 1975. [2] A. Al-Tamimi, M. Abu-Khalaf, and F. L. Lewis, “Adaptive critic designs for discrete- time zero-sum games with application to H` control,” IEEE Trans. Syst., Man, Cybern. B, vol. 37, no. 1, pp. 240–247, Feb. 2007. [3] A. Al-Tamimi and F. L. Lewis, “Discrete-time nonlinear HJB solution using approxi- mate dynamic programming: convergence proof,” in Proc. IEEE Int. Symp. Approximate Dynamic Programming and Reinforcement Learning, Honolulu, HI, Apr. 2007, pp. 38–43. [4] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Model-free Q-learning designs for linear discrete-time zero-sum games with application to H-infinity control,” Automatica, vol. 43, no. 3, pp. 473–481, 2007. [5] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time nonlinear HJB solu- tion using approximate dynamic programming: Convergence proof,” IEEE Trans. Syst., Man., Cybern. B, vol. 38, no. 4, pp. 943–949, Aug. 2008. [6] A. Al-Tamimi, F. L. Lewis, and Y. Wang, “Model-free H-infinity loadfrequency controller design for power systems,” in Proc. IEEE Int. Symp. Intelligent Control, 2007, pp. 118–125. [7] A. Al-Tamimi, D. Vrabie, M. Abu-Khalaf, and F. L. Lewis, “Model-free approximate dynamic programming schemes for linear systems,” in Proc. IJCNN, 2007, pp. 371–378. [8] M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach,” Automatica, vol. 41, no. 5, pp. 779–791, 2005. [9] M. Abu-Khalaf, F. L. Lewis, and J. Huang, “Policy iterations on the Hamilton-Jacobi- Isaacs equation for state feedback control with input saturation,” IEEE Trans. Automat. Contr., vol. 51, no. 12, pp. 1989–1995, Dec. 2006. [10] S. N. Balakrishnan and V. Biega, “Adaptive-critic-based neural networks for aircraft optimal control,” J. Guid. Control Dyn., vol. 19, pp. 893–898, July 1996. [11] S. N. Balakrishnan, J. Ding, and F. L. Lewis, “Issues on stability of ADP feedback controllers for dynamical systems,” IEEE Trans. Syst., Man., Cybern. B, vol. 38, no. 4, pp. 913–917, Aug. 2008. 46 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | MAY 2009 [12] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE Trans. Syst., Man., Cybern., vol. 13, no. 5, pp. 835–846, 1983. [13] R. W. Beard, “Improving the closed-loop performance of nonlinear systems,” Ph.D. disserta- tion, Elect. Eng. Dept., Rensselaer Polytech. Inst., Troy, NY, 1995. [14] R. W. Beard and G. N. Saridis, “Approximate solutions to the timeinvariant Hamil- ton-Jacobi-Bellman equation,” J. Optim. Theory Appl., vol. 96, no. 3, pp. 589–626, 1998. [15] R. W. Beard, G. N. Saridis, and J. T. Wen, “Galerkin approximations of the gener- alized Hamilton-Jacobi-Bellman equation,” Automatica, vol. 33, no. 12, pp. 2159–2177, 1997. [16] R. E. Bellman, Dynamic Programming. Princeton, NJ: Princeton Univ. Press, 1957. [17] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Belmont, MA: Ath- ena Scientific, 1996. [18] D. P. Bertsekas, M. L. Homer, D. A. Logan, S. D. Patek, and N. R. Sandell, “Missile defense and interceptor allocation by neuro-dynamic programming,” IEEE Trans. Syst., Man, Cybern. A, vol. 30, no. 1, pp. 42–51, Jan. 2000. [19] S. J. Bradtke, B. E. Ydestie, and A. G. Barto, “Adaptive linear quadratic control using policy iteration,” in Proc. American Control Conf., Baltimore, MD, June 1994, pp. 3475–3476. [20] J. Campos and F. L. Lewis, “Adaptive critic neural network for feedforward compen- sation,” in Proc. American Control Conf., San Diego, CA, June 1999, pp. 2813–2818. [21] Z. Chen and S. Jagannathan, “Generalized Hamilton-Jacobi-Bellman formulation- based neural network control of affine nonlinear discrete-time systems,” IEEE Trans. Neural Networks, vol. 19, no. 1, pp. 90–106, Jan. 2008. [22] T. Cheng, F. L. Lewis, and M. Abu-Khalaf, “Fixed-final-time-constrained optimal control of nonlinear systems using neural network HJB approach,” IEEE Trans. Neural Networks, vol. 18, no. 6, pp. 1725–1736, Nov. 2007. [23] T. Cheng, F. L. Lewis, and M. Abu-Khalaf, “A neural network solution for fixed- final time optimal control of nonlinear systems,” Automatica, vol. 43, no. 3, pp. 482–490, 2007. [24] J. Dalton and S. N. Balakrishnan, “A neighboring optimal adaptive critic for missile guidance,” Math. Comput. Model., vol. 23, pp. 175–188, Jan. 1996. [25] J. Dankert, Y. Lei, and J. Si, “A performance gradient perspective on approximate dynamic programming and its application to partially observable markov decision pro- cesses,” in Proc. Int. Symp. Intelligent Control, Munich, Oct. 2006, pp. 458–463. [26] A. K. Deb, Jayadeva, M. Gopal, and S. Chandra, “SVM-based tree-type neural net- works as a critic in adaptive critic designs for control,” IEEE Trans. Neural Networks, vol. 18, no. 4, pp. 1016–1030, July 2007. [27] D. Dong, C. Chen, T. J. Tarn, A. Pechen, and H. Rabitz, “Incoherent control of quantum systems with wavefunction-controllable subspaces via quantum reinforcement learning,” IEEE Trans. Syst., Man, Cybern. B, vol. 38, no. 4, pp. 957–962, Aug. 2008. [28] S. E. Dreyfus and A. M. Law, The Art and Theory of Dynamic Programming. New York, NY: Academic, 1977. [29] R. Enns and J. Si, “Apache helicopter stabilization using neural dynamic program- ming,” J. Guid. Control Dyn., vol. 25, no. 1, pp. 19–25, 2002. [30] R. Enns and J. Si, “Helicopter trimming and tracking control using direct neural dy- namic programming,” IEEE Trans. Neural Networks, vol. 14, no. 4, pp. 929–939, July 2003. [31] S. Ferrari and R. F. Stengel, “Online adaptive critic flight control,” J. Guid. Control Dyn., vol. 27, no. 5, pp. 777–786, 2004. [32] S. Ferrari, J. E. Steck, and R. Chandramohan, “Adaptive feedback control by con- strained approximate dynamic programming,” IEEE Trans. Syst., Man., Cybern. B, vol. 38, no. 4, pp. 982–987, Aug. 2008. [33] J. Govindhassamy, S. Mcloone, and G. Irwin, “Second-order training of adaptive critics for online process control,” IEEE Trans. Syst., Man., Cybern. B, vol. 35, no. 2, pp. 381–385, Apr. 2005. [34] T. Hanselmann, L. Noakes, and A. Zaknich, “Continuous-time adaptive critics,” IEEE Trans. Neural Networks, vol. 18, no. 3, pp. 631–647, May 2007. [35] R. Havira and J. Lewis, “Computation of quantized controls using differential dy- namic programming,” IEEE Trans. Automat. Contr., vol. 17, no. 2, pp. 191–196, Apr. 1972. [36] P. He and S. Jagannathan, “Reinforcement learning-based output feedback control of nonlinear systems with input constraints,” IEEE Trans. Syst., Man., Cybern. B, vol. 35, no. 1, pp. 150–154, Jan. 2005. [37] P. He and S. Jagannathan, “Reinforcement learning neural-network-based control- ler for nonlinear discrete-time systems with input constraints,” IEEE Trans. Syst., Man., Cybern. B, vol. 37, no. 2, pp. 425–436, Apr. 2007. [38] Z. G. Hou and C. P. Wu, “A dynamic programming neural network for large-scale optimization problems,” Acta Autom. Sin., vol. 25, no. 1, pp. 46–51, 2005. [39] H. Javaherian, D. Liu, Y. Zhang, and O. Kovalenko, “Adaptive critic learning tech- niques for automotive engine control,” in Proc. American Control Conf., Boston, MA, June 2004, pp. 4066–4071. [40] N. Jin, D. Liu, T. Huang, and Z. Pang, “Discrete-time adaptive dynamic program- ming using wavelet basis function neural networks,” in Proc. IEEE Int. Symp. Approximate Dynamic Programming and Reinforcement Learning, Honolulu, HI, Apr. 2007, pp. 135–142. [41] Y. J. Kim and M. T. Lim, “Parallel optimal control for weakly coupled nonlinear systems using successive Galerkin approximation,” IEEE Trans. Automat. Contr., vol. 53, no. 6, pp. 1542–1547, July 2008. [42] D. Kleinman, “On a iterative technique for Riccati equation computations,” IEEE Trans. Automat. Contr., vol. 13, no. 1, pp. 114–115, Feb. 1968. [43] N. V. Kulkarni and K. KrishnaKumar, “Intelligent engine control using an adaptive critic,” IEEE Trans. Contr. Syst. Technol., vol. 11, pp. 164–173, Mar. 2003. [44] T. Landelius, “Reinforcement learning and distributed local model synthesis,” Ph.D. dis- sertation, Linkoping University, Sweden, 1997.
分享到:
收藏