logo资料库

Matrix Calculus:Derivation and Simple Application HU Pili.pdf

第1页 / 共30页
第2页 / 共30页
第3页 / 共30页
第4页 / 共30页
第5页 / 共30页
第6页 / 共30页
第7页 / 共30页
第8页 / 共30页
资料共30页,剩余部分请下载后查看
Matrix Calculus: Derivation and Simple Application HU, Pili∗ March 30, 2012† Abstract Matrix Calculus[3] is a very useful tool in many engineering prob- lems. Basic rules of matrix calculus are nothing more than ordinary calculus rules covered in undergraduate courses. However, using ma- trix calculus, the derivation process is more compact. This document is adapted from the notes of a course the author recently attends. It builds matrix calculus from scratch. Only prerequisites are basic cal- culus notions and linear algebra operation. To get a quick executive guide, please refer to the cheat sheet in section(4). To see how matrix calculus simplify the process of derivation, please refer to the application in section(3.4). ∗hupili [at] ie [dot] cuhk [dot] edu [dot] hk †Last compile:April 24, 2012 1
HU, Pili Contents 1 Introductory Example 2 Derivation Matrix Calculus 3 4 4 4 5 6 7 8 9 10 11 13 15 16 16 18 20 21 24 24 24 25 25 25 27 28 28 29 2.1 Organization of Elements . . . . . . . . . . . . . . . . . . . . 2.2 Deal with Inner Product . . . . . . . . . . . . . . . . . . . . . 2.3 Properties of Trace . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Deal with Generalized Inner Product . . . . . . . . . . . . . . 2.5 Define Matrix Differential . . . . . . . . . . . . . . . . . . . . 2.6 Matrix Differential Properties . . . . . . . . . . . . . . . . . . 2.7 Schema of Hanlding Scalar Function . . . . . . . . . . . . . . 2.8 Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Vector Function and Vector Variable . . . . . . . . . . . . . . 2.10 Vector Function Differential . . . . . . . . . . . . . . . . . . . 2.11 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Application 3.1 The 2nd Induced Norm of Matrix . . . . . . . . . . . . . . . . 3.2 General Multivaraite Gaussian Distribution . . . . . . . . . . 3.3 Maximum Likelihood Estimation of Gaussian . . . . . . . . . 3.4 Least Square Error Inference: a Comparison . . . . . . . . . . 4 Cheat Sheet 4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Schema for Scalar Function . . . . . . . . . . . . . . . . . . . 4.3 Schema for Vector Function . . . . . . . . . . . . . . . . . . . 4.4 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Frequently Used Formula . . . . . . . . . . . . . . . . . . . . 4.6 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgements References Appendix 2
HU, Pili Matrix Calculus 1 Introductory Example We start with an one variable linear function: f (x) = ax To be coherent, we abuse the partial derivative notation: (1) (2) (3) (5) Extending this function to be multivariate, we have: f (x) = aixi = aTx Where a = [a1, a2, . . . , an]T and x = [x1, x2, . . . , xn]T. We first compute partial derivatives directly: ∂f ∂xk i aixi) ∂xk = ak (4) for all k = 1, 2, . . . , n. Then we organize n partial derivatives in the following way: = a ∂f ∂x i ∂( =  ∂f ∂x1 ∂f ∂x2 ... ∂f ∂xn  =  = a a1 a2 ... an ∂f ∂x = The first equality is by proper definition and the rest roots from ordinary calculus rules. Eqn(5) is analogous to eqn(2), except the variable changes from a scalar to a vector. Thus we want to directly claim the result of eqn(5) without those intermediate steps solving for partial derivatives separately. Actually, we’ll see soon that eqn(5) plays a core role in matrix calculus. Following sections are organized as follows: • Section(2) builds commonly used matrix calculus rules from ordinary calculus and linear algebra. Necessary and important properties of lin- ear algebra is also proved along the way. This section is not organized afterhand. All results are proved when we need them. • Section(3) shows some applications using matrix calculus. Table(1) shows the relation between Section(2) and Section(3). • Section(4) concludes a cheat sheet of matrix calculus. Note that this cheat sheet may be different from others. Users need to figure out some basic definitions before applying the rules. 3
HU, Pili Matrix Calculus Table 1: Derivation and Application Correspondance Derivation Application 2.1-2.7 2.9,2.10 2.8,2.11 3.1 3.2 3.3 2 Derivation 2.1 Organization of Elements From the introductary example, we already see that matrix calculus does not distinguish from ordinary calculus by fundamental rules. However, with better organization of elements and proving useful properties, we can sim- plify the derivation process in real problems. The author would like to adopt the following definition: Definition 1. For a scalar valued function f (x), the result size with x. That is ∂f ∂x has the same (6)  ∂f ∂x = ∂f ∂x11 ∂f ∂x21 ... ∂f ∂xm1 ∂f ∂x12 ∂f ∂x22 ... ∂f ∂xm2 . . . . . . . . . . . . ∂f ∂x1n ∂f ∂x2n ... ∂f ∂xmn  ∂f ∂x In eqn(2), x is a 1-by-1 matrix and the result = a is also a 1-by-1 matrix. In eqn(5), x is a column vector(known as n-by-1 matrix) and the result ∂f ∂x = a has the same size. Example 1. By this definition, we have: ∂f ∂xT = ( ∂f ∂x )T = aT (7) Note that we only use the organization definition in this example. Later we’ll show that with some matrix properties, this formula can be derived without using as a bridge. ∂f ∂x 2.2 Deal with Inner Product Theorem 1. If there’s a multivariate scalar function f (x) = aTx, we have ∂f ∂x = a. 4
HU, Pili Matrix Calculus Proof. See introductary example. Thus, Since aTx is scalar, we can write it equivalently as the trace of its own. Proposition 2. If there’s a multivariate scalar function f (x) = TraTx, we have = a. ∂f ∂x Tr [•] is the operator to sum up diagonal elements of a matrix. In the next section, we’ll explore more properties of trace. As long as we can transform our target function into the form of theorem(1) or proposition(2), the result can be written out directly. Notice in proposition(2), a and x are both vectors. We’ll show later as long as their sizes agree, it holds for matrix a and x. 2.3 Properties of Trace Definition 2. Trace of square matrix is defined as: Tr [A] = i Aii Example 2. Using definition(1,2), it is very easy to show: ∂Tr [A] ∂A = I (8) since only diagonal elements are kept by the trace operator. Theorem 3. Matrix trace has the following properties: • (1) Tr [A + B] = Tr [A] + Tr [B] • (2) Tr [cA] = cTr [A] • (3) Tr [AB] = Tr [BA] • (4) Tr [A1A2 . . . An] = Tr [AnA1 . . . An−1] • (5) TrATB = • (6) Tr [A] = TrAT i j AijBij where A, B are matrices with proper sizes, and c is a scalar value. Proof. See wikipedia [5] for the proof. Here we explain the intuitions behind each property to make it eas- ier to remenber. Property(1) and property(2) shows the linearity of trace. Property(3) means two matrices’ multiplication inside a the trace operator is commutative. Note that the matrix multiplication without trace is not commutative and the commutative property inside the trace does not hold 5
HU, Pili Matrix Calculus for more than 2 matrices. Property (4) is the proposition of property (3) by considering A1A2 . . . An−1 as a whole. It is known as cyclic property, so that you can rotate the matrices inside a trace operator. Property (5) shows a way to express the sum of element by element product using matrix product and trace. Note that inner product of two vectors is also the sum of ele- ment by element product. Property (5) resembles the vector inner product by form(ATB). The author regards property (5) as the extension of inner product to matrices(Generalized Inner Product). 2.4 Deal with Generalized Inner Product Theorem 4. If there’s a multivariate scalar function f (x) = TrATx, we have ∂f ∂x = A. (A, x can be matrices). Proof. Using property (5) of trace, we can write f as: f (x) = TrATx = Aijxij ∂( ij ij Aijxij) ∂xij = Aij It’s easy to show: ∂f ∂xij = Organize elements using definition(1), it is proved. With this theorem and properties of trace we revisit example(1). Example 3. For vector a, x and function f (x) = aTx = (f is scalar) = (property(3)) = (property(6)) = (property of transpose) = ∂f ∂xT ∂(aTx) ∂xT ∂(TraTx) ∂(TrxaT) ∂(TraxT) ∂(Tr(aT)TxT) ∂xT ∂xT ∂xT ∂xT (theorem(4)) = aT (9) (10) (11) (12) (13) (14) (15) (16) (17) The result is the same with example(1), where we used the basic definition. The above example actually demonstrates the usual way of handling a matrix derivative problem. 6
HU, Pili Matrix Calculus 2.5 Define Matrix Differential Although we want matrix derivative at most time, it turns out matrix differ- ential is easier to operate due to the form invariance property of differential. Matrix differential inherit this property as a natural consequence of the fol- lowing definition. Definition 3. Define matrix differential:  dA11 dA21 ... dA12 dA22 ... dAm1 dAm2  . . . dA1n . . . dA2n . . . . . . dAmn ... (18) dA = Theorem 5. Differential operator is distributive through trace operator: dTr [A] = Tr [dA] Proof. LHS = d( Aii) = i  dA11 dA21 ... i i dAii dA12 dA22 ... . . . dA1n . . . dA2n . . . . . . dAmn ...  (19) (20) (21) RHS = Tr dAm1 dAm2 = dAii = LHS Now that matrix differential is well defined, we want to relate it back to matrix derivative. The scalar version differential and derivative can be related as follows: df = dx (22) ∂f ∂x So far, we’re dealing with scalar function f and matrix variable x. and dx are both matrix according to definition. In order to make the quantities in eqn(22) equal, we must figure out a way to make the RHS a scalar. It’s not surprising that trace is what we want. ∂f ∂x Theorem 6. df = Tr ( ∂f ∂x )Tdx for scalar function f and arbitrarily sized x. 7 (23)
HU, Pili Proof. (definition of scalar differential) = LHS = df Matrix Calculus RHS = Tr ( )Tdx ij ij ij ( ij ∂f ∂xij dxij ∂f ∂x ∂f ∂x )ij(dx)ij ( ∂f ∂x )ijdxij ∂f ∂xij dxij = LHS (trace property (5)) = (definition(3)) = (definition(1)) = (24) (25) (26) (27) (28) (29) (30) (31) (32) (33) (34) (35) Theorem(6) is the bridge between matrix derivative and matrix differ- ential. We’ll see in later applications that matrix differential is more con- venient to manipulate. After certain manipulation we can get the form of theorem(6). Then we can directly write out matrix derivative using this theorem. 2.6 Matrix Differential Properties Theorem 7. We claim the following properties of matrix differential: • d(cA) = cdA • d(A + B) = dA + dB • d(AB) = dAB + AdB Proof. They’re all natural consequences given the definition(3). We only show the 3rd one in this document. Note that the equivalence holds if LHS and RHS are equivalent element by element. We consider the (ij)-th element. k k = = LHSij = d( AikBkj) (dAikBkj + AikdBkj) RHSij = (dAB)ij + (AdB)ij dAikBkj + AikdBkj k = LHSij k 8
分享到:
收藏