logo资料库

模式识别 第四版 答案.pdf

第1页 / 共209页
第2页 / 共209页
第3页 / 共209页
第4页 / 共209页
第5页 / 共209页
第6页 / 共209页
第7页 / 共209页
第8页 / 共209页
资料共209页,剩余部分请下载后查看
Notes and Solutions for: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas. John L. Weatherwax∗ January 19, 2006 Introduction Here you’ll find some notes that I wrote up as I worked through this excellent book. I’ve worked hard to make these notes as good as I can, but I have no illusions that they are perfect. If you feel that that there is a better way to accomplish or explain an exercise or derivation presented in these notes; or that one or more of the explanations is unclear, incomplete, or misleading, please tell me. If you find an error of any kind – technical, grammatical, typographical, whatever – please tell me that, too. I’ll gladly add to the acknowledgments in later printings the name of the first person to bring each problem to my attention. Acknowledgments Special thanks to (most recent comments are listed first): Karonny F, Mohammad Heshajin for helping improve these notes and solutions. All comments (no matter how small) are much appreciated. In fact, if you find these notes useful I would appreciate a contribution in the form of a solution to a problem that I did not work, a mathematical derivation of a statement or comment made in the book that was unclear, a piece of code that implements one of the algorithms discussed, or a correction to a typo (spelling, grammar, etc) about these notes. Sort of a “take a penny, leave a penny” type of approach. Remember: pay it forward. ∗wax@alum.mit.edu 1
Classifiers Based on Bayes Decision Theory Notes on the text Minimizing the average risk The symbol rk is the expected risk associated with observing an object from class k. This risk is divided up into parts that depend on what we then do when an object from class k with feature vector x is observed. Now we only observe the feature vector x and not the true class label k. Since we must still perform an action when we observe x let λki represent the loss associated with the event that the object is truly from class k and we decided that it is from class i. Define rk as the expected loss when an object of type k is presented to us. Then rk = = M M Xi=1 Xi=1 λkiP (we classify this object as a member of class i) λkiZRi p(x|ωk)dx , which is the books equation 2.14. Thus the total risk r is the expected value of the class dependent risks rk taking into account how likely each class is or M M M r = = = rkP (ωk) Xk=1 λkiZRi Xk=1 Xi=1 Xi=1 ZRi M Xk=1 M p(x|ωk)P (ωk)dx λkip(x|ωk)P (ωk)! dx . (1) The decision rule that leads to the smallest total risk is obtained by selecting Ri to be the region of feature space in which the integrand above is as small as possible. That is, Ri should be defined as the values of x such that for that value of i we have M Xk=1 λkip(x|ωk)P (ωk) < M Xk=1 λkjp(x|ωk)P (ωk) ∀j . In words the index i, when put in the sum above gives the smallest value when compared to all other possible choices. For these values of x we should select class ωi as our classification decision. 2
Bayesian classification with normal distributions When the covariance matrices for two classes are the same and diagonal i.e. Σi = Σj = σ2I then the discrimination functions gij(x) are given by gij(x) = wT (x − x0) = (µi − µj)T (x − x0) , (2) since the vector w is w = µi − µj in this case. Note that the point x0 is on the decision hyperplane i.e. satisfies gij(x) = 0 since gij(x0) = wT (x0 − x0) = 0. Let x be another point on the decision hyperplane, then x − x0 is a vector in the decision hyperplane. Since x is a point on the decision hyperplane it also must satisfy gij(x) = 0 from the functional form for gij(·) and the definition of w is this means that wT (x − x0) = (µi − µj)T (x − x0) = 0 . This is the statement that the line connecting µi and µj is orthogonal to the decision hy- perplane. In the same way, when the covariance matrices of each class are not diagonal but are nevertheless the Σi = Σj = Σ the same logic that we used above states that the decision hyperplane is again orthogonal to the vector w which in this case is Σ−1(µi − µj). The magnitude of P (ωi) relative to P (ωj) influences how close the decision hyperplane is to the respective class means µi or µj, in the sense that the class with the larger a priori probability will have a “larger” region of Rl assigned to it for classification. For example, if P (ωi) < P (ωj) then ln P (ωi) P (ωj ) < 0 so the point x0 which in the case Σi = Σj = Σ is given by , (3) x0 = we can write as 1 2 P (ωj) (µi + µj) − ln P (ωi) µi − µj ||µi − µj||2 Σ−1 x0 = 1 2 (µi + µj) + α(µi − µj) , with the value of α > 0. Since µi − µj is a vector from µj to µi the expression for x0 above starts at the midpoint 1 2(µi + µj) and moves closer to µi. Meaning that the amount of Rl assigned to class ωj is “larger” than the amount assigned to class ωi. This is expected since the prior probability of class ωj is larger than that of ωi. Notes on Example 2.2 To see the final lengths of the principal axes we start with the transformed equation of constant Mahalanobis distance of dm = √2.952 or (x′1 − µ′11)2 λ1 + (x′2 − µ′12)2 λ2 = (√2.952)2 = 2.952 . Since we want the principal axis about (0, 0) we have µ′11 = µ′12 = 0 and λ1 and λ2 are the eigenvalues given by solving |Σ − λI| = 0. In this case, we get λ1 = 1 (in direction v1) and λ2 = 2 (in direction v2). Then the above becomes in “standard form” for a conic section (x′1)2 2.952λ1 + (x′2)2 2.952λ2 = 1 . 3
From this expression we can read off the lengths of the principle axis 2p2.952λ1 = 2√2.952 = 3.43627 2p2.952λ2 = 2p2.952(2) = 4.85962 . Maximum A Posteriori (MAP) Estimation: Example 2.4 We will derive the MAP estimate of the population mean µ when given N samples xk distributed as p(x|µ) and a normal prior on µ i.e. N(µ0, σ2 µI). Then the estimate of the population mean µ given the sample X ≡ {xk}N k=1 is proportional to p(µ|X) ∝ p(µ)p(X|µ) = p(µ) N Yk=1 p(xk|µ) . Note that we have written p(µ) on the outside of the product terms since it should only appear once and not N times as might be inferred by had we written the product as QN k=1 p(µ)p(xk|µ). To find the value of µ that maximized this we take begin by taking the natural log of the expression above, taking the µ derivative and setting the resulting expression equal to zero. We find the natural log of the above given by ln(p(µ)) + N Xk=1 ln(p(xk|µ)) = − 1 2 ||µ − µ0||2 σ2 µ 1 2 − N Xk=1 (xk − µ)T Σ−1(xk − µ) . Then taking the derivative with respect to µ, setting the result equal to zero, and calling that solution ˆµ gives 1 σ2 µ − (ˆµ − µ0) + 1 σ2 N Xk=1 (xk − ˆµ) = 0 , were we have assumed that the density p(x|µ) is N(µ, Σ) with Σ = σ2I. When we solve for ˆµ in the above we get 1 σ2 µ ˆµ = k=1 xk µ0 + 1 σ2 PN σ2 + 1 σ2 µ N = Maximum Entropy Estimation µ0 + σ2 µ σ2 PN σ2 N 1 + σ2 µ k=1 xk (4) As another method to determine distribution parameters we seek to maximize the entropy H or p(x) ln(p(x))dx . (5) H = −ZX This is equivalent to minimizing its negative or RX straint that the density must integrate to one, we form the entropy Lagrangian p(x) ln(p(x))dx. To incorporate the con- HL =Z x2 x1 p(x) ln(p(x))dx − λZ x2 x1 p(x)dx − 1 , 4
where we have assumed that our density is non-zero only over [x1, x2]. The negative of the above is equivalent to −HL = −Z x2 x1 p(x)(ln(p(x)) − λ)dx − λ . Taking the p(x) derivative and setting it equal to zero ∂(−HL) ∂p x1 = −Z x2 = −Z x2 x1 p]dx [(ln(p) − λ) + p1 [ln(p) − λ + 1]dx = 0 . Solving for the integral of ln(p(x)) we get Z x2 x1 ln(p(x))dx = (λ − 1)(x2 − x1) . Take the x2 derivative of this expression and we find x ln(p(x2)) = λ − 1 ⇒ p(x2) = eλ−1, . To find the value of λ we put this expression into our constraint of R x2 or λ − 1 = ln 1 x2−x1, thus eλ−1(x2 − x1) = 1 , x1 1 , p(x) = expln 1 x2 − x1 = x2 − x1 p(x)dx = 1 to get a uniform distribution. Problem Solutions Problem 2.1 (the Bayes’ rule minimized the probability of error) Following the hint in the book, the probability of correct classification Pc is given by Pc = M Xi=1 P (x ∈ Ri, ωi) , since in order to be correct when x ∈ Ri the sample that generated x must come from the class ωi. Now this joint probability is given by P (x ∈ Ri, ωi) = P (x ∈ Ri|ωi)P (ωi) = ZRi p(x|ωi)dx P (ωi) . 5
So the expression for Pc then becomes Pc = = M Xi=1 ZRi Xi=1 ZRi M p(x|ωi)dx P (ωi) p(x|ωi)P (ωi)dx . (6) Since this is a sum of M different terms to maximize Pc we will define Ri to be the region of x where (7) p(x|ωi)P (ωi)dx will be as large as possible. As Equation 6 is the sum of such terms we will have also maximized Pc. Now dividing both sides of Equation 7 and using Bayes’ rule we have p(x|ωi)P (ωi) > p(x|ωj)P (ωj) ∀j 6= i . If we do this, then since in Ri from Equation 7 the expression RRi P (ωi|x) > P (ωj|x) ∀j 6= i , as the multi-class decision boundary, what we were to show. Problem 2.2 (finding the decision boundary) Using the books notation where λki is the loss associated with us deciding an object is from class i when it in fact is from class k we need to compare the expressions given by Equation 1. Since this is a two class problem M = 2 and we need to compare l1 = λ11p(x|ω1)P (ω1) + λ21p(x|ω2)P (ω2) l2 = λ12p(x|ω1)P (ω1) + λ22p(x|ω2)P (ω2) . Under zero-one loss these become l1 = λ21p(x|ω2)P (ω2) l2 = λ12p(x|ω1)P (ω1) . When l1 < l2 we will classify x as from class ω1 and from class ω2 otherwise. The decision boundary will be the point x0 where l1 = l2. This later equation (if we solve for the likelihood ratio) is = λ21P (ω2) λ12P (ω1) . (8) p(x|ω1) p(x|ω2) x2 σ2 σ2 If we assume that p(x|ω1) ∼ N (0, σ2) and p(x|ω2) ∼ N (1, σ2) then σ2 (2x − 1) . p(x|ω1) p(x|ω2) Setting this equal to λ21P (ω2) λ12P (ω1) and solving for x gives = exp− (x−1)2 1 2 2 e− 1 e− 1 2 = 1 as we were to show. x = 1 λ12P (ω1) , 2 − σ2 lnλ21P (ω2) 6
Problem 2.3 (an expression for the average risk) We are told to define the errors ε1,2 as the class conditional error or ε1 = P (x ∈ R2|ω1) =ZR2 ε2 = P (x ∈ R1|ω2) =ZR1 p(x|ω1)dx p(x|ω2)dx . Using these definitions we can manipulate the average risk as p(x|ω1)dx p(x|ω2)dx p(x|ω1)dx + λ12ZR2 p(x|ω2)dx + λ22ZR2 r = P (ω1)λ11ZR1 + P (ω2)λ21ZR1 = P (ω1)λ111 −ZR2 + P (ω2)λ21ZR1 = λ11P (ω1) − λ11ε1P (ω1) + λ12ε1P (ω1) + λ21ε2P (ω2) + λ22P (ω1) − λ22ε2P (ω2) = λ11P (ω1) + λ22P (ω2) + (λ12 − λ11)ε1P (ω1) + (λ12 − λ22)ε2P (ω2) , p(x|ω1)dx + λ12ZR2 p(x|ω2)dx + λ221 −ZR1 p(x|ω1)dx p(x|ω2)dx resulting in the desired expression. Problem 2.4 (bounding the probability of error) We desire to show that Pe ≤ 1 − 1 M . To do this recall that since PM otherwise the sum PM Bayes’ classification decision. That is i=1 P (ωi|x) = 1 at least one P (ωi|x) must be larger than 1 i=1 P (ωi|x) would have to be less than one. Now let P (ωi∗|x) be the M P (ωi∗|x) = max i P (ωi|x) . From the above discussion P (ωi∗|x) ≥ 1 Pe = 1 − max i the desired expression. M . From this we see that M − 1 M P (ωi|x) ≤ 1 − 1 M = , 7
Problem 2.5 (classification with Gaussians of the same mean) Since this is a two-class problem we can use the results from the book. We compute l12 the likelihood ratio l12 = (9) and compare this to the threshold t defined by p(x|ω1) p(x|ω2) = σ2 2 σ2 1 exp− x2 2 1 1 − σ2 1 σ2 2 . t ≡ P (ω2) λ12 − λ11 . P (ω1)λ21 − λ22 Then if l12 > t, then we classify x as a member of ω1. In the same way if l12 < t then we classify x as a member of ω2. The decision point x0 where we switch classification is given by l12(x0) = t or Solving for x0 we get exp− x2 0 2 σ2 2 − σ2 1σ2 σ2 2 = 1 σ2 1 σ2 2 t . x0 = ± 1σ2 2σ2 2 2 − σ2 1) (σ2 ln σ2 1P (ω2) σ2 2P (ω1)λ21 − λ22 λ12 − λ11 . For specific values of the parameters in this problem: σ2 i , P (ωi), and λij the two values for x0 above can be evaluated. These two values x0 differ only in their sign and have the same magnitude. For these given values of x0 we see that if |x| ≤ |x0| one class is selected as the classification decision while if |x| > |x0| the other class is selected. The class selected depends on the relative magnitude of the parameters σ2 i , P (ωi), and λij and seems difficult to determine a priori. To determine the class once we are given a fixed specification of these numbers we can evaluate l12 in Equation 9 for a specific value of x such that |x| ≤ |x0| (say x = 0) to determine if l12 < t or not. If so the region of x’s given by |x| ≤ |x0| will be classified as members of class ω1, while the region of x’s where |x| > |x0| would be classified as members of ω2. Problem 2.6 (the Neyman-Pearson classification criterion) Warning: attempting it. If anyone sees a way to proceed please contact me. I was not able to solve this problem. Below are a few notes I made while For this problem we want to fix the value of the error probability for class one at a particular value say ε1 = ε and then we want to minimize the probability of error we make when classifying the other class. Now recall the definitions of the error probability εi and ε1 =ZR2 ε2 =ZR1 p(x|ω1)P (ω1)dx , p(x|ω2)P (ω2)dx . 8
分享到:
收藏