logo资料库

RVM Explained.pdf

第1页 / 共10页
第2页 / 共10页
第3页 / 共10页
第4页 / 共10页
第5页 / 共10页
第6页 / 共10页
第7页 / 共10页
第8页 / 共10页
资料共10页,剩余部分请下载后查看
Theory
Evidence Approximation Theory
Evidence Approximation Procedure
Automatic Relevance Determination
Application
October 19, 2010 Relevance Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/T.Fletcher/
Introduction This document has been written in an attempt to make Tipping’s [1] Rele- vance Vector Machines (RVM) as simple to understand as possible for those with minimal experience of Machine Learning. It assumes knowledge of probability in the areas of Bayes’ theorem and Gaussian distributions in- cluding marginal and conditional Gaussian distributions. It also assumes familiarity with matrix differentiation, the vector representation of regres- sion and kernel (basis) functions. These latter two areas are briefly covered in the author’s similar paper on Support Vector Machines which can be found through the URL on the coverpage. The document has been split into two main sections. The first introduces the problem that needs to be solved, namely maximising the posterior probabil- ity of regression target values over some hyperparameters. It then proceeds to derive the equations required to do this. Aside from the areas mentioned above where knowledge is assumed, every mathematical step is gone through. It is therefore hoped that there is no ambiguity over any of the details in this explanation, though this does make the description a little cumbersome. The second section therefore explains from an algorithmic viewpoint the it- erations that would be required to actually apply the technique. Aside from Tipping [1], the majority of this document is based on work by MacKay [2], [3], [4], [5] and Bishop [6]. Notation With the view of making this description as explicit (in the mathematical sense) as possible, it is worth introducing some of the notation that will be used: • P (A|B, C) is the probability of A given B and C. Note that using this notation, different representations of the parameters will not alter this probability, e.g. P (A|B) ≡ P (A|B−1). Furthermore, for the sake of simplicity, occasionally some of the elements in the probabilistic notation will be omitted where they are not relevant, e.g. P (A|B) ≡ P (A|B, C, D etc). • X ∼ N (µ, σ2) is used to signify that X is normally distributed with mean µ and variance σ2. • Bold font is used to represent vectors and matrices. 1
1 Theory 1.1 Evidence Approximation Theory Linear regression problems are generally based on finding the parameter vector w and the offset c so that we can predict y for an unknown input x (x ∈ M ): y = wT x + c In practice we usually incorporate the offset c into w. If there is a non-linear relationship between x and y then a basis function can be used: y = wT φ(x) where x → φ(x) is a non-linear mapping (i.e. basis function). When attempting to calculate w from our our training examples, we as- sume that each target ti is representative of the true model yi, but with the addition of noise i: ti = yi + i = wT φ(xi) + i where i are assumed to be independent samples from a Gaussian noise process with zero mean and variance σ2, i.e. i ∼ N (0, σ2) ∀i . This means that: P (ti|xi, w, σ2) ∼ N (yi, σ2) = (2πσ2)− 1 2 exp = (2πσ2)− 1 2 exp − 1 2σ2 (ti − yi)2 − 1 2σ2 (ti − wT φ(xi))2 Looking at N training points simultaneously, so that the vector t represents all the individual training points ti and the N × M design matrix Φ is constructed such that the ith row represents the vector φ(xi), we have: N N i=1 P (t|xi, w, σ2) = = N (wT φ(xi), σ2) (2πσ2)− 1 2 exp i=1 = (2πσ2)− N 2 exp − 1 2σ2 (ti − wT φ(xi))2 − 1 2σ2 t − Φw2 2
When attempting to learn the relationship between x and y, we wish to constrain complexity and hence the growth of the weights w and do this by defining an explicit prior probability distribution on w. Our preference for smoother and therefore less complex functions is encoded by using a zero-mean Gaussian prior over w: P (w|αi) ∼ N (0, α−1 i ) where we have used αi to describe the inverse variance (i.e. precision) of each wi. If once again we look at all N points simultaneously, so that the ith element in the vector α represents αi, we have: N P (w|α) = N (0, α−1 i ) i=0 This means that there is an individual hyperparameter αi associated with each weight, modifying the strength of the prior thereon. The posterior probability over all the unknown parameters, given the data, is expressed as P (w, α, σ2|t). We are trying to find the w, α and σ2 which maximise this posterior probability. We can decompose the posterior: P (w, α, σ2|t) = P (w|t, α, σ2)P (α, σ2|t) (1.1) Substituting β−1 for σ2 to make the maths appear less cluttered, the first part of (1.1) can be expressed: P (w|t, α, β) ∼ N (m, Σ) (1.2) where the mean m and the covariance Σ are given by: m = βΣΦT t Σ = (A + βΦT Φ)−1 and A = diag(α). (1.3) (1.4) The method for arriving at (1.2), (1.3) and (1.4), relating to conditional Gaussian distributions, lies outside the scope of this document. In order to evaluate m and Σ we need to find the hyperparameters (α and β) which maximise the second part of (1.1), which we decompose: P (α, σ2|t) ∝ P (t|α, σ2)P (α)P (σ2) We will assume uniform hyperpriors and hence ignore P (α) and P (σ2). Our problem is now to maximise the evidence: P (t|α, σ2) ≡ P (t|α, β) = P (t|w, β)P (w|α)dw (1.5) 3
Looking at the first component of the equation: P (t|w, β) = = N (y, β−1) − N 2 exp t − Φw2 − β 2 (1.6) And then at the second (where M is the dimensionality of x): N 2π i=1 β M M i=1 P (w|α) = = N (0, α−1 i ) αiw2 − 1 2 − 1 2 α i exp wT Aw (2πα−1 2 exp i )− 1 M 1 2 i=1 i=1 = (2π)− M 2 Substituting (1.6) and (1.7) into (1.5): − N 2π N β 2 1 β 2 M M 2 − β 2 exp 2π 2π i=1 β 2 t − Φw2 (2π)− M 2 1 2 i α exp− P (t|α, β) = = (1.7) − 1 2 1 2 i exp α M i=1 dw wT Aw dw t − Φw2 + wT Aw 1 2 (1.8) In order to simplify (1.8), we create a definition to represent the integrand: E(w) = β 2 t − Φw2 + 1 2 wT Aw (1.9) This means that: P (t|α, β) = β 2π 2 1 N M 2 2π M i=1 1 2 i α exp{−E(w)} dw Expanding out (1.9) we get: E(w) = = β 2 1 2 (tT t − 2tT Φw + wT ΦT Φw) + (βtT t − 2βtT Φw + βwT ΦT Φw + wT Aw) wT Aw 1 2 4
Substituting in (1.4) and then using I = Σ−1Σ: E(w) = = 1 2 1 2 (βtT t − 2βtT Φw + wT Σ−1w) (βtT t − 2βtT ΦΣ−1Σw + wT Σ−1w) Substituting in (1.3): E(w) = 1 2 (βtT t − 2mT Σ−1w + wT Σ−1w + mT Σ−1m − mT Σ−1m) = E(t) + 1 2 (w − m)T Σ−1(w − m) where E(t) = (βtT t − mT Σ−1m) 1 2 Our integrand from (1.8) now becomes: exp{−E(w)} dw = exp{−E(t)} (2π) M 2 |Σ| 1 2 Substituting this back in, gives us: β 2 1 N M 2 2π 2π M i=1 1 2 i exp{−E(t)} (2π) α M 2 |Σ| 1 2 P (t|α, β) = This is known as our marginal likelihood and taking logs, gives us our log marginal likelihood : M i=1 ln P (t|α, β) = N 2 ln β − E(t) − 1 2 ln|Σ| − N 2 ln(2π) + 1 2 ln αi (1.10) It is this equation we need to maximise with respect to α and β, a process known as the evidence approximation procedure. 1.2 Evidence Approximation Procedure In order to maximise our log marginal likelihood, we start by taking deriva- tives of (1.10) with respect to αi and setting these to zero: ln P (t|α, β) = d dαi 1 2αi ⇒ αi = Σii − 1 − 1 2 2 1 − αiΣii m2 i = 0 m2 i 5
Substituting in γi = 1 − αiΣii our recursive definition for the αi which maximise (1.10) can be expressed more elegantly as: αi = γi m2 i We now need to differentiate (1.10) with respect to β and set these deriva- tives to zero: N β − t − Φm2 − T rΣΦT Φ ln P (t|α, β) = d dβ 1 2 = 0 (1.11) In order to solve this, we first simplify the argument of the trace operator T r[•]: ΣΦT Φ = ΣΦT Φ + β−1ΣA − β−1ΣA = ΣΦT Φβ + A β−1 − β−1ΣA =A + βΦT Φ−1ΦT Φβ + A β−1 − β−1ΣA = (I − AΣ) β−1 Substituting this back into (1.11): I − AΣ = 0 β = t − Φm2 − t − Φm2 − T r I − AΣ − T r β (N − T r [I − AΣ]) = t − Φm2 N 1 2 β ⇒ N β ⇒ 1 β ⇒ 1 β = N − T r [I − AΣ] t − Φm2 N − i γi t − Φm2 ⇒ β = The αi and β which maximise our marginal likelihood are then found it- eratively by setting α and β to initial values, finding values for m and Σ from (1.3) and (1.4), using these to calculate new estimates for α and β and repeating this process until a convergence criteria is met. We will then be left with values for α and β which maximise our marginal likelihood and which we can use to evaluate our predictive distribution over t for a new input x: P (t|x, α, β) = P (t|w, β)P (w|, α, β)dw = N (mT φ(x), σ2(x)) 6
This means that our estimate for t is the mean of the above distribution mT φ(x). Our confidence in our prediction is determined by the variance of this dis- tribution σ2(x) which is given by: σ2(x) = β−1 + φ(x)T Σφ(x) 1.3 Automatic Relevance Determination Whilst carrying out the evidence approximation procedure described above, many of the αi will tend to infinity. This has implications for the variance Σ and mean m of the posterior distribution over the corresponding weights in (1.2): lim αi→∞ Σ = lim ⇒ lim αi→∞ m = lim αi→∞(A + βΦT Φ)−1 = 0 αi→∞ βΣΦT t = 0 This means that each wi that such αi relate to will be distributed αi ∼ N (0, 0), i.e. will be equal to zero. The corresponding basis functions, φ(xi) should therefore be pruned from the overall design matrix Φ each iteration. The xi corresponding to the remaining non-zero weights after pruning are called relevance vectors and are analogous to the support vectors of an SVM. 7
分享到:
收藏