logo资料库

信息论与机器学习.pdf

第1页 / 共49页
第2页 / 共49页
第3页 / 共49页
第4页 / 共49页
第5页 / 共49页
第6页 / 共49页
第7页 / 共49页
第8页 / 共49页
资料共49页,剩余部分请下载后查看
Introduction
I The theory
Mean or Expected Value
Entropy
Entropy and Codes
Can we do better?
Logarithm and probabilities
Jensen's Inequality
Convexity
Definition and proof of Jensen's Inequality
Maximum Entropy
Specifying the distribution
Chain rule
Cross Entropy
Kullback-Leibler Divergence
Mutual Information
Definition
Alternative formulation and interpretation
Mutual Information VS Correlation
Linear Regression
A Set Theoretic view of Information Theory
The Inclusion Exclusion Principle
Cardinality of Intersections
Multiple Mutual Information
KL Divergence and Total Correlation
II Applications of Information Theory to Machine Learning
Loss function: from KL Divergence to Cross Entropy to Log Likelihood
Supervised Learning
Density Estimation
Feature selection
Information Gain
Joint Mutual Information
EM Algorithm
Bonus
Definition
Example
EM Algorithm: why does it work?
The MM Algorithm
EM is an instance of MM
Information Theory interpretation
Information Theory for Machine Learning Massimiliano Tomassoli (reverse(5102mnhuik)@gmail.com) 05/22/16 Contents 1 Introduction I The theory 2 Mean or Expected Value 3 Entropy 4 Entropy and Codes 4.1 Can we do better? . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Logarithm and probabilities 6 Jensen’s Inequality 6.1 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Definition and proof of Jensen’s Inequality . . . . . . . . . . . . . 7 Maximum Entropy 8 Specifying the distribution 9 Chain rule 10 Cross Entropy 11 Kullback-Leibler Divergence 12 Mutual Information 12.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Alternative formulation and interpretation . . . . . . . . . . . . . 12.3 Mutual Information VS Correlation . . . . . . . . . . . . . . . . . 12.4 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 3 3 9 9 11 12 14 14 15 17 19 19 20 21 22 22 24 24 25
13 A Set Theoretic view of Information Theory 13.1 The Inclusion Exclusion Principle . . . . . . . . . . . . . . . . . . 13.2 Cardinality of Intersections . . . . . . . . . . . . . . . . . . . . . 13.3 Multiple Mutual Information . . . . . . . . . . . . . . . . . . . . 13.4 KL Divergence and Total Correlation . . . . . . . . . . . . . . . . 26 27 30 31 37 II Applications of Information Theory to Machine Learn- ing 38 14 Loss function: from KL Divergence to Cross Entropy to Log Likelihood 14.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 14.2 Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Feature selection 15.1 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2 Joint Mutual Information . . . . . . . . . . . . . . . . . . . . . . 16 EM Algorithm 16.1 Bonus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 EM Algorithm: why does it work? 17.1 The MM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 17.2 EM is an instance of MM . . . . . . . . . . . . . . . . . . . . . . 17.3 Information Theory interpretation . . . . . . . . . . . . . . . . . 38 38 39 40 41 41 42 42 42 43 45 45 46 48 Introduction 1 Students of Machine Learning are usually introduced to Information Theory through brief tutorials which are too superficial to really understand what’s going on. One could always read a specialized book, but this might prove too much of an effort for someone who doesn’t intend to become an expert in Information Theory. This tutorial wants to be a reasoned exposition where everything follows logically from what precedes it. I hope I succeeded in that. Because of the intended audience, some theorems are presented without proof but are thoroughly explained to give a solid intuition of why they’re true. 2
To make this tutorial self-contained, I took the liberty of proving some clas- sical results such as the Inclusion Exclusion Principle and Jensen’s Inequality as I needed them to prove other results. Moreover, I decided to conclude this tutorial with two sections about the celebrated EM Algorithm which I think everyone should know even if they’re just interested in, say, Deep Learning. I didn’t throw in the EM Algorithm just for the fun of it: there’s an interesting, if not strong, connection with Information Theory. A warning: every derivation is my own so keep your eyes open and let me know if you find any mistakes! Part I The theory 2 Mean or Expected Value Before we dive into Information Theory, we’d better review the definition and some important properties of the mean or expected value. We’ll be focusing on the case of finite discrete random variables here, i.e. random variables which take only a finite number of values. For the infinite case there are some subtleties about convergence. Also, for the continuous case one just have to replace all the sums with integrals. Definition 1. Let X be a discrete random variable with distribution p. The mean of X is defined as EX∼p[X] = p(x)x. Remark 1. When there is no ambiguity, we can drop the distribution, the vari- able, or both: EX∼p[X] = EX [X] = Ep[X] = E[X]. x Definition 2. With a slight abuse of notation, if X is a random variable, we also see X as the set of values which X can take so that we can write, for instance, ∀x ∈ X, p(x) ∈ [0, 1]. Definition 3. If X is a random variable and f a function, then Z = f (X) is a random variable such that, for all z, p(z) = P (Z = z) = P (f (X) = z) = P (X ∈ f−1(z)) = p(x). x:f (x)=z Proposition 1. Let X be a random variable of distribution p and let f be a function. The mean of the random variable f (X) can be evaluated as follows Ef (X)[f (X)] = EX [f (X)] = p(x)f (x). x 3
Proof. Let Z = f (X) and let q be the distribution of Z. Then, Ef (X)[f (X)] = EZ[Z] z z x z z = = = = = q(z)z  x:f (x)=z x:f (x)=z  z p(x) p(x)z p(x)f (x) x:f (x)=z p(x)f (x) = EX [f (X)]. {x1, x2, . . . , xn} = Note that the double sum can be replaced with a single sum over x because z{x|f (x) = z}. ··· Definition 4. We can generalize definition 1 by considering a list of random variables X1, . . . , Xn: EX1,...Xn [f (X1, . . . , Xn)] = p(x1, . . . , xn)f (x1, . . . , xn). Definition 5. If X1, . . . , Xn are random variables, we can define a correspond- ing vector random variable as x1 xn X = (X1, . . . , Xn) = [X1 ··· Xn]t. Proposition 2. If X = (X1, . . . , Xn) and Y = (Y1, . . . , Ym) are vector random variables, i.e. vectors of random variables, then p(. . . , x, . . .) = p(. . . , x1, . . . , xn, . . .) p(. . . , x, . . .| . . . , y, . . .) = p(. . . , x1, . . . , xn, . . .| . . . , y1, . . . , ym, . . .), where x = (x1, . . . , xn) and y = (y1, . . . , yn). Proof. Let’s try not to be too technical. We’ll indicate events by writing predi- cates between curly braces. For instance, the event “X takes value x” is written as {X = x}. Now note that p(x, y) = p(X = x, Y = y) = P ({X = x} ∩ {Y = y}), which means that p(x, y) is really the probability of the intersection of the two events {X = x} and {Y = y}. Therefore, since X = x ⇐⇒ Xi = xi, i = 1, . . . , n, 4
then which means that {X = x} = {X1 = x1} ∩ ··· ∩ {Xn = xn}, p(. . . , x, . . .) = P (··· ∩ {X = x} ∩ ··· ) = P (··· ∩ {X1 = x1} ∩ ··· ∩ {Xn = xn} ∩ ··· ) = P (. . . , x1, . . . , xn, . . .). So, basically, a comma in these expressions means “and” or “intersection”. The proof for p(. . . , x, . . .| . . . , y, . . .) is analogous. Remark 2. Thanks to proposition 2, we can always simplify notation by writing just “X” or “x” instead of “X1, . . . , Xn” or “x1, . . . , xn”, respectively, and we can generalize results by doing the opposite. For instance, EX [f (X)] = p(x)f (x) ··· x implies the generalization EX1,...,Xn [f (X1, . . . , Xn)] = p(x1, . . . , xn)f (x1, . . . , xn). x1 xn Note that, to be precise, the f in the generalization is not exactly the same as the f in the simple case, but we won’t be afraid of abusing notation when convenient. Definition 6. The mean of a vector random variable X = (X1, . . . , Xn) is defined as E[X] = (E[X1], . . . , E[Xn]). In general, if M is an m × n matrix random variable, then [E[M ]]ij = E[Mij] i = 1, . . . , m, j = 1, . . . , n. Proposition 3. The mean of a constant is equal to the constant itself. Proof. A constant c can be seen as a random variable X which takes the value c with probability 1. Thus, E[c] = EX [X] = p(x)x = c. Proposition 4. E[·] is linear. x 5
EX [a + bX] = EX [f (X)] x x = = p(x)f (x) p(x)(a + bx) Proof. If X is a random variable, a, b two constants, and f (x) = a + bx, then = a p(x) + b p(x)x x = a + bEX [X]. x Proposition 5. Let X, Y be two random variables. Then EX,Y [f (X)] = EX [f (X)]. In general, we can drop any random variable the argument to the mean doesn’t depend on. Proof. This is easy: EX,Y [f (X)] = = = = x x x y y p(x, y)f (x) p(x)p(y|x)f (x) p(x)f (x) p(y|x) y p(x)f (x) = EX [f (X)]. x For the general case, we can consider an arbitrary sequence X1, . . . , Xn of ran- dom variables and prove that we can drop the first variable which doesn’t appear in the argument to the mean. Since n is finite, by repeating this process we must be left with a sequence Xi1, Xi2, . . . Xik of random variables which all appear in the argument to the mean. Remark 3. Thanks to property 5, we can adopt the convention that if X1, . . . , Xn are random variables and f is a function which depends on all of them and them alone, then we can write E[f (X1, . . . Xn)] = EX1,...,Xn [f (X1, . . . , Xn)]. 6
Remark 4. Given the notation and the conventions we established, we can easily conclude that, if L is linear, then E[·] ◦ L = L ◦ E[·]. Indeed, p(x)L(x) = L(p(x)x) x x (E[·] ◦ L)(X) = E[L(X)] = For instance, if L(X1, . . . , Xn) = n n n variables and a1, . . . , an constants, then (E[·] ◦ L)(X1, . . . , Xn) = E = L i=1 x aiXi = L(E[X]) = (L ◦ E[·])(X). p(x)x i=1 aiXi, where X1, . . . , Xn are random aiE[Xi] = (L ◦ E[·])(X1, . . . , Xn), = i=1 by defining E[·](X1, . . . , Xn) = (E[X1], . . . , E[Xn]). We can clean things up by using vectors. [a1, . . . , an]T , then our example becomes If X=[X1, . . . , Xn]T and a = (E[·] ◦ L)(X) = E[aT X] = aT E[X] = (L ◦ E[·])(X). Definition 7. Let X, Y be two random variables and f a function. The condi- tional mean of f (X, y) with respect to X given Y is defined as EX|Y [f (X, y)] = p(x|y)f (x, y), x where y is any fixed real number and not a random variable, so we could define a new random variable as Z = EX|Y [f (X, Y )]. Note that we didn’t write y, but Y this time. A more explicit way to write it would be Z = EX|Y [f (X,·)](Y ), which makes it clear that we’re transforming the variable Y through the function y → EX|Y [f (X, y)]. Proposition 6. If X, Y are two random variables, then EX|Y [f (X, Y )] . EX,Y [f (X, Y )] = EY 7
Proof. The proof is easy: EX,Y [f (X, Y )] = = = x x y = EY y y p(x, y)f (x, y) p(x|y)p(y)f (x, y) EX|Y [f (X, Y )] . EX|Y,C[f (X, Y, c)] . p(x|y)f (x, y) p(y) x Proposition 7. If X, Y, C are random variables, then, for any fixed c, EX,Y |C[f (X, Y, c)] = EY |C This is just a generalization of proposition 6. Proof. This proof can be derived from the proof of proposition 6 just by adding “|C”, “|c” and “, c” in the right places: EX,Y |C[f (X, Y, c)] = = = x x y p(x, y|c)f (x, y, c) EX|Y,C[f (X, Y, c)] . p(y|c) x y p(x|y, c)p(y|c)f (x, y, c) p(x|y, c)f (x, y, c) y = EY |C Corollary 1. If X, Y are independent random variables, then Proof. First of all, because X, Y are independent, in general, for any fixed y, Now we can conclude the proof: EX,Y [XY ] = EY EX,Y [XY ] = EX [X]EY [Y ]. EX|Y [f (X, y)] = = p(x|y)f (x, y) p(x)f (x, y) x x 8 = EX [f (X, y)]. EX|Y [XY ] = EY [EX [XY ]] = EY [Y EX [X]] = EX [X]EY [Y ].
分享到:
收藏