Statistical Foundations of Machine Learning.pdf

发布时间：2022-06-08 发布人：admin 分类：说明书资料大小：2.16M 资料格式：pdf 举报版权申诉

第1页 / 共235页

cuijianzhu-4544807-Statistical Foundations of Machine Learning.pdf-第2页.png

第2页 / 共235页

cuijianzhu-4544807-Statistical Foundations of Machine Learning.pdf-第3页.png

第3页 / 共235页

cuijianzhu-4544807-Statistical Foundations of Machine Learning.pdf-第4页.png

第4页 / 共235页

cuijianzhu-4544807-Statistical Foundations of Machine Learning.pdf-第5页.png

第5页 / 共235页

cuijianzhu-4544807-Statistical Foundations of Machine Learning.pdf-第6页.png

第6页 / 共235页

cuijianzhu-4544807-Statistical Foundations of Machine Learning.pdf-第7页.png

第7页 / 共235页

cuijianzhu-4544807-Statistical Foundations of Machine Learning.pdf-第8页.png

第8页 / 共235页

文本预览

SYLLABUS Statistical foundations of machine learning Gianluca Bontempi Machine Learning Group Computer Science Department Universite Libre de Bruxelles, ULB Belgique

Contents Index 1 Introduction 1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Foundations of probability 2.1 The random model of uncertainty . . . . . . . . . . . . . . . . . . . . 2.1.1 Axiomatic deﬁnition of probability . . . . . . . . . . . . . . . 2.1.2 Symmetrical deﬁnition of probability . . . . . . . . . . . . . . 2.1.3 Frequentist deﬁnition of probability . . . . . . . . . . . . . . 2.1.4 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . 2.1.5 Independence and conditional probability . . . . . . . . . . . 2.1.6 Combined experiments . . . . . . . . . . . . . . . . . . . . . . 2.1.7 The law of total probability and the Bayes’ theorem . . . . . 2.1.8 Array of joint/marginal probabilities . . . . . . . . . . . . . . 2.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Parametric probability function . . . . . . . . . . . . . . . . . 2.3.2 Expected value, variance and standard deviation of a discrete r.v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Moments of a discrete r.v. . . . . . . . . . . . . . . . . . . . . 2.3.4 Entropy and relative entropy . . . . . . . . . . . . . . . . . . 2.4 Continuous random variable . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Mean, variance, moments of a continuous r.v. . . . . . . . . . 2.5 Joint probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Marginal and conditional probability . . . . . . . . . . . . . . 2.6 Common discrete probability functions . . . . . . . . . . . . . . . . . 2.6.1 The Bernoulli trial . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 The Binomial probability function . . . . . . . . . . . . . . . 2.6.3 The Geometric probability function . . . . . . . . . . . . . . 2.6.4 The Poisson probability function . . . . . . . . . . . . . . . . 2.7 Common continuous distributions . . . . . . . . . . . . . . . . . . . . 2.7.1 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . 2.7.2 Exponential distribution . . . . . . . . . . . . . . . . . . . . . 2.7.3 The Gamma distribution . . . . . . . . . . . . . . . . . . . . 2.7.4 Normal distribution: the scalar case . . . . . . . . . . . . . . 2.7.5 The chi-squared distribution . . . . . . . . . . . . . . . . . . 2.7.6 Student’s t-distribution . . . . . . . . . . . . . . . . . . . . . 2.7.7 F-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.8 Bivariate continuous distribution . . . . . . . . . . . . . . . . 2.7.9 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Normal distribution: the multivariate case . . . . . . . . . . . . . . . 3 2 1 7 11 11 13 13 14 14 15 16 18 18 20 21 22 22 24 24 25 26 26 27 28 28 28 29 29 30 30 31 31 31 33 33 33 35 36 37

4 CONTENTS 2.8.1 2.9 Linear combinations of r.v. Bivariate normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.1 The sum of i.i.d. random variables . . . . . . . . . . . . . . . 2.10 Transformation of random variables . . . . . . . . . . . . . . . . . . 2.11 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . 2.12 The Chebyshev’s inequality . . . . . . . . . . . . . . . . . . . . . . . 3 Classical parametric estimation 38 39 40 40 41 41 43 43 45 45 46 47 47 47 48 48 49 50 51 52 52 53 53 54 54 56 57 58 59 59 61 61 62 63 63 63 64 65 68 68 69 71 71 72 73 73 74 74 3.3.1 3.3.2 3.1 Classical approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Point estimation . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Empirical distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Plug-in principle to deﬁne an estimator . . . . . . . . . . . . . . . . Sample average . . . . . . . . . . . . . . . . . . . . . . . . . . Sample variance . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Sampling distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 The assessment of an estimator . . . . . . . . . . . . . . . . . . . . . 3.5.1 Bias and variance . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Bias and variance of ˆµ . . . . . . . . . . . . . . . . . . . . . . 3.5.3 Bias of the estimator ˆσ2 . . . . . . . . . . . . . . . . . . . . . 3.5.4 Bias/variance decomposition of MSE . . . . . . . . . . . . . . 3.5.5 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.6 Eﬃciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Suﬃciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.7 3.6 The Hoeﬀding’s inequality . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Sampling distributions for Gaussian r.v.s . . . . . . . . . . . . . . . . 3.8 The principle of maximum likelihood . . . . . . . . . . . . . . . . . . 3.8.1 Maximum likelihood computation . . . . . . . . . . . . . . . 3.8.2 Properties of m.l. estimators . . . . . . . . . . . . . . . . . . 3.8.3 Cramer-Rao lower bound . . . . . . . . . . . . . . . . . . . . Interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.1 Conﬁdence interval of µ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10.1 Combination of m estimators . . . . . . . . . . . . . . . . . . 3.11 Testing hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.1 Types of hypothesis 3.11.2 Types of statistical test . . . . . . . . . . . . . . . . . . . . . 3.11.3 Pure signiﬁcance test . . . . . . . . . . . . . . . . . . . . . . . 3.11.4 Tests of signiﬁcance . . . . . . . . . . . . . . . . . . . . . . . 3.11.5 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . 3.11.6 Receiver Operating Characteristic curve . . . . . . . . . . . . 3.11.7 Choice of test . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.8 UMP level-α test . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.9 Likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.12.1 z-test (single and one-sided) . . . . . . . . . . . . . . . . . . . 3.12.2 t-test: single sample and two-sided . . . . . . . . . . . . . . . 3.12.3 χ2-test: single sample and two-sided . . . . . . . . . . . . . . 3.12.4 t-test: two samples, two sided . . . . . . . . . . . . . . . . . . 3.12.5 F-test: two samples, two sided . . . . . . . . . . . . . . . . . 3.10 Combination of two estimators 3.12 Parametric tests 3.9

CONTENTS 4 Nonparametric estimation and testing 4.3.1 4.1 Nonparametric methods . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Estimation of arbitrary statistics . . . . . . . . . . . . . . . . . . . . 4.3 Jacknife . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jacknife estimation . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Bootstrap sampling 4.4.2 Bootstrap estimate of the variance . . . . . . . . . . . . . . . 4.4.3 Bootstrap estimate of bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 The bootstrap principle . . . . . . . . . . . . . . . . . . . . . 4.6 Randomization tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Randomization and bootstrap . . . . . . . . . . . . . . . . . . 4.7 Permutation test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Considerations on nonparametric tests . . . . . . . . . . . . . . . . . 4.5 Bootstrap conﬁdence interval 5 75 75 76 77 77 79 79 79 80 81 81 82 83 84 84 5 A statistical framework of supervised learning 5.3.1 5.4.1 An illustrative example 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Estimating dependencies . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 The problem of classiﬁcation . . . . . . . . . . . . . . . . . . . . . . Inverse conditional distribution . . . . . . . . . . . . . . . . . 5.4 The problem of regression estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 87 90 92 94 96 96 . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.5.1 The decomposition of the generalization error in regression . 100 5.5.2 The decomposition of the generalization error in classiﬁcation 101 5.6 The supervised learning procedure . . . . . . . . . . . . . . . . . . . 102 5.7 Validation techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.7.1 The resampling methods . . . . . . . . . . . . . . . . . . . . . 104 . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.5 Generalization error 5.8 Concluding remarks 6 The machine learning procedure 107 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.3 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.4 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.5 The dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.6 Parametric identiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.6.1 Error functions . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.6.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . 110 6.7 Structural identiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.7.1 Model generation . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.7.2 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.7.3 Model selection criteria . . . . . . . . . . . . . . . . . . . . . 120 . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.8 Concluding remarks 7 Linear approaches 123 7.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.1.1 The univariate linear model . . . . . . . . . . . . . . . . . . . 123 7.1.2 Least-squares estimation . . . . . . . . . . . . . . . . . . . . . 124 7.1.3 Maximum likelihood estimation . . . . . . . . . . . . . . . . . 126 7.1.4 Partitioning the variability . . . . . . . . . . . . . . . . . . . 126 7.1.5 Test of hypotheses on the regression model . . . . . . . . . . 126 . . . . . . . . . . . . . . . . . . . . . . 127 7.1.6 Interval of conﬁdence

6 CONTENTS 7.1.7 Variance of the response . . . . . . . . . . . . . . . . . . . . . 128 7.1.8 Coeﬃcient of determination . . . . . . . . . . . . . . . . . . . 129 7.1.9 Multiple linear dependence . . . . . . . . . . . . . . . . . . . 129 7.1.10 The multiple linear regression model . . . . . . . . . . . . . . 129 7.1.11 The least-squares solution . . . . . . . . . . . . . . . . . . . . 130 7.1.12 Variance of the prediction . . . . . . . . . . . . . . . . . . . . 131 7.1.13 The HAT matrix . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.1.14 Generalization error of the linear model . . . . . . . . . . . . 131 7.1.15 The expected empirical error . . . . . . . . . . . . . . . . . . 131 7.1.16 The PSE and the FPE . . . . . . . . . . . . . . . . . . . . . 133 7.2 The PRESS statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 7.3 The weighted least-squares . . . . . . . . . . . . . . . . . . . . . . . 138 7.3.1 Recursive least-squares . . . . . . . . . . . . . . . . . . . . . . 139 7.4 Discriminant functions for classiﬁcation . . . . . . . . . . . . . . . . 141 7.4.1 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7.4.2 Support vector machines . . . . . . . . . . . . . . . . . . . . . 147 8 Nonlinear approaches 153 8.1 Nonlinear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 8.1.1 Artiﬁcial neural networks . . . . . . . . . . . . . . . . . . . . 156 8.1.2 From global modeling to divide-and-conquer . . . . . . . . . . 162 8.1.3 Classiﬁcation and Regression Trees . . . . . . . . . . . . . . . 163 8.1.4 Basis Function Networks . . . . . . . . . . . . . . . . . . . . . 167 8.1.5 Radial Basis Functions . . . . . . . . . . . . . . . . . . . . . . 167 8.1.6 Local Model Networks . . . . . . . . . . . . . . . . . . . . . . 168 8.1.7 Neuro-Fuzzy Inference Systems . . . . . . . . . . . . . . . . . 170 8.1.8 Learning in Basis Function Networks . . . . . . . . . . . . . . 171 8.1.9 From modular techniques to local modeling . . . . . . . . . . 174 8.1.10 Local modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 175 8.2 Nonlinear classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . 184 . . . . . . . . . . . . . . . . . . . . . . 186 SVM for nonlinear classiﬁcation . . . . . . . . . . . . . . . . . 187 8.2.1 Naive Bayes classiﬁer 8.2.2 9 Model averaging approaches 189 9.1 Stacked regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 9.2 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 9.3 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 9.3.1 The Ada Boost algorithm . . . . . . . . . . . . . . . . . . . . 193 9.3.2 The arcing algorithm . . . . . . . . . . . . . . . . . . . . . . . 195 9.3.3 Bagging and boosting . . . . . . . . . . . . . . . . . . . . . . 195 10 Conclusions 197 10.1 Causality and dependencies . . . . . . . . . . . . . . . . . . . . . . . 198 A Unsupervised learning 201 A.1 Probability density estimation . . . . . . . . . . . . . . . . . . . . . . 201 A.1.1 Nonparametric density estimation . . . . . . . . . . . . . . . 201 A.1.2 Semi-parametric density estimation . . . . . . . . . . . . . . . 203 A.2 K-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 A.3 Fuzzy clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 A.4 Fuzzy c-ellyptotypes . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

CONTENTS 7 B Some statistical notions 211 B.1 Useful relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 B.2 Convergence of random variables . . . . . . . . . . . . . . . . . . . . 211 B.3 Limits and probability . . . . . . . . . . . . . . . . . . . . . . . . . . 212 B.4 Expected value of a quadratic form . . . . . . . . . . . . . . . . . . . 212 B.5 The matrix inversion formula . . . . . . . . . . . . . . . . . . . . . . 213 B.6 Proof of Eq. (5.20) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 B.7 Biasedness of the quadratic empirical risk . . . . . . . . . . . . . . . 213 C Kernel functions 215 D Datasets 217 D.1 USPS dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 D.2 Golub dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 E R scripts 219

8 CONTENTS

分享到：

赞收藏

资料库

Statistical Foundations of Machine Learning.pdf

相关推荐

课程资源

热门标签

最新资料