Mathematics for Machine Learning【Marc Peter Deisenroth】.pdf

发布时间：2022-06-09 发布人：admin 分类：说明书资料大小：21.51M 资料格式：pdf 举报版权申诉

sigkill-10953455-4744300845221041340.pdf-第1页.png

第1页 / 共405页

sigkill-10953455-4744300845221041340.pdf-第2页.png

第2页 / 共405页

sigkill-10953455-4744300845221041340.pdf-第3页.png

第3页 / 共405页

sigkill-10953455-4744300845221041340.pdf-第4页.png

第4页 / 共405页

sigkill-10953455-4744300845221041340.pdf-第5页.png

第5页 / 共405页

sigkill-10953455-4744300845221041340.pdf-第6页.png

第6页 / 共405页

sigkill-10953455-4744300845221041340.pdf-第7页.png

第7页 / 共405页

sigkill-10953455-4744300845221041340.pdf-第8页.png

第8页 / 共405页

Contents

Tables of Symbols

Foreword

1. Introduction and Motivation

1.1 Finding Words for Intuitions

1.2 Two Ways to Read this Book

1.3 Exercises and Feedback

2. Linear Algebra

2.1 Systems of Linear Equations

2.2 Matrices

2.2.1 Matrix Addition and Multiplication

2.2.2 Inverse and Transpose

2.2.3 Multiplication by a Scalar

2.2.4 Compact Representations of Systems of Linear Equations

2.3 Solving Systems of Linear Equations

2.3.1 Particular and General Solution

2.3.2 Elementary Transformations

2.3.3 The Minus-1 Trick

2.3.4 Algorithms for Solving a System of Linear Equations

2.4 Vector Spaces

2.4.1 Groups

2.4.2 Vector Spaces

2.4.3 Vector Subspaces

2.5 Linear Independence

2.6 Basis and Rank

2.6.1 Generating Set and Basis

2.6.2 Rank

2.7 Linear Mappings

2.7.1 Matrix Representation of Linear Mappings

2.7.2 Basis Change

2.7.3 Image and Kernel

2.8 Affine Spaces

2.8.1 Affine Subspaces

2.8.2 Affine Mappings

3. Analytic Geometry

3.1 Norms

3.2 Inner Products

3.2.1 Dot Product

3.2.2 General Inner Products

3.2.3 Symmetric, Positive Definite Matrices

3.3 Lengths and Distance

3.4 Angles and Orthogonality

3.5 Orthonormal Basis

3.6 Inner Product of Functions

3.7 Orthogonal Projections

3.7.1 Projection onto 1-Dimensional Subspaces(Lines)

3.7.2 Projection onto General Subspaces

3.7.3 Projection onto Affine Subspaces

3.8 Rotations

3.8.1 Rotations in R2

3.8.2 Rotations in R3

3.8.3 Rotations in n Dimensions

3.8.4 Properties of Rotations

3.9 Further Reading

4. Matrix Decompositions

4.1 Determinants and Trace

4.2 Eigenvalues and Eigenvectors

4.3 Cholesky Decomposition

4.4 Eigendecomposition and Diagonalization

4.5 Singular Value Decomposition

4.5.1 Geometric Intuition for the SVD

4.5.2 Existence and Construction of the SVD

4.5.3 Eigenvalue Decomposition vs SVD

4.6 Matrix Approximation

4.7 Matrix Phylogeny

4.8 Further Reading

5. Vector Calculus

5.1 Differentiation of Univariate Functions

5.1.1 Taylor Series

5.1.2 Differentiation Rules

5.2 Partial Differentiation and Gradients

5.2.1 Basic Rules of Partial Differentiation

5.2.2 Chain Rule

5.3 Gradients of Vector-Valued Functions

5.4 Gradients of Matrices

5.5 Useful Identities for Computing Gradients

5.6 Backpropagation and Automatic Differentiation

5.6.1 Gradients in a Deep Network

5.6.2 Automatic Differentiation

5.7 Higher-order Derivatives

5.8 Linearization and Multivariate Taylor Series

5.9 Further Reading

6. Probability and Distributions

6.1 Construction of a Probability Space

6.1.1 Philosophical Issues

6.1.2 Probability and Random Variables

6.1.3 Statistics

6.2 Discrete and Continuous Probabilities

6.2.1 Discrete Probabilities

6.2.2 Continuous Probabilities

6.2.3 Contrasting Discrete and Continuous Distributions

6.3 Sum Rule, Product Rule and Bayes' Theorem

6.4 Summary Statistics and Independence

6.4.1 Mean and Covariances

6.4.2 Empirical Means and Covariances

6.4.3 Three Expressions for the Variance

6.4.4 Sums and Transformations of Random Variables

6.4.5 Statistical Independence

6.4.6 Inner Products of Random Variables

6.5 Gaussian Distribution

6.5.1 Marginals and Conditionals of Gaussians are Gaussians

6.5.2 Product of Gaussian Densities

6.5.3 Sums and Linear Transformations

6.6 Conjugacy and the Exponential Family

6.6.1 Conjugacy

6.6.2 Sufficient Statistics

6.6.3 Exponential Family

6.7 Change of Variables/Inverse Transform

6.7.1 Distribution Function Technique

6.7.2 Change of Variables

6.8 Further Reading

7. Continuous Optimization

7.1 Optimization using Gradient Descent

7.1.1 Stepsize

7.1.2 Gradient Descent with Momentum

7.1.3 Stochastic Gradient Descent

7.2 Constrained Optimization and Lagrange Multipliers

7.3 Convex Optimization

7.3.1 Linear Programming

7.3.2 Quadratic Programming

7.3.3 Legendre-Fenchel Transform and Convex Conjugate

7.4 Further Reading

8. When Models meet Data

8.1 Empirical Risk Minimization

8.1.1 Hypothesis Class of Functions

8.1.2 Loss Functions for Training

8.1.3 Regularization to Reduce Overfitting

8.1.4 Cross Validation to Assess the Generalization Performance

8.2 Parameter Estimation

8.2.1 Maximum Likelihood Estimation

8.2.2 Maximum A Posteriori Estimation

8.3 Probabilistic Modeling

8.3.1 MLE, MAP, and Bayesian Inference

8.3.2 Latent Variables

8.4 Directed Graphical Models

8.4.1 Graph Semantics

8.4.2 Conditional Independence and D-Separation

8.5 Model Selection

8.5.1 Nested Cross Validation

8.5.2 Bayesian Model Selection

8.5.3 Bayes Factors for Model Comparison

9. Linear Regression

9.1 Problem Formulation

9.2 Parameter Estimation

9.2.1 Maximum Likelihood Estimation

9.2.2 Overfitting in Linear Regression

9.2.3 Maximum A Posteriori Estimation

9.2.4 MAP Estimation as Regularization

9.3 Bayesian Linear Regression

9.3.1 Model

9.3.2 Prior Predictions

9.3.3 Posterior Distribution

9.3.4 Posterior Predictions

9.3.5 Computing the Marginal Likelihood

9.4 Maximum Likelihood as Orthogonal Projection

9.5 Further Reading

10. Dimensionality Reduction with Principal Component Analysis

10.1 Problem Setting

10.2 Maximum Variance Perspective

10.2.1 Direction with Maximal Variance

10.2.2 M-dimensional Subspace with Maximal Variance

10.3 Projection Perspective

10.3.1 Setting and Objective

10.3.2 Finding Optimal Coordinates

10.3.3 Finding the Basis of Principal Subspace

10.4 Eigenvector Computation and Low-Rank Approximations

10.4.1 PCA using Low-rank Matrix Approximations

10.4.2 Practical Aspects

10.5 PCA in High Dimensions

10.6 Key Steps of PCA in Practice

10.7 Latent Variable Perspective

10.7.1 Generative Process and Probabilistic Model

10.7.2 Likelihood and Joint Distribution

10.7.3 Posterior Distribution

10.8 Further Reading

11. Density Estimation with Gaussian Mixture Models

11.1 Gaussian Mixture Model

11.2 Parameter Learning via Maximum Likelihood

11.2.1 Responsibilities

11.2.2 Updating the Means

11.2.3 Updating the Covariances

11.2.4 Updating the Mixture Weights

11.3 EM Algorithm

11.4 Latent Variable Perpective

11.4.1 Generative Process and Probabilistic Model

11.4.2 Likelihood

11.4.3 Posterior Distribution

11.4.4 Extension to a Full Dataset

11.4.5 EM Algorithm Revised

11.5 Further Reading

12. Classification with Support Vector Machines

12.1 Separating Hyperplanes

12.2 Primal Support Vector Machine

12.2.1 Concept of the Margin

12.2.2 Traditional Derivation of the Margin

12.2.3 Why we can set the Margin to 1

12.2.4 Soft Margin SVM: Geometric View

12.2.5 Soft Margin SVM: Loss Function View

12.3 Dual Support Vector Machine

12.3.1 Convex Duality via Lagrange Multipliers

12.3.2 Dual SVM: Convex Hull View

12.4 Kernels

12.5 Numerical Solution

12.6 Further Reading

Index

References

Contents List of illustrations List of tables Foreword Part I Mathematical Foundations 1 1.1 1.2 1.3 Introduction and Motivation Finding Words for Intuitions Two Ways to Read this Book Exercises and Feedback Linear Algebra Systems of Linear Equations 2 2.1 2.2 Matrices 2.2.1 Matrix Addition and Multiplication 2.2.2 Inverse and Transpose 2.2.3 Multiplication by a Scalar 2.2.4 Compact Representations of Systems of Linear Equations Solving Systems of Linear Equations 2.3.1 Particular and General Solution 2.3.2 Elementary Transformations 2.3.3 The Minus-1 Trick 2.3.4 Algorithms for Solving a System of Linear Equations Vector Spaces 2.4.1 Groups 2.4.2 Vector Spaces 2.4.3 Vector Subspaces Linear Independence Basis and Rank 2.6.1 Generating Set and Basis 2.6.2 Rank Linear Mappings 2.7.1 Matrix Representation of Linear Mappings 2.7.2 Basis Change 2.7.3 Image and Kernel Afﬁne Spaces 2.8.1 Afﬁne Subspaces 2.3 2.4 2.5 2.6 2.7 2.8 vi x 1 11 13 13 15 18 19 21 24 24 26 27 28 29 29 30 34 36 37 38 39 41 42 46 46 49 50 52 55 60 63 63 i Draft chapter (November 27, 2018) from “Mathematics for Machine Learning” c2018 by Marc Peter Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press. Report errata and feedback to http://mml-book.com. Please do not post or distribute this ﬁle, please link to https://mml-book.com.

3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 3.7.1 Projection onto 1-Dimensional Subspaces (Lines) 3.7.2 Projection onto General Subspaces 3.7.3 Projection onto Afﬁne Subspaces Rotations 3.8.1 Rotations in R2 3.8.2 Rotations in R3 3.8.3 Rotations in n Dimensions 3.8.4 Properties of Rotations Further Reading Exercises Matrix Decompositions Determinant and Trace Eigenvalues and Eigenvectors Cholesky Decomposition Eigendecomposition and Diagonalization Singular Value Decomposition 4.5.1 Geometric Intuitions for the SVD 4.5.2 Existence and Construction of the SVD 4.5.3 Eigenvalue Decomposition vs Singular Value Decomposition Contents 65 65 65 72 73 74 74 74 75 77 78 80 81 82 84 86 89 90 91 92 93 94 94 95 96 97 104 112 114 119 120 123 127 130 135 136 138 141 143 144 147 148 149 150 151 156 ii 2.9 2.8.2 Afﬁne Mappings Further Reading Exercises Analytic Geometry 3 3.1 Norms 3.2 Inner Products 3.2.1 Dot Product 3.2.2 General Inner Products 3.2.3 Symmetric, Positive Deﬁnite Matrices Lengths and Distances Angles and Orthogonality 3.3 3.4 3.5 Orthonormal Basis 3.6 3.7 Orthogonal Projections Inner Product of Functions 4.6 Matrix Approximation 4.7 Matrix Phylogeny Further Reading 4.8 Exercises 5 5.1 5.2 5.3 5.4 Vector Calculus Differentiation of Univariate Functions 5.1.1 Taylor Series 5.1.2 Differentiation Rules Partial Differentiation and Gradients 5.2.1 Basic Rules of Partial Differentiation 5.2.2 Chain Rule Gradients of Vector-Valued Functions Gradients of Matrices Draft (2018-11-27) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.

Contents 5.5 5.6 Useful Identities for Computing Gradients Backpropagation and Automatic Differentiation 5.6.1 Gradients in a Deep Network 5.6.2 Automatic Differentiation 5.7 Higher-order Derivatives 5.8 5.9 Linearization and Multivariate Taylor Series Further Reading Exercises 6 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 Probability and Distributions Construction of a Probability Space 6.1.1 Philosophical Issues 6.1.2 Probability and Random Variables 6.1.3 Statistics Discrete and Continuous Probabilities 6.2.1 Discrete Probabilities 6.2.2 Continuous Probabilities 6.2.3 Contrasting Discrete and Continuous Distributions Sum Rule, Product Rule and Bayes’ Theorem Summary Statistics and Independence 6.4.1 Means and Covariances 6.4.2 Empirical Means and Covariances 6.4.3 Three Expressions for the Variance 6.4.4 Sums and Transformations of Random Variables 6.4.5 Statistical Independence 6.4.6 Inner Products of Random Variables Gaussian Distribution 6.5.1 Marginals and Conditionals of Gaussians are Gaussians 6.5.2 Product of Gaussian Densities 6.5.3 Sums and Linear Transformations 6.5.4 Sampling from Multivariate Gaussian Distributions Conjugacy and the Exponential Family 6.6.1 Conjugacy 6.6.2 Sufﬁcient Statistics 6.6.3 Exponential Family Change of Variables/Inverse Transform 6.7.1 Distribution Function Technique 6.7.2 Change of Variables Further Reading Exercises Continuous Optimization 7 7.1 Optimization using Gradient Descent 7.1.1 Stepsize 7.1.2 Gradient Descent with Momentum 7.1.3 Stochastic Gradient Descent Constrained Optimization and Lagrange Multipliers Convex Optimization 7.3.1 Linear Programming 7.2 7.3 iii 160 160 161 163 166 167 171 172 174 174 174 176 179 180 180 182 183 185 188 188 193 194 195 196 197 199 200 202 203 206 206 209 211 212 216 217 219 223 224 227 229 231 232 233 235 238 241 c2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.

iv 7.4 8 8.1 8.2 8.3 8.4 9 9.1 9.2 9.3 Contents 243 244 248 249 251 253 260 260 261 263 264 266 266 269 272 272 273 275 278 279 281 283 284 285 287 290 292 293 293 299 301 303 304 305 305 307 309 312 314 316 7.3.2 Quadratic Programming 7.3.3 Legendre-Fenchel Transform and Convex Conjugate Further Reading Exercises Part II Central Machine Learning Problems When Models meet Data Empirical Risk Minimization 8.1.1 Hypothesis Class of Functions 8.1.2 Loss Function for Training 8.1.3 Regularization to Reduce Overﬁtting 8.1.4 Cross Validation to Assess the Generalization Performance Parameter Estimation 8.2.1 Maximum Likelihood Estimation 8.2.2 Maximum A Posteriori Estimation Probabilistic Modeling and Inference 8.3.1 Probabilistic Models 8.3.2 Bayesian Inference 8.3.3 Latent Variable Models Directed Graphical Models 8.4.1 Graph Semantics 8.4.2 Conditional Independence and d-Separation 8.5 Model Selection 8.5.1 Nested Cross Validation 8.5.2 Bayesian Model Selection 8.5.3 Bayes Factors for Model Comparison Linear Regression Problem Formulation Parameter Estimation 9.2.1 Maximum Likelihood Estimation 9.2.2 Overﬁtting in Linear Regression 9.2.3 Maximum A Posteriori Estimation 9.2.4 MAP Estimation as Regularization Bayesian Linear Regression 9.3.1 Model 9.3.2 Prior Predictions 9.3.3 Posterior Distribution 9.3.4 Posterior Predictions 9.3.5 Computing the Marginal Likelihood 9.4 Maximum Likelihood as Orthogonal Projection 9.5 Further Reading 10 10.1 Problem Setting 10.2 Maximum Variance Perspective Dimensionality Reduction with Principal Component Analysis 318 319 321 322 10.2.1 Direction with Maximal Variance Draft (2018-11-27) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.

Contents 10.2.2 M-dimensional Subspace with Maximal Variance 10.3 Projection Perspective 10.3.1 Setting and Objective 10.3.2 Finding Optimal Coordinates 10.3.3 Finding the Basis of the Principal Subspace 10.4 Eigenvector Computation and Low-Rank Approximations 10.4.1 PCA using Low-rank Matrix Approximations 10.4.2 Practical Aspects 10.5 PCA in High Dimensions 10.6 Key Steps of PCA in Practice 10.7 Latent Variable Perspective 10.7.1 Generative Process and Probabilistic Model 10.7.2 Likelihood and Joint Distribution 10.7.3 Posterior Distribution 10.8 Further Reading Density Estimation with Gaussian Mixture Models 11 11.1 Gaussian Mixture Model 11.2 Parameter Learning via Maximum Likelihood 11.2.1 Responsibilities 11.2.2 Updating the Means 11.2.3 Updating the Covariances 11.2.4 Updating the Mixture Weights 11.3 EM Algorithm 11.4 Latent Variable Perspective 11.4.1 Generative Process and Probabilistic Model 11.4.2 Likelihood 11.4.3 Posterior Distribution 11.4.4 Extension to a Full Dataset 11.4.5 EM Algorithm Revisited 11.5 Further Reading Classiﬁcation with Support Vector Machines 12 12.1 Separating Hyperplanes 12.2 Primal Support Vector Machine 12.2.1 Concept Of The Margin 12.2.2 Traditional Derivation Of The Margin 12.2.3 Why We Can Set The Margin To 1 12.2.4 Soft Margin SVM: Geometric View 12.2.5 Soft Margin SVM: Loss Function View 12.3 Dual Support Vector Machine 12.3.1 Convex Duality Via Lagrange Multipliers 12.3.2 Soft Margin SVM: Convex Hull View 12.3.3 Kernels 12.3.4 Numerical Solution 12.4 Further Reading References Index v 323 326 326 328 330 334 335 335 336 337 341 341 343 344 345 349 350 351 353 354 357 359 361 363 363 366 367 367 368 369 371 373 375 375 377 379 380 381 383 384 386 389 391 393 395 407 c2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.

Foreword 7 Table of Symbols 575 576 577 Symbol a, b, c, ↵, , x, y, z A, B, C x>, A> A1 hx, yi x>y [x, y, z] (x, y, z) Rn a := b a =: b a / b g f r L L dim rk(A) Im() ker() span[b1] det(A) tr(A) | · | k·k A,C a 2A B ; E ei D N ✓ I m 0m,n 1m,n Meaning Scalars are lowercase Vectors are bold lowercase Matrices are bold uppercase Transpose of a vector or matrix Inverse of a matrix Inner product of x and y Dot product of x and y Matrix of column vectors stacked horizontally (Ordered) tuple n-dimensional vector space of real numbers a is deﬁned as b b is deﬁned as a a is proportional to b, i.e., a = const. · b Function composition; “g after f” Gradient Lagrangian Negative log-likelihood Dimensionality of vector space Rank of matrix A Image of linear mapping Kernel (null space) of a linear mapping Span (generating set) of b1 determinant of A trace of A Absolute value Norm; Euclidean unless speciﬁed Sets a is an element of the set A Basis set Empty set Eigenvalue Eigenspace of eigenvalue Standard/canonical vector (where i is the component that is 1) Number of dimensions; indexed by d = 1, . . . , D Number of data points; indexed by n = 1, . . . , N Parameter vector identity matrix of size m ⇥ m matrix of zeros of size m ⇥ n matrix of ones of size m ⇥ n Binomial coefﬁcient, n choose k Variance of argument Expectation of argument Covariance of the argument Gaussian distribution with mean µ and covariance ⌃ Bernoulli distribution with parameter µ Binomial distribution with parameters µ, N Random variable x is distributed according to p(✓) n k V[·] E[·] Cov[·] Nµ, ⌃ c2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press. Ber(µ) Bin(N, µ) x ⇠ p(✓)

分享到：

赞收藏

资料库

Mathematics for Machine Learning【Marc Peter Deisenroth】.pdf

相关推荐

课程资源

热门标签

最新资料