logo资料库

Mathematics for Machine Learning【Marc Peter Deisenroth】.pdf

第1页 / 共405页
第2页 / 共405页
第3页 / 共405页
第4页 / 共405页
第5页 / 共405页
第6页 / 共405页
第7页 / 共405页
第8页 / 共405页
资料共405页,剩余部分请下载后查看
Contents
Tables of Symbols
Foreword
1. Introduction and Motivation
1.1 Finding Words for Intuitions
1.2 Two Ways to Read this Book
1.3 Exercises and Feedback
2. Linear Algebra
2.1 Systems of Linear Equations
2.2 Matrices
2.2.1 Matrix Addition and Multiplication
2.2.2 Inverse and Transpose
2.2.3 Multiplication by a Scalar
2.2.4 Compact Representations of Systems of Linear Equations
2.3 Solving Systems of Linear Equations
2.3.1 Particular and General Solution
2.3.2 Elementary Transformations
2.3.3 The Minus-1 Trick
2.3.4 Algorithms for Solving a System of Linear Equations
2.4 Vector Spaces
2.4.1 Groups
2.4.2 Vector Spaces
2.4.3 Vector Subspaces
2.5 Linear Independence
2.6 Basis and Rank
2.6.1 Generating Set and Basis
2.6.2 Rank
2.7 Linear Mappings
2.7.1 Matrix Representation of Linear Mappings
2.7.2 Basis Change
2.7.3 Image and Kernel
2.8 Affine Spaces
2.8.1 Affine Subspaces
2.8.2 Affine Mappings
3. Analytic Geometry
3.1 Norms
3.2 Inner Products
3.2.1 Dot Product
3.2.2 General Inner Products
3.2.3 Symmetric, Positive Definite Matrices
3.3 Lengths and Distance
3.4 Angles and Orthogonality
3.5 Orthonormal Basis
3.6 Inner Product of Functions
3.7 Orthogonal Projections
3.7.1 Projection onto 1-Dimensional Subspaces(Lines)
3.7.2 Projection onto General Subspaces
3.7.3 Projection onto Affine Subspaces
3.8 Rotations
3.8.1 Rotations in R2
3.8.2 Rotations in R3
3.8.3 Rotations in n Dimensions
3.8.4 Properties of Rotations
3.9 Further Reading
4. Matrix Decompositions
4.1 Determinants and Trace
4.2 Eigenvalues and Eigenvectors
4.3 Cholesky Decomposition
4.4 Eigendecomposition and Diagonalization
4.5 Singular Value Decomposition
4.5.1 Geometric Intuition for the SVD
4.5.2 Existence and Construction of the SVD
4.5.3 Eigenvalue Decomposition vs SVD
4.6 Matrix Approximation
4.7 Matrix Phylogeny
4.8 Further Reading
5. Vector Calculus
5.1 Differentiation of Univariate Functions
5.1.1 Taylor Series
5.1.2 Differentiation Rules
5.2 Partial Differentiation and Gradients
5.2.1 Basic Rules of Partial Differentiation
5.2.2 Chain Rule
5.3 Gradients of Vector-Valued Functions
5.4 Gradients of Matrices
5.5 Useful Identities for Computing Gradients
5.6 Backpropagation and Automatic Differentiation
5.6.1 Gradients in a Deep Network
5.6.2 Automatic Differentiation
5.7 Higher-order Derivatives
5.8 Linearization and Multivariate Taylor Series
5.9 Further Reading
6. Probability and Distributions
6.1 Construction of a Probability Space
6.1.1 Philosophical Issues
6.1.2 Probability and Random Variables
6.1.3 Statistics
6.2 Discrete and Continuous Probabilities
6.2.1 Discrete Probabilities
6.2.2 Continuous Probabilities
6.2.3 Contrasting Discrete and Continuous Distributions
6.3 Sum Rule, Product Rule and Bayes' Theorem
6.4 Summary Statistics and Independence
6.4.1 Mean and Covariances
6.4.2 Empirical Means and Covariances
6.4.3 Three Expressions for the Variance
6.4.4 Sums and Transformations of Random Variables
6.4.5 Statistical Independence
6.4.6 Inner Products of Random Variables
6.5 Gaussian Distribution
6.5.1 Marginals and Conditionals of Gaussians are Gaussians
6.5.2 Product of Gaussian Densities
6.5.3 Sums and Linear Transformations
6.6 Conjugacy and the Exponential Family
6.6.1 Conjugacy
6.6.2 Sufficient Statistics
6.6.3 Exponential Family
6.7 Change of Variables/Inverse Transform
6.7.1 Distribution Function Technique
6.7.2 Change of Variables
6.8 Further Reading
7. Continuous Optimization
7.1 Optimization using Gradient Descent
7.1.1 Stepsize
7.1.2 Gradient Descent with Momentum
7.1.3 Stochastic Gradient Descent
7.2 Constrained Optimization and Lagrange Multipliers
7.3 Convex Optimization
7.3.1 Linear Programming
7.3.2 Quadratic Programming
7.3.3 Legendre-Fenchel Transform and Convex Conjugate
7.4 Further Reading
8. When Models meet Data
8.1 Empirical Risk Minimization
8.1.1 Hypothesis Class of Functions
8.1.2 Loss Functions for Training
8.1.3 Regularization to Reduce Overfitting
8.1.4 Cross Validation to Assess the Generalization Performance
8.2 Parameter Estimation
8.2.1 Maximum Likelihood Estimation
8.2.2 Maximum A Posteriori Estimation
8.3 Probabilistic Modeling
8.3.1 MLE, MAP, and Bayesian Inference
8.3.2 Latent Variables
8.4 Directed Graphical Models
8.4.1 Graph Semantics
8.4.2 Conditional Independence and D-Separation
8.5 Model Selection
8.5.1 Nested Cross Validation
8.5.2 Bayesian Model Selection
8.5.3 Bayes Factors for Model Comparison
9. Linear Regression
9.1 Problem Formulation
9.2 Parameter Estimation
9.2.1 Maximum Likelihood Estimation
9.2.2 Overfitting in Linear Regression
9.2.3 Maximum A Posteriori Estimation
9.2.4 MAP Estimation as Regularization
9.3 Bayesian Linear Regression
9.3.1 Model
9.3.2 Prior Predictions
9.3.3 Posterior Distribution
9.3.4 Posterior Predictions
9.3.5 Computing the Marginal Likelihood
9.4 Maximum Likelihood as Orthogonal Projection
9.5 Further Reading
10. Dimensionality Reduction with Principal Component Analysis
10.1 Problem Setting
10.2 Maximum Variance Perspective
10.2.1 Direction with Maximal Variance
10.2.2 M-dimensional Subspace with Maximal Variance
10.3 Projection Perspective
10.3.1 Setting and Objective
10.3.2 Finding Optimal Coordinates
10.3.3 Finding the Basis of Principal Subspace
10.4 Eigenvector Computation and Low-Rank Approximations
10.4.1 PCA using Low-rank Matrix Approximations
10.4.2 Practical Aspects
10.5 PCA in High Dimensions
10.6 Key Steps of PCA in Practice
10.7 Latent Variable Perspective
10.7.1 Generative Process and Probabilistic Model
10.7.2 Likelihood and Joint Distribution
10.7.3 Posterior Distribution
10.8 Further Reading
11. Density Estimation with Gaussian Mixture Models
11.1 Gaussian Mixture Model
11.2 Parameter Learning via Maximum Likelihood
11.2.1 Responsibilities
11.2.2 Updating the Means
11.2.3 Updating the Covariances
11.2.4 Updating the Mixture Weights
11.3 EM Algorithm
11.4 Latent Variable Perpective
11.4.1 Generative Process and Probabilistic Model
11.4.2 Likelihood
11.4.3 Posterior Distribution
11.4.4 Extension to a Full Dataset
11.4.5 EM Algorithm Revised
11.5 Further Reading
12. Classification with Support Vector Machines
12.1 Separating Hyperplanes
12.2 Primal Support Vector Machine
12.2.1 Concept of the Margin
12.2.2 Traditional Derivation of the Margin
12.2.3 Why we can set the Margin to 1
12.2.4 Soft Margin SVM: Geometric View
12.2.5 Soft Margin SVM: Loss Function View
12.3 Dual Support Vector Machine
12.3.1 Convex Duality via Lagrange Multipliers
12.3.2 Dual SVM: Convex Hull View
12.4 Kernels
12.5 Numerical Solution
12.6 Further Reading
Index
References
Contents List of illustrations List of tables Foreword Part I Mathematical Foundations 1 1.1 1.2 1.3 Introduction and Motivation Finding Words for Intuitions Two Ways to Read this Book Exercises and Feedback Linear Algebra Systems of Linear Equations 2 2.1 2.2 Matrices 2.2.1 Matrix Addition and Multiplication 2.2.2 Inverse and Transpose 2.2.3 Multiplication by a Scalar 2.2.4 Compact Representations of Systems of Linear Equations Solving Systems of Linear Equations 2.3.1 Particular and General Solution 2.3.2 Elementary Transformations 2.3.3 The Minus-1 Trick 2.3.4 Algorithms for Solving a System of Linear Equations Vector Spaces 2.4.1 Groups 2.4.2 Vector Spaces 2.4.3 Vector Subspaces Linear Independence Basis and Rank 2.6.1 Generating Set and Basis 2.6.2 Rank Linear Mappings 2.7.1 Matrix Representation of Linear Mappings 2.7.2 Basis Change 2.7.3 Image and Kernel Affine Spaces 2.8.1 Affine Subspaces 2.3 2.4 2.5 2.6 2.7 2.8 vi x 1 11 13 13 15 18 19 21 24 24 26 27 28 29 29 30 34 36 37 38 39 41 42 46 46 49 50 52 55 60 63 63 i Draft chapter (November 27, 2018) from “Mathematics for Machine Learning” c2018 by Marc Peter Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press. Report errata and feedback to http://mml-book.com. Please do not post or distribute this file, please link to https://mml-book.com.
3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 3.7.1 Projection onto 1-Dimensional Subspaces (Lines) 3.7.2 Projection onto General Subspaces 3.7.3 Projection onto Affine Subspaces Rotations 3.8.1 Rotations in R2 3.8.2 Rotations in R3 3.8.3 Rotations in n Dimensions 3.8.4 Properties of Rotations Further Reading Exercises Matrix Decompositions Determinant and Trace Eigenvalues and Eigenvectors Cholesky Decomposition Eigendecomposition and Diagonalization Singular Value Decomposition 4.5.1 Geometric Intuitions for the SVD 4.5.2 Existence and Construction of the SVD 4.5.3 Eigenvalue Decomposition vs Singular Value Decomposition Contents 65 65 65 72 73 74 74 74 75 77 78 80 81 82 84 86 89 90 91 92 93 94 94 95 96 97 104 112 114 119 120 123 127 130 135 136 138 141 143 144 147 148 149 150 151 156 ii 2.9 2.8.2 Affine Mappings Further Reading Exercises Analytic Geometry 3 3.1 Norms 3.2 Inner Products 3.2.1 Dot Product 3.2.2 General Inner Products 3.2.3 Symmetric, Positive Definite Matrices Lengths and Distances Angles and Orthogonality 3.3 3.4 3.5 Orthonormal Basis 3.6 3.7 Orthogonal Projections Inner Product of Functions 4.6 Matrix Approximation 4.7 Matrix Phylogeny Further Reading 4.8 Exercises 5 5.1 5.2 5.3 5.4 Vector Calculus Differentiation of Univariate Functions 5.1.1 Taylor Series 5.1.2 Differentiation Rules Partial Differentiation and Gradients 5.2.1 Basic Rules of Partial Differentiation 5.2.2 Chain Rule Gradients of Vector-Valued Functions Gradients of Matrices Draft (2018-11-27) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
Contents 5.5 5.6 Useful Identities for Computing Gradients Backpropagation and Automatic Differentiation 5.6.1 Gradients in a Deep Network 5.6.2 Automatic Differentiation 5.7 Higher-order Derivatives 5.8 5.9 Linearization and Multivariate Taylor Series Further Reading Exercises 6 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 Probability and Distributions Construction of a Probability Space 6.1.1 Philosophical Issues 6.1.2 Probability and Random Variables 6.1.3 Statistics Discrete and Continuous Probabilities 6.2.1 Discrete Probabilities 6.2.2 Continuous Probabilities 6.2.3 Contrasting Discrete and Continuous Distributions Sum Rule, Product Rule and Bayes’ Theorem Summary Statistics and Independence 6.4.1 Means and Covariances 6.4.2 Empirical Means and Covariances 6.4.3 Three Expressions for the Variance 6.4.4 Sums and Transformations of Random Variables 6.4.5 Statistical Independence 6.4.6 Inner Products of Random Variables Gaussian Distribution 6.5.1 Marginals and Conditionals of Gaussians are Gaussians 6.5.2 Product of Gaussian Densities 6.5.3 Sums and Linear Transformations 6.5.4 Sampling from Multivariate Gaussian Distributions Conjugacy and the Exponential Family 6.6.1 Conjugacy 6.6.2 Sufficient Statistics 6.6.3 Exponential Family Change of Variables/Inverse Transform 6.7.1 Distribution Function Technique 6.7.2 Change of Variables Further Reading Exercises Continuous Optimization 7 7.1 Optimization using Gradient Descent 7.1.1 Stepsize 7.1.2 Gradient Descent with Momentum 7.1.3 Stochastic Gradient Descent Constrained Optimization and Lagrange Multipliers Convex Optimization 7.3.1 Linear Programming 7.2 7.3 iii 160 160 161 163 166 167 171 172 174 174 174 176 179 180 180 182 183 185 188 188 193 194 195 196 197 199 200 202 203 206 206 209 211 212 216 217 219 223 224 227 229 231 232 233 235 238 241 c2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
iv 7.4 8 8.1 8.2 8.3 8.4 9 9.1 9.2 9.3 Contents 243 244 248 249 251 253 260 260 261 263 264 266 266 269 272 272 273 275 278 279 281 283 284 285 287 290 292 293 293 299 301 303 304 305 305 307 309 312 314 316 7.3.2 Quadratic Programming 7.3.3 Legendre-Fenchel Transform and Convex Conjugate Further Reading Exercises Part II Central Machine Learning Problems When Models meet Data Empirical Risk Minimization 8.1.1 Hypothesis Class of Functions 8.1.2 Loss Function for Training 8.1.3 Regularization to Reduce Overfitting 8.1.4 Cross Validation to Assess the Generalization Performance Parameter Estimation 8.2.1 Maximum Likelihood Estimation 8.2.2 Maximum A Posteriori Estimation Probabilistic Modeling and Inference 8.3.1 Probabilistic Models 8.3.2 Bayesian Inference 8.3.3 Latent Variable Models Directed Graphical Models 8.4.1 Graph Semantics 8.4.2 Conditional Independence and d-Separation 8.5 Model Selection 8.5.1 Nested Cross Validation 8.5.2 Bayesian Model Selection 8.5.3 Bayes Factors for Model Comparison Linear Regression Problem Formulation Parameter Estimation 9.2.1 Maximum Likelihood Estimation 9.2.2 Overfitting in Linear Regression 9.2.3 Maximum A Posteriori Estimation 9.2.4 MAP Estimation as Regularization Bayesian Linear Regression 9.3.1 Model 9.3.2 Prior Predictions 9.3.3 Posterior Distribution 9.3.4 Posterior Predictions 9.3.5 Computing the Marginal Likelihood 9.4 Maximum Likelihood as Orthogonal Projection 9.5 Further Reading 10 10.1 Problem Setting 10.2 Maximum Variance Perspective Dimensionality Reduction with Principal Component Analysis 318 319 321 322 10.2.1 Direction with Maximal Variance Draft (2018-11-27) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
Contents 10.2.2 M-dimensional Subspace with Maximal Variance 10.3 Projection Perspective 10.3.1 Setting and Objective 10.3.2 Finding Optimal Coordinates 10.3.3 Finding the Basis of the Principal Subspace 10.4 Eigenvector Computation and Low-Rank Approximations 10.4.1 PCA using Low-rank Matrix Approximations 10.4.2 Practical Aspects 10.5 PCA in High Dimensions 10.6 Key Steps of PCA in Practice 10.7 Latent Variable Perspective 10.7.1 Generative Process and Probabilistic Model 10.7.2 Likelihood and Joint Distribution 10.7.3 Posterior Distribution 10.8 Further Reading Density Estimation with Gaussian Mixture Models 11 11.1 Gaussian Mixture Model 11.2 Parameter Learning via Maximum Likelihood 11.2.1 Responsibilities 11.2.2 Updating the Means 11.2.3 Updating the Covariances 11.2.4 Updating the Mixture Weights 11.3 EM Algorithm 11.4 Latent Variable Perspective 11.4.1 Generative Process and Probabilistic Model 11.4.2 Likelihood 11.4.3 Posterior Distribution 11.4.4 Extension to a Full Dataset 11.4.5 EM Algorithm Revisited 11.5 Further Reading Classification with Support Vector Machines 12 12.1 Separating Hyperplanes 12.2 Primal Support Vector Machine 12.2.1 Concept Of The Margin 12.2.2 Traditional Derivation Of The Margin 12.2.3 Why We Can Set The Margin To 1 12.2.4 Soft Margin SVM: Geometric View 12.2.5 Soft Margin SVM: Loss Function View 12.3 Dual Support Vector Machine 12.3.1 Convex Duality Via Lagrange Multipliers 12.3.2 Soft Margin SVM: Convex Hull View 12.3.3 Kernels 12.3.4 Numerical Solution 12.4 Further Reading References Index v 323 326 326 328 330 334 335 335 336 337 341 341 343 344 345 349 350 351 353 354 357 359 361 363 363 366 367 367 368 369 371 373 375 375 377 379 380 381 383 384 386 389 391 393 395 407 c2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
Foreword 7 Table of Symbols 575 576 577 Symbol a, b, c, ↵, , x, y, z A, B, C x>, A> A1 hx, yi x>y [x, y, z] (x, y, z) Rn a := b a =: b a / b g f r L L dim rk(A) Im() ker() span[b1] det(A) tr(A) | · | k·k A,C a 2A B ; E ei D N ✓ I m 0m,n 1m,n Meaning Scalars are lowercase Vectors are bold lowercase Matrices are bold uppercase Transpose of a vector or matrix Inverse of a matrix Inner product of x and y Dot product of x and y Matrix of column vectors stacked horizontally (Ordered) tuple n-dimensional vector space of real numbers a is defined as b b is defined as a a is proportional to b, i.e., a = const. · b Function composition; “g after f” Gradient Lagrangian Negative log-likelihood Dimensionality of vector space Rank of matrix A Image of linear mapping Kernel (null space) of a linear mapping Span (generating set) of b1 determinant of A trace of A Absolute value Norm; Euclidean unless specified Sets a is an element of the set A Basis set Empty set Eigenvalue Eigenspace of eigenvalue Standard/canonical vector (where i is the component that is 1) Number of dimensions; indexed by d = 1, . . . , D Number of data points; indexed by n = 1, . . . , N Parameter vector identity matrix of size m ⇥ m matrix of zeros of size m ⇥ n matrix of ones of size m ⇥ n Binomial coefficient, n choose k Variance of argument Expectation of argument Covariance of the argument Gaussian distribution with mean µ and covariance ⌃ Bernoulli distribution with parameter µ Binomial distribution with parameters µ, N Random variable x is distributed according to p(✓) n k V[·] E[·] Cov[·] Nµ, ⌃ c2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press. Ber(µ) Bin(N, µ) x ⇠ p(✓)
分享到:
收藏