Contents
Tables of Symbols
Foreword
1. Introduction and Motivation
1.1 Finding Words for Intuitions
1.2 Two Ways to Read this Book
1.3 Exercises and Feedback
2. Linear Algebra
2.1 Systems of Linear Equations
2.2 Matrices
2.2.1 Matrix Addition and Multiplication
2.2.2 Inverse and Transpose
2.2.3 Multiplication by a Scalar
2.2.4 Compact Representations of Systems of Linear Equations
2.3 Solving Systems of Linear Equations
2.3.1 Particular and General Solution
2.3.2 Elementary Transformations
2.3.3 The Minus-1 Trick
2.3.4 Algorithms for Solving a System of Linear Equations
2.4 Vector Spaces
2.4.1 Groups
2.4.2 Vector Spaces
2.4.3 Vector Subspaces
2.5 Linear Independence
2.6 Basis and Rank
2.6.1 Generating Set and Basis
2.6.2 Rank
2.7 Linear Mappings
2.7.1 Matrix Representation of Linear Mappings
2.7.2 Basis Change
2.7.3 Image and Kernel
2.8 Affine Spaces
2.8.1 Affine Subspaces
2.8.2 Affine Mappings
3. Analytic Geometry
3.1 Norms
3.2 Inner Products
3.2.1 Dot Product
3.2.2 General Inner Products
3.2.3 Symmetric, Positive Definite Matrices
3.3 Lengths and Distance
3.4 Angles and Orthogonality
3.5 Orthonormal Basis
3.6 Inner Product of Functions
3.7 Orthogonal Projections
3.7.1 Projection onto 1-Dimensional Subspaces(Lines)
3.7.2 Projection onto General Subspaces
3.7.3 Projection onto Affine Subspaces
3.8 Rotations
3.8.1 Rotations in R2
3.8.2 Rotations in R3
3.8.3 Rotations in n Dimensions
3.8.4 Properties of Rotations
3.9 Further Reading
4. Matrix Decompositions
4.1 Determinants and Trace
4.2 Eigenvalues and Eigenvectors
4.3 Cholesky Decomposition
4.4 Eigendecomposition and Diagonalization
4.5 Singular Value Decomposition
4.5.1 Geometric Intuition for the SVD
4.5.2 Existence and Construction of the SVD
4.5.3 Eigenvalue Decomposition vs SVD
4.6 Matrix Approximation
4.7 Matrix Phylogeny
4.8 Further Reading
5. Vector Calculus
5.1 Differentiation of Univariate Functions
5.1.1 Taylor Series
5.1.2 Differentiation Rules
5.2 Partial Differentiation and Gradients
5.2.1 Basic Rules of Partial Differentiation
5.2.2 Chain Rule
5.3 Gradients of Vector-Valued Functions
5.4 Gradients of Matrices
5.5 Useful Identities for Computing Gradients
5.6 Backpropagation and Automatic Differentiation
5.6.1 Gradients in a Deep Network
5.6.2 Automatic Differentiation
5.7 Higher-order Derivatives
5.8 Linearization and Multivariate Taylor Series
5.9 Further Reading
6. Probability and Distributions
6.1 Construction of a Probability Space
6.1.1 Philosophical Issues
6.1.2 Probability and Random Variables
6.1.3 Statistics
6.2 Discrete and Continuous Probabilities
6.2.1 Discrete Probabilities
6.2.2 Continuous Probabilities
6.2.3 Contrasting Discrete and Continuous Distributions
6.3 Sum Rule, Product Rule and Bayes' Theorem
6.4 Summary Statistics and Independence
6.4.1 Mean and Covariances
6.4.2 Empirical Means and Covariances
6.4.3 Three Expressions for the Variance
6.4.4 Sums and Transformations of Random Variables
6.4.5 Statistical Independence
6.4.6 Inner Products of Random Variables
6.5 Gaussian Distribution
6.5.1 Marginals and Conditionals of Gaussians are Gaussians
6.5.2 Product of Gaussian Densities
6.5.3 Sums and Linear Transformations
6.6 Conjugacy and the Exponential Family
6.6.1 Conjugacy
6.6.2 Sufficient Statistics
6.6.3 Exponential Family
6.7 Change of Variables/Inverse Transform
6.7.1 Distribution Function Technique
6.7.2 Change of Variables
6.8 Further Reading
7. Continuous Optimization
7.1 Optimization using Gradient Descent
7.1.1 Stepsize
7.1.2 Gradient Descent with Momentum
7.1.3 Stochastic Gradient Descent
7.2 Constrained Optimization and Lagrange Multipliers
7.3 Convex Optimization
7.3.1 Linear Programming
7.3.2 Quadratic Programming
7.3.3 Legendre-Fenchel Transform and Convex Conjugate
7.4 Further Reading
8. When Models meet Data
8.1 Empirical Risk Minimization
8.1.1 Hypothesis Class of Functions
8.1.2 Loss Functions for Training
8.1.3 Regularization to Reduce Overfitting
8.1.4 Cross Validation to Assess the Generalization Performance
8.2 Parameter Estimation
8.2.1 Maximum Likelihood Estimation
8.2.2 Maximum A Posteriori Estimation
8.3 Probabilistic Modeling
8.3.1 MLE, MAP, and Bayesian Inference
8.3.2 Latent Variables
8.4 Directed Graphical Models
8.4.1 Graph Semantics
8.4.2 Conditional Independence and D-Separation
8.5 Model Selection
8.5.1 Nested Cross Validation
8.5.2 Bayesian Model Selection
8.5.3 Bayes Factors for Model Comparison
9. Linear Regression
9.1 Problem Formulation
9.2 Parameter Estimation
9.2.1 Maximum Likelihood Estimation
9.2.2 Overfitting in Linear Regression
9.2.3 Maximum A Posteriori Estimation
9.2.4 MAP Estimation as Regularization
9.3 Bayesian Linear Regression
9.3.1 Model
9.3.2 Prior Predictions
9.3.3 Posterior Distribution
9.3.4 Posterior Predictions
9.3.5 Computing the Marginal Likelihood
9.4 Maximum Likelihood as Orthogonal Projection
9.5 Further Reading
10. Dimensionality Reduction with Principal Component Analysis
10.1 Problem Setting
10.2 Maximum Variance Perspective
10.2.1 Direction with Maximal Variance
10.2.2 M-dimensional Subspace with Maximal Variance
10.3 Projection Perspective
10.3.1 Setting and Objective
10.3.2 Finding Optimal Coordinates
10.3.3 Finding the Basis of Principal Subspace
10.4 Eigenvector Computation and Low-Rank Approximations
10.4.1 PCA using Low-rank Matrix Approximations
10.4.2 Practical Aspects
10.5 PCA in High Dimensions
10.6 Key Steps of PCA in Practice
10.7 Latent Variable Perspective
10.7.1 Generative Process and Probabilistic Model
10.7.2 Likelihood and Joint Distribution
10.7.3 Posterior Distribution
10.8 Further Reading
11. Density Estimation with Gaussian Mixture Models
11.1 Gaussian Mixture Model
11.2 Parameter Learning via Maximum Likelihood
11.2.1 Responsibilities
11.2.2 Updating the Means
11.2.3 Updating the Covariances
11.2.4 Updating the Mixture Weights
11.3 EM Algorithm
11.4 Latent Variable Perpective
11.4.1 Generative Process and Probabilistic Model
11.4.2 Likelihood
11.4.3 Posterior Distribution
11.4.4 Extension to a Full Dataset
11.4.5 EM Algorithm Revised
11.5 Further Reading
12. Classification with Support Vector Machines
12.1 Separating Hyperplanes
12.2 Primal Support Vector Machine
12.2.1 Concept of the Margin
12.2.2 Traditional Derivation of the Margin
12.2.3 Why we can set the Margin to 1
12.2.4 Soft Margin SVM: Geometric View
12.2.5 Soft Margin SVM: Loss Function View
12.3 Dual Support Vector Machine
12.3.1 Convex Duality via Lagrange Multipliers
12.3.2 Dual SVM: Convex Hull View
12.4 Kernels
12.5 Numerical Solution
12.6 Further Reading
Index
References