Understanding Machine Learning - From Theory to Algorithms.pdf-资料库

qq_40543508-10012162-4744300845205317955.pdf-第1页.png

第1页 / 共416页

qq_40543508-10012162-4744300845205317955.pdf-第2页.png

第2页 / 共416页

qq_40543508-10012162-4744300845205317955.pdf-第3页.png

第3页 / 共416页

qq_40543508-10012162-4744300845205317955.pdf-第4页.png

第4页 / 共416页

qq_40543508-10012162-4744300845205317955.pdf-第5页.png

第5页 / 共416页

qq_40543508-10012162-4744300845205317955.pdf-第6页.png

第6页 / 共416页

qq_40543508-10012162-4744300845205317955.pdf-第7页.png

第7页 / 共416页

qq_40543508-10012162-4744300845205317955.pdf-第8页.png

第8页 / 共416页

Cover

Halftitle

Title

Dedication

Contents

Preface

1 Introduction

1.1 What Is Learning?

1.2 When Do We Need Machine Learning?

1.3 Types of Learning

1.4 Relations to Other Fields

1.5 How to Read This Book

1.5.1 Possible Course Plans Based on This Book

1.6 Notation

Part 1 Foundations

2 A Gentle Start

2.1 A Formal Model – The Statistical Learning Framework

2.2 Empirical Risk Minimization

2.2.1 Something May Go Wrong – Overfitting

2.3 Empirical Risk Minimization with Inductive Bias

2.3.1 Finite Hypothesis Classes

2.4 Exercises

3 A Formal Learning Model

3.1 PAC Learning

3.2 A More General Learning Model

3.2.1 Releasing the Realizability Assumption – Agnostic PAC Learning

3.2.2 The Scope of Learning Problems Modeled

3.3 Summary

3.4 Bibliographic Remarks

3.5 Exercises

4 Learning via Uniform Convergence

4.1 Uniform Convergence Is Sufficient for Learnability

4.2 Finite Classes Are Agnostic PAC Learnable

4.3 Summary

4.4 Bibliographic Remarks

4.5 Exercises

5 The Bias-Complexity Tradeoff

5.1 The No-Free-Lunch Theorem

5.1.1 No-Free-Lunch and Prior Knowledge

5.2 Error Decomposition

5.3 Summary

5.4 Bibliographic Remarks

5.5 Exercises

6 The VC-Dimension

6.1 Infinite-Size Classes Can Be Learnable

6.2 The VC-Dimension

6.3 Examples

6.3.1 Threshold Functions

6.3.2 Intervals

6.3.3 Axis Aligned Rectangles

6.3.4 Finite Classes

6.3.5 VC-Dimension and the Number of Parameters

6.4 The Fundamental Theorem of PAC learning

6.5 Proof of Theorem 6.7

6.5.1 Sauer's Lemma and the Growth Function

6.5.2 Uniform Convergence for Classes of Small Effective Size

6.6 Summary

6.7 Bibliographic Remarks

6.8 Exercises

7 Nonuniform Learnability

7.1 Nonuniform Learnability

7.1.1 Characterizing Nonuniform Learnability

7.2 Structural Risk Minimization

7.3 Minimum Description Length and Occam's Razor

7.3.1 Occam's Razor

7.4 Other Notions of Learnability – Consistency

7.5 Discussing the Different Notions of Learnability

7.5.1 The No-Free-Lunch Theorem Revisited

7.6 Summary

7.7 Bibliographic Remarks

7.8 Exercises

8 The Runtime of Learning

8.1 Computational Complexity of Learning

8.1.1 Formal Definition[sup(*)]

8.2 Implementing the ERM Rule

8.2.1 Finite Classes

8.2.2 Axis Aligned Rectangles

8.2.3 Boolean Conjunctions

8.2.4 Learning 3-Term DNF

8.3 Efficiently Learnable, but Not by a Proper ERM

8.4 Hardness of Learning[sup(*)]

8.5 Summary

8.6 Bibliographic Remarks

8.7 Exercises

Part 2 From Theory to Algorithms

9 Linear Predictors

9.1 Halfspaces

9.1.1 Linear Programming for the Class of Halfspaces

9.1.2 Perceptron for Halfspaces

9.1.3 The VC Dimension of Halfspaces

9.2 Linear Regression

9.2.1 Least Squares

9.2.2 Linear Regression for Polynomial Regression Tasks

9.3 Logistic Regression

9.4 Summary

9.5 Bibliographic Remarks

9.6 Exercises

10 Boosting

10.1 Weak Learnability

10.1.1 Efficient Implementation of ERM for Decision Stumps

10.2 AdaBoost

10.3 Linear Combinations of Base Hypotheses

10.3.1 The VC-Dimension of L(B,T)

10.4 AdaBoost for Face Recognition

10.5 Summary

10.6 Bibliographic Remarks

10.7 Exercises

11 Model Selection and Validation

11.1 Model Selection Using SRM

11.2 Validation

11.2.1 Hold Out Set

11.2.2 Validation for Model Selection

11.2.3 The Model-Selection Curve

11.2.4 k-Fold Cross Validation

11.2.5 Train-Validation-Test Split

11.3 What to Do If Learning Fails

11.4 Summary

11.5 Exercises

12 Convex Learning Problems

12.1 Convexity, Lipschitzness, and Smoothness

12.1.1 Convexity

12.1.2 Lipschitzness

12.1.3 Smoothness

12.2 Convex Learning Problems

12.2.1 Learnability of Convex Learning Problems

12.2.2 Convex-Lipschitz/Smooth-Bounded Learning Problems

12.3 Surrogate Loss Functions

12.4 Summary

12.5 Bibliographic Remarks

12.6 Exercises

13 Regularization and Stability

13.1 Regularized Loss Minimization

13.1.1 Ridge Regression

13.2 Stable Rules Do Not Overfit

13.3 Tikhonov Regularization as a Stabilizer

13.3.1 Lipschitz Loss

13.3.2 Smooth and Nonnegative Loss

13.4 Controlling the Fitting-Stability Tradeoff

13.5 Summary

13.6 Bibliographic Remarks

13.7 Exercises

14 Stochastic Gradient Descent

14.1 Gradient Descent

14.1.1 Analysis of GD for Convex-Lipschitz Functions

14.2 Subgradients

14.2.1 Calculating Subgradients

14.2.2 Subgradients of Lipschitz Functions

14.2.3 Subgradient Descent

14.3 Stochastic Gradient Descent (SGD)

14.3.1 Analysis of SGD for Convex-Lipschitz-Bounded Functions

14.4 Variants

14.4.1 Adding a Projection Step

14.4.2 Variable Step Size

14.4.3 Other Averaging Techniques

14.4.4 Strongly Convex Functions[sup(*)]

14.5 Learning with SGD

14.5.1 SGD for Risk Minimization

14.5.2 Analyzing SGD for Convex-Smooth Learning Problems

14.5.3 SGD for Regularized Loss Minimization

14.6 Summary

14.7 Bibliographic Remarks

14.8 Exercises

15 Support Vector Machines

15.1 Margin and Hard-SVM

15.1.1 The Homogenous Case

15.1.2 The Sample Complexity of Hard-SVM

15.2 Soft-SVM and Norm Regularization

15.2.1 The Sample Complexity of Soft-SVM

15.2.2 Margin and Norm-Based Bounds versus Dimension

15.2.3 The Ramp Loss[sup(*)]

15.3 Optimality Conditions and ''Support Vectors''[sup(*)]

15.4 Duality[sup(*)]

15.5 Implementing Soft-SVM Using SGD

15.6 Summary

15.7 Bibliographic Remarks

15.8 Exercises

16 Kernel Methods

16.1 Embeddings into Feature Spaces

16.2 The Kernel Trick

16.2.1 Kernels as a Way to Express Prior Knowledge

16.2.2 Characterizing Kernel Functions[sup(*)]

16.3 Implementing Soft-SVM with Kernels

16.4 Summary

16.5 Bibliographic Remarks

16.6 Exercises

17 Multiclass, Ranking, and Complex Prediction Problems

17.1 One-versus-All and All-Pairs

17.2 Linear Multiclass Predictors

17.2.1 How to Construct Ψ

17.2.2 Cost-Sensitive Classification

17.2.3 ERM

17.2.4 Generalized Hinge Loss

17.2.5 Multiclass SVM and SGD

17.3 Structured Output Prediction

17.4 Ranking

17.4.1 Linear Predictors for Ranking

17.5 Bipartite Ranking and Multivariate Performance Measures

17.5.1 Linear Predictors for Bipartite Ranking

17.6 Summary

17.7 Bibliographic Remarks

17.8 Exercises

18 Decision Trees

18.1 Sample Complexity

18.2 Decision Tree Algorithms

18.2.1 Implementations of the Gain Measure

18.2.2 Pruning

18.2.3 Threshold-Based Splitting Rules for Real-Valued Features

18.3 Random Forests

18.4 Summary

18.5 Bibliographic Remarks

18.6 Exercises

19 Nearest Neighbor

19.1 k Nearest Neighbors

19.2 Analysis

19.2.1 A Generalization Bound for the 1-NN Rule

19.2.2 The ''Curse of Dimensionality''

19.3 Efficient Implementation[sup(*)]

19.4 Summary

19.5 Bibliographic Remarks

19.6 Exercises

20 Neural Networks

20.1 Feedforward Neural Networks

20.2 Learning Neural Networks

20.3 The Expressive Power of Neural Networks

20.3.1 Geometric Intuition

20.4 The Sample Complexity of Neural Networks

20.5 The Runtime of Learning Neural Networks

20.6 SGD and Backpropagation

20.7 Summary

20.8 Bibliographic Remarks

20.9 Exercises

Part 3 Additional Learning Models

21 Online Learning

21.1 Online Classification in the Realizable Case

21.1.1 Online Learnability

21.2 Online Classification in the Unrealizable Case

21.2.1 Weighted-Majority

21.3 Online Convex Optimization

21.4 The Online Perceptron Algorithm

21.5 Summary

21.6 Bibliographic Remarks

21.7 Exercises

22 Clustering

22.1 Linkage-Based Clustering Algorithms

22.2 k-Means and Other Cost Minimization Clusterings

22.2.1 The k-Means Algorithm

22.3 Spectral Clustering

22.3.1 Graph Cut

22.3.2 Graph Laplacian and Relaxed Graph Cuts

22.3.3 Unnormalized Spectral Clustering

22.4 Information Bottleneck[sup(*)]

22.5 A High Level View of Clustering

22.6 Summary

22.7 Bibliographic Remarks

22.8 Exercises

23 Dimensionality Reduction

23.1 Principal Component Analysis (PCA)

23.1.1 A More Efficient Solution for the Case d >> m

23.1.2 Implementation and Demonstration

23.2 Random Projections

23.3 Compressed Sensing

23.3.1 Proofs[sup(*)]

23.4 PCA or Compressed Sensing?

23.5 Summary

23.6 Bibliographic Remarks

23.7 Exercises

24 Generative Models

24.1 Maximum Likelihood Estimator

24.1.1 Maximum Likelihood Estimation for Continuous Random Variables

24.1.2 Maximum Likelihood and Empirical Risk Minimization

24.1.3 Generalization Analysis

24.2 Naive Bayes

24.3 Linear Discriminant Analysis

24.4 Latent Variables and the EM Algorithm

24.4.1 EM as an Alternate Maximization Algorithm

24.4.2 EM for Mixture of Gaussians (Soft k-Means)

24.5 Bayesian Reasoning

24.6 Summary

24.7 Bibliographic Remarks

24.8 Exercises

25 Feature Selection and Generation

25.1 Feature Selection

25.1.1 Filters

25.1.2 Greedy Selection Approaches

25.1.3 Sparsity-Inducing Norms

25.2 Feature Manipulation and Normalization

25.2.1 Examples of Feature Transformations

25.3 Feature Learning

25.3.1 Dictionary Learning Using Auto-Encoders

25.4 Summary

25.5 Bibliographic Remarks

25.6 Exercises

Part 4 Advanced Theory

26 Rademacher Complexities

26.1 The Rademacher Complexity

26.1.1 Rademacher Calculus

26.2 Rademacher Complexity of Linear Classes

26.3 Generalization Bounds for SVM

26.4 Generalization Bounds for Predictors with Low [sup(1)] Norm

26.5 Bibliographic Remarks

27 Covering Numbers

27.1 Covering

27.1.1 Properties

27.2 From Covering to Rademacher Complexity via Chaining

27.3 Bibliographic Remarks

28 Proof of the Fundamental Theorem of Learning Theory

28.1 The Upper Bound for the Agnostic Case

28.2 The Lower Bound for the Agnostic Case

28.2.1 Showing That m(ε,δ) ≥ 0.5log(1/(4δ))/ε[sup(2)]

28.2.2 Showing That m(ε,1/8) ≥ 8d/ε[sup(2)]

28.3 The Upper Bound for the Realizable Case

28.3.1 From ε-Nets to PAC Learnability

29 Multiclass Learnability

29.1 The Natarajan Dimension

29.2 The Multiclass Fundamental Theorem

29.2.1 On the Proof of Theorem 29.3

29.3 Calculating the Natarajan Dimension

29.3.1 One-vs.-All Based Classes

29.3.2 General Multiclass-to-Binary Reductions

29.3.3 Linear Multiclass Predictors

29.4 On Good and Bad ERMs

29.5 Bibliographic Remarks

29.6 Exercises

30 Compression Bounds

30.1 Compression Bounds

30.2 Examples

30.2.1 Axis Aligned Rectangles

30.2.2 Halfspaces

30.2.3 Separating Polynomials

30.2.4 Separation with Margin

30.3 Bibliographic Remarks

31 PAC-Bayes

31.1 PAC-Bayes Bounds

31.2 Bibliographic Remarks

31.3 Exercises

A Technical Lemmas

B Measure Concentration

B.1 Markov's Inequality

B.2 Chebyshev's Inequality

B.3 Chernoff's Bounds

B.4 Hoeffding's Inequality

B.5 Bennet's and Bernstein's Inequalities

B.5.1 Application

B.6 Slud's Inequality

B.7 Concentration of χ[sup(2)] Variables

C Linear Algebra

C.1 Basic Definitions

C.2 Eigenvalues and Eigenvectors

C.3 Positive definite matrices

C.4 Singular Value Decomposition (SVD)

References

Index

资料库

Understanding Machine Learning - From Theory to Algorithms.pdf

相关推荐

人工智能

热门标签

最新资料