logo资料库

Understanding Machine Learning - From Theory to Algorithms.pdf

第1页 / 共416页
第2页 / 共416页
第3页 / 共416页
第4页 / 共416页
第5页 / 共416页
第6页 / 共416页
第7页 / 共416页
第8页 / 共416页
资料共416页,剩余部分请下载后查看
Cover
Halftitle
Title
Copyright
Dedication
Contents
Preface
1 Introduction
1.1 What Is Learning?
1.2 When Do We Need Machine Learning?
1.3 Types of Learning
1.4 Relations to Other Fields
1.5 How to Read This Book
1.5.1 Possible Course Plans Based on This Book
1.6 Notation
Part 1 Foundations
2 A Gentle Start
2.1 A Formal Model – The Statistical Learning Framework
2.2 Empirical Risk Minimization
2.2.1 Something May Go Wrong – Overfitting
2.3 Empirical Risk Minimization with Inductive Bias
2.3.1 Finite Hypothesis Classes
2.4 Exercises
3 A Formal Learning Model
3.1 PAC Learning
3.2 A More General Learning Model
3.2.1 Releasing the Realizability Assumption – Agnostic PAC Learning
3.2.2 The Scope of Learning Problems Modeled
3.3 Summary
3.4 Bibliographic Remarks
3.5 Exercises
4 Learning via Uniform Convergence
4.1 Uniform Convergence Is Sufficient for Learnability
4.2 Finite Classes Are Agnostic PAC Learnable
4.3 Summary
4.4 Bibliographic Remarks
4.5 Exercises
5 The Bias-Complexity Tradeoff
5.1 The No-Free-Lunch Theorem
5.1.1 No-Free-Lunch and Prior Knowledge
5.2 Error Decomposition
5.3 Summary
5.4 Bibliographic Remarks
5.5 Exercises
6 The VC-Dimension
6.1 Infinite-Size Classes Can Be Learnable
6.2 The VC-Dimension
6.3 Examples
6.3.1 Threshold Functions
6.3.2 Intervals
6.3.3 Axis Aligned Rectangles
6.3.4 Finite Classes
6.3.5 VC-Dimension and the Number of Parameters
6.4 The Fundamental Theorem of PAC learning
6.5 Proof of Theorem 6.7
6.5.1 Sauer's Lemma and the Growth Function
6.5.2 Uniform Convergence for Classes of Small Effective Size
6.6 Summary
6.7 Bibliographic Remarks
6.8 Exercises
7 Nonuniform Learnability
7.1 Nonuniform Learnability
7.1.1 Characterizing Nonuniform Learnability
7.2 Structural Risk Minimization
7.3 Minimum Description Length and Occam's Razor
7.3.1 Occam's Razor
7.4 Other Notions of Learnability – Consistency
7.5 Discussing the Different Notions of Learnability
7.5.1 The No-Free-Lunch Theorem Revisited
7.6 Summary
7.7 Bibliographic Remarks
7.8 Exercises
8 The Runtime of Learning
8.1 Computational Complexity of Learning
8.1.1 Formal Definition[sup(*)]
8.2 Implementing the ERM Rule
8.2.1 Finite Classes
8.2.2 Axis Aligned Rectangles
8.2.3 Boolean Conjunctions
8.2.4 Learning 3-Term DNF
8.3 Efficiently Learnable, but Not by a Proper ERM
8.4 Hardness of Learning[sup(*)]
8.5 Summary
8.6 Bibliographic Remarks
8.7 Exercises
Part 2 From Theory to Algorithms
9 Linear Predictors
9.1 Halfspaces
9.1.1 Linear Programming for the Class of Halfspaces
9.1.2 Perceptron for Halfspaces
9.1.3 The VC Dimension of Halfspaces
9.2 Linear Regression
9.2.1 Least Squares
9.2.2 Linear Regression for Polynomial Regression Tasks
9.3 Logistic Regression
9.4 Summary
9.5 Bibliographic Remarks
9.6 Exercises
10 Boosting
10.1 Weak Learnability
10.1.1 Efficient Implementation of ERM for Decision Stumps
10.2 AdaBoost
10.3 Linear Combinations of Base Hypotheses
10.3.1 The VC-Dimension of L(B,T)
10.4 AdaBoost for Face Recognition
10.5 Summary
10.6 Bibliographic Remarks
10.7 Exercises
11 Model Selection and Validation
11.1 Model Selection Using SRM
11.2 Validation
11.2.1 Hold Out Set
11.2.2 Validation for Model Selection
11.2.3 The Model-Selection Curve
11.2.4 k-Fold Cross Validation
11.2.5 Train-Validation-Test Split
11.3 What to Do If Learning Fails
11.4 Summary
11.5 Exercises
12 Convex Learning Problems
12.1 Convexity, Lipschitzness, and Smoothness
12.1.1 Convexity
12.1.2 Lipschitzness
12.1.3 Smoothness
12.2 Convex Learning Problems
12.2.1 Learnability of Convex Learning Problems
12.2.2 Convex-Lipschitz/Smooth-Bounded Learning Problems
12.3 Surrogate Loss Functions
12.4 Summary
12.5 Bibliographic Remarks
12.6 Exercises
13 Regularization and Stability
13.1 Regularized Loss Minimization
13.1.1 Ridge Regression
13.2 Stable Rules Do Not Overfit
13.3 Tikhonov Regularization as a Stabilizer
13.3.1 Lipschitz Loss
13.3.2 Smooth and Nonnegative Loss
13.4 Controlling the Fitting-Stability Tradeoff
13.5 Summary
13.6 Bibliographic Remarks
13.7 Exercises
14 Stochastic Gradient Descent
14.1 Gradient Descent
14.1.1 Analysis of GD for Convex-Lipschitz Functions
14.2 Subgradients
14.2.1 Calculating Subgradients
14.2.2 Subgradients of Lipschitz Functions
14.2.3 Subgradient Descent
14.3 Stochastic Gradient Descent (SGD)
14.3.1 Analysis of SGD for Convex-Lipschitz-Bounded Functions
14.4 Variants
14.4.1 Adding a Projection Step
14.4.2 Variable Step Size
14.4.3 Other Averaging Techniques
14.4.4 Strongly Convex Functions[sup(*)]
14.5 Learning with SGD
14.5.1 SGD for Risk Minimization
14.5.2 Analyzing SGD for Convex-Smooth Learning Problems
14.5.3 SGD for Regularized Loss Minimization
14.6 Summary
14.7 Bibliographic Remarks
14.8 Exercises
15 Support Vector Machines
15.1 Margin and Hard-SVM
15.1.1 The Homogenous Case
15.1.2 The Sample Complexity of Hard-SVM
15.2 Soft-SVM and Norm Regularization
15.2.1 The Sample Complexity of Soft-SVM
15.2.2 Margin and Norm-Based Bounds versus Dimension
15.2.3 The Ramp Loss[sup(*)]
15.3 Optimality Conditions and ''Support Vectors''[sup(*)]
15.4 Duality[sup(*)]
15.5 Implementing Soft-SVM Using SGD
15.6 Summary
15.7 Bibliographic Remarks
15.8 Exercises
16 Kernel Methods
16.1 Embeddings into Feature Spaces
16.2 The Kernel Trick
16.2.1 Kernels as a Way to Express Prior Knowledge
16.2.2 Characterizing Kernel Functions[sup(*)]
16.3 Implementing Soft-SVM with Kernels
16.4 Summary
16.5 Bibliographic Remarks
16.6 Exercises
17 Multiclass, Ranking, and Complex Prediction Problems
17.1 One-versus-All and All-Pairs
17.2 Linear Multiclass Predictors
17.2.1 How to Construct Ψ
17.2.2 Cost-Sensitive Classification
17.2.3 ERM
17.2.4 Generalized Hinge Loss
17.2.5 Multiclass SVM and SGD
17.3 Structured Output Prediction
17.4 Ranking
17.4.1 Linear Predictors for Ranking
17.5 Bipartite Ranking and Multivariate Performance Measures
17.5.1 Linear Predictors for Bipartite Ranking
17.6 Summary
17.7 Bibliographic Remarks
17.8 Exercises
18 Decision Trees
18.1 Sample Complexity
18.2 Decision Tree Algorithms
18.2.1 Implementations of the Gain Measure
18.2.2 Pruning
18.2.3 Threshold-Based Splitting Rules for Real-Valued Features
18.3 Random Forests
18.4 Summary
18.5 Bibliographic Remarks
18.6 Exercises
19 Nearest Neighbor
19.1 k Nearest Neighbors
19.2 Analysis
19.2.1 A Generalization Bound for the 1-NN Rule
19.2.2 The ''Curse of Dimensionality''
19.3 Efficient Implementation[sup(*)]
19.4 Summary
19.5 Bibliographic Remarks
19.6 Exercises
20 Neural Networks
20.1 Feedforward Neural Networks
20.2 Learning Neural Networks
20.3 The Expressive Power of Neural Networks
20.3.1 Geometric Intuition
20.4 The Sample Complexity of Neural Networks
20.5 The Runtime of Learning Neural Networks
20.6 SGD and Backpropagation
20.7 Summary
20.8 Bibliographic Remarks
20.9 Exercises
Part 3 Additional Learning Models
21 Online Learning
21.1 Online Classification in the Realizable Case
21.1.1 Online Learnability
21.2 Online Classification in the Unrealizable Case
21.2.1 Weighted-Majority
21.3 Online Convex Optimization
21.4 The Online Perceptron Algorithm
21.5 Summary
21.6 Bibliographic Remarks
21.7 Exercises
22 Clustering
22.1 Linkage-Based Clustering Algorithms
22.2 k-Means and Other Cost Minimization Clusterings
22.2.1 The k-Means Algorithm
22.3 Spectral Clustering
22.3.1 Graph Cut
22.3.2 Graph Laplacian and Relaxed Graph Cuts
22.3.3 Unnormalized Spectral Clustering
22.4 Information Bottleneck[sup(*)]
22.5 A High Level View of Clustering
22.6 Summary
22.7 Bibliographic Remarks
22.8 Exercises
23 Dimensionality Reduction
23.1 Principal Component Analysis (PCA)
23.1.1 A More Efficient Solution for the Case d >> m
23.1.2 Implementation and Demonstration
23.2 Random Projections
23.3 Compressed Sensing
23.3.1 Proofs[sup(*)]
23.4 PCA or Compressed Sensing?
23.5 Summary
23.6 Bibliographic Remarks
23.7 Exercises
24 Generative Models
24.1 Maximum Likelihood Estimator
24.1.1 Maximum Likelihood Estimation for Continuous Random Variables
24.1.2 Maximum Likelihood and Empirical Risk Minimization
24.1.3 Generalization Analysis
24.2 Naive Bayes
24.3 Linear Discriminant Analysis
24.4 Latent Variables and the EM Algorithm
24.4.1 EM as an Alternate Maximization Algorithm
24.4.2 EM for Mixture of Gaussians (Soft k-Means)
24.5 Bayesian Reasoning
24.6 Summary
24.7 Bibliographic Remarks
24.8 Exercises
25 Feature Selection and Generation
25.1 Feature Selection
25.1.1 Filters
25.1.2 Greedy Selection Approaches
25.1.3 Sparsity-Inducing Norms
25.2 Feature Manipulation and Normalization
25.2.1 Examples of Feature Transformations
25.3 Feature Learning
25.3.1 Dictionary Learning Using Auto-Encoders
25.4 Summary
25.5 Bibliographic Remarks
25.6 Exercises
Part 4 Advanced Theory
26 Rademacher Complexities
26.1 The Rademacher Complexity
26.1.1 Rademacher Calculus
26.2 Rademacher Complexity of Linear Classes
26.3 Generalization Bounds for SVM
26.4 Generalization Bounds for Predictors with Low [sup(1)] Norm
26.5 Bibliographic Remarks
27 Covering Numbers
27.1 Covering
27.1.1 Properties
27.2 From Covering to Rademacher Complexity via Chaining
27.3 Bibliographic Remarks
28 Proof of the Fundamental Theorem of Learning Theory
28.1 The Upper Bound for the Agnostic Case
28.2 The Lower Bound for the Agnostic Case
28.2.1 Showing That m(ε,δ) ≥ 0.5log(1/(4δ))/ε[sup(2)]
28.2.2 Showing That m(ε,1/8) ≥ 8d/ε[sup(2)]
28.3 The Upper Bound for the Realizable Case
28.3.1 From ε-Nets to PAC Learnability
29 Multiclass Learnability
29.1 The Natarajan Dimension
29.2 The Multiclass Fundamental Theorem
29.2.1 On the Proof of Theorem 29.3
29.3 Calculating the Natarajan Dimension
29.3.1 One-vs.-All Based Classes
29.3.2 General Multiclass-to-Binary Reductions
29.3.3 Linear Multiclass Predictors
29.4 On Good and Bad ERMs
29.5 Bibliographic Remarks
29.6 Exercises
30 Compression Bounds
30.1 Compression Bounds
30.2 Examples
30.2.1 Axis Aligned Rectangles
30.2.2 Halfspaces
30.2.3 Separating Polynomials
30.2.4 Separation with Margin
30.3 Bibliographic Remarks
31 PAC-Bayes
31.1 PAC-Bayes Bounds
31.2 Bibliographic Remarks
31.3 Exercises
A Technical Lemmas
B Measure Concentration
B.1 Markov's Inequality
B.2 Chebyshev's Inequality
B.3 Chernoff's Bounds
B.4 Hoeffding's Inequality
B.5 Bennet's and Bernstein's Inequalities
B.5.1 Application
B.6 Slud's Inequality
B.7 Concentration of χ[sup(2)] Variables
C Linear Algebra
C.1 Basic Definitions
C.2 Eigenvalues and Eigenvectors
C.3 Positive definite matrices
C.4 Singular Value Decomposition (SVD)
References
Index
Understanding Machine Learning Machine learning is one of the fastest growing areas of computer science, with far-reaching applications. The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a princi- pled way. The book provides an extensive theoretical account of the fundamental ideas underlying machine learning and the mathematical derivations that transform these principles into practical algorithms. Fol- lowing a presentation of the basics of the field, the book covers a wide array of central topics that have not been addressed by previous text- books. These include a discussion of the computational complexity of learning and the concepts of convexity and stability; important algorith- mic paradigms including stochastic gradient descent, neural networks, and structured output learning; and emerging theoretical concepts such as the PAC-Bayes approach and compression-based bounds. Designed for an advanced undergraduate or beginning graduate course, the text makes the fundamentals and algorithms of machine learning accessible to stu- dents and nonexpert readers in statistics, computer science, mathematics, and engineering. Shai Shalev-Shwartz is an Associate Professor at the School of Computer Science and Engineering at The Hebrew University, Israel. Shai Ben-David is a Professor in the School of Computer Science at the University of Waterloo, Canada.
UNDERSTANDING MACHINE LEARNING FromTheoryto Algorithms Shai Shalev-Shwartz The Hebrew University, Jerusalem Shai Ben-David University of Waterloo, Canada
32 Avenue of the Americas, New York, NY 10013-2473, USA Cambridge University Press is part of the University of Cambridge. It furthers the University’s mission by disseminating knowledge in the pursuit of education, learning and research at the highest international levels of excellence. www.cambridge.org Information on this title: www.cambridge.org/9781107057135 c Shai Shalev-Shwartz and Shai Ben-David 2014 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2014 Printed in the United States of America A catalog record for this publication is available from the British Library Library of Congress Cataloging in Publication Data ISBN 978-1-107-05713-5 Hardback Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party Internet Web sites referred to in this publication, and does not guarantee that any content on such Web sites is, or will remain, accurate or appropriate.
Triple-S dedicates the book to triple-M
分享到:
收藏