Preface
Festschrift
Credits
Contents
List of Contributors
Acronyms
Part I History of Statistical Learning Theory
Reference
1 In Hindsight: Doklady Akademii Nauk SSSR, 181(4), 1968
References
2 On the Uniform Convergence of the Frequencies of Occurrence of Events to Their Probabilities
2.1 Introduction
2.2 Statement of the Problem
Statement of the Problem
2.3 Some Additional Definitions
Some Additional Definitions
2.4 A Property of the Growth Function
A Property of the Growth Function
2.5 Sufficient Conditions for Uniform Convergence Not Depending on Properties of the Distribution
Sufficient Conditions for Uniform Convergence Not Depending on Properties of the Distribution
2.6 Necessary and Sufficient Conditions for the Uniform Convergence of Frequencies to Probabilities
Necessary and Sufficient Conditions for the Uniform Convergence of Frequencies to Probabilities
Reference
3 Early History of Support Vector Machines
3.1 The ``Generalised Portrait'' Method for Pattern Recognition (1962)
3.1.1 Initial Heuristic Ideas
3.1.2 Reducing to a Convex (Quadratic) Programming Problem
3.1.3 Generalisation of the Idea
3.2 The Kuhn–Tucker Theorem and Its Application to the ``Generalised Portrait'' Method
3.3 Searching for Coefficients Instead of Coordinates
3.4 Optimum (Largest Margin) Hyperplane
3.5 Lagrangian Dual Function
3.6 Generalisation Properties
3.7 Support Vector Machines
References
Part II Theory and Practice of Statistical Learning Theory
4 Some Remarks on the Statistical Analysis of SVMs and Related Methods
4.1 Introduction
4.2 Mathematical Prerequisites
4.3 Universal Consistency
4.3.1 Binary Classification
4.3.2 Least Squares Regression
4.4 Learning Rates
4.4.1 Binary Classification
4.4.2 Least Squares Regression
References
5 Explaining AdaBoost
5.1 Introduction
5.2 Direct Application of VC Theory
5.3 The Margins Explanation
5.4 Loss Minimization
5.5 Regularization
5.6 Inherently Unpredictable Data
5.7 Conclusions
References
6 On the Relations and Differences Between Popper Dimension, Exclusion Dimension and VC-Dimension
6.1 Introduction
6.2 Setting
6.3 Definitions with Logical Quantifiers
6.4 Relations
6.5 Discussion
References
7 On Learnability, Complexity and Stability
7.1 Introduction
7.2 Supervised Learning, Consistency and Learnability
7.2.1 Consistency and Learnability
7.3 Learnability of a Hypothesis Space
7.3.1 Complexity and Learnability
7.3.2 Stability and Learnability
7.4 Learnability in the General Learning Setting
7.4.1 Vapnik's Approach and Non-trivial Consistency
7.4.2 Stability and Learnability for General Learning
7.5 Discussion
References
8 Loss Functions
8.1 Introduction
8.2 Proper Losses and Their Representations
8.3 Divergences and the Bridge to Proper Losses
8.4 Surrogate Losses
8.5 Extension to Multiclass Losses and Divergences
8.6 Parameterisations, Links and Convexity
8.7 Effect of Losses on Statistical Convergence
8.8 Conclusion
References
9 Statistical Learning Theory in Practice
9.1 Introduction
9.2 Practical Issues with the Classical Approach
9.3 A Short Review of Existing Learning Models
9.3.1 Linear Models
9.3.2 Embedding Models
9.3.3 Nearest Neighbour
9.3.4 SVMs and Kernel Methods
9.3.5 Neural Networks
9.3.6 Further Methods
9.4 Practical Challenges on Real Datasets
References
10 PAC-Bayesian Theory
10.1 Introduction
10.2 Basic PAC-Bayesian Theorems
10.3 Catoni's Basic Inequality
10.4 Catoni's Localization Theorem
10.5 Discussion
References
11 Kernel Ridge Regression
11.1 Introduction
11.2 Kernel Ridge Regression
11.3 Kernel Ridge Regression Without Probability
11.3.1 Ordinary Ridge Regression
11.3.2 Kernel Ridge Regression
11.4 Kernel Ridge Regression in Conformal Prediction
References
12 Multi-task Learning for Computational Biology: Overview and Outlook
12.1 Introduction
12.2 Multi-task Learning
12.2.1 Relation to Transfer Learning
12.2.2 Regularization-Based Multi-task Learning
12.2.2.1 Common Approaches
12.2.3 Learning Task Similarities
12.2.3.1 A Simple Approach
12.2.3.2 Multi-task Relationship Learning (MTRL)
12.2.3.3 Multi-task Multiple Kernel Learning (MT-MKL)
12.2.3.4 Hierarchical MT-MKL
12.2.4 When Does MTL Pay Off?
12.3 Application in Computational Biology
12.4 Conclusion
References
13 Semi-supervised Learning in Causal and Anticausal Settings
13.1 Introduction
13.2 Causal Inference
13.3 Semi-supervised Learning
13.4 Empirical Results
13.5 Conclusion
References
14 Strong Universal Consistent Estimate of the Minimum Mean Squared Error
14.1 Introduction
14.2 Strong Universal Consistency
14.3 Rate of Convergence
References
15 The Median Hypothesis
15.1 Introduction
15.2 Relation to Previous Work
15.3 The Hypothesis Depth: Definitions and Properties
15.3.1 Depth for Linear Classifiers
15.3.2 Breakdown Point
15.4 Measuring Depth
15.4.1 Finding the Median
15.4.2 Implementation Issues
15.5 Discussion
References
16 Efficient Transductive Online Learning via RandomizedRounding
16.1 Introduction
16.2 The Minimax Forecaster
16.3 The R2 Forecaster
16.4 Application 1: Transductive Online Learning
16.5 Application 2: Online Collaborative Filtering
Appendix: Derivation of the Minimax Forecaster
References
17 Pivotal Estimation in High-Dimensional Regression via Linear Programming
17.1 Introduction
17.2 Notation
17.3 The Estimator
17.4 Sensitivity Characteristics
17.5 Bounds on the Estimation and Prediction Errors
References
18 On Sparsity Inducing Regularization Methods for Machine Learning
18.1 Introduction
18.2 Fixed Point Algorithms Based on Proximity Operators
18.2.1 Notation and Problem Formulation
18.2.2 Computation of a Generalized Proximity Operator with a Fixed Point Method
18.2.3 Connection to the Forward-Backward Algorithm
18.3 Examples of Composite Functions
18.4 Application to Support Vector Machines
18.5 Conclusion
References
19 Sharp Oracle Inequalities in Low Rank Estimation
19.1 Main Result
19.2 Proof of Theorem 19.1
References
20 On the Consistency of the Bootstrap Approach for Support Vector Machines and Related Kernel-Based Methods
20.1 Introduction
20.2 Support Vector Machines
20.3 Consistency of Bootstrap SVMs
20.4 Proofs
20.4.1 Tools for the Proof of Theorem 20.1
20.4.2 Proof of Theorem 20.1
References
21 Kernels, Pre-images and Optimization
21.1 Introduction
21.2 The Kernel Trick
21.3 Kernel PCA
21.4 Pre-images
21.5 Pre-images for Gradient-Based Optimization
21.5.1 Example: Shape Optimization
21.5.2 Origin of the ``Noise''
21.5.3 Optimization Constrained to the Data Manifold
21.5.4 Applications in Density Functional Theory
21.6 Conclusion
References
22 Efficient Learning of Sparse Ranking Functions
22.1 Introduction
22.2 Problem Setting
22.3 An Efficient Coordinate Descent Algorithm
22.4 Extensions
22.5 Experiments
22.6 Conclusions
References
23 Direct Approximation of Divergences Between Probability Distributions
23.1 Introduction
23.2 Divergence Measures
23.3 Direct Divergence Approximation
23.4 Usage of Divergence Estimators in Machine Learning
23.5 Conclusions
References
Index