Statistical Pattern Recognition
Contents
Preface
Notation
1 Introduction to Statistical Pattern Recognition
1.1 Statistical Pattern Recognition
1.1.1 Introduction
1.1.2 The Basic Model
1.2 Stages in a Pattern Recognition Problem
1.3 Issues
1.4 Approaches to Statistical Pattern Recognition
1.5 Elementary Decision Theory
1.5.1 Bayes’ Decision Rule for Minimum Error
1.5.2 Bayes’ Decision Rule for Minimum Error – Reject Option
1.5.3 Bayes’ Decision Rule for Minimum Risk
1.5.4 Bayes’ Decision Rule for Minimum Risk – Reject Option
1.5.5 Neyman–Pearson Decision Rule
1.5.6 Minimax Criterion
1.5.7 Discussion
1.6 Discriminant Functions
1.6.1 Introduction
1.6.2 Linear Discriminant Functions
1.6.3 Piecewise Linear Discriminant Functions
1.6.4 Generalised Linear Discriminant Function
1.6.5 Summary
1.7 Multiple Regression
1.8 Outline of Book
1.9 Notes and References
Exercises
2 Density Estimation – Parametric
2.1 Introduction
2.2 Estimating the Parameters of the Distributions
2.2.1 Estimative Approach
2.2.2 Predictive Approach
2.3 The Gaussian Classifier
2.3.1 Specification
2.3.2 Derivation of the Gaussian Classifier Plug-In Estimates
2.3.3 Example Application Study
2.4 Dealing with Singularities in the Gaussian Classifier
2.4.1 Introduction
2.4.2 Na¨ıve Bayes
2.4.3 Projection onto a Subspace
2.4.4 Linear Discriminant Function
2.4.5 Regularised Discriminant Analysis
2.4.6 Example Application Study
2.4.7 Further Developments
2.4.8 Summary
2.5 Finite Mixture Models
2.5.1 Introduction
2.5.2 Mixture Models for Discrimination
2.5.3 Parameter Estimation for Normal Mixture Models
2.5.4 Normal Mixture Model Covariance Matrix Constraints
2.5.5 How Many Components?
2.5.6 Maximum Likelihood Estimation via EM
2.5.7 Example Application Study
2.5.8 Further Developments
2.5.9 Summary
2.6 Application Studies
2.7 Summary and Discussion
2.8 Recommendations
2.9 Notes and References
Exercises
3 Density Estimation – Bayesian
3.1 Introduction
3.1.1 Basics
3.1.2 Recursive Calculation
3.1.3 Proportionality
3.2 Analytic Solutions
3.2.1 Conjugate Priors
3.2.2 Estimating the Mean of a Normal Distribution with Known Variance
3.2.3 Estimating the Mean and the Covariance Matrix of a Multivariate Normal Distribution
3.2.4 Unknown Prior Class Probabilities
3.2.5 Summary
3.3 Bayesian Sampling Schemes
3.3.1 Introduction
3.3.2 Summarisation
3.3.3 Sampling Version of the Bayesian Classifier
3.3.4 Rejection Sampling
3.3.5 Ratio of Uniforms
3.3.6 Importance Sampling
3.4 Markov Chain Monte Carlo Methods
3.4.1 Introduction
3.4.2 The Gibbs Sampler
3.4.3 Metropolis–Hastings Algorithm
3.4.4 Data Augmentation
3.4.5 Reversible Jump Markov Chain Monte Carlo
3.4.6 Slice Sampling
3.4.7 MCMC Example – Estimation of Noisy Sinusoids
3.4.8 Summary
3.4.9 Notes and References
3.5 Bayesian Approaches to Discrimination
3.5.1 Labelled Training Data
3.5.2 Unlabelled Training Data
3.6 Sequential Monte Carlo Samplers
3.6.1 Introduction
3.6.2 Basic Methodology
3.6.3 Summary
3.7 Variational Bayes
3.7.1 Introduction
3.7.2 Description
3.7.3 Factorised Variational Approximation
3.7.4 Simple Example
3.7.5 Use of the Procedure for Model Selection
3.7.6 Further Developments and Applications
3.7.7 Summary
3.8 Approximate Bayesian Computation
3.8.1 Introduction
3.8.2 ABC Rejection Sampling
3.8.3 ABC MCMC Sampling
3.8.4 ABC Population Monte Carlo Sampling
3.8.5 Model Selection
3.8.6 Summary
3.9 Example Application Study
3.10 Application Studies
3.11 Summary and Discussion
3.12 Recommendations
3.13 Notes and References
Exercises
4 Density Estimation – Nonparametric
4.1 Introduction
4.1.1 Basic Properties of Density Estimators
4.2 k-Nearest-Neighbour Method
4.2.1 k-Nearest-Neighbour Classifier
4.2.2 Derivation
4.2.3 Choice of Distance Metric
4.2.4 Properties of the Nearest-Neighbour Rule
4.2.5 Linear Approximating and Eliminating Search Algorithm
4.2.6 Branch and Bound Search Algorithms: kd-Trees
4.2.7 Branch and Bound Search Algorithms: Ball-Trees
4.2.8 Editing Techniques
4.2.9 Example Application Study
4.2.10 Further Developments
4.2.11 Summary
4.3 Histogram Method
4.3.1 Data Adaptive Histograms
4.3.2 Independence Assumption (Naïve Bayes)
4.3.3 Lancaster Models
4.3.4 Maximum Weight Dependence Trees
4.3.5 Bayesian Networks
4.3.6 Example Application Study – Naïve Bayes Text Classification
4.3.7 Summary
4.4 Kernel Methods
4.4.1 Biasedness
4.4.2 Multivariate Extension
4.4.3 Choice of Smoothing Parameter
4.4.4 Choice of Kernel
4.4.5 Example Application Study
4.4.6 Further Developments
4.4.7 Summary
4.5 Expansion by Basis Functions
4.6 Copulas
4.6.1 Introduction
4.6.2 Mathematical Basis
4.6.3 Copula Functions
4.6.4 Estimating Copula Probability Density Functions
4.6.5 Simple Example
4.6.6 Summary
4.7 Application Studies
4.7.1 Comparative Studies
4.8 Summary and Discussion
4.9 Recommendations
4.10 Notes and References
Exercises
5 Linear Discriminant Analysis
5.1 Introduction
5.2 Two-Class Algorithms
5.2.1 General Ideas
5.2.2 Perceptron Criterion
5.2.3 Fisher’s Criterion
5.2.4 Least Mean-Squared-Error Procedures
5.2.5 Further Developments
5.2.6 Summary
5.3 Multiclass Algorithms
5.3.1 General Ideas
5.3.2 Error-Correction Procedure
5.3.3 Fisher’s Criterion – Linear Discriminant Analysis
5.3.4 Least Mean-Squared-Error Procedures
5.3.5 Regularisation
5.3.6 Example Application Study
5.3.7 Further Developments
5.3.8 Summary
5.4 Support Vector Machines
5.4.1 Introduction
5.4.2 Linearly Separable Two-Class Data
5.4.3 Linearly Nonseparable Two-Class Data
5.4.4 Multiclass SVMs
5.4.5 SVMs for Regression
5.4.6 Implementation
5.4.7 Example Application Study
5.4.8 Summary
5.5 Logistic Discrimination
5.5.1 Two-Class Case
5.5.2 Maximum Likelihood Estimation
5.5.3 Multiclass Logistic Discrimination
5.5.4 Example Application Study
5.5.5 Further Developments
5.5.6 Summary
5.6 Application Studies
5.7 Summary and Discussion
5.8 Recommendations
5.9 Notes and References
Exercises
6 Nonlinear Discriminant Analysis – Kernel and Projection Methods
6.1 Introduction
6.2 Radial Basis Functions
6.2.1 Introduction
6.2.2 Specifying the Model
6.2.3 Specifying the Functional Form
6.2.4 The Positions of the Centres
6.2.5 Smoothing Parameters
6.2.6 Calculation of the Weights
6.2.7 Model Order Selection
6.2.8 Simple RBF
6.2.9 Motivation
6.2.10 RBF Properties
6.2.11 Example Application Study
6.2.12 Further Developments
6.2.13 Summary
6.3 Nonlinear Support Vector Machines
6.3.1 Introduction
6.3.2 Binary Classification
6.3.3 Types of Kernel
6.3.4 Model Selection
6.3.5 Multiclass SVMs
6.3.6 Probability Estimates
6.3.7 Nonlinear Regression
6.3.8 Example Application Study
6.3.9 Further Developments
6.3.10 Summary
6.4 The Multilayer Perceptron
6.4.1 Introduction
6.4.2 Specifying the MLP Structure
6.4.3 Determining the MLP Weights
6.4.4 Modelling Capacity of the MLP
6.4.5 Logistic Classification
6.4.6 Example Application Study
6.4.7 Bayesian MLP Networks
6.4.8 Projection Pursuit
6.4.9 Summary
6.5 Application Studies
6.6 Summary and Discussion
6.7 Recommendations
6.8 Notes and References
Exercises
7 Rule and Decision Tree Induction
7.1 Introduction
7.2 Decision Trees
7.2.1 Introduction
7.2.2 Decision Tree Construction
7.2.3 Selection of the Splitting Rule
7.2.4 Terminating the Splitting Procedure
7.2.5 Assigning Class Labels to Terminal Nodes
7.2.6 Decision Tree Pruning – Worked Example
7.2.7 Decision Tree Construction Methods
7.2.8 Other Issues
7.2.9 Example Application Study
7.2.10 Further Developments
7.2.11 Summary
7.3 Rule Induction
7.3.1 Introduction
7.3.2 Generating Rules from a Decision Tree
7.3.3 Rule Induction Using a Sequential Covering Algorithm
7.3.4 Example Application Study
7.3.5 Further Developments
7.3.6 Summary
7.4 Multivariate Adaptive Regression Splines
7.4.1 Introduction
7.4.2 Recursive Partitioning Model
7.4.3 Example Application Study
7.4.4 Further Developments
7.4.5 Summary
7.5 Application Studies
7.6 Summary and Discussion
7.7 Recommendations
7.8 Notes and References
Exercises
8 Ensemble Methods
8.1 Introduction
8.2 Characterising a Classifier Combination Scheme
8.2.1 Feature Space
8.2.2 Level
8.2.3 Degree of Training
8.2.4 Form of Component Classifiers
8.2.5 Structure
8.2.6 Optimisation
8.3 Data Fusion
8.3.1 Architectures
8.3.2 Bayesian Approaches
8.3.3 Neyman–Pearson Formulation
8.3.4 Trainable Rules
8.3.5 Fixed Rules
8.4 Classifier Combination Methods
8.4.1 Product Rule
8.4.2 Sum Rule
8.4.3 Min, Max and Median Combiners
8.4.4 Majority Vote
8.4.5 Borda Count
8.4.6 Combiners Trained on Class Predictions
8.4.7 Stacked Generalisation
8.4.8 Mixture of Experts
8.4.9 Bagging
8.4.10 Boosting
8.4.11 Random Forests
8.4.12 Model Averaging
8.4.13 Summary of Methods
8.4.14 Example Application Study
8.4.15 Further Developments
8.5 Application Studies
8.6 Summary and Discussion
8.7 Recommendations
8.8 Notes and References
Exercises
9 Performance Assessment
9.1 Introduction
9.2 Performance Assessment
9.2.1 Performance Measures
9.2.2 Discriminability
9.2.3 Reliability
9.2.4 ROC Curves for Performance Assessment
9.2.5 Population and Sensor Drift
9.2.6 Example Application Study
9.2.7 Further Developments
9.2.8 Summary
9.3 Comparing Classifier Performance
9.3.1 Which Technique is Best?
9.3.2 Statistical Tests
9.3.3 Comparing Rules When Misclassification Costs are Uncertain
9.3.4 Example Application Study
9.3.5 Further Developments
9.3.6 Summary
9.4 Application Studies
9.5 Summary and Discussion
9.6 Recommendations
9.7 Notes and References
Exercises
10 Feature Selection and Extraction
10.1 Introduction
10.2 Feature Selection
10.2.1 Introduction
10.2.2 Characterisation of Feature Selection Approaches
10.2.3 Evaluation Measures
10.2.4 Search Algorithms for Feature Subset Selection
10.2.5 Complete Search – Branch and Bound
10.2.6 Sequential Search
10.2.7 Random Search
10.2.8 Markov Blanket
10.2.9 Stability of Feature Selection
10.2.10 Example Application Study
10.2.11 Further Developments
10.2.12 Summary
10.3 Linear Feature Extraction
10.3.1 Principal Components Analysis
10.3.2 Karhunen–Loève Transformation
10.3.3 Example Application Study
10.3.4 Further Developments
10.3.5 Summary
10.4 Multidimensional Scaling
10.4.1 Classical Scaling
10.4.2 Metric MDS
10.4.3 Ordinal Scaling
10.4.4 Algorithms
10.4.5 MDS for Feature Extraction
10.4.6 Example Application Study
10.4.7 Further Developments
10.4.8 Summary
10.5 Application Studies
10.6 Summary and Discussion
10.7 Recommendations
10.8 Notes and References
Exercises
11 Clustering
11.1 Introduction
11.2 Hierarchical Methods
11.2.1 Single-Link Method
11.2.2 Complete-Link Method
11.2.3 Sum-of-Squares Method
11.2.4 General Agglomerative Algorithm
11.2.5 Properties of a Hierarchical Classification
11.2.6 Example Application Study
11.2.7 Summary
11.3 Quick Partitions
11.4 Mixture Models
11.4.1 Model Description
11.4.2 Example Application Study
11.5 Sum-of-Squares Methods
11.5.1 Clustering Criteria
11.5.2 Clustering Algorithms
11.5.3 Vector Quantisation
11.5.4 Example Application Study
11.5.5 Further Developments
11.5.6 Summary
11.6 Spectral Clustering
11.6.1 Elementary Graph Theory
11.6.2 Similarity Matrices
11.6.3 Application to Clustering
11.6.4 Spectral Clustering Algorithm
11.6.5 Forms of Graph Laplacian
11.6.6 Example Application Study
11.6.7 Further Developments
11.6.8 Summary
11.7 Cluster Validity
11.7.1 Introduction
11.7.2 Statistical Tests
11.7.3 Absence of Class Structure
11.7.4 Validity of Individual Clusters
11.7.5 Hierarchical Clustering
11.7.6 Validation of Individual Clusterings
11.7.7 Partitions
11.7.8 Relative Criteria
11.7.9 Choosing the Number of Clusters
11.8 Application Studies
11.9 Summary and Discussion
11.10 Recommendations
11.11 Notes and References
Exercises
12 Complex Networks
12.1 Introduction
12.1.1 Characteristics
12.1.2 Properties
12.1.3 Questions to Address
12.1.4 Descriptive Features
12.1.5 Outline
12.2 Mathematics of Networks
12.2.1 Graph Matrices
12.2.2 Connectivity
12.2.3 Distance Measures
12.2.4 Weighted Networks
12.2.5 Centrality Measures
12.2.6 Random Graphs
12.3 Community Detection
12.3.1 Clustering Methods
12.3.2 Girvan–Newman Algorithm
12.3.3 Modularity Approaches
12.3.4 Local Modularity
12.3.5 Clique Percolation
12.3.6 Example Application Study
12.3.7 Further Developments
12.3.8 Summary
12.4 Link Prediction
12.4.1 Approaches to Link Prediction
12.4.2 Example Application Study
12.4.3 Further Developments
12.5 Application Studies
12.6 Summary and Discussion
12.7 Recommendations
12.8 Notes and References
Exercises
13 Additional Topics
13.1 Model Selection
13.1.1 Separate Training and Test Sets
13.1.2 Cross-Validation
13.1.3 The Bayesian Viewpoint
13.1.4 Akaike’s Information Criterion
13.1.5 Minimum Description Length
13.2 Missing Data
13.3 Outlier Detection and Robust Procedures
13.4 Mixed Continuous and Discrete Variables
13.5 Structural Risk Minimisation and the Vapnik–Chervonenkis Dimension
13.5.1 Bounds on the Expected Risk
13.5.2 The VC Dimension
References
Index