Preface
Contents
Symbols
Introduction
Motivation
Data and Knowledge
Tycho Brahe and Johannes Kepler
Intelligent Data Analysis
The Data Analysis Process
Methods, Tasks, and Tools
Problem Categories
Catalog of Methods
Available Tools
How to Read This Book
References
Practical Data Analysis: An Example
The Setup
Disclaimer
The Data
The Analysts
Data Understanding and Pattern Finding
The Naive Approach
The Sound Approach
Explanation Finding
The Naive Approach
The Sound Approach
Predicting the Future
The Naive Approach
The Sound Approach
Concluding Remarks
Project Understanding
Determine the Project Objective
Assess the Situation
Determine Analysis Goals
Further Reading
References
Data Understanding
Attribute Understanding
Data Quality
Data Visualization
Methods for One and Two Attributes
Methods for Higher-Dimensional Data
Principal Component Analysis
Projection Pursuit
Multidimensional Scaling
Variations of PCA and MDS
Parallel Coordinates
Radar and Star Plots
Correlation Analysis
Outlier Detection
Outlier Detection for Single Attributes
Outlier Detection for Multidimensional Data
Missing Values
A Checklist for Data Understanding
Data Understanding in Practice
Data Understanding in KNIME
Data Loading
Data Types
Visualization
Data Understanding in R
Histograms
Boxplots
Scatter Plots
Principal Component Analysis
Multidimensional Scaling
Parallel Coordinates, Radar, and Star Plots
Correlation Coefficients
Grubb's Test for Outlier Detection
References
Principles of Modeling
Model Classes
Fitting Criteria and Score Functions
Error Functions for Classification Problems
Measures of Interestingness
Algorithms for Model Fitting
Closed Form Solutions
Gradient Method
Combinatorial Optimization
Random Search, Greedy Strategies, and Other Heuristics
Types of Errors
Experimental Error
Bayes Error
ROC Curves and Confusion Matrices
Sample Error
Model Error
Algorithmic Error
Machine Learning Bias and Variance
Learning Without Bias?
Model Validation
Training and Test Data
Cross-Validation
Bootstrapping
Measures for Model Complexity
The Minimum Description Length Principle
Akaike's and the Bayesian Information Criterion
Model Errors and Validation in Practice
Errors and Validation in KNIME
Validation in R
Further Reading
References
Data Preparation
Select Data
Feature Selection
Selecting the k Top-Ranked Features
Selecting the Top-Ranked Subset
Dimensionality Reduction
Record Selection
Clean Data
Improve Data Quality
Missing Values
Ignorance/Deletion
Imputation
Explicit Value or Variable
Construct Data
Provide Operability
Scale Conversion
Dynamic Domains
Problem Reformulation
Assure Impartiality
Maximize Efficiency
Complex Data Types
Text Data Analysis
Graph Data Analysis
Image Data Analysis
Other Data Types
Data Integration
Vertical Data Integration
Horizontal Data Integration
Data Preparation in Practice
Data Preparation in KNIME
Data Preparation in R
Dimensionality Reduction
Missing Values
Normalization and Scaling
References
Finding Patterns
Clustering
Hierarchical Clustering
Prototype-Based Clustering
Density-Based Clustering
Self-organizing Maps
Association Rules
Deviation Analysis
Hierarchical Clustering
Overview
Construction
Variations and Issues
Cluster-to-Cluster Distance
Divisive Clustering
Categorical Data
Notion of (Dis-)Similarity
Numerical Attributes
Nonisotropic Distances
Text Data and Time Series
Binary Attributes
Nominal Attributes
Ordinal Attributes
Attribute Weighting
The Curse of Dimensionality
Prototype- and Model-Based Clustering
Overview
Construction
Minimization of Cluster Variance (k-Means Model)
Minimization of Cluster Variance (Fuzzy c-Means Model)
Gaussian Mixture Decomposition (GMD)
Variations and Issues
How many clusters?
Cluster Shape
Noise Clustering
Density-Based Clustering
Overview
Construction
Variations and Issues
Relaxing the density threshold
Subspace Clustering
Self-organizing Maps
Overview
Construction
Frequent Pattern Mining and Association Rules
Overview
Construction
Variations and Issues
Perfect Extension Pruning
Reducing the Output
Assessing Association Rules
Frequent Subgraph Mining
Deviation Analysis
Overview
Construction
Variations and Issues
Finding Patterns in Practice
Finding Patterns with KNIME
Finding Patterns in R
Hierarchical Clustering
Prototype-Based Clustering
Self Organizing Maps
Association Rules
Further Reading
References
Finding Explanations
Decision trees
Bayes classifiers
Regression models
Rule models
Decision Trees
Overview
Construction
Variations and Issues
Numerical Attributes
Numerical Target Attributes: Regression Trees
Pruning
Missing Values
Additional Notes
Bayes Classifiers
Overview
Construction
Variations and Issues
Performance
Linear Discriminant Analysis
Augmented Naive Bayes Classifiers
Missing Values
Misclassification Costs
Regression
Overview
Construction
Variations and Issues
Transformations
Gradient Descent
Robust Regression
Function Selection
Two Class Problems
Rule learning
Propositional Rules
Extracting Rules from Decision Trees
Extracting Propositional Rules
Other Types of Propositional Rule Learners
Inductive Logic Programming or First-Order Rules
Finding Explanations in Practice
Finding Explanations with KNIME
Using Explanations with R
Decision Trees
Naive Bayes Classifers
Regression
Further Reading
References
Finding Predictors
Nearest-Neighbor Predictors
Artificial Neural Networks
Support Vector Machines
Ensemble Methods
Nearest-Neighbor Predictors
Overview
Construction
Variations and Issues
Weighting with Kernel Functions
Locally Weighted Polynomial Regression
Feature Weights
Data Set Reduction and Prototype Building
Artifical Neural Networks
Overview
Construction
Variations and Issues
Backpropagation Variants
Weight Decay
Sensitivity Analysis
Support Vector Machines
Overview
Dual Representation
Kernel Functions
Support Vectors and Margin of Error
Construction
Variations and Issues
Slack Variables
Multiclass Support Vector Machines
Support Vector Regression
Ensemble Methods
Overview
Construction
Bayesian Voting
Bagging
Random Subspace Selection
Injecting Randomness
Boosting
Mixture of Experts
Stacking
Further Reading
Finding Predictors in Practice
Finding Predictors with KNIME
Using Predictors in R
Nearest Neighbor Classifiers
Neural Networks
Support Vector Machines
Ensemble Methods
References
Evaluation and Deployment
Evaluation
Documentation
Testbed Evaluation
Deployment and Monitoring
References
Appendix A Statistics
Terms and Notation
Descriptive Statistics
Tabular Representations
Graphical Representations
Characteristic Measures for One-Dimensional Data
Location Measures
Mode
Median (Central Value)
Quantiles
Mean
Dispersion Measures
Range
Interquantile Range
Mean Absolute Deviation
Variance and Standard Deviation
Shape Measures
Skewness
Kurtosis
Box Plots
Characteristic Measures for Multidimensional Data
Mean
Covariance and Correlation
Principal Component Analysis
Example of a Principal Component Analysis
Probability Theory
Probability
Intuitive Notions of Probability
The Formal Definition of Probability
Basic Methods and Theorems
Combinatorial Methods
Geometric Probabilities
Conditional Probability and Independent Events
Total Probability and Bayes' Rule
Bernoulli's Law of Large Numbers
Random Variables
Real-Valued Random Variables
Discrete Random Variables
Continuous Random Variables
Random Vectors
Characteristic Measures of Random Variables
Expected Value
Properties of the Expected Value
Variance and Standard Deviation
Properties of the Variance
Quantiles
Some Special Distributions
The Binomial Distribution
The Polynomial Distribution
The Geometric Distribution
The Hypergeometric Distribution
The Poisson Distribution
The Uniform Distribution
The Normal Distribution
The chi2 Distribution
The Exponential Distribution
Inferential Statistics
Random Samples
Parameter Estimation
Point Estimation
Point Estimation Examples
Maximum Likelihood Estimation
Maximum Likelihood Estimation Example
Maximum A Posteriori Estimation
Maximum A Posteriori Estimation Example
Interval Estimation
Interval Estimation Examples
Hypothesis Testing
Error Types and Significance Level
Parameter Test
Parameter Test Example
Goodness-of-Fit Test
Goodness-of-Fit Test Example
(In)Dependence Test
Appendix B The R Project
Installation and Overview
Reading Files and R Objects
R Functions and Commands
Libraries/Packages
R Workspace
Finding Help
Further Reading
Appendix C KNIME
Installation and Overview
Building Workflows
Example Flow
R Integration
References
Index