logo资料库

Guide to Intelligent Data Analysis.pdf

第1页 / 共399页
第2页 / 共399页
第3页 / 共399页
第4页 / 共399页
第5页 / 共399页
第6页 / 共399页
第7页 / 共399页
第8页 / 共399页
资料共399页,剩余部分请下载后查看
Preface
Contents
Symbols
Introduction
Motivation
Data and Knowledge
Tycho Brahe and Johannes Kepler
Intelligent Data Analysis
The Data Analysis Process
Methods, Tasks, and Tools
Problem Categories
Catalog of Methods
Available Tools
How to Read This Book
References
Practical Data Analysis: An Example
The Setup
Disclaimer
The Data
The Analysts
Data Understanding and Pattern Finding
The Naive Approach
The Sound Approach
Explanation Finding
The Naive Approach
The Sound Approach
Predicting the Future
The Naive Approach
The Sound Approach
Concluding Remarks
Project Understanding
Determine the Project Objective
Assess the Situation
Determine Analysis Goals
Further Reading
References
Data Understanding
Attribute Understanding
Data Quality
Data Visualization
Methods for One and Two Attributes
Methods for Higher-Dimensional Data
Principal Component Analysis
Projection Pursuit
Multidimensional Scaling
Variations of PCA and MDS
Parallel Coordinates
Radar and Star Plots
Correlation Analysis
Outlier Detection
Outlier Detection for Single Attributes
Outlier Detection for Multidimensional Data
Missing Values
A Checklist for Data Understanding
Data Understanding in Practice
Data Understanding in KNIME
Data Loading
Data Types
Visualization
Data Understanding in R
Histograms
Boxplots
Scatter Plots
Principal Component Analysis
Multidimensional Scaling
Parallel Coordinates, Radar, and Star Plots
Correlation Coefficients
Grubb's Test for Outlier Detection
References
Principles of Modeling
Model Classes
Fitting Criteria and Score Functions
Error Functions for Classification Problems
Measures of Interestingness
Algorithms for Model Fitting
Closed Form Solutions
Gradient Method
Combinatorial Optimization
Random Search, Greedy Strategies, and Other Heuristics
Types of Errors
Experimental Error
Bayes Error
ROC Curves and Confusion Matrices
Sample Error
Model Error
Algorithmic Error
Machine Learning Bias and Variance
Learning Without Bias?
Model Validation
Training and Test Data
Cross-Validation
Bootstrapping
Measures for Model Complexity
The Minimum Description Length Principle
Akaike's and the Bayesian Information Criterion
Model Errors and Validation in Practice
Errors and Validation in KNIME
Validation in R
Further Reading
References
Data Preparation
Select Data
Feature Selection
Selecting the k Top-Ranked Features
Selecting the Top-Ranked Subset
Dimensionality Reduction
Record Selection
Clean Data
Improve Data Quality
Missing Values
Ignorance/Deletion
Imputation
Explicit Value or Variable
Construct Data
Provide Operability
Scale Conversion
Dynamic Domains
Problem Reformulation
Assure Impartiality
Maximize Efficiency
Complex Data Types
Text Data Analysis
Graph Data Analysis
Image Data Analysis
Other Data Types
Data Integration
Vertical Data Integration
Horizontal Data Integration
Data Preparation in Practice
Data Preparation in KNIME
Data Preparation in R
Dimensionality Reduction
Missing Values
Normalization and Scaling
References
Finding Patterns
Clustering
Hierarchical Clustering
Prototype-Based Clustering
Density-Based Clustering
Self-organizing Maps
Association Rules
Deviation Analysis
Hierarchical Clustering
Overview
Construction
Variations and Issues
Cluster-to-Cluster Distance
Divisive Clustering
Categorical Data
Notion of (Dis-)Similarity
Numerical Attributes
Nonisotropic Distances
Text Data and Time Series
Binary Attributes
Nominal Attributes
Ordinal Attributes
Attribute Weighting
The Curse of Dimensionality
Prototype- and Model-Based Clustering
Overview
Construction
Minimization of Cluster Variance (k-Means Model)
Minimization of Cluster Variance (Fuzzy c-Means Model)
Gaussian Mixture Decomposition (GMD)
Variations and Issues
How many clusters?
Cluster Shape
Noise Clustering
Density-Based Clustering
Overview
Construction
Variations and Issues
Relaxing the density threshold
Subspace Clustering
Self-organizing Maps
Overview
Construction
Frequent Pattern Mining and Association Rules
Overview
Construction
Variations and Issues
Perfect Extension Pruning
Reducing the Output
Assessing Association Rules
Frequent Subgraph Mining
Deviation Analysis
Overview
Construction
Variations and Issues
Finding Patterns in Practice
Finding Patterns with KNIME
Finding Patterns in R
Hierarchical Clustering
Prototype-Based Clustering
Self Organizing Maps
Association Rules
Further Reading
References
Finding Explanations
Decision trees
Bayes classifiers
Regression models
Rule models
Decision Trees
Overview
Construction
Variations and Issues
Numerical Attributes
Numerical Target Attributes: Regression Trees
Pruning
Missing Values
Additional Notes
Bayes Classifiers
Overview
Construction
Variations and Issues
Performance
Linear Discriminant Analysis
Augmented Naive Bayes Classifiers
Missing Values
Misclassification Costs
Regression
Overview
Construction
Variations and Issues
Transformations
Gradient Descent
Robust Regression
Function Selection
Two Class Problems
Rule learning
Propositional Rules
Extracting Rules from Decision Trees
Extracting Propositional Rules
Other Types of Propositional Rule Learners
Inductive Logic Programming or First-Order Rules
Finding Explanations in Practice
Finding Explanations with KNIME
Using Explanations with R
Decision Trees
Naive Bayes Classifers
Regression
Further Reading
References
Finding Predictors
Nearest-Neighbor Predictors
Artificial Neural Networks
Support Vector Machines
Ensemble Methods
Nearest-Neighbor Predictors
Overview
Construction
Variations and Issues
Weighting with Kernel Functions
Locally Weighted Polynomial Regression
Feature Weights
Data Set Reduction and Prototype Building
Artifical Neural Networks
Overview
Construction
Variations and Issues
Backpropagation Variants
Weight Decay
Sensitivity Analysis
Support Vector Machines
Overview
Dual Representation
Kernel Functions
Support Vectors and Margin of Error
Construction
Variations and Issues
Slack Variables
Multiclass Support Vector Machines
Support Vector Regression
Ensemble Methods
Overview
Construction
Bayesian Voting
Bagging
Random Subspace Selection
Injecting Randomness
Boosting
Mixture of Experts
Stacking
Further Reading
Finding Predictors in Practice
Finding Predictors with KNIME
Using Predictors in R
Nearest Neighbor Classifiers
Neural Networks
Support Vector Machines
Ensemble Methods
References
Evaluation and Deployment
Evaluation
Documentation
Testbed Evaluation
Deployment and Monitoring
References
Appendix A Statistics
Terms and Notation
Descriptive Statistics
Tabular Representations
Graphical Representations
Characteristic Measures for One-Dimensional Data
Location Measures
Mode
Median (Central Value)
Quantiles
Mean
Dispersion Measures
Range
Interquantile Range
Mean Absolute Deviation
Variance and Standard Deviation
Shape Measures
Skewness
Kurtosis
Box Plots
Characteristic Measures for Multidimensional Data
Mean
Covariance and Correlation
Principal Component Analysis
Example of a Principal Component Analysis
Probability Theory
Probability
Intuitive Notions of Probability
The Formal Definition of Probability
Basic Methods and Theorems
Combinatorial Methods
Geometric Probabilities
Conditional Probability and Independent Events
Total Probability and Bayes' Rule
Bernoulli's Law of Large Numbers
Random Variables
Real-Valued Random Variables
Discrete Random Variables
Continuous Random Variables
Random Vectors
Characteristic Measures of Random Variables
Expected Value
Properties of the Expected Value
Variance and Standard Deviation
Properties of the Variance
Quantiles
Some Special Distributions
The Binomial Distribution
The Polynomial Distribution
The Geometric Distribution
The Hypergeometric Distribution
The Poisson Distribution
The Uniform Distribution
The Normal Distribution
The chi2 Distribution
The Exponential Distribution
Inferential Statistics
Random Samples
Parameter Estimation
Point Estimation
Point Estimation Examples
Maximum Likelihood Estimation
Maximum Likelihood Estimation Example
Maximum A Posteriori Estimation
Maximum A Posteriori Estimation Example
Interval Estimation
Interval Estimation Examples
Hypothesis Testing
Error Types and Significance Level
Parameter Test
Parameter Test Example
Goodness-of-Fit Test
Goodness-of-Fit Test Example
(In)Dependence Test
Appendix B The R Project
Installation and Overview
Reading Files and R Objects
R Functions and Commands
Libraries/Packages
R Workspace
Finding Help
Further Reading
Appendix C KNIME
Installation and Overview
Building Workflows
Example Flow
R Integration
References
Index
Texts in Computer Science Editors David Gries Fred B. Schneider For further volumes: http://www.springer.com/series/3191
Michael R. Berthold r Christian Borgelt r Frank Höppner r Frank Klawonn Guide to Intelligent Data Analysis How to Intelligently Make Sense of Real Data
Prof. Dr. Michael R. Berthold FB Informatik und Informationswissenschaft Universität Konstanz 78457 Konstanz Germany Michael.Berthold@uni-konstanz.de Dr. Christian Borgelt Intelligent Data Analysis & Graphical Models Research Unit European Centre for Soft Computing C/ Gonzalo Gutiérrez Quirós s/n Edificio Científico-Technológico Campus Mieres, 3a Planta 33600 Mieres, Asturias Spain christian.borgelt@softcomputing.es Series Editors David Gries Department of Computer Science Upson Hall Cornell University Ithaca, NY 14853-7501, USA Prof. Dr. Frank Höppner FB Wirtschaft Ostfalia University of Applied Sciences Robert-Koch-Platz 10-14 38440 Wolfsburg Germany f.hoeppner@ostfalia.de Prof. Dr. Frank Klawonn FB Informatik Ostfalia University of Applied Sciences Salzdahlumer Str. 46/48 38302 Wolfenbüttel Germany f.klawonn@ostfalia.de Fred B. Schneider Department of Computer Science Upson Hall Cornell University Ithaca, NY 14853-7501, USA ISSN 1868-0941 ISBN 978-1-84882-259-7 DOI 10.1007/978-1-84882-260-3 Springer London Dordrecht Heidelberg New York e-ISSN 1868-095X e-ISBN 978-1-84882-260-3 British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Library of Congress Control Number: 2010930517 © Springer-Verlag London Limited 2010 Apart from any fair dealing for the purposes of research or private study, or criticism or review, as per- mitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publish- ers, or in the case of reprographic reproduction in accordance with the terms of licenses issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers. The use of registered names, trademarks, etc., in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant laws and regulations and therefore free for general use. The publisher makes no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that may be made. Cover design: VTeX, Vilnius Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface The main motivation to write this book came from all our problems to find suitable material for a textbook that would really help us to teach the practical aspects of data analysis together with the needed theoretical underpinnings. Many books out there tackle either one or the other of these aspects (and, especially for the latter, there are some fantastic text books out there), but a book providing a good combination was nowhere to be found. The idea to write our own book to address this shortcoming arose in two different places at the same time—when one of the authors was asked to review the book proposal of the others, we quickly realized that it would be much better to join forces instead of independently pursuing our individual projects. We hope that this book helps others to learn what kind of challenges data analysts face in the real world and at the same time provides them with solid knowledge about the processes, algorithms, and theories to successfully tackle these problems. We have put a lot of effort into balancing the practical aspects of applying and using data analysis techniques while making sure at the same time that we did not forget to also explain the statistical and mathematical underpinnings behind the algorithms beneath all of this. There are many people to be thanked, and we will not attempt to list them all. However, we do want to single out Iris Adä who has been a tremendous help with the generation of the data sets used in this book. She and Martin Horn also deserve our thanks for an intense last minute round of proof reading. Konstanz, Germany Oviedo, Spain Braunschweig, Germany Michael R. Berthold Christian Borgelt Frank Höppner and Frank Klawonn v
Contents 1 2 3 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . 1.1 Motivation . . . . . . . . . . . . . . . . . . . 1.1.1 Data and Knowledge . . . . . . . . . . . . . . . . . . . . . 1.1.2 Tycho Brahe and Johannes Kepler . . . . . . . . . . . . . . Intelligent Data Analysis . . . . . . . . . . . . . . . . . . 1.1.3 1.2 The Data Analysis Process . . . . . . . . . . . . . . . . . . . . . . 1.3 Methods, Tasks, and Tools . . . . . . . . . . . . . . . . . . . . . 1.4 How to Read This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . Practical Data Analysis: An Example . . . . . . . . . . . . . . . . . . 2.1 The Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Data Understanding and Pattern Finding . . . . . . . . . . . . . . 2.3 Explanation Finding . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Predicting the Future . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . Project Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Determine the Project Objective . . . . . . . . . . . . . . . . . . . 3.2 Assess the Situation . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Determine Analysis Goals . . . . . . . . . . . . . . . . . . . . . . 3.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . Data Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Attribute Understanding . . . . . . . . . 4.2 Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Methods for One and Two Attributes . . . 4.3.2 Methods for Higher-Dimensional Data . . . . . . . . . . . . . . 1 1 2 4 6 7 11 13 14 15 15 16 20 21 23 25 26 28 30 31 32 33 34 37 40 40 48 59 vii
viii 5 6 Contents 4.5 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Outlier Detection for Single Attributes . . . . . . . . . . . 4.5.2 Outlier Detection for Multidimensional Data . . . . . . . . 4.6 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 A Checklist for Data Understanding . . . . . . . . . . . . . . . . . 4.8 Data Understanding in Practice . . . . . . . . . . . . . . . . . . . 4.8.1 Data Understanding in KNIME . . . . . . . . . . . . . . . 4.8.2 Data Understanding in R . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 63 64 65 68 69 70 73 78 . Principles of Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 81 82 5.1 Model Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.2 Fitting Criteria and Score Functions . . . . . . . . . . . . . . . . . 87 5.2.1 Error Functions for Classification Problems . . . . . . . . . 89 5.2.2 Measures of Interestingness . . . . . . . . . . . . . . . . . 89 . . . . . . . . . . . . . . . . . . . 5.3 Algorithms for Model Fitting . 89 . . . . . . . . . . . . . . . . . . . 5.3.1 Closed Form Solutions 90 5.3.2 Gradient Method . . . . . . . . . . . . . . . . . . . . . . . 92 5.3.3 Combinatorial Optimization . . . . . . . . . . . . . . . . . 92 5.3.4 Random Search, Greedy Strategies, and Other Heuristics . 93 5.4 Types of Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.4.1 Experimental Error . . . . . . . . . . . . . . . . . . . 99 5.4.2 Sample Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.4.3 Model Error . . . . . . . . . . . . . . . . . . . . . . 101 5.4.4 Algorithmic Error 5.4.5 Machine Learning Bias and Variance . . . . . . . . . . . . 101 5.4.6 Learning Without Bias? . . . . . . . . . . . . . . . . . . . 102 5.5 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.5.1 Training and Test Data . . . . . . . . . . . . . . . . . . . . 102 5.5.2 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . 103 5.5.3 Bootstrapping . . . . . . . 104 5.5.4 Measures for Model Complexity . . . . . . . . . . . . . . 105 5.6 Model Errors and Validation in Practice . . . . . . . . . . . . . . . 111 5.6.1 Errors and Validation in KNIME . . . . . . . . . . . . . . 111 5.6.2 Validation in R . . . . . . . . . . . . . . . . . . . . . . . . 111 5.7 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 References . . . . . . . . . . . . . . . . . . . Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.1 Select Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.1.1 Feature Selection . . . . . . . . . . . . . . . . . . . . 116 6.1.2 Dimensionality Reduction . . . . . . . . . . . . . . . . . . 121 6.1.3 Record Selection . . . 121 . . . . . . . . . . . . . 123 . . . . . . . . . . . . . . . . . . . . 123 . . Improve Data Quality . 6.2 Clean Data . . . . . . . . . . . . . . . . . . . . 6.2.1 . . . . . . . . . . . . . . . .
Contents ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Construct Data . . . . . . . . . 6.2.2 Missing Values . . . . . 124 . . . . . . . . . . . . 127 . . . . . . . . . . . . . . . . . . . . 127 6.3.1 Provide Operability . . . . . . . . . . . . . . . . . . . . 129 6.3.2 Assure Impartiality . 6.3.3 Maximize Efficiency . . . . . . . . . . . . . . . . . . . . . 131 6.4 Complex Data Types . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.5 Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.5.1 Vertical Data Integration . . . . . . . . . . . . . . . . . . . 136 6.5.2 Horizontal Data Integration . . . . . . . . . . . . . . . . . 136 . . . . . . . . . . . . . . . . . . . 138 6.6.1 Data Preparation in KNIME . . . . . . . . . . . . . . . . . 139 6.6.2 Data Preparation in R . . . . . . . . . . . . . . . . . . . . 141 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 References . 6.6 Data Preparation in Practice . 7 7.5 Self-organizing Maps Finding Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . 147 7.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 7.1.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . 150 7.1.3 Variations and Issues . . . . . . . . . . . . . . . . . . . . . 152 7.2 Notion of (Dis-)Similarity . . . . . . . . . . . . . . . . . . . . . . 155 7.3 Prototype- and Model-Based Clustering . . . . . . . . . . . . . . . 162 7.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 7.3.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . 164 7.3.3 Variations and Issues . . . . . . . . . . . . . . . . . . . . . 167 7.4 Density-Based Clustering . . . . . . . . . . . . . . . . . . . . . . 169 7.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 7.4.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . 171 7.4.3 Variations and Issues . . . . . . . . . . . . . . . . . . . . . 173 . . . . . . . . . . . . . . . . . . . . . . . . 175 7.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 7.5.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . 176 . . . . . . . . . 179 7.6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 7.6.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . 181 7.6.3 Variations and Issues . . . . . . . . . . . . . . . . . . . . . 187 7.7 Deviation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 194 7.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 7.7.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . 195 7.7.3 Variations and Issues . . . . . . . . . . . . . . . . . . . . . 197 7.8 Finding Patterns in Practice . . . . . . . . . . . . . . . . . . . . . 198 7.8.1 Finding Patterns with KNIME . . . . . . . . . . . . . . . . 199 7.8.2 Finding Patterns in R . . . . . . . . . . . . . . . . . . . . 201 7.9 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 7.6 Frequent Pattern Mining and Association Rules References . .
分享到:
收藏