Guide to Intelligent Data Analysis.pdf

发布时间：2022-06-08 发布人：admin 分类：说明书资料大小：4.97M 资料格式：pdf 举报版权申诉

bulala123-11217360-4744300845221787348.pdf-第1页.png

第1页 / 共399页

bulala123-11217360-4744300845221787348.pdf-第2页.png

第2页 / 共399页

bulala123-11217360-4744300845221787348.pdf-第3页.png

第3页 / 共399页

bulala123-11217360-4744300845221787348.pdf-第4页.png

第4页 / 共399页

bulala123-11217360-4744300845221787348.pdf-第5页.png

第5页 / 共399页

bulala123-11217360-4744300845221787348.pdf-第6页.png

第6页 / 共399页

bulala123-11217360-4744300845221787348.pdf-第7页.png

第7页 / 共399页

bulala123-11217360-4744300845221787348.pdf-第8页.png

第8页 / 共399页

Preface

Contents

Symbols

Introduction

Motivation

Data and Knowledge

Tycho Brahe and Johannes Kepler

Intelligent Data Analysis

The Data Analysis Process

Methods, Tasks, and Tools

Problem Categories

Catalog of Methods

Available Tools

How to Read This Book

References

Practical Data Analysis: An Example

The Setup

Disclaimer

The Data

The Analysts

Data Understanding and Pattern Finding

The Naive Approach

The Sound Approach

Explanation Finding

The Naive Approach

The Sound Approach

Predicting the Future

The Naive Approach

The Sound Approach

Concluding Remarks

Project Understanding

Determine the Project Objective

Assess the Situation

Determine Analysis Goals

Further Reading

References

Data Understanding

Attribute Understanding

Data Quality

Data Visualization

Methods for One and Two Attributes

Methods for Higher-Dimensional Data

Principal Component Analysis

Projection Pursuit

Multidimensional Scaling

Variations of PCA and MDS

Parallel Coordinates

Radar and Star Plots

Correlation Analysis

Outlier Detection

Outlier Detection for Single Attributes

Outlier Detection for Multidimensional Data

Missing Values

A Checklist for Data Understanding

Data Understanding in Practice

Data Understanding in KNIME

Data Loading

Data Types

Visualization

Data Understanding in R

Histograms

Boxplots

Scatter Plots

Principal Component Analysis

Multidimensional Scaling

Parallel Coordinates, Radar, and Star Plots

Correlation Coefficients

Grubb's Test for Outlier Detection

References

Principles of Modeling

Model Classes

Fitting Criteria and Score Functions

Error Functions for Classification Problems

Measures of Interestingness

Algorithms for Model Fitting

Closed Form Solutions

Gradient Method

Combinatorial Optimization

Random Search, Greedy Strategies, and Other Heuristics

Types of Errors

Experimental Error

Bayes Error

ROC Curves and Confusion Matrices

Sample Error

Model Error

Algorithmic Error

Machine Learning Bias and Variance

Learning Without Bias?

Model Validation

Training and Test Data

Cross-Validation

Bootstrapping

Measures for Model Complexity

The Minimum Description Length Principle

Akaike's and the Bayesian Information Criterion

Model Errors and Validation in Practice

Errors and Validation in KNIME

Validation in R

Further Reading

References

Data Preparation

Select Data

Feature Selection

Selecting the k Top-Ranked Features

Selecting the Top-Ranked Subset

Dimensionality Reduction

Record Selection

Clean Data

Improve Data Quality

Missing Values

Ignorance/Deletion

Imputation

Explicit Value or Variable

Construct Data

Provide Operability

Scale Conversion

Dynamic Domains

Problem Reformulation

Assure Impartiality

Maximize Efficiency

Complex Data Types

Text Data Analysis

Graph Data Analysis

Image Data Analysis

Other Data Types

Data Integration

Vertical Data Integration

Horizontal Data Integration

Data Preparation in Practice

Data Preparation in KNIME

Data Preparation in R

Dimensionality Reduction

Missing Values

Normalization and Scaling

References

Finding Patterns

Clustering

Hierarchical Clustering

Prototype-Based Clustering

Density-Based Clustering

Self-organizing Maps

Association Rules

Deviation Analysis

Hierarchical Clustering

Overview

Construction

Variations and Issues

Cluster-to-Cluster Distance

Divisive Clustering

Categorical Data

Notion of (Dis-)Similarity

Numerical Attributes

Nonisotropic Distances

Text Data and Time Series

Binary Attributes

Nominal Attributes

Ordinal Attributes

Attribute Weighting

The Curse of Dimensionality

Prototype- and Model-Based Clustering

Overview

Construction

Minimization of Cluster Variance (k-Means Model)

Minimization of Cluster Variance (Fuzzy c-Means Model)

Gaussian Mixture Decomposition (GMD)

Variations and Issues

How many clusters?

Cluster Shape

Noise Clustering

Density-Based Clustering

Overview

Construction

Variations and Issues

Relaxing the density threshold

Subspace Clustering

Self-organizing Maps

Overview

Construction

Frequent Pattern Mining and Association Rules

Overview

Construction

Variations and Issues

Perfect Extension Pruning

Reducing the Output

Assessing Association Rules

Frequent Subgraph Mining

Deviation Analysis

Overview

Construction

Variations and Issues

Finding Patterns in Practice

Finding Patterns with KNIME

Finding Patterns in R

Hierarchical Clustering

Prototype-Based Clustering

Self Organizing Maps

Association Rules

Further Reading

References

Finding Explanations

Decision trees

Bayes classifiers

Regression models

Rule models

Decision Trees

Overview

Construction

Variations and Issues

Numerical Attributes

Numerical Target Attributes: Regression Trees

Pruning

Missing Values

Additional Notes

Bayes Classifiers

Overview

Construction

Variations and Issues

Performance

Linear Discriminant Analysis

Augmented Naive Bayes Classifiers

Missing Values

Misclassification Costs

Regression

Overview

Construction

Variations and Issues

Transformations

Gradient Descent

Robust Regression

Function Selection

Two Class Problems

Rule learning

Propositional Rules

Extracting Rules from Decision Trees

Extracting Propositional Rules

Other Types of Propositional Rule Learners

Inductive Logic Programming or First-Order Rules

Finding Explanations in Practice

Finding Explanations with KNIME

Using Explanations with R

Decision Trees

Naive Bayes Classifers

Regression

Further Reading

References

Finding Predictors

Nearest-Neighbor Predictors

Artificial Neural Networks

Support Vector Machines

Ensemble Methods

Nearest-Neighbor Predictors

Overview

Construction

Variations and Issues

Weighting with Kernel Functions

Locally Weighted Polynomial Regression

Feature Weights

Data Set Reduction and Prototype Building

Artifical Neural Networks

Overview

Construction

Variations and Issues

Backpropagation Variants

Weight Decay

Sensitivity Analysis

Support Vector Machines

Overview

Dual Representation

Kernel Functions

Support Vectors and Margin of Error

Construction

Variations and Issues

Slack Variables

Multiclass Support Vector Machines

Support Vector Regression

Ensemble Methods

Overview

Construction

Bayesian Voting

Bagging

Random Subspace Selection

Injecting Randomness

Boosting

Mixture of Experts

Stacking

Further Reading

Finding Predictors in Practice

Finding Predictors with KNIME

Using Predictors in R

Nearest Neighbor Classifiers

Neural Networks

Support Vector Machines

Ensemble Methods

References

Evaluation and Deployment

Evaluation

Documentation

Testbed Evaluation

Deployment and Monitoring

References

Appendix A Statistics

Terms and Notation

Descriptive Statistics

Tabular Representations

Graphical Representations

Characteristic Measures for One-Dimensional Data

Location Measures

Mode

Median (Central Value)

Quantiles

Mean

Dispersion Measures

Range

Interquantile Range

Mean Absolute Deviation

Variance and Standard Deviation

Shape Measures

Skewness

Kurtosis

Box Plots

Characteristic Measures for Multidimensional Data

Mean

Covariance and Correlation

Principal Component Analysis

Example of a Principal Component Analysis

Probability Theory

Probability

Intuitive Notions of Probability

The Formal Definition of Probability

Basic Methods and Theorems

Combinatorial Methods

Geometric Probabilities

Conditional Probability and Independent Events

Total Probability and Bayes' Rule

Bernoulli's Law of Large Numbers

Random Variables

Real-Valued Random Variables

Discrete Random Variables

Continuous Random Variables

Random Vectors

Characteristic Measures of Random Variables

Expected Value

Properties of the Expected Value

Variance and Standard Deviation

Properties of the Variance

Quantiles

Some Special Distributions

The Binomial Distribution

The Polynomial Distribution

The Geometric Distribution

The Hypergeometric Distribution

The Poisson Distribution

The Uniform Distribution

The Normal Distribution

The chi2 Distribution

The Exponential Distribution

Inferential Statistics

Random Samples

Parameter Estimation

Point Estimation

Point Estimation Examples

Maximum Likelihood Estimation

Maximum Likelihood Estimation Example

Maximum A Posteriori Estimation

Maximum A Posteriori Estimation Example

Interval Estimation

Interval Estimation Examples

Hypothesis Testing

Error Types and Significance Level

Parameter Test

Parameter Test Example

Goodness-of-Fit Test

Goodness-of-Fit Test Example

(In)Dependence Test

Appendix B The R Project

Installation and Overview

Reading Files and R Objects

R Functions and Commands

Libraries/Packages

R Workspace

Finding Help

Further Reading

Appendix C KNIME

Installation and Overview

Building Workflows

Example Flow

R Integration

References

Index

Texts in Computer Science Editors David Gries Fred B. Schneider For further volumes: http://www.springer.com/series/3191

Michael R. Berthold r Christian Borgelt r Frank Höppner r Frank Klawonn Guide to Intelligent Data Analysis How to Intelligently Make Sense of Real Data

Prof. Dr. Michael R. Berthold FB Informatik und Informationswissenschaft Universität Konstanz 78457 Konstanz Germany Michael.Berthold@uni-konstanz.de Dr. Christian Borgelt Intelligent Data Analysis & Graphical Models Research Unit European Centre for Soft Computing C/ Gonzalo Gutiérrez Quirós s/n Ediﬁcio Cientíﬁco-Technológico Campus Mieres, 3a Planta 33600 Mieres, Asturias Spain christian.borgelt@softcomputing.es Series Editors David Gries Department of Computer Science Upson Hall Cornell University Ithaca, NY 14853-7501, USA Prof. Dr. Frank Höppner FB Wirtschaft Ostfalia University of Applied Sciences Robert-Koch-Platz 10-14 38440 Wolfsburg Germany f.hoeppner@ostfalia.de Prof. Dr. Frank Klawonn FB Informatik Ostfalia University of Applied Sciences Salzdahlumer Str. 46/48 38302 Wolfenbüttel Germany f.klawonn@ostfalia.de Fred B. Schneider Department of Computer Science Upson Hall Cornell University Ithaca, NY 14853-7501, USA ISSN 1868-0941 ISBN 978-1-84882-259-7 DOI 10.1007/978-1-84882-260-3 Springer London Dordrecht Heidelberg New York e-ISSN 1868-095X e-ISBN 978-1-84882-260-3 British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Library of Congress Control Number: 2010930517 © Springer-Verlag London Limited 2010 Apart from any fair dealing for the purposes of research or private study, or criticism or review, as per- mitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publish- ers, or in the case of reprographic reproduction in accordance with the terms of licenses issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers. The use of registered names, trademarks, etc., in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant laws and regulations and therefore free for general use. The publisher makes no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that may be made. Cover design: VTeX, Vilnius Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Preface The main motivation to write this book came from all our problems to ﬁnd suitable material for a textbook that would really help us to teach the practical aspects of data analysis together with the needed theoretical underpinnings. Many books out there tackle either one or the other of these aspects (and, especially for the latter, there are some fantastic text books out there), but a book providing a good combination was nowhere to be found. The idea to write our own book to address this shortcoming arose in two different places at the same time—when one of the authors was asked to review the book proposal of the others, we quickly realized that it would be much better to join forces instead of independently pursuing our individual projects. We hope that this book helps others to learn what kind of challenges data analysts face in the real world and at the same time provides them with solid knowledge about the processes, algorithms, and theories to successfully tackle these problems. We have put a lot of effort into balancing the practical aspects of applying and using data analysis techniques while making sure at the same time that we did not forget to also explain the statistical and mathematical underpinnings behind the algorithms beneath all of this. There are many people to be thanked, and we will not attempt to list them all. However, we do want to single out Iris Adä who has been a tremendous help with the generation of the data sets used in this book. She and Martin Horn also deserve our thanks for an intense last minute round of proof reading. Konstanz, Germany Oviedo, Spain Braunschweig, Germany Michael R. Berthold Christian Borgelt Frank Höppner and Frank Klawonn v

Contents 1 2 3 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . 1.1 Motivation . . . . . . . . . . . . . . . . . . . 1.1.1 Data and Knowledge . . . . . . . . . . . . . . . . . . . . . 1.1.2 Tycho Brahe and Johannes Kepler . . . . . . . . . . . . . . Intelligent Data Analysis . . . . . . . . . . . . . . . . . . 1.1.3 1.2 The Data Analysis Process . . . . . . . . . . . . . . . . . . . . . . 1.3 Methods, Tasks, and Tools . . . . . . . . . . . . . . . . . . . . . 1.4 How to Read This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . Practical Data Analysis: An Example . . . . . . . . . . . . . . . . . . 2.1 The Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Data Understanding and Pattern Finding . . . . . . . . . . . . . . 2.3 Explanation Finding . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Predicting the Future . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . Project Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Determine the Project Objective . . . . . . . . . . . . . . . . . . . 3.2 Assess the Situation . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Determine Analysis Goals . . . . . . . . . . . . . . . . . . . . . . 3.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . Data Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Attribute Understanding . . . . . . . . . 4.2 Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Methods for One and Two Attributes . . . 4.3.2 Methods for Higher-Dimensional Data . . . . . . . . . . . . . . 1 1 2 4 6 7 11 13 14 15 15 16 20 21 23 25 26 28 30 31 32 33 34 37 40 40 48 59 vii

viii 5 6 Contents 4.5 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Outlier Detection for Single Attributes . . . . . . . . . . . 4.5.2 Outlier Detection for Multidimensional Data . . . . . . . . 4.6 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 A Checklist for Data Understanding . . . . . . . . . . . . . . . . . 4.8 Data Understanding in Practice . . . . . . . . . . . . . . . . . . . 4.8.1 Data Understanding in KNIME . . . . . . . . . . . . . . . 4.8.2 Data Understanding in R . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 63 64 65 68 69 70 73 78 . Principles of Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 81 82 5.1 Model Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.2 Fitting Criteria and Score Functions . . . . . . . . . . . . . . . . . 87 5.2.1 Error Functions for Classiﬁcation Problems . . . . . . . . . 89 5.2.2 Measures of Interestingness . . . . . . . . . . . . . . . . . 89 . . . . . . . . . . . . . . . . . . . 5.3 Algorithms for Model Fitting . 89 . . . . . . . . . . . . . . . . . . . 5.3.1 Closed Form Solutions 90 5.3.2 Gradient Method . . . . . . . . . . . . . . . . . . . . . . . 92 5.3.3 Combinatorial Optimization . . . . . . . . . . . . . . . . . 92 5.3.4 Random Search, Greedy Strategies, and Other Heuristics . 93 5.4 Types of Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.4.1 Experimental Error . . . . . . . . . . . . . . . . . . . 99 5.4.2 Sample Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.4.3 Model Error . . . . . . . . . . . . . . . . . . . . . . 101 5.4.4 Algorithmic Error 5.4.5 Machine Learning Bias and Variance . . . . . . . . . . . . 101 5.4.6 Learning Without Bias? . . . . . . . . . . . . . . . . . . . 102 5.5 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.5.1 Training and Test Data . . . . . . . . . . . . . . . . . . . . 102 5.5.2 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . 103 5.5.3 Bootstrapping . . . . . . . 104 5.5.4 Measures for Model Complexity . . . . . . . . . . . . . . 105 5.6 Model Errors and Validation in Practice . . . . . . . . . . . . . . . 111 5.6.1 Errors and Validation in KNIME . . . . . . . . . . . . . . 111 5.6.2 Validation in R . . . . . . . . . . . . . . . . . . . . . . . . 111 5.7 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 References . . . . . . . . . . . . . . . . . . . Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.1 Select Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.1.1 Feature Selection . . . . . . . . . . . . . . . . . . . . 116 6.1.2 Dimensionality Reduction . . . . . . . . . . . . . . . . . . 121 6.1.3 Record Selection . . . 121 . . . . . . . . . . . . . 123 . . . . . . . . . . . . . . . . . . . . 123 . . Improve Data Quality . 6.2 Clean Data . . . . . . . . . . . . . . . . . . . . 6.2.1 . . . . . . . . . . . . . . . .

Contents ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Construct Data . . . . . . . . . 6.2.2 Missing Values . . . . . 124 . . . . . . . . . . . . 127 . . . . . . . . . . . . . . . . . . . . 127 6.3.1 Provide Operability . . . . . . . . . . . . . . . . . . . . 129 6.3.2 Assure Impartiality . 6.3.3 Maximize Efﬁciency . . . . . . . . . . . . . . . . . . . . . 131 6.4 Complex Data Types . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.5 Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.5.1 Vertical Data Integration . . . . . . . . . . . . . . . . . . . 136 6.5.2 Horizontal Data Integration . . . . . . . . . . . . . . . . . 136 . . . . . . . . . . . . . . . . . . . 138 6.6.1 Data Preparation in KNIME . . . . . . . . . . . . . . . . . 139 6.6.2 Data Preparation in R . . . . . . . . . . . . . . . . . . . . 141 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 References . 6.6 Data Preparation in Practice . 7 7.5 Self-organizing Maps Finding Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . 147 7.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 7.1.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . 150 7.1.3 Variations and Issues . . . . . . . . . . . . . . . . . . . . . 152 7.2 Notion of (Dis-)Similarity . . . . . . . . . . . . . . . . . . . . . . 155 7.3 Prototype- and Model-Based Clustering . . . . . . . . . . . . . . . 162 7.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 7.3.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . 164 7.3.3 Variations and Issues . . . . . . . . . . . . . . . . . . . . . 167 7.4 Density-Based Clustering . . . . . . . . . . . . . . . . . . . . . . 169 7.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 7.4.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . 171 7.4.3 Variations and Issues . . . . . . . . . . . . . . . . . . . . . 173 . . . . . . . . . . . . . . . . . . . . . . . . 175 7.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 7.5.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . 176 . . . . . . . . . 179 7.6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 7.6.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . 181 7.6.3 Variations and Issues . . . . . . . . . . . . . . . . . . . . . 187 7.7 Deviation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 194 7.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 7.7.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . 195 7.7.3 Variations and Issues . . . . . . . . . . . . . . . . . . . . . 197 7.8 Finding Patterns in Practice . . . . . . . . . . . . . . . . . . . . . 198 7.8.1 Finding Patterns with KNIME . . . . . . . . . . . . . . . . 199 7.8.2 Finding Patterns in R . . . . . . . . . . . . . . . . . . . . 201 7.9 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 7.6 Frequent Pattern Mining and Association Rules References . .

分享到：

赞收藏

资料库

Guide to Intelligent Data Analysis.pdf

相关推荐

课程资源

热门标签

最新资料