Front Cover
Data Mining with R , Learning with Case Studies
Contents
1 Introduction
2 Predicting Algae Blooms
3 Predicting Stock Market Returns
4 Detecting Fraudulent Transactions
5 Classifying Microarray Samples
Bibliography
Preface
Acknowledgments
List of Figures
List of Tables
Chapter 1: Introduction
1.1 How to Read This Book?
1.2 A Short Introduction to
1.2.1 Starting with R
1.2.2 R Objects
1.2.3 Vectors
1.2.4 Vectorization
1.2.5 Factors
1.2.6 Generating Sequences
1.2.7 Sub-Setting
1.2.8 Matrices and Arrays
1.2.9 Lists
1.2.10 Data Frames
1.2.11 Creating New Functions
1.2.12 Objects, Classes, and Methods
1.2.13 Managing Your Sessions
1.3 A Short Introduction to
Chapter 2: Predicting Algae Blooms
2.1 Problem Description and Objectives
2.2 Data Description
2.3 Loading the Data into
2.4 Data Visualization and Summarization
2.5 Unknown Values
2.5.1 Removing the Observations with Unknown Values
2.5.2 Filling in the Unknowns with the Most Frequent Values
2.5.3 Filling in the Unknown Values by Exploring Correlations
2.5.4 Filling in the Unknown Values by Exploring Similarities between Cases
2.6 Obtaining Prediction Models
2.6.1 Multiple Linear Regression
2.6.2 Regression Trees
2.7 Model Evaluation and Selection
2.8 Predictions for the Seven Algae
2.9 Summary
Chapter 3: Predicting Stock Market Returns
3.1 Problem Description and Objectives
3.2 The Available Data
3.2.1 Handling Time-Dependent Data in R
3.2.2 Reading the Data from the CSV File
3.2.3 Getting the Data from the Web
3.2.4 Reading the Data from a MySQL Database
3.2.4.1 Loading the Data into R Running on Windows
3.2.4.2 Loading the Data into R Running on Linux
3.3 De.ning the Prediction Tasks
3.3.1 What to Predict?
FIGURE 3.1
3.3.2 Which Predictors?
FIGURE 3.2
3.3.3 The Prediction Tasks
3.3.4 Evaluation Criteria
TABLE 3.1
Predictions
True
Values
3.4 The Prediction Models
3.4.1 How Will the Training Data Be Used?
FIGURE 3.3
3.4.2 The Modeling Tools
3.4.2.1 Arti.cial Neural Networks
3.4.2.2 Support Vector Machines
FIGURE 3.4
x
y
x
y
3.4.2.3 Multivariate Adaptive Regression Splines
x
FIGURE 3.5
3.5 From Predictions into Actions
3.5.1 How Will the Predictions Be Used?
3.5.2 Trading-Related Evaluation Criteria
3.5.3 Putting Everything Together: A Simulated Trader
FIGURE 3.6
3.6 Model Evaluation and Selection
3.6.1 Monte Carlo Estimates
FIGURE 3.7
3.6.2 Experimental Comparisons
3.6.3 Results Analysis
FIGURE 3.8
3.7 The Trading System
3.7.1 Evaluation of the Final Test Data
FIGURE 3.9
FIGURE 3.10
FIGURE 3.11
3.7.2 An Online Trading System
3.8 Summary
Chapter 4: Detecting Fraudulent Transactions
4.1 Problem Description and Objectives
4.2 The Available Data
4.2.1 Loading the Data into R
4.2.2 Exploring the Dataset
4.2.3 Data Problems
4.2.3.1 Unknown Values
4.2.3.2 Few Transactions of Some Products
4.3 De.ning the Data Mining Tasks
4.3.1 Di.erent Approaches to the Problem
4.3.1.1 Unsupervised Techniques
4.3.1.2 Supervised Techniques
4.3.1.3 Semi-Supervised Techniques
4.3.2 Evaluation Criteria
4.3.2.1 Precision and Recall
4.3.2.2 Lift Charts and Precision/Recall Curves
4.3.2.3 Normalized Distance to Typical Price
4.3.3 Experimental Methodology
4.4 Obtaining Outlier Rankings
4.4.1 Unsupervised Approaches
4.4.1.1 The Modi.ed Box Plot Rule
FIGURE 4.7
4.4.1.2 Local Outlier Factors (
)
FIGURE 4.8
4.4.1.3 Clustering-Based Outlier Rankings (
)
4.4.2 Supervised Approaches
FIGURE 4.9
4.4.2.1 The Class Imbalance Problem
FIGURE 4.10
4.4.2.2 Naive Bayes
FIGURE 4.11
FIGURE 4.12
4.4.2.3 AdaBoost
FIGURE 4.13
4.4.3 Semi-Supervised Approaches
FIGURE 4.14
FIGURE 4.15
4.5 Summary
Chapter 5: Classifying Microarray Samples
5.1 Problem Description and Objectives
5.1.1 Brief Background on Microarray Experiments
5.1.2 The ALL Dataset
5.2 The Available Data
5.2.1 Exploring the Dataset
5.3 Gene (Feature) Selection
5.3.1 Simple Filters Based on Distribution Properties
FIGURE 5.2
5.3.2 ANOVA Filters
FIGURE 5.3
5.3.3 Filtering Using Random Forests
5.3.4 Filtering Using Feature Clustering Ensembles
FIGURE 5.4
5.4 Predicting Cytogenetic Abnormalities
5.4.1 De.ning the Prediction Task
5.4.2 The Evaluation Metric
5.4.3 The Experimental Procedure
5.4.4 The Modeling Techniques
5.4.4.1 Random Forests
5.4.4.2 k-Nearest Neighbors
5.4.5 Comparing the Models
5.5 Summary
Bibliography
Subject Index
Index of Data Mining Topics
Index of R Functions
Back Cover