logo资料库

MachineLearningwithR.pdf 英文原版

第1页 / 共396页
第2页 / 共396页
第3页 / 共396页
第4页 / 共396页
第5页 / 共396页
第6页 / 共396页
第7页 / 共396页
第8页 / 共396页
资料共396页,剩余部分请下载后查看
Cover
Copyright
Credits
About the Author
About the Reviewers
www.PacktPub.com
Table of Contents
Preface
Chapter 1: Introducing Machine Learning
The origins of machine learning
Uses and abuses of machine learning
Ethical considerations
How do machines learn?
Abstraction and knowledge representation
Generalization
Assessing the success of learning
Steps to apply machine learning to your data
Choosing a machine learning algorithm
Thinking about the input data
Thinking about types of machine learning algorithms
Matching your data to an appropriate algorithm
Using R for machine learning
Installing and loading R packages
Installing an R package
Installing a package using the point-and-click interface
Loading an R package
Summary
Chapter 2: Managing and Understanding Data
R data structures
Vectors
Factors
Lists
Data frames
Matrixes and arrays
Managing data with R
Saving and loading R data structures
Importing and saving data from CSV files
Importing data from SQL databases
Exploring and understanding data
Exploring the structure of data
Exploring numeric variables
Measuring the central tendency – mean and median
Measuring spread – quartiles and the five-number summary
Visualizing numeric variables – boxplots
Visualizing numeric variables – histograms
Understanding numeric data – uniform and normal distributions
Measuring spread – variance and standard deviation
Exploring categorical variables
Measuring the central tendency – the mode
Exploring relationships between variables
Visualizing relationships – scatterplots
Examining relationships – two-way cross-tabulations
Summary
Chapter 3: Lazy Learning – Classification using Nearest Neighbors
Understanding classification using nearest neighbors
The kNN algorithm
Calculating distance
Choosing an appropriate k
Preparing data for use with kNN
Why is the kNN algorithm lazy?
Diagnosing breast cancer with the kNN algorithm
Step 1 – collecting data
Step 2 – exploring and preparing the data
Transformation – normalizing numeric data
Data preparation – creating training and test datasets
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Transformation – z-score standardization
Testing alternative values of k
Summary
Chapter 4: Probabilistic Learning – Classification using Naive Bayes
Understanding naive Bayes
Basic concepts of Bayesian methods
Probability
Joint probability
Conditional probability with Bayes' theorem
The naive Bayes algorithm
The naive Bayes classification
The Laplace estimator
Using numeric features with naive Bayes
Example – filtering mobile phone spam with the naive Bayes algorithm
Step 1 – collecting data
Step 2 – exploring and preparing the data
Data preparation – processing text data for analysis
Data preparation – creating training and test datasets
Visualizing text data – word clouds
Data preparation – creating indicator features for frequent words
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Summary
Chapter 5: Divide and Conquer – Classification using Decision Trees and Rules
Understanding decision trees
Divide-and-conquer
The C5.0 decision tree algorithm
Choosing the best split
Pruning the decision tree
Example – identifying risky bank loans using C5.0 decision trees
Step 1 – collecting data
Step 2 – exploring and preparing the data
Data preparation – creating random training and test datasets
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Boosting the accuracy of decision trees
Making some mistakes more costly than others
Understanding classification rules
Separate-and-conquer
The One Rule algorithm
The RIPPER algorithm
Rules from decision trees
Example – identifying poisonous mushrooms with rule learners
Step 1 – collecting data
Step 2 – exploring and preparing the data
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Summary
Chapter 6: Forecasting Numeric Data – Regression Methods
Understanding regression
Simple linear regression
Ordinary least squares estimation
Correlations
Multiple linear regression
Example – predicting medical expenses using linear regression
Step 1 – collecting data
Step 2 – exploring and preparing the data
Exploring relationships among features – correlation matrix
Visualizing relationships among features – scatterplot matrix
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Model specification – adding non-linear relationships
Transformation – converting a numeric variable to a binary indicator
Model specification – adding interaction effects
Putting it all together – an improved regression model
Understanding regression trees and model trees
Adding regression to trees
Example – estimating the quality of wines with regression trees and model trees
Step 1 – collecting data
Step 2 – exploring and preparing the data
Step 3 – training a model on the data
Visualizing decision trees
Step 4 – evaluating model performance
Measuring performance with mean absolute error
Step 5 – improving model performance
Summary
Chapter 7: Black Box Methods – Neural Networks and Support Vector Machines
Understanding neural networks
From biological to artificial neurons
Activation functions
Network topology
The number of layers
The direction of information travel
The number of nodes in each layer
Training neural networks with backpropagation
Modeling the strength of concrete with ANNs
Step 1 – collecting data
Step 2 – exploring and preparing the data
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Understanding Support Vector Machines
Classification with hyperplanes
Finding the maximum margin
The case of linearly separable data
The case of non-linearly separable data
Using kernels for non-linear spaces
Performing OCR with SVMs
Step 1 – collecting data
Step 2 – exploring and preparing the data
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Summary
Chapter 8: Finding Patterns – Market Basket Analysis using Association Rules
Understanding association rules
The Apriori algorithm for association rule learning
Measuring rule interest – support and confidence
Building a set of rules with the Apriori principle
Example – identifying frequently purchased groceries with association rules
Step 1 – collecting data
Step 2 – exploring and preparing the data
Data preparation – creating a sparse matrix for transaction data
Visualizing item support – item frequency plots
Visualizing the transaction data – plotting the sparse matrix
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Sorting the set of association rules
Taking subsets of association rules
Saving association rules to a file or data frame
Summary
Chapter 9: Finding Groups of Data – Clustering with k-means
Understanding clustering
Clustering as a machine learning task
The k-means algorithm for clustering
Using distance to assign and update clusters
Choosing the appropriate number of clusters
Finding teen market segments using k-means clustering
Step 1 – collecting data
Step 2 – exploring and preparing the data
Data preparation – dummy coding missing values
Data preparation – imputing missing values
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Summary
Chapter 10: Evaluating Model Performance
Measuring performance for classification
Working with classification prediction data in R
A closer look at confusion matrices
Using confusion matrices to measure performance
Beyond accuracy – other measures of performance
Kappa statistic
Sensitivity and specificity
Precision and recall
The F-measure
Visualizing performance tradeoffs
ROC curves
Estimating future performance
The holdout method
Cross-validation
Bootstrap sampling
Summary
Chapter 11: Improving Model Performance
Tuning stock models for better performance
Using caret for automated parameter tuning
Creating a simple tuned model
Customizing the tuning process
Improving model performance with meta-learning
Understanding ensembles
Bagging
Boosting
Random forests
Training random forests
Evaluating random forest performance
Summary
Chapter 12: Specialized Machine Learning Topics
Working with specialized data
Getting data from the Web with RCurl
Reading and writing XML with 'XML'
Reading and writing JSON with rjson
Reading and writing Microsoft Excel spreadsheets using xlsx
Working with bioinformatics data
Working with social network or graph data
Improving the performance of R
Managing very large datasets
Making data frames faster with data.table
Creating disk-based data frames with ff
Using massive matrices with bigmemory
Learning faster with parallel computing
Measuring execution time
Working in parallel with foreach
Using a multitasking operating system with multicore
Networking multiple workstations with snow and snowfall
Parallel cloud computing with MapReduce and Hadoop
GPU computing
Deploying optimized learning algorithms
Building bigger regression models with biglm
Growing bigger and faster random forests with bigrf
Training and evaluating models in parallel with caret
Summary
Index
http://freepdf-books.com
Machine Learning with R Learn how to use R to apply powerful machine learning methods and gain an insight into real-world applications Brett Lantz BIRMINGHAM - MUMBAI
Machine Learning with R Copyright © 2013 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: October 2013 Production Reference: 1211013 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78216-214-8 www.packtpub.com Cover Image by Abhishek Pandey (abhishek.pandey1210@gmail.com) http://freepdf-books.com
Credits Author Brett Lantz Reviewers Jia Liu Mzabalazo Z. Ngwenya Abhinav Upadhyay Acquisition Editor James Jones Lead Technical Editor Azharuddin Sheikh Technical Editors Pooja Arondekar Pratik More Anusri Ramchandran Harshad Vairat Project Coordinator Anugya Khurana Proofreaders Simran Bhogal Ameesha Green Paul Hindle Indexer Tejal Soni Graphics Ronak Dhruv Production Coordinator Nilesh R. Mohite Cover Work Nilesh R. Mohite http://freepdf-books.com
About the Author Brett Lantz has spent the past 10 years using innovative data methods to understand human behavior. A sociologist by training, he was first enchanted by machine learning while studying a large database of teenagers' social networking website profiles. Since then, he has worked on interdisciplinary studies of cellular telephone calls, medical billing data, and philanthropic activity, among others. When he's not spending time with family, following college sports, or being entertained by his dachshunds, he maintains dataspelunking.com, a website dedicated to sharing knowledge about the search for insight in data. This book could not have been written without the support of my family and friends. In particular, my wife Jessica deserves many thanks for her patience and encouragement throughout the past year. My son Will (who was born while Chapter 10 was underway), also deserves special mention for his role in the writing process; without his gracious ability to sleep through the night, I could not have strung together a coherent sentence the next morning. I dedicate this book to him in the hope that one day he is inspired to follow his curiosity wherever it may lead. I am also indebted to many others who supported this book indirectly. My interactions with educators, peers, and collaborators at the University of Michigan, the University of Notre Dame, and the University of Central Florida seeded many of the ideas I attempted to express in the text. Additionally, without the work of researchers who shared their expertise in publications, lectures, and source code, this book might not exist at all. Finally, I appreciate the efforts of the R team and all those who have contributed to R packages, whose work ultimately brought machine learning to the masses. http://freepdf-books.com
About the Reviewers Jia Liu holds a Master's degree in Statistics from the University of Maryland, Baltimore County, and is presently a PhD candidate in statistics from Iowa State University. Her research interests include mixed-effects model, Bayesian method, Boostrap method, reliability, design of experiments, machine learning and data mining. She has two year's experience as a student consultant in statistics and two year's internship experience in agriculture and pharmaceutical industry. Mzabalazo Z. Ngwenya has worked extensively in the field of statistical consulting and currently works as a biometrician. He holds an MSc in Mathematical Statistics from the University of Cape Town and is at present studying for a PhD (at the School of Information Technology, University of Pretoria), in the field of Computational Intelligence. His research interests include statistical computing, machine learning, and spatial statistics. Previously, he was involved in reviewing Learning RStudio for R Statistical Computing (Van de Loo and de Jong, 2012), and R Statistical Application Development by Example beginner's guide (Prabhanjan Narayanachar Tattar , 2013). http://freepdf-books.com
Abhinav Upadhyay finished his Bachelor's degree in 2011 with a major in Information Technology. His main areas of interest include machine learning and information retrieval. In 2011, he worked for the NetBSD Foundation as part of the Google Summer of Code program. During that period, he wrote a search engine for Unix manual pages. This project resulted in a new implementation of the apropos utility for NetBSD. Currently, he is working as a Development Engineer for SocialTwist. His day-to-day work involves writing system level tools and frameworks to manage the product infrastructure. He is also an open source enthusiast and quite active in the community. In his free time, he maintains and contributes to several open source projects. http://freepdf-books.com
www.PacktPub.com Support files, eBooks, discount offers and more You might want to visit www.PacktPub.com for support files and downloads related to your book. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at service@packtpub.com for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. TM http://PacktLib.PacktPub.com Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books. Why Subscribe? • Fully searchable across every book published by Packt • Copy and paste, print and bookmark content • On demand and accessible via web browser Free Access for Packt account holders If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access. http://freepdf-books.com
分享到:
收藏