MachineLearningwithR.pdf 英文原版

发布时间：2022-05-29 发布人：admin 分类：说明书资料大小：5.82M 资料格式：pdf 举报版权申诉

weixin_38744270-11581091-4744302542962621865.pdf-第1页.png

第1页 / 共396页

weixin_38744270-11581091-4744302542962621865.pdf-第2页.png

第2页 / 共396页

weixin_38744270-11581091-4744302542962621865.pdf-第3页.png

第3页 / 共396页

weixin_38744270-11581091-4744302542962621865.pdf-第4页.png

第4页 / 共396页

weixin_38744270-11581091-4744302542962621865.pdf-第5页.png

第5页 / 共396页

weixin_38744270-11581091-4744302542962621865.pdf-第6页.png

第6页 / 共396页

weixin_38744270-11581091-4744302542962621865.pdf-第7页.png

第7页 / 共396页

weixin_38744270-11581091-4744302542962621865.pdf-第8页.png

第8页 / 共396页

Cover

Credits

About the Author

About the Reviewers

www.PacktPub.com

Table of Contents

Preface

Chapter 1: Introducing Machine Learning

The origins of machine learning

Uses and abuses of machine learning

Ethical considerations

How do machines learn?

Abstraction and knowledge representation

Generalization

Assessing the success of learning

Steps to apply machine learning to your data

Choosing a machine learning algorithm

Thinking about the input data

Thinking about types of machine learning algorithms

Matching your data to an appropriate algorithm

Using R for machine learning

Installing and loading R packages

Installing an R package

Installing a package using the point-and-click interface

Loading an R package

Summary

Chapter 2: Managing and Understanding Data

R data structures

Vectors

Factors

Lists

Data frames

Matrixes and arrays

Managing data with R

Saving and loading R data structures

Importing and saving data from CSV files

Importing data from SQL databases

Exploring and understanding data

Exploring the structure of data

Exploring numeric variables

Measuring the central tendency – mean and median

Measuring spread – quartiles and the five-number summary

Visualizing numeric variables – boxplots

Visualizing numeric variables – histograms

Understanding numeric data – uniform and normal distributions

Measuring spread – variance and standard deviation

Exploring categorical variables

Measuring the central tendency – the mode

Exploring relationships between variables

Visualizing relationships – scatterplots

Examining relationships – two-way cross-tabulations

Summary

Chapter 3: Lazy Learning – Classification using Nearest Neighbors

Understanding classification using nearest neighbors

The kNN algorithm

Calculating distance

Choosing an appropriate k

Preparing data for use with kNN

Why is the kNN algorithm lazy?

Diagnosing breast cancer with the kNN algorithm

Step 1 – collecting data

Step 2 – exploring and preparing the data

Transformation – normalizing numeric data

Data preparation – creating training and test datasets

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

Transformation – z-score standardization

Testing alternative values of k

Summary

Chapter 4: Probabilistic Learning – Classification using Naive Bayes

Understanding naive Bayes

Basic concepts of Bayesian methods

Probability

Joint probability

Conditional probability with Bayes' theorem

The naive Bayes algorithm

The naive Bayes classification

The Laplace estimator

Using numeric features with naive Bayes

Example – filtering mobile phone spam with the naive Bayes algorithm

Step 1 – collecting data

Step 2 – exploring and preparing the data

Data preparation – processing text data for analysis

Data preparation – creating training and test datasets

Visualizing text data – word clouds

Data preparation – creating indicator features for frequent words

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

Summary

Chapter 5: Divide and Conquer – Classification using Decision Trees and Rules

Understanding decision trees

Divide-and-conquer

The C5.0 decision tree algorithm

Choosing the best split

Pruning the decision tree

Example – identifying risky bank loans using C5.0 decision trees

Step 1 – collecting data

Step 2 – exploring and preparing the data

Data preparation – creating random training and test datasets

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

Boosting the accuracy of decision trees

Making some mistakes more costly than others

Understanding classification rules

Separate-and-conquer

The One Rule algorithm

The RIPPER algorithm

Rules from decision trees

Example – identifying poisonous mushrooms with rule learners

Step 1 – collecting data

Step 2 – exploring and preparing the data

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

Summary

Chapter 6: Forecasting Numeric Data – Regression Methods

Understanding regression

Simple linear regression

Ordinary least squares estimation

Correlations

Multiple linear regression

Example – predicting medical expenses using linear regression

Step 1 – collecting data

Step 2 – exploring and preparing the data

Exploring relationships among features – correlation matrix

Visualizing relationships among features – scatterplot matrix

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

Model specification – adding non-linear relationships

Transformation – converting a numeric variable to a binary indicator

Model specification – adding interaction effects

Putting it all together – an improved regression model

Understanding regression trees and model trees

Adding regression to trees

Example – estimating the quality of wines with regression trees and model trees

Step 1 – collecting data

Step 2 – exploring and preparing the data

Step 3 – training a model on the data

Visualizing decision trees

Step 4 – evaluating model performance

Measuring performance with mean absolute error

Step 5 – improving model performance

Summary

Chapter 7: Black Box Methods – Neural Networks and Support Vector Machines

Understanding neural networks

From biological to artificial neurons

Activation functions

Network topology

The number of layers

The direction of information travel

The number of nodes in each layer

Training neural networks with backpropagation

Modeling the strength of concrete with ANNs

Step 1 – collecting data

Step 2 – exploring and preparing the data

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

Understanding Support Vector Machines

Classification with hyperplanes

Finding the maximum margin

The case of linearly separable data

The case of non-linearly separable data

Using kernels for non-linear spaces

Performing OCR with SVMs

Step 1 – collecting data

Step 2 – exploring and preparing the data

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

Summary

Chapter 8: Finding Patterns – Market Basket Analysis using Association Rules

Understanding association rules

The Apriori algorithm for association rule learning

Measuring rule interest – support and confidence

Building a set of rules with the Apriori principle

Example – identifying frequently purchased groceries with association rules

Step 1 – collecting data

Step 2 – exploring and preparing the data

Data preparation – creating a sparse matrix for transaction data

Visualizing item support – item frequency plots

Visualizing the transaction data – plotting the sparse matrix

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

Sorting the set of association rules

Taking subsets of association rules

Saving association rules to a file or data frame

Summary

Chapter 9: Finding Groups of Data – Clustering with k-means

Understanding clustering

Clustering as a machine learning task

The k-means algorithm for clustering

Using distance to assign and update clusters

Choosing the appropriate number of clusters

Finding teen market segments using k-means clustering

Step 1 – collecting data

Step 2 – exploring and preparing the data

Data preparation – dummy coding missing values

Data preparation – imputing missing values

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

Summary

Chapter 10: Evaluating Model Performance

Measuring performance for classification

Working with classification prediction data in R

A closer look at confusion matrices

Using confusion matrices to measure performance

Beyond accuracy – other measures of performance

Kappa statistic

Sensitivity and specificity

Precision and recall

The F-measure

Visualizing performance tradeoffs

ROC curves

Estimating future performance

The holdout method

Cross-validation

Bootstrap sampling

Summary

Chapter 11: Improving Model Performance

Tuning stock models for better performance

Using caret for automated parameter tuning

Creating a simple tuned model

Customizing the tuning process

Improving model performance with meta-learning

Understanding ensembles

Bagging

Boosting

Random forests

Training random forests

Evaluating random forest performance

Summary

Chapter 12: Specialized Machine Learning Topics

Working with specialized data

Getting data from the Web with RCurl

Reading and writing XML with 'XML'

Reading and writing JSON with rjson

Reading and writing Microsoft Excel spreadsheets using xlsx

Working with bioinformatics data

Working with social network or graph data

Improving the performance of R

Managing very large datasets

Making data frames faster with data.table

Creating disk-based data frames with ff

Using massive matrices with bigmemory

Learning faster with parallel computing

Measuring execution time

Working in parallel with foreach

Using a multitasking operating system with multicore

Networking multiple workstations with snow and snowfall

Parallel cloud computing with MapReduce and Hadoop

GPU computing

Deploying optimized learning algorithms

Building bigger regression models with biglm

Growing bigger and faster random forests with bigrf

Training and evaluating models in parallel with caret

Summary

Index

http://freepdf-books.com

Machine Learning with R Learn how to use R to apply powerful machine learning methods and gain an insight into real-world applications Brett Lantz BIRMINGHAM - MUMBAI

Machine Learning with R Copyright © 2013 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: October 2013 Production Reference: 1211013 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78216-214-8 www.packtpub.com Cover Image by Abhishek Pandey (abhishek.pandey1210@gmail.com) http://freepdf-books.com

Credits Author Brett Lantz Reviewers Jia Liu Mzabalazo Z. Ngwenya Abhinav Upadhyay Acquisition Editor James Jones Lead Technical Editor Azharuddin Sheikh Technical Editors Pooja Arondekar Pratik More Anusri Ramchandran Harshad Vairat Project Coordinator Anugya Khurana Proofreaders Simran Bhogal Ameesha Green Paul Hindle Indexer Tejal Soni Graphics Ronak Dhruv Production Coordinator Nilesh R. Mohite Cover Work Nilesh R. Mohite http://freepdf-books.com

About the Author Brett Lantz has spent the past 10 years using innovative data methods to understand human behavior. A sociologist by training, he was first enchanted by machine learning while studying a large database of teenagers' social networking website profiles. Since then, he has worked on interdisciplinary studies of cellular telephone calls, medical billing data, and philanthropic activity, among others. When he's not spending time with family, following college sports, or being entertained by his dachshunds, he maintains dataspelunking.com, a website dedicated to sharing knowledge about the search for insight in data. This book could not have been written without the support of my family and friends. In particular, my wife Jessica deserves many thanks for her patience and encouragement throughout the past year. My son Will (who was born while Chapter 10 was underway), also deserves special mention for his role in the writing process; without his gracious ability to sleep through the night, I could not have strung together a coherent sentence the next morning. I dedicate this book to him in the hope that one day he is inspired to follow his curiosity wherever it may lead. I am also indebted to many others who supported this book indirectly. My interactions with educators, peers, and collaborators at the University of Michigan, the University of Notre Dame, and the University of Central Florida seeded many of the ideas I attempted to express in the text. Additionally, without the work of researchers who shared their expertise in publications, lectures, and source code, this book might not exist at all. Finally, I appreciate the efforts of the R team and all those who have contributed to R packages, whose work ultimately brought machine learning to the masses. http://freepdf-books.com

About the Reviewers Jia Liu holds a Master's degree in Statistics from the University of Maryland, Baltimore County, and is presently a PhD candidate in statistics from Iowa State University. Her research interests include mixed-effects model, Bayesian method, Boostrap method, reliability, design of experiments, machine learning and data mining. She has two year's experience as a student consultant in statistics and two year's internship experience in agriculture and pharmaceutical industry. Mzabalazo Z. Ngwenya has worked extensively in the field of statistical consulting and currently works as a biometrician. He holds an MSc in Mathematical Statistics from the University of Cape Town and is at present studying for a PhD (at the School of Information Technology, University of Pretoria), in the field of Computational Intelligence. His research interests include statistical computing, machine learning, and spatial statistics. Previously, he was involved in reviewing Learning RStudio for R Statistical Computing (Van de Loo and de Jong, 2012), and R Statistical Application Development by Example beginner's guide (Prabhanjan Narayanachar Tattar , 2013). http://freepdf-books.com

Abhinav Upadhyay finished his Bachelor's degree in 2011 with a major in Information Technology. His main areas of interest include machine learning and information retrieval. In 2011, he worked for the NetBSD Foundation as part of the Google Summer of Code program. During that period, he wrote a search engine for Unix manual pages. This project resulted in a new implementation of the apropos utility for NetBSD. Currently, he is working as a Development Engineer for SocialTwist. His day-to-day work involves writing system level tools and frameworks to manage the product infrastructure. He is also an open source enthusiast and quite active in the community. In his free time, he maintains and contributes to several open source projects. http://freepdf-books.com

www.PacktPub.com Support files, eBooks, discount offers and more You might want to visit www.PacktPub.com for support files and downloads related to your book. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at service@packtpub.com for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. TM http://PacktLib.PacktPub.com Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books. Why Subscribe? • Fully searchable across every book published by Packt • Copy and paste, print and bookmark content • On demand and accessible via web browser Free Access for Packt account holders If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access. http://freepdf-books.com

分享到：

赞收藏

资料库

MachineLearningwithR.pdf 英文原版

相关推荐

开发技术

热门标签

最新资料