introduction to statistical learning with R.pdf

发布时间：2022-06-13 发布人：admin 分类：说明书资料大小：9.58M 资料格式：pdf 举报版权申诉

e90ecbe8-8a71-4e62-b0f3-b6b9703ed728.pdf-第1页.png

第1页 / 共435页

e90ecbe8-8a71-4e62-b0f3-b6b9703ed728.pdf-第2页.png

第2页 / 共435页

e90ecbe8-8a71-4e62-b0f3-b6b9703ed728.pdf-第3页.png

第3页 / 共435页

e90ecbe8-8a71-4e62-b0f3-b6b9703ed728.pdf-第4页.png

第4页 / 共435页

e90ecbe8-8a71-4e62-b0f3-b6b9703ed728.pdf-第5页.png

第5页 / 共435页

e90ecbe8-8a71-4e62-b0f3-b6b9703ed728.pdf-第6页.png

第6页 / 共435页

e90ecbe8-8a71-4e62-b0f3-b6b9703ed728.pdf-第7页.png

第7页 / 共435页

e90ecbe8-8a71-4e62-b0f3-b6b9703ed728.pdf-第8页.png

第8页 / 共435页

Front Matter

Preface

Contents

1-Introduction

1 Introduction

2-Statisitical Learning

2 Statistical Learning

2.1 What Is Statistical Learning?

2.1.1 Why Estimate f?

2.1.2 How Do We Estimate f?

2.1.3 The Trade-Off Between Prediction Accuracyand Model Interpretability

2.1.4 Supervised Versus Unsupervised Learning

2.1.5 Regression Versus Classification Problems

2.2 Assessing Model Accuracy

2.2.1 Measuring the Quality of Fit

2.2.2 The Bias-Variance Trade-Off

2.2.3 The Classification Setting

2.3 Lab: Introduction to R

2.3.1 Basic Commands

2.3.2 Graphics

2.3.3 Indexing Data

2.3.4 Loading Data

2.3.5 Additional Graphical and Numerical Summaries

2.4 Exercises

3-Linear Regression

3 Linear Regression

3.1 Simple Linear Regression

3.1.1 Estimating the Coefficients

3.1.2 Assessing the Accuracy of the CoefficientEstimates

3.1.3 Assessing the Accuracy of the Model

Residual Standard Error

R2 Statistic

3.2 Multiple Linear Regression

3.2.1 Estimating the Regression Coefficients

3.2.2 Some Important Questions

One: Is There a Relationship Between the Response and Predictors?

Two: Deciding on Important Variables

Three: Model Fit

Four: Predictions

3.3 Other Considerations in the Regression Model

3.3.1 Qualitative Predictors

Predictors with Only Two Levels

Qualitative Predictors with More than Two Levels

3.3.2 Extensions of the Linear Model

Removing the Additive Assumption

Non-linear Relationships

3.3.3 Potential Problems

1. Non-linearity of the Data

2. Correlation of Error Terms

3. Non-constant Variance of Error Terms

4. Outliers

5. High Leverage Points

6. Collinearity

3.4 The Marketing Plan

3.5 Comparison of Linear Regression with K-NearestNeighbors

3.6 Lab: Linear Regression

3.6.1 Libraries

3.6.2 Simple Linear Regression

3.6.3 Multiple Linear Regression

3.6.4 Interaction Terms

3.6.5 Non-linear Transformations of the Predictors

3.6.6 Qualitative Predictors

3.6.7 Writing Functions

3.7 Exercises

4-Classification

4 Classification

4.1 An Overview of Classification

4.2 Why Not Linear Regression?

4.3 Logistic Regression

4.3.1 The Logistic Model

4.3.2 Estimating the Regression Coefficients

4.3.3 Making Predictions

4.3.4 Multiple Logistic Regression

4.3.5 Logistic Regression for >2 Response Classes

4.4 Linear Discriminant Analysis

4.4.1 Using Bayes' Theorem for Classification

4.4.2 Linear Discriminant Analysis for p=1

4.4.3 Linear Discriminant Analysis for p>1

4.4.4 Quadratic Discriminant Analysis

4.5 A Comparison of Classification Methods

4.6 Lab: Logistic Regression, LDA, QDA, and KNN

4.6.1 The Stock Market Data

4.6.2 Logistic Regression

4.6.3 Linear Discriminant Analysis

4.6.4 Quadratic Discriminant Analysis

4.6.5 K-Nearest Neighbors

4.6.6 An Application to Caravan Insurance Data

4.7 Exercises

5-Resampling Methods

5 Resampling Methods

5.1 Cross-Validation

5.1.1 The Validation Set Approach

5.1.2 Leave-One-Out Cross-Validation

5.1.3 k-Fold Cross-Validation

5.1.4 Bias-Variance Trade-Off for k-FoldCross-Validation

5.1.5 Cross-Validation on Classification Problems

5.2 The Bootstrap

5.3 Lab: Cross-Validation and the Bootstrap

5.3.1 The Validation Set Approach

5.3.2 Leave-One-Out Cross-Validation

5.3.3 k-Fold Cross-Validation

5.3.4 The Bootstrap

Estimating the Accuracy of a Statistic of Interest

Estimating the Accuracy of a Linear Regression Model

5.4 Exercises

6-Linear Model Selection and Regularization

6 Linear Model Selection and Regularization

6.1 Subset Selection

6.1.1 Best Subset Selection

6.1.2 Stepwise Selection

Forward Stepwise Selection

Backward Stepwise Selection

Hybrid Approaches

6.1.3 Choosing the Optimal Model

Cp, AIC, BIC, and Adjusted R2

Validation and Cross-Validation

6.2 Shrinkage Methods

6.2.1 Ridge Regression

An Application to the Credit Data

Why Does Ridge Regression Improve Over Least Squares?

6.2.2 The Lasso

Another Formulation for Ridge Regression and the Lasso

The Variable Selection Property of the Lasso

Comparing the Lasso and Ridge Regression

A Simple Special Case for Ridge Regression and the Lasso

Bayesian Interpretation for Ridge Regression and the Lasso

6.2.3 Selecting the Tuning Parameter

6.3 Dimension Reduction Methods

6.3.1 Principal Components Regression

An Overview of Principal Components Analysis

The Principal Components Regression Approach

6.3.2 Partial Least Squares

6.4 Considerations in High Dimensions

6.4.1 High-Dimensional Data

6.4.2 What Goes Wrong in High Dimensions?

6.4.3 Regression in High Dimensions

6.4.4 Interpreting Results in High Dimensions

6.5 Lab 1: Subset Selection Methods

6.5.1 Best Subset Selection

6.5.2 Forward and Backward Stepwise Selection

6.5.3 Choosing Among Models Using the ValidationSet Approach and Cross-Validation

6.6 Lab 2: Ridge Regression and the Lasso

6.6.1 Ridge Regression

6.6.2 The Lasso

6.7 Lab 3: PCR and PLS Regression

6.7.1 Principal Components Regression

6.7.2 Partial Least Squares

6.8 Exercises

7-Moving Beyond Linearity

7 Moving Beyond Linearity

7.1 Polynomial Regression

7.2 Step Functions

7.3 Basis Functions

7.4 Regression Splines

7.4.1 Piecewise Polynomials

7.4.2 Constraints and Splines

7.4.3 The Spline Basis Representation

7.4.4 Choosing the Number and Locationsof the Knots

7.4.5 Comparison to Polynomial Regression

7.5 Smoothing Splines

7.5.1 An Overview of Smoothing Splines

7.5.2 Choosing the Smoothing Parameter

7.6 Local Regression

7.7 Generalized Additive Models

7.7.1 GAMs for Regression Problems

Pros and Cons of GAMs

7.7.2 GAMs for Classification Problems

7.8 Lab: Non-linear Modeling

7.8.1 Polynomial Regression and Step Functions

7.8.2 Splines

7.8.3 GAMs

7.9 Exercises

8-Tree-Based Methods

8 Tree-Based Methods

8.1 The Basics of Decision Trees

8.1.1 Regression Trees

Predicting Baseball Players' Salaries Using Regression Trees

Prediction via Stratification of the Feature Space

Tree Pruning

8.1.2 Classification Trees

8.1.3 Trees Versus Linear Models

8.1.4 Advantages and Disadvantages of Trees

8.2 Bagging, Random Forests, Boosting

8.2.1 Bagging

Out-of-Bag Error Estimation

Variable Importance Measures

8.2.2 Random Forests

8.2.3 Boosting

8.3 Lab: Decision Trees

8.3.1 Fitting Classification Trees

8.3.2 Fitting Regression Trees

8.3.3 Bagging and Random Forests

8.3.4 Boosting

8.4 Exercises

9-Support Vector Machines

9 Support Vector Machines

9.1 Maximal Margin Classifier

9.1.1 What Is a Hyperplane?

9.1.2 Classification Using a Separating Hyperplane

9.1.3 The Maximal Margin Classifier

9.1.4 Construction of the Maximal Margin Classifier

9.1.5 The Non-separable Case

9.2 Support Vector Classifiers

9.2.1 Overview of the Support Vector Classifier

9.2.2 Details of the Support Vector Classifier

9.3 Support Vector Machines

9.3.1 Classification with Non-linear DecisionBoundaries

9.3.2 The Support Vector Machine

9.3.3 An Application to the Heart Disease Data

9.4 SVMs with More than Two Classes

9.4.1 One-Versus-One Classification

9.4.2 One-Versus-All Classification

9.5 Relationship to Logistic Regression

9.6 Lab: Support Vector Machines

9.6.1 Support Vector Classifier

9.6.2 Support Vector Machine

9.6.3 ROC Curves

9.6.4 SVM with Multiple Classes

9.6.5 Application to Gene Expression Data

9.7 Exercises

10-Unsupervised Learning

10 Unsupervised Learning

10.1 The Challenge of Unsupervised Learning

10.2 Principal Components Analysis

10.2.1 What Are Principal Components?

10.2.2 Another Interpretation of Principal Components

10.2.3 More on PCA

Scaling the Variables

Uniqueness of the Principal Components

The Proportion of Variance Explained

Deciding How Many Principal Components to Use

10.2.4 Other Uses for Principal Components

10.3 Clustering Methods

10.3.1 K-Means Clustering

10.3.2 Hierarchical Clustering

Interpreting a Dendrogram

The Hierarchical Clustering Algorithm

Choice of Dissimilarity Measure

10.3.3 Practical Issues in Clustering

Small Decisions with Big Consequences

Validating the Clusters Obtained

Other Considerations in Clustering

A Tempered Approach to Interpreting the Results of Clustering

10.4 Lab 1: Principal Components Analysis

10.5 Lab 2: Clustering

10.5.1 K-Means Clustering

10.5.2 Hierarchical Clustering

10.6 Lab 3: NCI60 Data Example

10.6.1 PCA on the NCI60 Data

10.6.2 Clustering the Observations of the NCI60 Data

10.7 Exercises

Back Matter

Index

Springer Texts in Statistics 103 Series Editors: G. Casella S. Fienberg I. Olkin For further volumes: http://www.springer.com/series/417

Gareth James • Daniela Witten • Trevor Hastie Robert Tibshirani An Introduction to Statistical Learning with Applications in R 123

Gareth James Department of Information and Operations Management University of Southern California Los Angeles, CA, USA Trevor Hastie Department of Statistics Stanford University Stanford, CA, USA Daniela Witten Department of Biostatistics University of Washington Seattle, WA, USA Robert Tibshirani Department of Statistics Stanford University Stanford, CA, USA ISSN 1431-875X ISBN 978-1-4614-7137-0 DOI 10.1007/978-1-4614-7138-7 Springer New York Heidelberg Dordrecht London ISBN 978-1-4614-7138-7 (eBook) Library of Congress Control Number: 2013936251 © Springer Science+Business Media New York 2013 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissim- ilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied speciﬁcally for the pur- pose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publi- cation does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

To our parents: Alison and Michael James Chiara Nappi and Edward Witten Valerie and Patrick Hastie Vera and Sami Tibshirani and to our families: Michael, Daniel, and Catherine Ari Samantha, Timothy, and Lynda Charlie, Ryan, Julie, and Cheryl

Preface Statistical learning refers to a set of tools for modeling and understanding complex datasets. It is a recently developed area in statistics and blends with parallel developments in computer science and, in particular, machine learning. The ﬁeld encompasses many methods such as the lasso and sparse regression, classiﬁcation and regression trees, and boosting and support vector machines. With the explosion of “Big Data” problems, statistical learning has be- come a very hot ﬁeld in many scientiﬁc areas as well as marketing, ﬁnance, and other business disciplines. People with statistical learning skills are in high demand. One of the ﬁrst books in this area—The Elements of Statistical Learning (ESL) (Hastie, Tibshirani, and Friedman)—was published in 2001, with a second edition in 2009. ESL has become a popular text not only in statis- tics but also in related ﬁelds. One of the reasons for ESL’s popularity is its relatively accessible style. But ESL is intended for individuals with ad- vanced training in the mathematical sciences. An Introduction to Statistical Learning (ISL) arose from the perceived need for a broader and less tech- nical treatment of these topics. In this new book, we cover many of the same topics as ESL, but we concentrate more on the applications of the methods and less on the mathematical details. We have created labs illus- trating how to implement each of the statistical learning methods using the popular statistical software package R. These labs provide the reader with valuable hands-on experience. This book is appropriate for advanced undergraduates or master’s stu- dents in statistics or related quantitative ﬁelds or for individuals in other vii

viii Preface disciplines who wish to use statistical learning tools to analyze their data. It can be used as a textbook for a course spanning one or two semesters. We would like to thank several readers for valuable comments on prelim- inary drafts of this book: Pallavi Basu, Alexandra Chouldechova, Patrick Danaher, Will Fithian, Luella Fu, Sam Gross, Max Grazier G’Sell, Court- ney Paulson, Xinghao Qiao, Elisa Sheng, Noah Simon, Kean Ming Tan, and Xin Lu Tan. It’s tough to make predictions, especially about the future. Los Angeles, USA Seattle, USA Palo Alto, USA Palo Alto, USA -Yogi Berra Gareth James Daniela Witten Trevor Hastie Robert Tibshirani

分享到：

赞收藏

资料库

introduction to statistical learning with R.pdf

相关推荐

课程资源

热门标签

最新资料