logo资料库

Advanced Analytics with Spark Patterns for Learning from Data at Scale 无水印pdf 0分.pdf

第1页 / 共276页
第2页 / 共276页
第3页 / 共276页
第4页 / 共276页
第5页 / 共276页
第6页 / 共276页
第7页 / 共276页
第8页 / 共276页
资料共276页,剩余部分请下载后查看
Cover
Copyright
Table of Contents
Foreword
Preface
What's in This Book
Using Code Examples
Safari® Books Online
How to Contact Us
Acknowledgments
1. Analyzing Big Data
The Challenges of Data Science
Introducing Apache Spark
About This Book
2. Introduction to Data Analysis with Scala and Spark
Scala for Data Scientists
The Spark Programming Model
Record Linkage
Getting Started: The Spark Shell and SparkContext
Bringing Data from the Cluster to the Client
Shipping Code from the Client to the Cluster
Structuring Data with Tuples and Case Classes
Aggregations
Creating Histograms
Summary Statistics for Continuous Variables
Creating Reusable Code for Computing Summary Statistics
Simple Variable Selection and Scoring
Where to Go from Here
3. Recommending Music and the Audioscrobbler Data Set
Data Set
The Alternating Least Squares Recommender Algorithm
Preparing the Data
Building a First Model
Spot Checking Recommendations
Evaluating Recommendation Quality
Computing AUC
Hyperparameter Selection
Making Recommendations
Where to Go from Here
4. Predicting Forest Cover with Decision Trees
Fast Forward to Regression
Vectors and Features
Training Examples
Decision Trees and Forests
Covtype Data Set
Preparing the Data
A First Decision Tree
Decision Tree Hyperparameters
Tuning Decision Trees
Categorical Features Revisited
Random Decision Forests
Making Predictions
Where to Go from Here
5. Anomaly Detection in Network Traffic with K-means Clustering
Anomaly Detection
K-means Clustering
Network Intrusion
KDD Cup 1999 Data Set
A First Take on Clustering
Choosing k
Visualization in R
Feature Normalization
Categorical Variables
Using Labels with Entropy
Clustering in Action
Where to Go from Here
6. Understanding Wikipedia with Latent Semantic Analysis
The Term-Document Matrix
Getting the Data
Parsing and Preparing the Data
Lemmatization
Computing the TF-IDFs
Singular Value Decomposition
Finding Important Concepts
Querying and Scoring with the Low-Dimensional Representation
Term-Term Relevance
Document-Document Relevance
Term-Document Relevance
Multiple-Term Queries
Where to Go from Here
7. Analyzing Co-occurrence Networks with GraphX
The MEDLINE Citation Index: A Network Analysis
Getting the Data
Parsing XML Documents with Scala's XML Library
Analyzing the MeSH Major Topics and Their Co-occurrences
Constructing a Co-occurrence Network with GraphX
Understanding the Structure of Networks
Connected Components
Degree Distribution
Filtering Out Noisy Edges
Processing EdgeTriplets
Analyzing the Filtered Graph
Small-World Networks
Cliques and Clustering Coefficients
Computing Average Path Length with Pregel
Where to Go from Here
8. Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data
Getting the Data
Working with Temporal and Geospatial Data in Spark
Temporal Data with JodaTime and NScalaTime
Geospatial Data with the Esri Geometry API and Spray
Exploring the Esri Geometry API
Intro to GeoJSON
Preparing the New York City Taxi Trip Data
Handling Invalid Records at Scale
Geospatial Analysis
Sessionization in Spark
Building Sessions: Secondary Sorts in Spark
Where to Go from Here
9. Estimating Financial Risk through Monte Carlo Simulation
Terminology
Methods for Calculating VaR
Variance-Covariance
Historical Simulation
Monte Carlo Simulation
Our Model
Getting the Data
Preprocessing
Determining the Factor Weights
Sampling
The Multivariate Normal Distribution
Running the Trials
Visualizing the Distribution of Returns
Evaluating Our Results
Where to Go from Here
10. Analyzing Genomics Data and the BDG Project
Decoupling Storage from Modeling
Ingesting Genomics Data with the ADAM CLI
Parquet Format and Columnar Storage
Predicting Transcription Factor Binding Sites from ENCODE Data
Querying Genotypes from the 1000 Genomes Project
Where to Go from Here
11. Analyzing Neuroimaging Data with PySpark and Thunder
Overview of PySpark
PySpark Internals
Overview and Installation of the Thunder Library
Loading Data with Thunder
Thunder Core Data Types
Categorizing Neuron Types with Thunder
Where to Go from Here
Appendix A. Deeper into Spark
Serialization
Accumulators
Spark and the Data Scientist's Workflow
File Formats
Spark Subprojects
MLlib
Spark Streaming
Spark SQL
GraphX
Appendix B. Upcoming MLlib Pipelines API
Beyond Mere Modeling
The Pipelines API
Text Classification Example Walkthrough
Index
About the Authors
Advanced Analytics with Spark In this practical book, four Cloudera data scientists present a set of self- contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example. You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques—classification, collaborative filtering, and anomaly detection, among others—to fields such as genomics, security, and finance. If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find these patterns useful for working on your own data applications. Patterns include: ■ Recommending music and the Audioscrobbler data set ■ Predicting forest cover with decision trees ■ Anomaly detection in network traffic with K-means clustering ■ Understanding Wikipedia with Latent Semantic Analysis ■ Analyzing co-occurrence networks with GraphX ■ Geospatial and temporal data analysis on the New York City Taxi Trips data ■ Estimating financial risk through Monte Carlo simulation ■ Analyzing genomics data and the BDG project ■ Analyzing neuroimaging data with PySpark and Thunder  Sandy Ryza is a Senior Data Scientist at Cloudera and active contributor to the Apache Spark project. Uri Laserson is a Senior Data Scientist at Cloudera, where he focuses on Python in the Hadoop ecosystem. Sean Owen is Director of Data Science for EMEA at Cloudera, and a committer for Apache Spark. Josh Wills is Senior Director of Data Science at Cloudera and founder of the Apache Crunch project. DATA/SPARK US $49.99 CAN $57.99 ISBN: 978-1-491-91276-8 Twitter: @oreillymedia facebook.com/oreilly A d v a n c e d A n a l y t i c s w i t h S p a r k O w e n & W R y z a , L a s e r s o n , i l l s Advanced Analytics with Spark PATTERNS FOR LEARNING FROM DATA AT SCALE Sandy Ryza, Uri Laserson, Sean Owen & Josh Wills
Advanced Analytics with Spark Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills
Advanced Analytics with Spark by Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills Copyright © 2015 Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or corporate@oreilly.com. Editor: Marie Beaugureau Production Editor: Kara Ebrahim Copyeditor: Kim Cofer Proofreader: Rachel Monaghan April 2015: First Edition Revision History for the First Edition 2015-03-27: First Release Indexer: Judy McConville Interior Designer: David Futato Cover Designer: Ellie Volckhausen Illustrator: Rebecca Demarest See http://oreilly.com/catalog/errata.csp?isbn=9781491912768 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Advanced Analytics with Spark, the cover image of a peregrine falcon, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-91276-8 [LSI]
Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1. Analyzing Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 The Challenges of Data Science 3 Introducing Apache Spark 4 About This Book 6 2. Introduction to Data Analysis with Scala and Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Scala for Data Scientists 10 The Spark Programming Model 11 Record Linkage 11 Getting Started: The Spark Shell and SparkContext 13 Bringing Data from the Cluster to the Client 18 Shipping Code from the Client to the Cluster 22 Structuring Data with Tuples and Case Classes 23 Aggregations 28 Creating Histograms 29 Summary Statistics for Continuous Variables 30 Creating Reusable Code for Computing Summary Statistics 31 Simple Variable Selection and Scoring 36 Where to Go from Here 37 3. Recommending Music and the Audioscrobbler Data Set. . . . . . . . . . . . . . . . . . . . . . . . . . 39 Data Set 40 The Alternating Least Squares Recommender Algorithm 41 Preparing the Data 43 iii
Building a First Model 46 Spot Checking Recommendations 48 Evaluating Recommendation Quality 50 Computing AUC 51 Hyperparameter Selection 53 Making Recommendations 55 Where to Go from Here 56 4. Predicting Forest Cover with Decision Trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Fast Forward to Regression 59 Vectors and Features 60 Training Examples 61 Decision Trees and Forests 62 Covtype Data Set 65 Preparing the Data 66 A First Decision Tree 67 Decision Tree Hyperparameters 71 Tuning Decision Trees 73 Categorical Features Revisited 75 Random Decision Forests 77 Making Predictions 79 Where to Go from Here 79 5. Anomaly Detection in Network Traffic with K-means Clustering. . . . . . . . . . . . . . . . . . . 81 Anomaly Detection 82 K-means Clustering 82 Network Intrusion 83 KDD Cup 1999 Data Set 84 A First Take on Clustering 85 Choosing k 87 Visualization in R 89 Feature Normalization 91 Categorical Variables 94 Using Labels with Entropy 95 Clustering in Action 96 Where to Go from Here 97 6. Understanding Wikipedia with Latent Semantic Analysis. . . . . . . . . . . . . . . . . . . . . . . . . 99 The Term-Document Matrix 100 Getting the Data 102 Parsing and Preparing the Data 102 Lemmatization 104 iv | Table of Contents
Computing the TF-IDFs 105 Singular Value Decomposition 107 Finding Important Concepts 109 Querying and Scoring with the Low-Dimensional Representation 112 Term-Term Relevance 113 Document-Document Relevance 115 Term-Document Relevance 116 Multiple-Term Queries 117 Where to Go from Here 119 7. Analyzing Co-occurrence Networks with GraphX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 The MEDLINE Citation Index: A Network Analysis 122 Getting the Data 123 Parsing XML Documents with Scala’s XML Library 125 Analyzing the MeSH Major Topics and Their Co-occurrences 127 Constructing a Co-occurrence Network with GraphX 129 Understanding the Structure of Networks 132 Connected Components 132 Degree Distribution 135 Filtering Out Noisy Edges 138 Processing EdgeTriplets 139 Analyzing the Filtered Graph 140 Small-World Networks 142 Cliques and Clustering Coefficients 143 Computing Average Path Length with Pregel 144 Where to Go from Here 149 8. Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data. . . . . . . . 151 Getting the Data 152 Working with Temporal and Geospatial Data in Spark 153 Temporal Data with JodaTime and NScalaTime 153 Geospatial Data with the Esri Geometry API and Spray 155 Exploring the Esri Geometry API 155 Intro to GeoJSON 157 Preparing the New York City Taxi Trip Data 159 Handling Invalid Records at Scale 160 Geospatial Analysis 164 Sessionization in Spark 167 Building Sessions: Secondary Sorts in Spark 168 Where to Go from Here 171 Table of Contents | v
9. Estimating Financial Risk through Monte Carlo Simulation. . . . . . . . . . . . . . . . . . . . . . 173 Terminology 174 Methods for Calculating VaR 175 Variance-Covariance 175 Historical Simulation 175 Monte Carlo Simulation 175 Our Model 176 Getting the Data 177 Preprocessing 178 Determining the Factor Weights 181 Sampling 183 The Multivariate Normal Distribution 185 Running the Trials 186 Visualizing the Distribution of Returns 189 Evaluating Our Results 190 Where to Go from Here 192 10. Analyzing Genomics Data and the BDG Project. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Decoupling Storage from Modeling 196 Ingesting Genomics Data with the ADAM CLI 198 Parquet Format and Columnar Storage 204 Predicting Transcription Factor Binding Sites from ENCODE Data 206 Querying Genotypes from the 1000 Genomes Project 213 Where to Go from Here 214 11. Analyzing Neuroimaging Data with PySpark and Thunder. . . . . . . . . . . . . . . . . . . . . . . 217 Overview of PySpark 218 PySpark Internals 219 Overview and Installation of the Thunder Library 221 Loading Data with Thunder 222 Thunder Core Data Types 229 Categorizing Neuron Types with Thunder 231 Where to Go from Here 236 A. Deeper into Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 B. Upcoming MLlib Pipelines API. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 vi | Table of Contents
分享到:
收藏