Advanced Analytics with Spark Patterns for Learning from Data at Scale 无水印pdf 0分.pdf

发布时间：2022-06-14 发布人：admin 分类：说明书资料大小：3.98M 资料格式：pdf 举报版权申诉

4f553241-3a67-47a1-8e71-8dbba1eecb6c.pdf-第1页.png

第1页 / 共276页

4f553241-3a67-47a1-8e71-8dbba1eecb6c.pdf-第2页.png

第2页 / 共276页

4f553241-3a67-47a1-8e71-8dbba1eecb6c.pdf-第3页.png

第3页 / 共276页

4f553241-3a67-47a1-8e71-8dbba1eecb6c.pdf-第4页.png

第4页 / 共276页

4f553241-3a67-47a1-8e71-8dbba1eecb6c.pdf-第5页.png

第5页 / 共276页

4f553241-3a67-47a1-8e71-8dbba1eecb6c.pdf-第6页.png

第6页 / 共276页

4f553241-3a67-47a1-8e71-8dbba1eecb6c.pdf-第7页.png

第7页 / 共276页

4f553241-3a67-47a1-8e71-8dbba1eecb6c.pdf-第8页.png

第8页 / 共276页

Cover

Table of Contents

Foreword

Preface

What's in This Book

Using Code Examples

Safari® Books Online

How to Contact Us

Acknowledgments

1. Analyzing Big Data

The Challenges of Data Science

Introducing Apache Spark

About This Book

2. Introduction to Data Analysis with Scala and Spark

Scala for Data Scientists

The Spark Programming Model

Record Linkage

Getting Started: The Spark Shell and SparkContext

Bringing Data from the Cluster to the Client

Shipping Code from the Client to the Cluster

Structuring Data with Tuples and Case Classes

Aggregations

Creating Histograms

Summary Statistics for Continuous Variables

Creating Reusable Code for Computing Summary Statistics

Simple Variable Selection and Scoring

Where to Go from Here

3. Recommending Music and the Audioscrobbler Data Set

Data Set

The Alternating Least Squares Recommender Algorithm

Preparing the Data

Building a First Model

Spot Checking Recommendations

Evaluating Recommendation Quality

Computing AUC

Hyperparameter Selection

Making Recommendations

Where to Go from Here

4. Predicting Forest Cover with Decision Trees

Fast Forward to Regression

Vectors and Features

Training Examples

Decision Trees and Forests

Covtype Data Set

Preparing the Data

A First Decision Tree

Decision Tree Hyperparameters

Tuning Decision Trees

Categorical Features Revisited

Random Decision Forests

Making Predictions

Where to Go from Here

5. Anomaly Detection in Network Traffic with K-means Clustering

Anomaly Detection

K-means Clustering

Network Intrusion

KDD Cup 1999 Data Set

A First Take on Clustering

Choosing k

Visualization in R

Feature Normalization

Categorical Variables

Using Labels with Entropy

Clustering in Action

Where to Go from Here

6. Understanding Wikipedia with Latent Semantic Analysis

The Term-Document Matrix

Getting the Data

Parsing and Preparing the Data

Lemmatization

Computing the TF-IDFs

Singular Value Decomposition

Finding Important Concepts

Querying and Scoring with the Low-Dimensional Representation

Term-Term Relevance

Document-Document Relevance

Term-Document Relevance

Multiple-Term Queries

Where to Go from Here

7. Analyzing Co-occurrence Networks with GraphX

The MEDLINE Citation Index: A Network Analysis

Getting the Data

Parsing XML Documents with Scala's XML Library

Analyzing the MeSH Major Topics and Their Co-occurrences

Constructing a Co-occurrence Network with GraphX

Understanding the Structure of Networks

Connected Components

Degree Distribution

Filtering Out Noisy Edges

Processing EdgeTriplets

Analyzing the Filtered Graph

Small-World Networks

Cliques and Clustering Coefficients

Computing Average Path Length with Pregel

Where to Go from Here

8. Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data

Getting the Data

Working with Temporal and Geospatial Data in Spark

Temporal Data with JodaTime and NScalaTime

Geospatial Data with the Esri Geometry API and Spray

Exploring the Esri Geometry API

Intro to GeoJSON

Preparing the New York City Taxi Trip Data

Handling Invalid Records at Scale

Geospatial Analysis

Sessionization in Spark

Building Sessions: Secondary Sorts in Spark

Where to Go from Here

9. Estimating Financial Risk through Monte Carlo Simulation

Terminology

Methods for Calculating VaR

Variance-Covariance

Historical Simulation

Monte Carlo Simulation

Our Model

Getting the Data

Preprocessing

Determining the Factor Weights

Sampling

The Multivariate Normal Distribution

Running the Trials

Visualizing the Distribution of Returns

Evaluating Our Results

Where to Go from Here

10. Analyzing Genomics Data and the BDG Project

Decoupling Storage from Modeling

Ingesting Genomics Data with the ADAM CLI

Parquet Format and Columnar Storage

Predicting Transcription Factor Binding Sites from ENCODE Data

Querying Genotypes from the 1000 Genomes Project

Where to Go from Here

11. Analyzing Neuroimaging Data with PySpark and Thunder

Overview of PySpark

PySpark Internals

Overview and Installation of the Thunder Library

Loading Data with Thunder

Thunder Core Data Types

Categorizing Neuron Types with Thunder

Where to Go from Here

Appendix A. Deeper into Spark

Serialization

Accumulators

Spark and the Data Scientist's Workflow

File Formats

Spark Subprojects

MLlib

Spark Streaming

Spark SQL

GraphX

Appendix B. Upcoming MLlib Pipelines API

Beyond Mere Modeling

The Pipelines API

Text Classification Example Walkthrough

Index

About the Authors

Advanced Analytics with Spark In this practical book, four Cloudera data scientists present a set of self- contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example. You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques—classification, collaborative filtering, and anomaly detection, among others—to fields such as genomics, security, and finance. If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find these patterns useful for working on your own data applications. Patterns include: ■ Recommending music and the Audioscrobbler data set ■ Predicting forest cover with decision trees ■ Anomaly detection in network traffic with K-means clustering ■ Understanding Wikipedia with Latent Semantic Analysis ■ Analyzing co-occurrence networks with GraphX ■ Geospatial and temporal data analysis on the New York City Taxi Trips data ■ Estimating financial risk through Monte Carlo simulation ■ Analyzing genomics data and the BDG project ■ Analyzing neuroimaging data with PySpark and Thunder Sandy Ryza is a Senior Data Scientist at Cloudera and active contributor to the Apache Spark project. Uri Laserson is a Senior Data Scientist at Cloudera, where he focuses on Python in the Hadoop ecosystem. Sean Owen is Director of Data Science for EMEA at Cloudera, and a committer for Apache Spark. Josh Wills is Senior Director of Data Science at Cloudera and founder of the Apache Crunch project. DATA/SPARK US $49.99 CAN $57.99 ISBN: 978-1-491-91276-8 Twitter: @oreillymedia facebook.com/oreilly A d v a n c e d A n a l y t i c s w i t h S p a r k O w e n & W R y z a , L a s e r s o n , i l l s Advanced Analytics with Spark PATTERNS FOR LEARNING FROM DATA AT SCALE Sandy Ryza, Uri Laserson, Sean Owen & Josh Wills

Advanced Analytics with Spark Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills

Advanced Analytics with Spark by Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills Copyright © 2015 Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or corporate@oreilly.com. Editor: Marie Beaugureau Production Editor: Kara Ebrahim Copyeditor: Kim Cofer Proofreader: Rachel Monaghan April 2015: First Edition Revision History for the First Edition 2015-03-27: First Release Indexer: Judy McConville Interior Designer: David Futato Cover Designer: Ellie Volckhausen Illustrator: Rebecca Demarest See http://oreilly.com/catalog/errata.csp?isbn=9781491912768 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Advanced Analytics with Spark, the cover image of a peregrine falcon, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-91276-8 [LSI]

Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1. Analyzing Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 The Challenges of Data Science 3 Introducing Apache Spark 4 About This Book 6 2. Introduction to Data Analysis with Scala and Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Scala for Data Scientists 10 The Spark Programming Model 11 Record Linkage 11 Getting Started: The Spark Shell and SparkContext 13 Bringing Data from the Cluster to the Client 18 Shipping Code from the Client to the Cluster 22 Structuring Data with Tuples and Case Classes 23 Aggregations 28 Creating Histograms 29 Summary Statistics for Continuous Variables 30 Creating Reusable Code for Computing Summary Statistics 31 Simple Variable Selection and Scoring 36 Where to Go from Here 37 3. Recommending Music and the Audioscrobbler Data Set. . . . . . . . . . . . . . . . . . . . . . . . . . 39 Data Set 40 The Alternating Least Squares Recommender Algorithm 41 Preparing the Data 43 iii

Building a First Model 46 Spot Checking Recommendations 48 Evaluating Recommendation Quality 50 Computing AUC 51 Hyperparameter Selection 53 Making Recommendations 55 Where to Go from Here 56 4. Predicting Forest Cover with Decision Trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Fast Forward to Regression 59 Vectors and Features 60 Training Examples 61 Decision Trees and Forests 62 Covtype Data Set 65 Preparing the Data 66 A First Decision Tree 67 Decision Tree Hyperparameters 71 Tuning Decision Trees 73 Categorical Features Revisited 75 Random Decision Forests 77 Making Predictions 79 Where to Go from Here 79 5. Anomaly Detection in Network Traffic with K-means Clustering. . . . . . . . . . . . . . . . . . . 81 Anomaly Detection 82 K-means Clustering 82 Network Intrusion 83 KDD Cup 1999 Data Set 84 A First Take on Clustering 85 Choosing k 87 Visualization in R 89 Feature Normalization 91 Categorical Variables 94 Using Labels with Entropy 95 Clustering in Action 96 Where to Go from Here 97 6. Understanding Wikipedia with Latent Semantic Analysis. . . . . . . . . . . . . . . . . . . . . . . . . 99 The Term-Document Matrix 100 Getting the Data 102 Parsing and Preparing the Data 102 Lemmatization 104 iv | Table of Contents

Computing the TF-IDFs 105 Singular Value Decomposition 107 Finding Important Concepts 109 Querying and Scoring with the Low-Dimensional Representation 112 Term-Term Relevance 113 Document-Document Relevance 115 Term-Document Relevance 116 Multiple-Term Queries 117 Where to Go from Here 119 7. Analyzing Co-occurrence Networks with GraphX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 The MEDLINE Citation Index: A Network Analysis 122 Getting the Data 123 Parsing XML Documents with Scala’s XML Library 125 Analyzing the MeSH Major Topics and Their Co-occurrences 127 Constructing a Co-occurrence Network with GraphX 129 Understanding the Structure of Networks 132 Connected Components 132 Degree Distribution 135 Filtering Out Noisy Edges 138 Processing EdgeTriplets 139 Analyzing the Filtered Graph 140 Small-World Networks 142 Cliques and Clustering Coefficients 143 Computing Average Path Length with Pregel 144 Where to Go from Here 149 8. Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data. . . . . . . . 151 Getting the Data 152 Working with Temporal and Geospatial Data in Spark 153 Temporal Data with JodaTime and NScalaTime 153 Geospatial Data with the Esri Geometry API and Spray 155 Exploring the Esri Geometry API 155 Intro to GeoJSON 157 Preparing the New York City Taxi Trip Data 159 Handling Invalid Records at Scale 160 Geospatial Analysis 164 Sessionization in Spark 167 Building Sessions: Secondary Sorts in Spark 168 Where to Go from Here 171 Table of Contents | v

9. Estimating Financial Risk through Monte Carlo Simulation. . . . . . . . . . . . . . . . . . . . . . 173 Terminology 174 Methods for Calculating VaR 175 Variance-Covariance 175 Historical Simulation 175 Monte Carlo Simulation 175 Our Model 176 Getting the Data 177 Preprocessing 178 Determining the Factor Weights 181 Sampling 183 The Multivariate Normal Distribution 185 Running the Trials 186 Visualizing the Distribution of Returns 189 Evaluating Our Results 190 Where to Go from Here 192 10. Analyzing Genomics Data and the BDG Project. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Decoupling Storage from Modeling 196 Ingesting Genomics Data with the ADAM CLI 198 Parquet Format and Columnar Storage 204 Predicting Transcription Factor Binding Sites from ENCODE Data 206 Querying Genotypes from the 1000 Genomes Project 213 Where to Go from Here 214 11. Analyzing Neuroimaging Data with PySpark and Thunder. . . . . . . . . . . . . . . . . . . . . . . 217 Overview of PySpark 218 PySpark Internals 219 Overview and Installation of the Thunder Library 221 Loading Data with Thunder 222 Thunder Core Data Types 229 Categorizing Neuron Types with Thunder 231 Where to Go from Here 236 A. Deeper into Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 B. Upcoming MLlib Pipelines API. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 vi | Table of Contents

分享到：

赞收藏

资料库

Advanced Analytics with Spark Patterns for Learning from Data at Scale 无水印pdf 0分.pdf

相关推荐

开发技术

热门标签

最新资料