Learning Pyspark.pdf

发布时间：2022-06-03 发布人：admin 分类：说明书资料大小：7.54M 资料格式：pdf 举报版权申诉

jinyuan7708-10625960-4744302543353471575.pdf-第1页.png

第1页 / 共273页

jinyuan7708-10625960-4744302543353471575.pdf-第2页.png

第2页 / 共273页

jinyuan7708-10625960-4744302543353471575.pdf-第3页.png

第3页 / 共273页

jinyuan7708-10625960-4744302543353471575.pdf-第4页.png

第4页 / 共273页

jinyuan7708-10625960-4744302543353471575.pdf-第5页.png

第5页 / 共273页

jinyuan7708-10625960-4744302543353471575.pdf-第6页.png

第6页 / 共273页

jinyuan7708-10625960-4744302543353471575.pdf-第7页.png

第7页 / 共273页

jinyuan7708-10625960-4744302543353471575.pdf-第8页.png

第8页 / 共273页

Cover

Credits

Foreword

About the Authors

About the Reviewer

www.PacktPub.com

Customer Feedback

Table of Contents

Preface

Chapter 1: Understanding Spark

What is Apache Spark?

Spark Jobs and APIs

Execution process

Resilient Distributed Dataset

DataFrames

Datasets

Catalyst Optimizer

Project Tungsten

Spark 2.0 architecture

Unifying Datasets and DataFrames

Introducing SparkSession

Tungsten phase 2

Structured streaming

Continuous applications

Summary

Chapter 2: Resilient Distributed Datasets

Internal workings of an RDD

Creating RDDs

Schema

Reading from files

Lambda expressions

Global versus local scope

Transformations

The .map(...) transformation

The .filter(...) transformation

The .flatMap(...) transformation

The .distinct(...) transformation

The .sample(...) transformation

The .leftOuterJoin(...) transformation

The .repartition(...) transformation

Actions

The .take(...) method

The .collect(...) method

The .reduce(...) method

The .count(...) method

The .saveAsTextFile(...) method

The .foreach(...) method

Summary

Chapter 3: DataFrames

Python to RDD communications

Catalyst Optimizer refresh

Speeding up PySpark with DataFrames

Creating DataFrames

Generating our own JSON data

Creating a DataFrame

Creating a temporary table

Simple DataFrame queries

DataFrame API query

SQL query

Interoperating with RDDs

Inferring the schema using reflection

Programmatically specifying the schema

Querying with the DataFrame API

Number of rows

Running filter statements

Querying with SQL

Number of rows

Running filter statements using the where Clauses

DataFrame scenario – on-time flight performance

Preparing the source datasets

Joining flight performance and airports

Visualizing our flight-performance data

Spark Dataset API

Summary

Chapter 4: Prepare Data for Modeling

Checking for duplicates, missing observations, and outliers

Duplicates

Missing observations

Outliers

Getting familiar with your data

Descriptive statistics

Correlations

Visualization

Histograms

Interactions between features

Summary

Chapter 5: Introducing MLlib

Overview of the package

Loading and transforming the data

Getting to know your data

Descriptive statistics

Correlations

Statistical testing

Creating the final dataset

Creating an RDD of LabeledPoints

Splitting into training and testing

Predicting infant survival

Logistic regression in MLlib

Selecting only the most predictable features

Random forest in MLlib

Summary

Chapter 6: Introducing the ML Package

Overview of the package

Transformer

Estimators

Classification

Regression

Clustering

Pipeline

Predicting the chances of infant survival with ML

Loading the data

Creating transformers

Creating an estimator

Creating a pipeline

Fitting the model

Evaluating the performance of the model

Saving the model

Parameter hyper-tuning

Grid search

Train-validation splitting

Other features of PySpark ML in action

Feature extraction

NLP - related feature extractors

Discretizing continuous variables

Standardizing continuous variables

Classification

Clustering

Finding clusters in the births dataset

Topic mining

Regression

Summary

Chapter 7: GraphFrames

Introducing GraphFrames

Installing GraphFrames

Creating a library

Preparing your flights dataset

Building the graph

Executing simple queries

Determining the number of airports and trips

Determining the longest delay in this dataset

Determining the number of delayed versus on-time/early flights

What flights departing Seattle are most likely to have significant delays?

What states tend to have significant delays departing from Seattle?

Understanding vertex degrees

Determining the top transfer airports

Understanding motifs

Determining airport ranking using PageRank

Determining the most popular non-stop flights

Using Breadth-First Search

Visualizing flights using D3

Summary

Chapter 8: TensorFrames

What is Deep Learning?

The need for neural networks and Deep Learning

What is feature engineering?

Bridging the data and algorithm

What is TensorFlow?

Installing Pip

Installing TensorFlow

Matrix multiplication using constants

Matrix multiplication using placeholders

Running the model

Running another model

Discussion

Introducing TensorFrames

TensorFrames – quick start

Configuration and setup

Launching a Spark cluster

Creating a TensorFrames library

Installing TensorFlow on your cluster

Using TensorFlow to add a constant to an existing column

Executing the Tensor graph

Blockwise reducing operations example

Building a DataFrame of vectors

Analysing the DataFrame

Computing elementwise sum and min of all vectors

Summary

Chapter 9: Polyglot Persistence with Blaze

Installing Blaze

Polyglot persistence

Abstracting data

Working with NumPy arrays

Working with pandas' DataFrame

Working with files

Working with databases

Interacting with relational databases

Interacting with the MongoDB database

Data operations

Accessing columns

Symbolic transformations

Operations on columns

Reducing data

Joins

Summary

Chapter 10: Structured Streaming

What is Spark Streaming?

Why do we need Spark Streaming?

What is the Spark Streaming application data flow?

Simple streaming application using DStreams

A quick primer on global aggregations

Introducing Structured Streaming

Summary

Chapter 11: Packaging Spark Applications

The spark-submit command

Command line parameters

Deploying the app programmatically

Configuring your SparkSession

Creating SparkSession

Modularizing code

Structure of the module

Calculating the distance between two points

Converting distance units

Building an egg

User defined functions in Spark

Submitting a job

Monitoring execution

Databricks Jobs

Summary

Index

Learning PySpark Build data-intensive applications locally and deploy at scale using the combined powers of Python and Spark 2.0 Tomasz Drabas Denny Lee BIRMINGHAM - MUMBAI

Learning PySpark Copyright © 2017 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: February 2017 Production reference: 1220217 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78646-370-8 www.packtpub.com

Credits Project Coordinator Shweta H Birwatkar Proofreader Safis Editing Indexer Aishwarya Gangawane Graphics Disha Haria Production Coordinator Aparna Bhagat Cover Work Aparna Bhagat Authors Tomasz Drabas Denny Lee Reviewer Holden Karau Commissioning Editor Amey Varangaonkar Acquisition Editor Prachi Bisht Content Development Editor Amrita Noronha Technical Editor Akash Patel Copy Editor Safis Editing

Foreword Thank you for choosing this book to start your PySpark adventures, I hope you are as excited as I am. When Denny Lee first told me about this new book I was delighted-one of the most important things that makes Apache Spark such a wonderful platform, is supporting both the Java/Scala/JVM worlds and Python (and more recently R) worlds. Many of the previous books for Spark have been focused on either all of the core languages, or primarily focused on JVM languages, so it's great to see PySpark get its chance to shine with a dedicated book from such experienced Spark educators. By supporting both of these different worlds, we are able to more effectively work together as Data Scientists and Data Engineers, while stealing the best ideas from each other's communities. It has been a privilege to have the opportunity to review early versions of this book, which has only increased my excitement for the project. I've had the privilege of being at some of the same conferences and meetups and watching the authors introduce new concepts in the world of Spark to a variety of audiences (from first timers to old hands), and they've done a great job distilling their experience for this book. The experience of the authors shines through with everything from their explanations to the topics covered. Beyond simply introducing PySpark they have also taken the time to look at up and coming packages from the community, such as GraphFrames and TensorFrames. I think the community is one of those often-overlooked components when deciding what tools to use, and Python has a great community and I'm looking forward to you joining the Python Spark community. So, enjoy your adventure; I know you are in good hands with Denny Lee and Tomek Drabas. I truly believe that by having a diverse community of Spark users we will be able to make better tools useful for everyone, so I hope to see you around at one of the conferences, meetups, or mailing lists soon :) Holden Karau P.S. I owe Denny a beer; if you want to buy him a Bud Light lime (or lime-a-rita) for me I'd be much obliged (although he might not be quite as amused as I am).

About the Authors Tomasz Drabas is a Data Scientist working for Microsoft and currently residing in the Seattle area. He has over 13 years of experience in data analytics and data science in numerous fields: advanced technology, airlines, telecommunications, finance, and consulting he gained while working on three continents: Europe, Australia, and North America. While in Australia, Tomasz has been working on his PhD in Operations Research with a focus on choice modeling and revenue management applications in the airline industry. At Microsoft, Tomasz works with big data on a daily basis, solving machine learning problems such as anomaly detection, churn prediction, and pattern recognition using Spark. Tomasz has also authored the Practical Data Analysis Cookbook published by Packt Publishing in 2016. I would like to thank my family: Rachel, Skye, and Albert—you are the love of my life and I cherish every day I spend with you! Thank you for always standing by me and for encouraging me to push my career goals further and further. Also, to my family and my in-laws for putting up with me (in general). There are many more people that have influenced me over the years that I would have to write another book to thank them all. You know who you are and I want to thank you from the bottom of my heart! However, I would not have gotten through my PhD if it was not for Czesia Wieruszewska; Czesiu - dziękuję za Twoją pomoc bez której nie rozpocząłbym mojej podróży po Antypodach. Along with Krzys Krzysztoszek, you guys have always believed in me! Thank you!

Denny Lee is a Principal Program Manager at Microsoft for the Azure DocumentDB team—Microsoft's blazing fast, planet-scale managed document store service. He is a hands-on distributed systems and data science engineer with more than 18 years of experience developing Internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments. He has extensive experience of building greenfield teams as well as turnaround/ change catalyst. Prior to joining the Azure DocumentDB team, Denny worked as a Technology Evangelist at Databricks; he has been working with Apache Spark since 0.5. He was also the Senior Director of Data Sciences Engineering at Concur, and was on the incubation team that built Microsoft's Hadoop on Windows and Azure service (currently known as HDInsight). Denny also has a Masters in Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise healthcare customers for the last 15 years. I would like to thank my wonderful spouse, Hua-Ping, and my awesome daughters, Isabella and Samantha. You are the ones who keep me grounded and help me reach for the stars!

分享到：

赞收藏

资料库

Learning Pyspark.pdf

相关推荐

大数据

热门标签

最新资料