Mastering Machine Learning with Spark 2.X 无水印pdf.pdf

发布时间：2022-06-14 发布人：admin 分类：说明书资料大小：15.80M 资料格式：pdf 举报版权申诉

95cdbb5d-9d6b-4f7e-b5f7-85828298755d.pdf-第1页.png

第1页 / 共383页

95cdbb5d-9d6b-4f7e-b5f7-85828298755d.pdf-第2页.png

第2页 / 共383页

95cdbb5d-9d6b-4f7e-b5f7-85828298755d.pdf-第3页.png

第3页 / 共383页

95cdbb5d-9d6b-4f7e-b5f7-85828298755d.pdf-第4页.png

第4页 / 共383页

95cdbb5d-9d6b-4f7e-b5f7-85828298755d.pdf-第5页.png

第5页 / 共383页

95cdbb5d-9d6b-4f7e-b5f7-85828298755d.pdf-第6页.png

第6页 / 共383页

95cdbb5d-9d6b-4f7e-b5f7-85828298755d.pdf-第7页.png

第7页 / 共383页

95cdbb5d-9d6b-4f7e-b5f7-85828298755d.pdf-第8页.png

第8页 / 共383页

Cover

About the Author

Table of Contents

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1 Introduction to Large-Scale Machine Learning and Spark

Data science

The sexiest role of the 21st century – data scientist?

A day in the life of a data scientist

Working with big data

The machine learning algorithm using a distributed environment

Splitting of data into multiple machines

From Hadoop MapReduce to Spark

What is Databricks?

Inside the box

Introducing H2O.ai

Design of Sparkling Water

What's the difference between H2O and Spark's MLlib?

Data munging

Data science - an iterative process

Summary

2 Detecting Dark Matter - The Higgs-Boson Particle

Type I versus type II error

Finding the Higgs-Boson particle

The LHC and data creation

The theory behind the Higgs-Boson

Measuring for the Higgs-Boson

The dataset

Spark start and data load

Labeled point vector

Data caching

Creating a training and testing set

What about cross-validation?

Our first model – decision tree

Gini versus Entropy

Next model – tree ensembles

Random forest model

Grid search

Gradient boosting machine

Last model - H2O deep learning

Build a 3-layer DNN

Adding more layers

Building models and inspecting results

Summary

3 Ensemble Methods for Multi-Class Classification

Data

Modeling goal

Challenges

Machine learning workflow

Starting Spark shell

Exploring data

Missing data

Summary of missing value analysis

Data unification

Missing values

Categorical values

Final transformation

Modelling data with Random Forest

Building a classification model using Spark RandomForest

Classification model evaluation

Spark model metrics

Building a classification model using H2O RandomForest

Summary

4 Predicting Movie Reviews Using NLP and Spark Streaming

NLP - a brief primer

The dataset

Dataset preparation

Feature extraction

Feature extraction method– bag-of-words model

Text tokenization

Declaring our stopwords list

Stemming and lemmatization

Featurization - feature hashing

Term Frequency - Inverse Document Frequency (TF-IDF) weighting scheme

Let's do some (model) training!

Spark decision tree model

Spark Naive Bayes model

Spark random forest model

Spark GBM model

Super-learner model

Super learner

Composing all transformations together

Using the super-learner model

Summary

5 Word2vec for Prediction and Clustering

Motivation of word vectors

Word2vec explained

What is a word vector?

The CBOW model

The skip-gram model

Fun with word vectors

Cosine similarity

Doc2vec explained

The distributed-memory model

The distributed bag-of-words model

Applying word2vec and exploring our data with vectors

Creating document vectors

Supervised learning task

Summary

6 Extracting Patterns from Clickstream Data

Frequent pattern mining

Pattern mining terminology

Frequent pattern mining problem

The association rule mining problem

The sequential pattern mining problem

Pattern mining with Spark MLlib

Frequent pattern mining with FP-growth

Association rule mining

Sequential pattern mining with prefix span

Pattern mining on MSNBC clickstream data

Deploying a pattern mining application

The Spark Streaming module

Summary

7 Graph Analytics with GraphX

Basic graph theory

Graphs

Directed and undirected graphs

Order and degree

Directed acyclic graphs

Connected components

Trees

Multigraphs

Property graphs

GraphX distributed graph processing engine

Graph representation in GraphX

Graph properties and operations

Building and loading graphs

Visualizing graphs with Gephi

Gephi

Creating GEXF files from GraphX graphs

Advanced graph processing

Aggregating messages

Pregel

GraphFrames

Graph algorithms and applications

Clustering

Vertex importance

GraphX in context

Summary

8 Lending Club Loan Prediction

Motivation

Goal

Data

Data dictionary

Preparation of the environment

Data load

Exploration – data analysis

Basic clean up

Useless columns

String columns

Loan progress columns

Categorical columns

Text columns

Missing data

Prediction targets

Loan status model

Base model

The emp_title column transformation

The desc column transformation

Interest RateModel

Using models for scoring

Model deployment

Stream creation

Stream transformation

Stream output

Summary

Mastering Machine Learning with Spark 2.x Create scalable machine learning applications to power a modern data-driven business using Spark

Alex Tellez Max Pumperla Michal Malohlava BIRMINGHAM - MUMBAI

Mastering Machine Learning with Spark 2.x Copyright © 2017 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: August 2017 Production reference: 1290817 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham

B3 2PB, UK. ISBN 978-1-78528-345-1 www.packtpub.com

Credits Author Alex Tellez Max Pumperla Michal Malohlava Reviewer Dipanjan Deb Copy Editor Muktikant Garimella Project Coordinator Ulhas Kambali Commissioning Editor Veena Pagare Proofreader Safis Editing

Acquisition Editor Larissa Pinto Indexer Rekha Nair Content Development Editor Nikhil Borkar Graphics Jason Monteiro Technical Editor Diwakar Shukla Production Coordinator Melwyn Dsa

About the Authors Alex Tellez is a life-long data hacker/enthusiast with a passion for data science and its application to business problems. He has a wealth of experience working across multiple industries, including banking, health care, online dating, human resources, and online gaming. Alex has also given multiple talks at various AI/machine learning conferences, in addition to lectures at universities about neural networks. When he’s not neck-deep in a textbook, Alex enjoys spending time with family, riding bikes, and utilizing machine learning to feed his French wine curiosity! First and foremost, I’d like to thank my co-author, Michal, for helping me write this book. As fellow ML enthusiasts, cyclists, runners, and fathers, we both developed a deeper understanding of each other through this endeavor, which has taken well over one year to create. Simply put, this book would not have been possible without Michal’s support and encouragement. Next, I’d like to thank my mom, dad, and elder brother, Andres, who have been there every step of the way from day 1 until now. Without question, my elder brother continues to be my hero and is someone that I will forever look up to as being a guiding light. Of course, no acknowledgements would be finished without giving thanks to my beautiful wife, Denise, and daughter, Miya, who have provided the love and support to continue the writing of this book during nights and weekends. I cannot emphasize enough how much you both mean to me and how you guys are the inspiration and motivation that keeps this engine running. To my daughter, Miya, my hope is that you can pick this book up and one day realize that your old man isn’t quite as silly as I appear to let on. Last but not least, I’d also like to give thanks to you, the reader, for your interest in this exciting field using this incredible technology. Whether you are a seasoned ML expert, or a newcomer to the field looking to gain a foothold, you have come to the right book and my hope is that you get as much out of this as Michal and I did in writing this work. Max Pumperla is a data scientist and engineer specializing in deep learning and its applications. He currently works as a deep learning engineer at Skymind and is a co- founder of aetros.com. Max is the author and maintainer of several Python packages, including elephas, a distributed deep learning library using Spark. His open source

分享到：

赞收藏

资料库

Mastering Machine Learning with Spark 2.X 无水印pdf.pdf

相关推荐

开发技术

热门标签

最新资料