logo资料库

Mastering Machine Learning with Spark 2.X 无水印pdf.pdf

第1页 / 共383页
第2页 / 共383页
第3页 / 共383页
第4页 / 共383页
第5页 / 共383页
第6页 / 共383页
第7页 / 共383页
第8页 / 共383页
资料共383页,剩余部分请下载后查看
Cover
Copyright
About the Author
Table of Contents
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1 Introduction to Large-Scale Machine Learning and Spark
Data science
The sexiest role of the 21st century – data scientist?
A day in the life of a data scientist
Working with big data
The machine learning algorithm using a distributed environment
Splitting of data into multiple machines
From Hadoop MapReduce to Spark
What is Databricks?
Inside the box
Introducing H2O.ai
Design of Sparkling Water
What's the difference between H2O and Spark's MLlib?
Data munging
Data science - an iterative process
Summary
2 Detecting Dark Matter - The Higgs-Boson Particle
Type I versus type II error
Finding the Higgs-Boson particle
The LHC and data creation
The theory behind the Higgs-Boson
Measuring for the Higgs-Boson
The dataset
Spark start and data load
Labeled point vector
Data caching
Creating a training and testing set
What about cross-validation?
Our first model – decision tree
Gini versus Entropy
Next model – tree ensembles
Random forest model
Grid search
Gradient boosting machine
Last model - H2O deep learning
Build a 3-layer DNN
Adding more layers
Building models and inspecting results
Summary
3 Ensemble Methods for Multi-Class Classification
Data
Modeling goal
Challenges
Machine learning workflow
Starting Spark shell
Exploring data
Missing data
Summary of missing value analysis
Data unification
Missing values
Categorical values
Final transformation
Modelling data with Random Forest
Building a classification model using Spark RandomForest
Classification model evaluation
Spark model metrics
Building a classification model using H2O RandomForest
Summary
4 Predicting Movie Reviews Using NLP and Spark Streaming
NLP - a brief primer
The dataset
Dataset preparation
Feature extraction
Feature extraction method– bag-of-words model
Text tokenization
Declaring our stopwords list
Stemming and lemmatization
Featurization - feature hashing
Term Frequency - Inverse Document Frequency (TF-IDF) weighting scheme
Let's do some (model) training!
Spark decision tree model
Spark Naive Bayes model
Spark random forest model
Spark GBM model
Super-learner model
Super learner
Composing all transformations together
Using the super-learner model
Summary
5 Word2vec for Prediction and Clustering
Motivation of word vectors
Word2vec explained
What is a word vector?
The CBOW model
The skip-gram model
Fun with word vectors
Cosine similarity
Doc2vec explained
The distributed-memory model
The distributed bag-of-words model
Applying word2vec and exploring our data with vectors
Creating document vectors
Supervised learning task
Summary
6 Extracting Patterns from Clickstream Data
Frequent pattern mining
Pattern mining terminology
Frequent pattern mining problem
The association rule mining problem
The sequential pattern mining problem
Pattern mining with Spark MLlib
Frequent pattern mining with FP-growth
Association rule mining
Sequential pattern mining with prefix span
Pattern mining on MSNBC clickstream data
Deploying a pattern mining application
The Spark Streaming module
Summary
7 Graph Analytics with GraphX
Basic graph theory
Graphs
Directed and undirected graphs
Order and degree
Directed acyclic graphs
Connected components
Trees
Multigraphs
Property graphs
GraphX distributed graph processing engine
Graph representation in GraphX
Graph properties and operations
Building and loading graphs
Visualizing graphs with Gephi
Gephi
Creating GEXF files from GraphX graphs
Advanced graph processing
Aggregating messages
Pregel
GraphFrames
Graph algorithms and applications
Clustering
Vertex importance
GraphX in context
Summary
8 Lending Club Loan Prediction
Motivation
Goal
Data
Data dictionary
Preparation of the environment
Data load
Exploration – data analysis
Basic clean up
Useless columns
String columns
Loan progress columns
Categorical columns
Text columns
Missing data
Prediction targets
Loan status model
Base model
The emp_title column transformation
The desc column transformation
Interest RateModel
Using models for scoring
Model deployment
Stream creation
Stream transformation
Stream output
Summary
Mastering Machine Learning with Spark 2.x Create scalable machine learning applications to power a modern data-driven business using Spark
Alex Tellez Max Pumperla Michal Malohlava BIRMINGHAM - MUMBAI
Mastering Machine Learning with Spark 2.x Copyright © 2017 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: August 2017 Production reference: 1290817 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham
B3 2PB, UK. ISBN 978-1-78528-345-1 www.packtpub.com
Credits Author Alex Tellez Max Pumperla Michal Malohlava Reviewer Dipanjan Deb Copy Editor Muktikant Garimella Project Coordinator Ulhas Kambali Commissioning Editor Veena Pagare Proofreader Safis Editing
Acquisition Editor Larissa Pinto Indexer Rekha Nair Content Development Editor Nikhil Borkar Graphics Jason Monteiro Technical Editor Diwakar Shukla Production Coordinator Melwyn Dsa
About the Authors Alex Tellez is a life-long data hacker/enthusiast with a passion for data science and its application to business problems. He has a wealth of experience working across multiple industries, including banking, health care, online dating, human resources, and online gaming. Alex has also given multiple talks at various AI/machine learning conferences, in addition to lectures at universities about neural networks. When he’s not neck-deep in a textbook, Alex enjoys spending time with family, riding bikes, and utilizing machine learning to feed his French wine curiosity! First and foremost, I’d like to thank my co-author, Michal, for helping me write this book. As fellow ML enthusiasts, cyclists, runners, and fathers, we both developed a deeper understanding of each other through this endeavor, which has taken well over one year to create. Simply put, this book would not have been possible without Michal’s support and encouragement. Next, I’d like to thank my mom, dad, and elder brother, Andres, who have been there every step of the way from day 1 until now. Without question, my elder brother continues to be my hero and is someone that I will forever look up to as being a guiding light. Of course, no acknowledgements would be finished without giving thanks to my beautiful wife, Denise, and daughter, Miya, who have provided the love and support to continue the writing of this book during nights and weekends. I cannot emphasize enough how much you both mean to me and how you guys are the inspiration and motivation that keeps this engine running. To my daughter, Miya, my hope is that you can pick this book up and one day realize that your old man isn’t quite as silly as I appear to let on. Last but not least, I’d also like to give thanks to you, the reader, for your interest in this exciting field using this incredible technology. Whether you are a seasoned ML expert, or a newcomer to the field looking to gain a foothold, you have come to the right book and my hope is that you get as much out of this as Michal and I did in writing this work. Max Pumperla is a data scientist and engineer specializing in deep learning and its applications. He currently works as a deep learning engineer at Skymind and is a co- founder of aetros.com. Max is the author and maintainer of several Python packages, including elephas, a distributed deep learning library using Spark. His open source
分享到:
收藏