About This E-Book
Title Page
Copyright Page
Contents at a Glance
Table of Contents
Preface
Why Should I Learn Spark?
How This Book Is Organized
Data Used in the Exercises
Conventions Used in This Book
About the Author
Dedication
Acknowledgments
We Want to Hear from You
Reader Services
Part I: Getting Started with Apache Spark
Hour 1. Introducing Apache Spark
What Is Spark?
Spark and Hadoop
Spark as an Abstraction
Spark Is Fast, Efficient, and Scalable
What Sort of Applications Use Spark?
Programming Interfaces to Spark
Ways to Use Spark
Interactive Use
Non-interactive Use
Input/Output Types
Summary
Q&A
Workshop
Quiz
Answers
Hour 2. Understanding Hadoop
Hadoop and a Brief History of Big Data
Hadoop Explained
Introducing HDFS
HDFS Overview
HDFS Architecture
Introducing YARN
What Is YARN?
Running an Application on YARN
Other Resource Managers
Anatomy of a Hadoop Cluster
How Spark Works with Hadoop
HDFS as a Data Source for Spark
YARN as a Resource Scheduler for Spark
Summary
Q&A
Workshop
Quiz
Answers
Hour 3. Installing Spark
Spark Deployment Modes
Preparing to Install Spark
Installing Spark in Standalone Mode
Getting Spark
Installing a Multi-node Spark Standalone Cluster
Exploring the Spark Install
Deploying Spark on Hadoop
Using a Management Console or Interface
Installing Manually
Summary
Q&A
Workshop
Quiz
Answers
Exercises
Hour 4. Understanding the Spark Application Architecture
Anatomy of a Spark Application
Spark Driver
The Spark Context
Application Planning
Application Scheduling
Other Driver Functions
Spark Executors and Workers
Spark Master and Cluster Manager
Spark Master
Cluster Manager
Spark Applications Running on YARN
ResourceManager as the Cluster Manager
ApplicationsMaster as the Spark Master
yarn-cluster Mode
yarn-client Mode
Log File Management with Spark on YARN
Local Mode
Summary
Q&A
Workshop
Quiz
Answers
Hour 5. Deploying Spark in the Cloud
Amazon Web Services Primer
Elastic Compute Cloud (EC2)
Simple Storage Service (S3)
Elastic MapReduce (EMR)
AWS Pricing and Getting Started
Spark on EC2
Spark on EMR
Hosted Spark with Databricks
Summary
Q&A
Workshop
Quiz
Answers
Part II: Programming with Apache Spark
Hour 6. Learning the Basics of Spark Programming with RDDs
Introduction to RDDs
Loading Data into RDDs
Creating an RDD from a File or Files
Creating an RDD from a Datasource
Creating an RDD Programatically
Operations on RDDs
Coarse-Grained versus Fine-Grained Transformations
Transformations, Actions, and Lazy Evaluation
RDD Persistence and Re-use
RDD Lineage
Fault Tolerance with RDDs
Types of RDDs
Summary
Q&A
Workshop
Quiz
Answers
Hour 7. Understanding MapReduce Concepts
MapReduce History and Background
The Motivation for MapReduce
The Design Goals for MapReduce
Records and Key Value Pairs
Key Value Pairs and Records
MapReduce Explained
Map Phase
Partitioning Function
Shuffle
Reduce Phase
Fault Tolerance
Combiner Functions
Asymmetry and Speculative Execution
Map-only MapReduce Applications
An Election Analogy for MapReduce
Word Count: The “Hello, World” of MapReduce
Why Count Words?
How It Works
Map and Reduce Functions in Spark
Summary
Q&A
Workshop
Quiz
Answers
Hour 8. Getting Started with Scala
Scala History and Background
Scala Beginnings
Scala Basics
Scala’s Compile Time and Run Time Architecture
Variables and Primitives in Scala
Data Structures in Scala
Control Structures in Scala
Object-Oriented Programming in Scala
Classes and Inheritance
Mixin Composition
Singleton Objects
Polymorphism
Functional Programming in Scala
First-class Functions
Anonymous Functions
Higher-order Functions
Closures
Currying
Lazy Evaluation
Immutable Data Structures
Spark Programming in Scala
Summary
Q&A
Workshop
Quiz
Answers
Hour 9. Functional Programming with Python
Python Overview
Python Background
Python Runtime Architecture
Data Structures and Serialization in Python
Lists
Sets
Tuples
Dictionaries
Python Object Serialization
Python Functional Programming Basics
Anonymous Functions and lambda
Higher-order Functions
Tail Calls
Short-circuiting
Parallelization
Closures in Python
Interactive Programming Using IPython
IPython History and Background
Using IPython with Spark
Jupyter, the IPython Notebook
Summary
Q&A
Workshop
Quiz
Answers
Hour 10. Working with the Spark API (Transformations and Actions)
RDDs and Data Sampling
RDD Refresher
Data Sampling with Spark
Spark Transformations
Functional Transformations
Grouping, Sorting, and Distinct Functions
Set Operations
Spark Actions
The count Action
The collect, take, top, and first Actions
The reduce and fold Actions
The foreach Action
Key Value Pair Operations
Key Value Pair RDD Dictionary Functions
Functional Key Value Pair RDD Transformations
Grouping, Aggregation, Sorting, and Set Operations
Join Functions
Join Types
Join Transformations
Numerical RDD Operations
min()
max()
mean()
sum()
stdev()
variance()
stats()
Summary
Q&A
Workshop
Quiz
Answers
Hour 11. Using RDDs: Caching, Persistence, and Output
RDD Storage Levels
RDD Lineage Revisited
RDD Storage Levels
Caching, Persistence, and Checkpointing
Caching RDDs
Persisting RDDs
Choosing When to Persist or Cache RDDs
Checkpointing RDDs
Saving RDD Output
External Storage Systems
Storage Formats
Introduction to Alluxio (Tachyon)
Alluxio Background
Alluxio Architecture
Alluxio as a Filesystem
Alluxio for Off Heap RDD Persistence
Other Alluxio Features and Usages
Summary
Q&A
Workshop
Quiz
Answers
Hour 12. Advanced Spark Programming
Broadcast Variables
Broadcast Variable Creation and Usage
Advantages of Broadcast Variables
Accumulators
Using Accumulators
Custom Accumulators
Uses for Accumulators
Partitioning and Repartitioning
Partitioning Overview
Controlling Partitions
Repartitioning Functions
Partition-specific API Methods
Processing RDDs with External Programs
pipe()
Summary
Q&A
Workshop
Quiz
Answers
Part III: Extensions to Spark
Hour 13. Using SQL with Spark
Introduction to Spark SQL
Background
Hive Overview
SQL on Hadoop
Spark SQL Architecture
HiveContext and SQLContext
Getting Started with Spark SQL DataFrames
Creating a DataFrame from an Existing RDD
Creating a DataFrame from a Hive Table
Creating a DataFrame from JSON Objects
Creating DataFrames from Files Using the DataFrameReader
Converting DataFrames to RDDs
DataFrame Data Model
DataFrame Schemas
Using Spark SQL DataFrames
DataFrame Metadata Operations
Basic DataFrame Operations
DataFrame Built-in Functions and UDFs
DataFrame Set Operations
Caching, Persisting, and Repartitioning DataFrames
Saving DataFrame Output Using the DataFrameWriter
Accessing Spark SQL
Accessing Spark SQL Using the spark-sql Shell
Running the Thrift JDBC/ODBC server
Summary
Q&A
Workshop
Quiz
Answers
Hour 14. Stream Processing with Spark
Introduction to Spark Streaming
Streaming, Spark Style
Spark Streaming Architecture
The StreamingContext
Using DStreams
DStream Sources
DStream Transformations
DStream Output Operations
State Operations
updateStateByKey()
Sliding Window Operations
window()
reduceByKeyAndWindow()
Summary
Q&A
Workshop
Quiz
Answers
Hour 15. Getting Started with Spark and R
Introduction to R
Getting Started with the R Language
Introducing SparkR
The SparkR Shell
Creating Data Frames in SparkR
Using SparkR
Building Predictive Models with SparkR
Using SparkR with RStudio
Summary
Q&A
Workshop
Quiz
Answers
Hour 16. Machine Learning with Spark
Introduction to Machine Learning and MLlib
Machine Learning Primer
Machine Learning with Spark
Classification Using Spark MLlib
Decision Trees
Naive Bayes
Collaborative Filtering Using Spark MLlib
Clustering Using Spark MLlib
k-means Clustering
Summary
Q&A
Workshop
Quiz
Answers
Hour 17. Introducing Sparkling Water (H20 and Spark)
Introduction to H2O
H2O Deep Learning
H2O Flow
H2O Architecture
Running H2O on Hadoop
Sparkling Water—H2O on Spark
Sparkling Water Architecture
Summary
Q&A
Workshop
Quiz
Answers
Hour 18. Graph Processing with Spark
Introduction to Graphs
Graph Processing in Spark
Google, Pregel, and PageRank
GraphX: Spark’s Graph Processing System
Introduction to GraphFrames
Accessing the GraphFrames Library
Creating a GraphFrame
GraphFrame Operations
Using Graphing Algorithms with GraphFrames
Summary
Q&A
Workshop
Quiz
Answers
Hour 19. Using Spark with NoSQL Systems
Introduction to NoSQL
Bigtable: The Beginnings of the NoSQL Movement
NoSQL System Characteristics
Types of NoSQL Systems
Using Spark with HBase
HBase Data Model and Shell
Data Distribution in HBase
HBase and Spark
Using Spark with Cassandra
Cassandra Data Model
Cassandra Query Language (CQL)
Accessing Cassandra Using Spark
Using Spark with DynamoDB and More
Amazon DynamoDB
Other NoSQL Implementations
The Future for NoSQL
Summary
Q&A
Workshop
Quiz
Answers
Hour 20. Using Spark with Messaging Systems
Overview of Messaging Systems
Pub-Sub Messaging Exchange Pattern
Using Spark with Apache Kafka
Kafka Overview
Spark and Kafka
Spark, MQTT, and the Internet of Things
MQTT Overview
Using Spark with MQTT
Using Spark with Amazon Kinesis
Kinesis Streams
Using Spark with Kinesis
Summary
Q&A
Workshop
Quiz
Answers
Part IV: Managing Spark
Hour 21. Administering Spark
Spark Configuration
Spark Environment Variables
Spark Configuration
Administering Spark Standalone
Spark Standalone Revisited
Deploying Spark Standalone Clusters
Scheduling with Spark Standalone
Administering Spark on YARN
Spark on YARN Revisited
Deploying Spark on YARN
Managing Spark Applications Running on YARN
YARN Scheduling
Summary
Q&A
Workshop
Quiz
Answers
Hour 22. Monitoring Spark
Exploring the Spark Application UI
Jobs
Stages
Storage
Environment
Executors
Viewing the Status of All Running Applications
Spark History Server
Deploying the Spark History Server
Exploring the Spark History Server UI
Spark History Server API Access
Spark Metrics
Logging in Spark
Log4j
Summary
Q&A
Workshop
Quiz
Answers
Hour 23. Extending and Securing Spark
Isolating Spark
Perimeter Security
Gateway Services
Authentication and Authorization
Securing Spark Communication
Spark Authentication Using a Shared Secret
Encrypting Spark Communication
Securing the Spark Web UI
Securing Spark with Kerberos
Kerberos Overview
Kerberos with Hadoop
Kerberos Configuration with Spark
Summary
Q&A
Workshop
Quiz
Answers
Hour 24. Improving Spark Performance
Benchmarking Spark
Benchmarks
Canary Queries
Performance Monitoring Solutions
Application Development Best Practices
Application Development Optimizations
System, Configuration, or Job Submission Optimizations
Optimizing Partitions
Inefficient Partitioning
Diagnosing Application Performance Issues
Using the Application UI to Diagnose Performance Issues
Using the Spark History UI to Diagnose Performance Issues
Summary
Q&A
Workshop
Quiz
Answers
Index
Code Snippets