logo资料库

Apache Spark in 24 Hours,.pdf

第1页 / 共1107页
第2页 / 共1107页
第3页 / 共1107页
第4页 / 共1107页
第5页 / 共1107页
第6页 / 共1107页
第7页 / 共1107页
第8页 / 共1107页
资料共1107页,剩余部分请下载后查看
About This E-Book
Title Page
Copyright Page
Contents at a Glance
Table of Contents
Preface
Why Should I Learn Spark?
How This Book Is Organized
Data Used in the Exercises
Conventions Used in This Book
About the Author
Dedication
Acknowledgments
We Want to Hear from You
Reader Services
Part I: Getting Started with Apache Spark
Hour 1. Introducing Apache Spark
What Is Spark?
Spark and Hadoop
Spark as an Abstraction
Spark Is Fast, Efficient, and Scalable
What Sort of Applications Use Spark?
Programming Interfaces to Spark
Ways to Use Spark
Interactive Use
Non-interactive Use
Input/Output Types
Summary
Q&A
Workshop
Quiz
Answers
Hour 2. Understanding Hadoop
Hadoop and a Brief History of Big Data
Hadoop Explained
Introducing HDFS
HDFS Overview
HDFS Architecture
Introducing YARN
What Is YARN?
Running an Application on YARN
Other Resource Managers
Anatomy of a Hadoop Cluster
How Spark Works with Hadoop
HDFS as a Data Source for Spark
YARN as a Resource Scheduler for Spark
Summary
Q&A
Workshop
Quiz
Answers
Hour 3. Installing Spark
Spark Deployment Modes
Preparing to Install Spark
Installing Spark in Standalone Mode
Getting Spark
Installing a Multi-node Spark Standalone Cluster
Exploring the Spark Install
Deploying Spark on Hadoop
Using a Management Console or Interface
Installing Manually
Summary
Q&A
Workshop
Quiz
Answers
Exercises
Hour 4. Understanding the Spark Application Architecture
Anatomy of a Spark Application
Spark Driver
The Spark Context
Application Planning
Application Scheduling
Other Driver Functions
Spark Executors and Workers
Spark Master and Cluster Manager
Spark Master
Cluster Manager
Spark Applications Running on YARN
ResourceManager as the Cluster Manager
ApplicationsMaster as the Spark Master
yarn-cluster Mode
yarn-client Mode
Log File Management with Spark on YARN
Local Mode
Summary
Q&A
Workshop
Quiz
Answers
Hour 5. Deploying Spark in the Cloud
Amazon Web Services Primer
Elastic Compute Cloud (EC2)
Simple Storage Service (S3)
Elastic MapReduce (EMR)
AWS Pricing and Getting Started
Spark on EC2
Spark on EMR
Hosted Spark with Databricks
Summary
Q&A
Workshop
Quiz
Answers
Part II: Programming with Apache Spark
Hour 6. Learning the Basics of Spark Programming with RDDs
Introduction to RDDs
Loading Data into RDDs
Creating an RDD from a File or Files
Creating an RDD from a Datasource
Creating an RDD Programatically
Operations on RDDs
Coarse-Grained versus Fine-Grained Transformations
Transformations, Actions, and Lazy Evaluation
RDD Persistence and Re-use
RDD Lineage
Fault Tolerance with RDDs
Types of RDDs
Summary
Q&A
Workshop
Quiz
Answers
Hour 7. Understanding MapReduce Concepts
MapReduce History and Background
The Motivation for MapReduce
The Design Goals for MapReduce
Records and Key Value Pairs
Key Value Pairs and Records
MapReduce Explained
Map Phase
Partitioning Function
Shuffle
Reduce Phase
Fault Tolerance
Combiner Functions
Asymmetry and Speculative Execution
Map-only MapReduce Applications
An Election Analogy for MapReduce
Word Count: The “Hello, World” of MapReduce
Why Count Words?
How It Works
Map and Reduce Functions in Spark
Summary
Q&A
Workshop
Quiz
Answers
Hour 8. Getting Started with Scala
Scala History and Background
Scala Beginnings
Scala Basics
Scala’s Compile Time and Run Time Architecture
Variables and Primitives in Scala
Data Structures in Scala
Control Structures in Scala
Object-Oriented Programming in Scala
Classes and Inheritance
Mixin Composition
Singleton Objects
Polymorphism
Functional Programming in Scala
First-class Functions
Anonymous Functions
Higher-order Functions
Closures
Currying
Lazy Evaluation
Immutable Data Structures
Spark Programming in Scala
Summary
Q&A
Workshop
Quiz
Answers
Hour 9. Functional Programming with Python
Python Overview
Python Background
Python Runtime Architecture
Data Structures and Serialization in Python
Lists
Sets
Tuples
Dictionaries
Python Object Serialization
Python Functional Programming Basics
Anonymous Functions and lambda
Higher-order Functions
Tail Calls
Short-circuiting
Parallelization
Closures in Python
Interactive Programming Using IPython
IPython History and Background
Using IPython with Spark
Jupyter, the IPython Notebook
Summary
Q&A
Workshop
Quiz
Answers
Hour 10. Working with the Spark API (Transformations and Actions)
RDDs and Data Sampling
RDD Refresher
Data Sampling with Spark
Spark Transformations
Functional Transformations
Grouping, Sorting, and Distinct Functions
Set Operations
Spark Actions
The count Action
The collect, take, top, and first Actions
The reduce and fold Actions
The foreach Action
Key Value Pair Operations
Key Value Pair RDD Dictionary Functions
Functional Key Value Pair RDD Transformations
Grouping, Aggregation, Sorting, and Set Operations
Join Functions
Join Types
Join Transformations
Numerical RDD Operations
min()
max()
mean()
sum()
stdev()
variance()
stats()
Summary
Q&A
Workshop
Quiz
Answers
Hour 11. Using RDDs: Caching, Persistence, and Output
RDD Storage Levels
RDD Lineage Revisited
RDD Storage Levels
Caching, Persistence, and Checkpointing
Caching RDDs
Persisting RDDs
Choosing When to Persist or Cache RDDs
Checkpointing RDDs
Saving RDD Output
External Storage Systems
Storage Formats
Introduction to Alluxio (Tachyon)
Alluxio Background
Alluxio Architecture
Alluxio as a Filesystem
Alluxio for Off Heap RDD Persistence
Other Alluxio Features and Usages
Summary
Q&A
Workshop
Quiz
Answers
Hour 12. Advanced Spark Programming
Broadcast Variables
Broadcast Variable Creation and Usage
Advantages of Broadcast Variables
Accumulators
Using Accumulators
Custom Accumulators
Uses for Accumulators
Partitioning and Repartitioning
Partitioning Overview
Controlling Partitions
Repartitioning Functions
Partition-specific API Methods
Processing RDDs with External Programs
pipe()
Summary
Q&A
Workshop
Quiz
Answers
Part III: Extensions to Spark
Hour 13. Using SQL with Spark
Introduction to Spark SQL
Background
Hive Overview
SQL on Hadoop
Spark SQL Architecture
HiveContext and SQLContext
Getting Started with Spark SQL DataFrames
Creating a DataFrame from an Existing RDD
Creating a DataFrame from a Hive Table
Creating a DataFrame from JSON Objects
Creating DataFrames from Files Using the DataFrameReader
Converting DataFrames to RDDs
DataFrame Data Model
DataFrame Schemas
Using Spark SQL DataFrames
DataFrame Metadata Operations
Basic DataFrame Operations
DataFrame Built-in Functions and UDFs
DataFrame Set Operations
Caching, Persisting, and Repartitioning DataFrames
Saving DataFrame Output Using the DataFrameWriter
Accessing Spark SQL
Accessing Spark SQL Using the spark-sql Shell
Running the Thrift JDBC/ODBC server
Summary
Q&A
Workshop
Quiz
Answers
Hour 14. Stream Processing with Spark
Introduction to Spark Streaming
Streaming, Spark Style
Spark Streaming Architecture
The StreamingContext
Using DStreams
DStream Sources
DStream Transformations
DStream Output Operations
State Operations
updateStateByKey()
Sliding Window Operations
window()
reduceByKeyAndWindow()
Summary
Q&A
Workshop
Quiz
Answers
Hour 15. Getting Started with Spark and R
Introduction to R
Getting Started with the R Language
Introducing SparkR
The SparkR Shell
Creating Data Frames in SparkR
Using SparkR
Building Predictive Models with SparkR
Using SparkR with RStudio
Summary
Q&A
Workshop
Quiz
Answers
Hour 16. Machine Learning with Spark
Introduction to Machine Learning and MLlib
Machine Learning Primer
Machine Learning with Spark
Classification Using Spark MLlib
Decision Trees
Naive Bayes
Collaborative Filtering Using Spark MLlib
Clustering Using Spark MLlib
k-means Clustering
Summary
Q&A
Workshop
Quiz
Answers
Hour 17. Introducing Sparkling Water (H20 and Spark)
Introduction to H2O
H2O Deep Learning
H2O Flow
H2O Architecture
Running H2O on Hadoop
Sparkling Water—H2O on Spark
Sparkling Water Architecture
Summary
Q&A
Workshop
Quiz
Answers
Hour 18. Graph Processing with Spark
Introduction to Graphs
Graph Processing in Spark
Google, Pregel, and PageRank
GraphX: Spark’s Graph Processing System
Introduction to GraphFrames
Accessing the GraphFrames Library
Creating a GraphFrame
GraphFrame Operations
Using Graphing Algorithms with GraphFrames
Summary
Q&A
Workshop
Quiz
Answers
Hour 19. Using Spark with NoSQL Systems
Introduction to NoSQL
Bigtable: The Beginnings of the NoSQL Movement
NoSQL System Characteristics
Types of NoSQL Systems
Using Spark with HBase
HBase Data Model and Shell
Data Distribution in HBase
HBase and Spark
Using Spark with Cassandra
Cassandra Data Model
Cassandra Query Language (CQL)
Accessing Cassandra Using Spark
Using Spark with DynamoDB and More
Amazon DynamoDB
Other NoSQL Implementations
The Future for NoSQL
Summary
Q&A
Workshop
Quiz
Answers
Hour 20. Using Spark with Messaging Systems
Overview of Messaging Systems
Pub-Sub Messaging Exchange Pattern
Using Spark with Apache Kafka
Kafka Overview
Spark and Kafka
Spark, MQTT, and the Internet of Things
MQTT Overview
Using Spark with MQTT
Using Spark with Amazon Kinesis
Kinesis Streams
Using Spark with Kinesis
Summary
Q&A
Workshop
Quiz
Answers
Part IV: Managing Spark
Hour 21. Administering Spark
Spark Configuration
Spark Environment Variables
Spark Configuration
Administering Spark Standalone
Spark Standalone Revisited
Deploying Spark Standalone Clusters
Scheduling with Spark Standalone
Administering Spark on YARN
Spark on YARN Revisited
Deploying Spark on YARN
Managing Spark Applications Running on YARN
YARN Scheduling
Summary
Q&A
Workshop
Quiz
Answers
Hour 22. Monitoring Spark
Exploring the Spark Application UI
Jobs
Stages
Storage
Environment
Executors
Viewing the Status of All Running Applications
Spark History Server
Deploying the Spark History Server
Exploring the Spark History Server UI
Spark History Server API Access
Spark Metrics
Logging in Spark
Log4j
Summary
Q&A
Workshop
Quiz
Answers
Hour 23. Extending and Securing Spark
Isolating Spark
Perimeter Security
Gateway Services
Authentication and Authorization
Securing Spark Communication
Spark Authentication Using a Shared Secret
Encrypting Spark Communication
Securing the Spark Web UI
Securing Spark with Kerberos
Kerberos Overview
Kerberos with Hadoop
Kerberos Configuration with Spark
Summary
Q&A
Workshop
Quiz
Answers
Hour 24. Improving Spark Performance
Benchmarking Spark
Benchmarks
Canary Queries
Performance Monitoring Solutions
Application Development Best Practices
Application Development Optimizations
System, Configuration, or Job Submission Optimizations
Optimizing Partitions
Inefficient Partitioning
Diagnosing Application Performance Issues
Using the Application UI to Diagnose Performance Issues
Using the Spark History UI to Diagnose Performance Issues
Summary
Q&A
Workshop
Quiz
Answers
Index
Code Snippets
About This E-Book EPUB is an open, industry-standard format for e-books. However, support for EPUB and its many features varies across reading devices and applications. Use your device or app settings to customize the presentation to your liking. Settings that you can customize often include font, font size, single or double column, landscape or portrait mode, and figures that you can click or tap to enlarge. For additional information about the settings and features on your reading device or app, visit the device manufacturer’s Web site. Many titles include programming code or configuration examples. To optimize the presentation of these elements, view the e-book in single-column, landscape mode and adjust the font size to the smallest setting. In addition to presenting code and configurations in the reflowable text format, we have included images of the code that mimic the presentation found in the print book; therefore, where the reflowable format may compromise the presentation of the code listing, you will see a “Click here to view code image” link. Click the link to view the print-fidelity code image. To return to the previous page viewed, click the Back button on your device or app.
Sams Teach Yourself Apache Spark™ in 24 Hours Jeffrey Aven 800 East 96th Street, Indianapolis, Indiana, 46240 USA
Sams Teach Yourself Apache Spark™ in 24 Hours Copyright © 2017 by Pearson Education, Inc. All rights reserved. No part of this book shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, without written permission from the publisher. No patent liability is assumed with respect to the use of the information contained herein. Although every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions. Nor is any liability assumed for damages resulting from the use of the information contained herein. ISBN-13: 978-0-672-33851-9 ISBN-10: 0-672-33851-3 Library of Congress Control Number: 2016946659 Printed in the United States of America First Printing: August 2016 Editor in Chief Greg Wiegand Acquisitions Editor Trina McDonald Development Editor Chris Zahn Technical Editor Cody Koeninger Managing Editor Sandra Schroeder Project Editor Lori Lyons Project Manager Ellora Sengupta Copy Editor Linda Morris Indexer Cheryl Lenser Proofreader Sudhakaran Editorial Assistant Olivia Basegio Cover Designer Chuti Prasertsith Compositor codeMantra Trademarks All terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized. Sams Publishing cannot attest to the accuracy of this information. Use of a term in this book should not be regarded as affecting the validity of any trademark or service mark. Warning and Disclaimer Every effort has been made to make this book as complete and as accurate as possible, but no warranty or fitness is implied. The information provided is on an “as is” basis. The author and the publisher shall have neither liability nor responsibility to any person or entity with respect to any loss or damages arising from the information contained in this book. Special Sales For information about buying this title in bulk quantities, or for special sales opportunities (which may include electronic versions; custom cover designs; and content particular to your business, training goals, marketing focus, or branding interests), please contact our corporate sales department at corpsales@pearsoned.com or (800) 382-3419. For government sales inquiries, please contact governmentsales@pearsoned.com. For questions about sales outside the U.S., please contact intlcs@pearsoned.com.
Contents at a Glance Preface About the Author Part I: Getting Started with Apache Spark HOUR 1 Introducing Apache Spark 2 Understanding Hadoop 3 Installing Spark 4 Understanding the Spark Application Architecture 5 Deploying Spark in the Cloud Part II: Programming with Apache Spark HOUR 6 Learning the Basics of Spark Programming with RDDs 7 Understanding MapReduce Concepts 8 Getting Started with Scala 9 Functional Programming with Python 10 Working with the Spark API (Transformations and Actions) 11 Using RDDs: Caching, Persistence, and Output 12 Advanced Spark Programming Part III: Extensions to Spark HOUR 13 Using SQL with Spark 14 Stream Processing with Spark 15 Getting Started with Spark and R 16 Machine Learning with Spark 17 Introducing Sparkling Water (H20 and Spark) 18 Graph Processing with Spark 19 Using Spark with NoSQL Systems 20 Using Spark with Messaging Systems Part IV: Managing Spark HOUR 21 Administering Spark 22 Monitoring Spark
23 Extending and Securing Spark 24 Improving Spark Performance Index
Table of Contents Preface About the Author Part I: Getting Started with Apache Spark HOUR 1: Introducing Apache Spark What Is Spark? What Sort of Applications Use Spark? Programming Interfaces to Spark Ways to Use Spark Summary Q&A Workshop HOUR 2: Understanding Hadoop Hadoop and a Brief History of Big Data Hadoop Explained Introducing HDFS Introducing YARN Anatomy of a Hadoop Cluster How Spark Works with Hadoop Summary Q&A Workshop HOUR 3: Installing Spark Spark Deployment Modes Preparing to Install Spark Installing Spark in Standalone Mode Exploring the Spark Install Deploying Spark on Hadoop Summary Q&A Workshop Exercises HOUR 4: Understanding the Spark Application Architecture Anatomy of a Spark Application
Spark Driver Spark Executors and Workers Spark Master and Cluster Manager Spark Applications Running on YARN Local Mode Summary Q&A Workshop HOUR 5: Deploying Spark in the Cloud Amazon Web Services Primer Spark on EC2 Spark on EMR Hosted Spark with Databricks Summary Q&A Workshop Part II: Programming with Apache Spark HOUR 6: Learning the Basics of Spark Programming with RDDs Introduction to RDDs Loading Data into RDDs Operations on RDDs Types of RDDs Summary Q&A Workshop HOUR 7: Understanding MapReduce Concepts MapReduce History and Background Records and Key Value Pairs MapReduce Explained Word Count: The “Hello, World” of MapReduce Summary Q&A Workshop HOUR 8: Getting Started with Scala Scala History and Background
分享到:
收藏