Apache Spark in 24 Hours,.pdf

发布时间：2022-05-29 发布人：admin 分类：说明书资料大小：36.59M 资料格式：pdf 举报版权申诉

u011669700-10118724-4744300845210578687.pdf-第1页.png

第1页 / 共1107页

u011669700-10118724-4744300845210578687.pdf-第2页.png

第2页 / 共1107页

u011669700-10118724-4744300845210578687.pdf-第3页.png

第3页 / 共1107页

u011669700-10118724-4744300845210578687.pdf-第4页.png

第4页 / 共1107页

u011669700-10118724-4744300845210578687.pdf-第5页.png

第5页 / 共1107页

u011669700-10118724-4744300845210578687.pdf-第6页.png

第6页 / 共1107页

u011669700-10118724-4744300845210578687.pdf-第7页.png

第7页 / 共1107页

u011669700-10118724-4744300845210578687.pdf-第8页.png

第8页 / 共1107页

About This E-Book

Title Page

Contents at a Glance

Table of Contents

Preface

Why Should I Learn Spark?

How This Book Is Organized

Data Used in the Exercises

Conventions Used in This Book

About the Author

Dedication

Acknowledgments

We Want to Hear from You

Reader Services

Part I: Getting Started with Apache Spark

Hour 1. Introducing Apache Spark

What Is Spark?

Spark and Hadoop

Spark as an Abstraction

Spark Is Fast, Efficient, and Scalable

What Sort of Applications Use Spark?

Programming Interfaces to Spark

Ways to Use Spark

Interactive Use

Non-interactive Use

Input/Output Types

Summary

Q&A

Workshop

Quiz

Answers

Hour 2. Understanding Hadoop

Hadoop and a Brief History of Big Data

Hadoop Explained

Introducing HDFS

HDFS Overview

HDFS Architecture

Introducing YARN

What Is YARN?

Running an Application on YARN

Other Resource Managers

Anatomy of a Hadoop Cluster

How Spark Works with Hadoop

HDFS as a Data Source for Spark

YARN as a Resource Scheduler for Spark

Summary

Q&A

Workshop

Quiz

Answers

Hour 3. Installing Spark

Spark Deployment Modes

Preparing to Install Spark

Installing Spark in Standalone Mode

Getting Spark

Installing a Multi-node Spark Standalone Cluster

Exploring the Spark Install

Deploying Spark on Hadoop

Using a Management Console or Interface

Installing Manually

Summary

Q&A

Workshop

Quiz

Answers

Exercises

Hour 4. Understanding the Spark Application Architecture

Anatomy of a Spark Application

Spark Driver

The Spark Context

Application Planning

Application Scheduling

Other Driver Functions

Spark Executors and Workers

Spark Master and Cluster Manager

Spark Master

Cluster Manager

Spark Applications Running on YARN

ResourceManager as the Cluster Manager

ApplicationsMaster as the Spark Master

yarn-cluster Mode

yarn-client Mode

Log File Management with Spark on YARN

Local Mode

Summary

Q&A

Workshop

Quiz

Answers

Hour 5. Deploying Spark in the Cloud

Amazon Web Services Primer

Elastic Compute Cloud (EC2)

Simple Storage Service (S3)

Elastic MapReduce (EMR)

AWS Pricing and Getting Started

Spark on EC2

Spark on EMR

Hosted Spark with Databricks

Summary

Q&A

Workshop

Quiz

Answers

Part II: Programming with Apache Spark

Hour 6. Learning the Basics of Spark Programming with RDDs

Introduction to RDDs

Loading Data into RDDs

Creating an RDD from a File or Files

Creating an RDD from a Datasource

Creating an RDD Programatically

Operations on RDDs

Coarse-Grained versus Fine-Grained Transformations

Transformations, Actions, and Lazy Evaluation

RDD Persistence and Re-use

RDD Lineage

Fault Tolerance with RDDs

Types of RDDs

Summary

Q&A

Workshop

Quiz

Answers

Hour 7. Understanding MapReduce Concepts

MapReduce History and Background

The Motivation for MapReduce

The Design Goals for MapReduce

Records and Key Value Pairs

Key Value Pairs and Records

MapReduce Explained

Map Phase

Partitioning Function

Shuffle

Reduce Phase

Fault Tolerance

Combiner Functions

Asymmetry and Speculative Execution

Map-only MapReduce Applications

An Election Analogy for MapReduce

Word Count: The “Hello, World” of MapReduce

Why Count Words?

How It Works

Map and Reduce Functions in Spark

Summary

Q&A

Workshop

Quiz

Answers

Hour 8. Getting Started with Scala

Scala History and Background

Scala Beginnings

Scala Basics

Scala’s Compile Time and Run Time Architecture

Variables and Primitives in Scala

Data Structures in Scala

Control Structures in Scala

Object-Oriented Programming in Scala

Classes and Inheritance

Mixin Composition

Singleton Objects

Polymorphism

Functional Programming in Scala

First-class Functions

Anonymous Functions

Higher-order Functions

Closures

Currying

Lazy Evaluation

Immutable Data Structures

Spark Programming in Scala

Summary

Q&A

Workshop

Quiz

Answers

Hour 9. Functional Programming with Python

Python Overview

Python Background

Python Runtime Architecture

Data Structures and Serialization in Python

Lists

Sets

Tuples

Dictionaries

Python Object Serialization

Python Functional Programming Basics

Anonymous Functions and lambda

Higher-order Functions

Tail Calls

Short-circuiting

Parallelization

Closures in Python

Interactive Programming Using IPython

IPython History and Background

Using IPython with Spark

Jupyter, the IPython Notebook

Summary

Q&A

Workshop

Quiz

Answers

Hour 10. Working with the Spark API (Transformations and Actions)

RDDs and Data Sampling

RDD Refresher

Data Sampling with Spark

Spark Transformations

Functional Transformations

Grouping, Sorting, and Distinct Functions

Set Operations

Spark Actions

The count Action

The collect, take, top, and first Actions

The reduce and fold Actions

The foreach Action

Key Value Pair Operations

Key Value Pair RDD Dictionary Functions

Functional Key Value Pair RDD Transformations

Grouping, Aggregation, Sorting, and Set Operations

Join Functions

Join Types

Join Transformations

Numerical RDD Operations

min()

max()

mean()

sum()

stdev()

variance()

stats()

Summary

Q&A

Workshop

Quiz

Answers

Hour 11. Using RDDs: Caching, Persistence, and Output

RDD Storage Levels

RDD Lineage Revisited

RDD Storage Levels

Caching, Persistence, and Checkpointing

Caching RDDs

Persisting RDDs

Choosing When to Persist or Cache RDDs

Checkpointing RDDs

Saving RDD Output

External Storage Systems

Storage Formats

Introduction to Alluxio (Tachyon)

Alluxio Background

Alluxio Architecture

Alluxio as a Filesystem

Alluxio for Off Heap RDD Persistence

Other Alluxio Features and Usages

Summary

Q&A

Workshop

Quiz

Answers

Hour 12. Advanced Spark Programming

Broadcast Variables

Broadcast Variable Creation and Usage

Advantages of Broadcast Variables

Accumulators

Using Accumulators

Custom Accumulators

Uses for Accumulators

Partitioning and Repartitioning

Partitioning Overview

Controlling Partitions

Repartitioning Functions

Partition-specific API Methods

Processing RDDs with External Programs

pipe()

Summary

Q&A

Workshop

Quiz

Answers

Part III: Extensions to Spark

Hour 13. Using SQL with Spark

Introduction to Spark SQL

Background

Hive Overview

SQL on Hadoop

Spark SQL Architecture

HiveContext and SQLContext

Getting Started with Spark SQL DataFrames

Creating a DataFrame from an Existing RDD

Creating a DataFrame from a Hive Table

Creating a DataFrame from JSON Objects

Creating DataFrames from Files Using the DataFrameReader

Converting DataFrames to RDDs

DataFrame Data Model

DataFrame Schemas

Using Spark SQL DataFrames

DataFrame Metadata Operations

Basic DataFrame Operations

DataFrame Built-in Functions and UDFs

DataFrame Set Operations

Caching, Persisting, and Repartitioning DataFrames

Saving DataFrame Output Using the DataFrameWriter

Accessing Spark SQL

Accessing Spark SQL Using the spark-sql Shell

Running the Thrift JDBC/ODBC server

Summary

Q&A

Workshop

Quiz

Answers

Hour 14. Stream Processing with Spark

Introduction to Spark Streaming

Streaming, Spark Style

Spark Streaming Architecture

The StreamingContext

Using DStreams

DStream Sources

DStream Transformations

DStream Output Operations

State Operations

updateStateByKey()

Sliding Window Operations

window()

reduceByKeyAndWindow()

Summary

Q&A

Workshop

Quiz

Answers

Hour 15. Getting Started with Spark and R

Introduction to R

Getting Started with the R Language

Introducing SparkR

The SparkR Shell

Creating Data Frames in SparkR

Using SparkR

Building Predictive Models with SparkR

Using SparkR with RStudio

Summary

Q&A

Workshop

Quiz

Answers

Hour 16. Machine Learning with Spark

Introduction to Machine Learning and MLlib

Machine Learning Primer

Machine Learning with Spark

Classification Using Spark MLlib

Decision Trees

Naive Bayes

Collaborative Filtering Using Spark MLlib

Clustering Using Spark MLlib

k-means Clustering

Summary

Q&A

Workshop

Quiz

Answers

Hour 17. Introducing Sparkling Water (H20 and Spark)

Introduction to H2O

H2O Deep Learning

H2O Flow

H2O Architecture

Running H2O on Hadoop

Sparkling Water—H2O on Spark

Sparkling Water Architecture

Summary

Q&A

Workshop

Quiz

Answers

Hour 18. Graph Processing with Spark

Introduction to Graphs

Graph Processing in Spark

Google, Pregel, and PageRank

GraphX: Spark’s Graph Processing System

Introduction to GraphFrames

Accessing the GraphFrames Library

Creating a GraphFrame

GraphFrame Operations

Using Graphing Algorithms with GraphFrames

Summary

Q&A

Workshop

Quiz

Answers

Hour 19. Using Spark with NoSQL Systems

Introduction to NoSQL

Bigtable: The Beginnings of the NoSQL Movement

NoSQL System Characteristics

Types of NoSQL Systems

Using Spark with HBase

HBase Data Model and Shell

Data Distribution in HBase

HBase and Spark

Using Spark with Cassandra

Cassandra Data Model

Cassandra Query Language (CQL)

Accessing Cassandra Using Spark

Using Spark with DynamoDB and More

Amazon DynamoDB

Other NoSQL Implementations

The Future for NoSQL

Summary

Q&A

Workshop

Quiz

Answers

Hour 20. Using Spark with Messaging Systems

Overview of Messaging Systems

Pub-Sub Messaging Exchange Pattern

Using Spark with Apache Kafka

Kafka Overview

Spark and Kafka

Spark, MQTT, and the Internet of Things

MQTT Overview

Using Spark with MQTT

Using Spark with Amazon Kinesis

Kinesis Streams

Using Spark with Kinesis

Summary

Q&A

Workshop

Quiz

Answers

Part IV: Managing Spark

Hour 21. Administering Spark

Spark Configuration

Spark Environment Variables

Spark Configuration

Administering Spark Standalone

Spark Standalone Revisited

Deploying Spark Standalone Clusters

Scheduling with Spark Standalone

Administering Spark on YARN

Spark on YARN Revisited

Deploying Spark on YARN

Managing Spark Applications Running on YARN

YARN Scheduling

Summary

Q&A

Workshop

Quiz

Answers

Hour 22. Monitoring Spark

Exploring the Spark Application UI

Jobs

Stages

Storage

Environment

Executors

Viewing the Status of All Running Applications

Spark History Server

Deploying the Spark History Server

Exploring the Spark History Server UI

Spark History Server API Access

Spark Metrics

Logging in Spark

Log4j

Summary

Q&A

Workshop

Quiz

Answers

Hour 23. Extending and Securing Spark

Isolating Spark

Perimeter Security

Gateway Services

Authentication and Authorization

Securing Spark Communication

Spark Authentication Using a Shared Secret

Encrypting Spark Communication

Securing the Spark Web UI

Securing Spark with Kerberos

Kerberos Overview

Kerberos with Hadoop

Kerberos Configuration with Spark

Summary

Q&A

Workshop

Quiz

Answers

Hour 24. Improving Spark Performance

Benchmarking Spark

Benchmarks

Canary Queries

Performance Monitoring Solutions

Application Development Best Practices

Application Development Optimizations

System, Configuration, or Job Submission Optimizations

Optimizing Partitions

Inefficient Partitioning

Diagnosing Application Performance Issues

Using the Application UI to Diagnose Performance Issues

Using the Spark History UI to Diagnose Performance Issues

Summary

Q&A

Workshop

Quiz

Answers

Index

Code Snippets

About This E-Book EPUB is an open, industry-standard format for e-books. However, support for EPUB and its many features varies across reading devices and applications. Use your device or app settings to customize the presentation to your liking. Settings that you can customize often include font, font size, single or double column, landscape or portrait mode, and figures that you can click or tap to enlarge. For additional information about the settings and features on your reading device or app, visit the device manufacturer’s Web site. Many titles include programming code or configuration examples. To optimize the presentation of these elements, view the e-book in single-column, landscape mode and adjust the font size to the smallest setting. In addition to presenting code and configurations in the reflowable text format, we have included images of the code that mimic the presentation found in the print book; therefore, where the reflowable format may compromise the presentation of the code listing, you will see a “Click here to view code image” link. Click the link to view the print-fidelity code image. To return to the previous page viewed, click the Back button on your device or app.

Sams Teach Yourself Apache Spark™ in 24 Hours Jeffrey Aven 800 East 96th Street, Indianapolis, Indiana, 46240 USA

Sams Teach Yourself Apache Spark™ in 24 Hours Copyright © 2017 by Pearson Education, Inc. All rights reserved. No part of this book shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, without written permission from the publisher. No patent liability is assumed with respect to the use of the information contained herein. Although every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions. Nor is any liability assumed for damages resulting from the use of the information contained herein. ISBN-13: 978-0-672-33851-9 ISBN-10: 0-672-33851-3 Library of Congress Control Number: 2016946659 Printed in the United States of America First Printing: August 2016 Editor in Chief Greg Wiegand Acquisitions Editor Trina McDonald Development Editor Chris Zahn Technical Editor Cody Koeninger Managing Editor Sandra Schroeder Project Editor Lori Lyons Project Manager Ellora Sengupta Copy Editor Linda Morris Indexer Cheryl Lenser Proofreader Sudhakaran Editorial Assistant Olivia Basegio Cover Designer Chuti Prasertsith Compositor codeMantra Trademarks All terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized. Sams Publishing cannot attest to the accuracy of this information. Use of a term in this book should not be regarded as affecting the validity of any trademark or service mark. Warning and Disclaimer Every effort has been made to make this book as complete and as accurate as possible, but no warranty or fitness is implied. The information provided is on an “as is” basis. The author and the publisher shall have neither liability nor responsibility to any person or entity with respect to any loss or damages arising from the information contained in this book. Special Sales For information about buying this title in bulk quantities, or for special sales opportunities (which may include electronic versions; custom cover designs; and content particular to your business, training goals, marketing focus, or branding interests), please contact our corporate sales department at corpsales@pearsoned.com or (800) 382-3419. For government sales inquiries, please contact governmentsales@pearsoned.com. For questions about sales outside the U.S., please contact intlcs@pearsoned.com.

Contents at a Glance Preface About the Author Part I: Getting Started with Apache Spark HOUR 1 Introducing Apache Spark 2 Understanding Hadoop 3 Installing Spark 4 Understanding the Spark Application Architecture 5 Deploying Spark in the Cloud Part II: Programming with Apache Spark HOUR 6 Learning the Basics of Spark Programming with RDDs 7 Understanding MapReduce Concepts 8 Getting Started with Scala 9 Functional Programming with Python 10 Working with the Spark API (Transformations and Actions) 11 Using RDDs: Caching, Persistence, and Output 12 Advanced Spark Programming Part III: Extensions to Spark HOUR 13 Using SQL with Spark 14 Stream Processing with Spark 15 Getting Started with Spark and R 16 Machine Learning with Spark 17 Introducing Sparkling Water (H20 and Spark) 18 Graph Processing with Spark 19 Using Spark with NoSQL Systems 20 Using Spark with Messaging Systems Part IV: Managing Spark HOUR 21 Administering Spark 22 Monitoring Spark

23 Extending and Securing Spark 24 Improving Spark Performance Index

Table of Contents Preface About the Author Part I: Getting Started with Apache Spark HOUR 1: Introducing Apache Spark What Is Spark? What Sort of Applications Use Spark? Programming Interfaces to Spark Ways to Use Spark Summary Q&A Workshop HOUR 2: Understanding Hadoop Hadoop and a Brief History of Big Data Hadoop Explained Introducing HDFS Introducing YARN Anatomy of a Hadoop Cluster How Spark Works with Hadoop Summary Q&A Workshop HOUR 3: Installing Spark Spark Deployment Modes Preparing to Install Spark Installing Spark in Standalone Mode Exploring the Spark Install Deploying Spark on Hadoop Summary Q&A Workshop Exercises HOUR 4: Understanding the Spark Application Architecture Anatomy of a Spark Application

Spark Driver Spark Executors and Workers Spark Master and Cluster Manager Spark Applications Running on YARN Local Mode Summary Q&A Workshop HOUR 5: Deploying Spark in the Cloud Amazon Web Services Primer Spark on EC2 Spark on EMR Hosted Spark with Databricks Summary Q&A Workshop Part II: Programming with Apache Spark HOUR 6: Learning the Basics of Spark Programming with RDDs Introduction to RDDs Loading Data into RDDs Operations on RDDs Types of RDDs Summary Q&A Workshop HOUR 7: Understanding MapReduce Concepts MapReduce History and Background Records and Key Value Pairs MapReduce Explained Word Count: The “Hello, World” of MapReduce Summary Q&A Workshop HOUR 8: Getting Started with Scala Scala History and Background

分享到：

赞收藏

资料库

Apache Spark in 24 Hours,.pdf

相关推荐

开发技术

热门标签

最新资料