Spark入门（Learning Spark）-2015年O'reilly英文原版（完整），0积分.pdf

发布时间：2022-05-29 发布人：admin 分类：说明书资料大小：7.82M 资料格式：pdf 举报版权申诉

zhoujianjun2-9703119-4744300845378384156.pdf-第1页.png

第1页 / 共274页

zhoujianjun2-9703119-4744300845378384156.pdf-第2页.png

第2页 / 共274页

zhoujianjun2-9703119-4744300845378384156.pdf-第3页.png

第3页 / 共274页

zhoujianjun2-9703119-4744300845378384156.pdf-第4页.png

第4页 / 共274页

zhoujianjun2-9703119-4744300845378384156.pdf-第5页.png

第5页 / 共274页

zhoujianjun2-9703119-4744300845378384156.pdf-第6页.png

第6页 / 共274页

zhoujianjun2-9703119-4744300845378384156.pdf-第7页.png

第7页 / 共274页

zhoujianjun2-9703119-4744300845378384156.pdf-第8页.png

第8页 / 共274页

Table of Contents

Foreword

Preface

Audience

How This Book Is Organized

Supporting Books

Conventions Used in This Book

Code Examples

Safari® Books Online

How to Contact Us

Acknowledgments

Chapter 1. Introduction to Data Analysis with Spark

What Is Apache Spark?

A Unified Stack

Spark Core

Spark SQL

Spark Streaming

MLlib

GraphX

Cluster Managers

Who Uses Spark, and for What?

Data Science Tasks

Data Processing Applications

A Brief History of Spark

Spark Versions and Releases

Storage Layers for Spark

Chapter 2. Downloading Spark and Getting Started

Downloading Spark

Introduction to Spark’s Python and Scala Shells

Introduction to Core Spark Concepts

Standalone Applications

Initializing a SparkContext

Building Standalone Applications

Conclusion

Chapter 3. Programming with RDDs

RDD Basics

Creating RDDs

RDD Operations

Transformations

Actions

Lazy Evaluation

Passing Functions to Spark

Python

Scala

Java

Common Transformations and Actions

Basic RDDs

Converting Between RDD Types

Persistence (Caching)

Conclusion

Chapter 4. Working with Key/Value Pairs

Motivation

Creating Pair RDDs

Transformations on Pair RDDs

Aggregations

Grouping Data

Joins

Sorting Data

Actions Available on Pair RDDs

Data Partitioning (Advanced)

Determining an RDD’s Partitioner

Operations That Benefit from Partitioning

Operations That Affect Partitioning

Example: PageRank

Custom Partitioners

Conclusion

Chapter 5. Loading and Saving Your Data

Motivation

File Formats

Text Files

JSON

Comma-Separated Values and Tab-Separated Values

SequenceFiles

Object Files

Hadoop Input and Output Formats

File Compression

Filesystems

Local/“Regular” FS

Amazon S3

HDFS

Structured Data with Spark SQL

Apache Hive

JSON

Databases

Java Database Connectivity

Cassandra

HBase

Elasticsearch

Conclusion

Chapter 6. Advanced Spark Programming

Introduction

Accumulators

Accumulators and Fault Tolerance

Custom Accumulators

Broadcast Variables

Optimizing Broadcasts

Working on a Per-Partition Basis

Piping to External Programs

Numeric RDD Operations

Conclusion

Chapter 7. Running on a Cluster

Introduction

Spark Runtime Architecture

The Driver

Executors

Cluster Manager

Launching a Program

Summary

Deploying Applications with spark-submit

Packaging Your Code and Dependencies

A Java Spark Application Built with Maven

A Scala Spark Application Built with sbt

Dependency Conflicts

Scheduling Within and Between Spark Applications

Cluster Managers

Standalone Cluster Manager

Hadoop YARN

Apache Mesos

Amazon EC2

Which Cluster Manager to Use?

Conclusion

Chapter 8. Tuning and Debugging Spark

Configuring Spark with SparkConf

Components of Execution: Jobs, Tasks, and Stages

Finding Information

Spark Web UI

Driver and Executor Logs

Key Performance Considerations

Level of Parallelism

Serialization Format

Memory Management

Hardware Provisioning

Conclusion

Chapter 9. Spark SQL

Linking with Spark SQL

Using Spark SQL in Applications

Initializing Spark SQL

Basic Query Example

SchemaRDDs

Caching

Loading and Saving Data

Apache Hive

Parquet

JSON

From RDDs

JDBC/ODBC Server

Working with Beeline

Long-Lived Tables and Queries

User-Defined Functions

Spark SQL UDFs

Hive UDFs

Spark SQL Performance

Performance Tuning Options

Conclusion

Chapter 10. Spark Streaming

A Simple Example

Architecture and Abstraction

Transformations

Stateless Transformations

Stateful Transformations

Output Operations

Input Sources

Core Sources

Additional Sources

Multiple Sources and Cluster Sizing

24/7 Operation

Checkpointing

Driver Fault Tolerance

Worker Fault Tolerance

Receiver Fault Tolerance

Processing Guarantees

Streaming UI

Performance Considerations

Batch and Window Sizes

Level of Parallelism

Garbage Collection and Memory Usage

Conclusion

Chapter 11. Machine Learning with MLlib

Overview

System Requirements

Machine Learning Basics

Example: Spam Classification

Data Types

Working with Vectors

Algorithms

Feature Extraction

Statistics

Classification and Regression

Clustering

Collaborative Filtering and Recommendation

Dimensionality Reduction

Model Evaluation

Tips and Performance Considerations

Preparing Features

Configuring Algorithms

Caching RDDs to Reuse

Recognizing Sparsity

Level of Parallelism

Pipeline API

Conclusion

Index

About the Authors

Learning Spark Data in all domains is getting bigger. How can you work with it efficiently? This book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. With Spark, you can tackle big datasets quickly through simple APIs in Python, Java, and Scala. Written by the developers of Spark, this book will have data scientists and engineers up and running in no time. You’ll learn how to express parallel jobs with just a few lines of code, and cover applications from simple batch jobs to stream processing and machine learning. “ Learning Spark is at the top of my list for anyone needing a gentle guide to the most popular framework for building big data applications.” —Ben Lorica Chief Data Scientist, O’Reilly Media ■ Quickly dive into Spark capabilities such as distributed datasets, in-memory caching, and the interactive shell ■ Leverage Spark’s powerful built-in libraries, including Spark SQL, Spark Streaming, and MLlib ■ Use one programming paradigm instead of mixing and matching tools like Hive, Hadoop, Mahout, and Storm ■ Learn how to deploy interactive, batch, and streaming applications ■ Connect to data sources including HDFS, Hive, JSON, and S3 ■ Master advanced topics like data partitioning and shared variables Holden Karau, a software development engineer at Databricks, is active in open source and the author of Fast Data Processing with Spark (Packt Publishing). Andy Konwinski, co-founder of Databricks, is a committer on Apache Spark and co-creator of the Apache Mesos project. Patrick Wendell is a co-founder of Databricks and a committer on Apache Spark. He also maintains several subsystems of Spark’s core engine. Matei Zaharia, CTO at Databricks, is the creator of Apache Spark and serves as its Vice President at Apache. PROGRAMMING LANGUAGES/SPARK US $39.99 CAN $ 45.99 ISBN: 978-1-449-35862-4 Twitter: @oreillymedia facebook.com/oreilly L e a r n i n g S p a r k W e n d e K a r a u , l l K o n w n s k i , & Z a h a r i a i Learning Spark LIGHTNING-FAST DATA ANALYSIS Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Learning Spark Data in all domains is getting bigger. How can you work with it efficiently? This book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. With Spark, you can tackle big datasets quickly through simple APIs in Python, Java, and Scala. Written by the developers of Spark, this book will have data scientists and engineers up and running in no time. You’ll learn how to express parallel jobs with just a few lines of code, and cover applications from simple batch jobs to stream processing and machine learning. “ Learning Spark is at the top of my list for anyone needing a gentle guide to the most popular framework for building big data applications.” —Ben Lorica Chief Data Scientist, O’Reilly Media ■ Quickly dive into Spark capabilities such as distributed datasets, in-memory caching, and the interactive shell ■ Leverage Spark’s powerful built-in libraries, including Spark SQL, Spark Streaming, and MLlib ■ Use one programming paradigm instead of mixing and matching tools like Hive, Hadoop, Mahout, and Storm ■ Learn how to deploy interactive, batch, and streaming applications ■ Connect to data sources including HDFS, Hive, JSON, and S3 ■ Master advanced topics like data partitioning and shared variables Holden Karau, a software development engineer at Databricks, is active in open source and the author of Fast Data Processing with Spark (Packt Publishing). Andy Konwinski, co-founder of Databricks, is a committer on Apache Spark and co-creator of the Apache Mesos project. Patrick Wendell is a co-founder of Databricks and a committer on Apache Spark. He also maintains several subsystems of Spark’s core engine. Matei Zaharia, CTO at Databricks, is the creator of Apache Spark and serves as its Vice President at Apache. PROGRAMMING LANGUAGES/SPARK US $39.99 CAN $45.99 ISBN: 978-1-449-35862-4 Twitter: @oreillymedia facebook.com/oreilly L e a r n i n g S p a r k W e n d e K a r a u , l l K o n w n s k i , & Z a h a r i a i Learning Spark LIGHTNING-FAST DATA ANALYSIS Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Learning Spark Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia

Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia Copyright © 2015 Databricks. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or corporate@oreilly.com. Editors: Ann Spencer and Marie Beaugureau Production Editor: Kara Ebrahim Copyeditor: Rachel Monaghan Proofreader: Charles Roumeliotis Indexer: Ellen Troutman Interior Designer: David Futato Cover Designer: Ellie Volckhausen Illustrator: Rebecca Demarest February 2015: First Edition Revision History for the First Edition 2015-01-26: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781449358624 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Learning Spark, the cover image of a small-spotted catshark, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-449-35862-4 [LSI]

Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1. Introduction to Data Analysis with Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 What Is Apache Spark? 1 A Unified Stack 2 Spark Core 3 Spark SQL 3 Spark Streaming 3 MLlib 4 GraphX 4 Cluster Managers 4 Who Uses Spark, and for What? 4 Data Science Tasks 5 Data Processing Applications 6 A Brief History of Spark 6 Spark Versions and Releases 7 Storage Layers for Spark 7 2. Downloading Spark and Getting Started. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Downloading Spark 9 Introduction to Spark’s Python and Scala Shells 11 Introduction to Core Spark Concepts 14 Standalone Applications 17 Initializing a SparkContext 17 Building Standalone Applications 18 Conclusion 21 iii

3. Programming with RDDs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 RDD Basics 23 Creating RDDs 25 RDD Operations 26 Transformations 27 Actions 28 Lazy Evaluation 29 Passing Functions to Spark 30 Python 30 Scala 31 Java 32 Common Transformations and Actions 34 Basic RDDs 34 Converting Between RDD Types 42 Persistence (Caching) 44 Conclusion 46 4. Working with Key/Value Pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Motivation 47 Creating Pair RDDs 48 Transformations on Pair RDDs 49 Aggregations 51 Grouping Data 57 Joins 58 Sorting Data 59 Actions Available on Pair RDDs 60 Data Partitioning (Advanced) 61 Determining an RDD’s Partitioner 64 Operations That Benefit from Partitioning 65 Operations That Affect Partitioning 65 Example: PageRank 66 Custom Partitioners 68 Conclusion 70 5. Loading and Saving Your Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Motivation 71 File Formats 72 Text Files 73 JSON 74 Comma-Separated Values and Tab-Separated Values 77 SequenceFiles 80 Object Files 83 iv | Table of Contents

Hadoop Input and Output Formats 84 File Compression 87 Filesystems 89 Local/“Regular” FS 89 Amazon S3 90 HDFS 90 Structured Data with Spark SQL 91 Apache Hive 91 JSON 92 Databases 93 Java Database Connectivity 93 Cassandra 94 HBase 96 Elasticsearch 97 Conclusion 98 6. Advanced Spark Programming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Introduction 99 Accumulators 100 Accumulators and Fault Tolerance 103 Custom Accumulators 103 Broadcast Variables 104 Optimizing Broadcasts 106 Working on a Per-Partition Basis 107 Piping to External Programs 109 Numeric RDD Operations 113 Conclusion 115 7. Running on a Cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Introduction 117 Spark Runtime Architecture 117 The Driver 118 Executors 119 Cluster Manager 119 Launching a Program 120 Summary 120 Deploying Applications with spark-submit 121 Packaging Your Code and Dependencies 123 A Java Spark Application Built with Maven 124 A Scala Spark Application Built with sbt 126 Dependency Conflicts 128 Scheduling Within and Between Spark Applications 128 Table of Contents | v

Cluster Managers 129 Standalone Cluster Manager 129 Hadoop YARN 133 Apache Mesos 134 Amazon EC2 135 Which Cluster Manager to Use? 138 Conclusion 139 8. Tuning and Debugging Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Configuring Spark with SparkConf 141 Components of Execution: Jobs, Tasks, and Stages 145 Finding Information 150 Spark Web UI 150 Driver and Executor Logs 154 Key Performance Considerations 155 Level of Parallelism 155 Serialization Format 156 Memory Management 157 Hardware Provisioning 158 Conclusion 160 9. Spark SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Linking with Spark SQL 162 Using Spark SQL in Applications 164 Initializing Spark SQL 164 Basic Query Example 165 SchemaRDDs 166 Caching 169 Loading and Saving Data 170 Apache Hive 170 Parquet 171 JSON 172 From RDDs 174 JDBC/ODBC Server 175 Working with Beeline 177 Long-Lived Tables and Queries 178 User-Defined Functions 178 Spark SQL UDFs 178 Hive UDFs 179 Spark SQL Performance 180 Performance Tuning Options 180 Conclusion 182 vi | Table of Contents

分享到：

赞收藏

资料库

Spark入门（Learning Spark）-2015年O'reilly英文原版（完整），0积分.pdf

相关推荐

开发技术

热门标签

最新资料