logo资料库

Spark入门(Learning Spark)-2015年O'reilly英文原版(完整),0积分.pdf

第1页 / 共274页
第2页 / 共274页
第3页 / 共274页
第4页 / 共274页
第5页 / 共274页
第6页 / 共274页
第7页 / 共274页
第8页 / 共274页
资料共274页,剩余部分请下载后查看
Table of Contents
Foreword
Preface
Audience
How This Book Is Organized
Supporting Books
Conventions Used in This Book
Code Examples
Safari® Books Online
How to Contact Us
Acknowledgments
Chapter 1. Introduction to Data Analysis with Spark
What Is Apache Spark?
A Unified Stack
Spark Core
Spark SQL
Spark Streaming
MLlib
GraphX
Cluster Managers
Who Uses Spark, and for What?
Data Science Tasks
Data Processing Applications
A Brief History of Spark
Spark Versions and Releases
Storage Layers for Spark
Chapter 2. Downloading Spark and Getting Started
Downloading Spark
Introduction to Spark’s Python and Scala Shells
Introduction to Core Spark Concepts
Standalone Applications
Initializing a SparkContext
Building Standalone Applications
Conclusion
Chapter 3. Programming with RDDs
RDD Basics
Creating RDDs
RDD Operations
Transformations
Actions
Lazy Evaluation
Passing Functions to Spark
Python
Scala
Java
Common Transformations and Actions
Basic RDDs
Converting Between RDD Types
Persistence (Caching)
Conclusion
Chapter 4. Working with Key/Value Pairs
Motivation
Creating Pair RDDs
Transformations on Pair RDDs
Aggregations
Grouping Data
Joins
Sorting Data
Actions Available on Pair RDDs
Data Partitioning (Advanced)
Determining an RDD’s Partitioner
Operations That Benefit from Partitioning
Operations That Affect Partitioning
Example: PageRank
Custom Partitioners
Conclusion
Chapter 5. Loading and Saving Your Data
Motivation
File Formats
Text Files
JSON
Comma-Separated Values and Tab-Separated Values
SequenceFiles
Object Files
Hadoop Input and Output Formats
File Compression
Filesystems
Local/“Regular” FS
Amazon S3
HDFS
Structured Data with Spark SQL
Apache Hive
JSON
Databases
Java Database Connectivity
Cassandra
HBase
Elasticsearch
Conclusion
Chapter 6. Advanced Spark Programming
Introduction
Accumulators
Accumulators and Fault Tolerance
Custom Accumulators
Broadcast Variables
Optimizing Broadcasts
Working on a Per-Partition Basis
Piping to External Programs
Numeric RDD Operations
Conclusion
Chapter 7. Running on a Cluster
Introduction
Spark Runtime Architecture
The Driver
Executors
Cluster Manager
Launching a Program
Summary
Deploying Applications with spark-submit
Packaging Your Code and Dependencies
A Java Spark Application Built with Maven
A Scala Spark Application Built with sbt
Dependency Conflicts
Scheduling Within and Between Spark Applications
Cluster Managers
Standalone Cluster Manager
Hadoop YARN
Apache Mesos
Amazon EC2
Which Cluster Manager to Use?
Conclusion
Chapter 8. Tuning and Debugging Spark
Configuring Spark with SparkConf
Components of Execution: Jobs, Tasks, and Stages
Finding Information
Spark Web UI
Driver and Executor Logs
Key Performance Considerations
Level of Parallelism
Serialization Format
Memory Management
Hardware Provisioning
Conclusion
Chapter 9. Spark SQL
Linking with Spark SQL
Using Spark SQL in Applications
Initializing Spark SQL
Basic Query Example
SchemaRDDs
Caching
Loading and Saving Data
Apache Hive
Parquet
JSON
From RDDs
JDBC/ODBC Server
Working with Beeline
Long-Lived Tables and Queries
User-Defined Functions
Spark SQL UDFs
Hive UDFs
Spark SQL Performance
Performance Tuning Options
Conclusion
Chapter 10. Spark Streaming
A Simple Example
Architecture and Abstraction
Transformations
Stateless Transformations
Stateful Transformations
Output Operations
Input Sources
Core Sources
Additional Sources
Multiple Sources and Cluster Sizing
24/7 Operation
Checkpointing
Driver Fault Tolerance
Worker Fault Tolerance
Receiver Fault Tolerance
Processing Guarantees
Streaming UI
Performance Considerations
Batch and Window Sizes
Level of Parallelism
Garbage Collection and Memory Usage
Conclusion
Chapter 11. Machine Learning with MLlib
Overview
System Requirements
Machine Learning Basics
Example: Spam Classification
Data Types
Working with Vectors
Algorithms
Feature Extraction
Statistics
Classification and Regression
Clustering
Collaborative Filtering and Recommendation
Dimensionality Reduction
Model Evaluation
Tips and Performance Considerations
Preparing Features
Configuring Algorithms
Caching RDDs to Reuse
Recognizing Sparsity
Level of Parallelism
Pipeline API
Conclusion
Index
About the Authors
Learning Spark Data in all domains is getting bigger. How can you work with it efficiently? This book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. With Spark, you can tackle big datasets quickly through simple APIs in Python, Java, and Scala. Written by the developers of Spark, this book will have data scientists and engineers up and running in no time. You’ll learn how to express parallel jobs with just a few lines of code, and cover applications from simple batch jobs to stream processing and machine learning. “ Learning Spark is at the top of my list for anyone needing a gentle guide to the most popular framework for building big data applications.” —Ben Lorica Chief Data Scientist, O’Reilly Media ■ Quickly dive into Spark capabilities such as distributed datasets, in-memory caching, and the interactive shell ■ Leverage Spark’s powerful built-in libraries, including Spark SQL, Spark Streaming, and MLlib ■ Use one programming paradigm instead of mixing and matching tools like Hive, Hadoop, Mahout, and Storm ■ Learn how to deploy interactive, batch, and streaming applications ■ Connect to data sources including HDFS, Hive, JSON, and S3 ■ Master advanced topics like data partitioning and shared variables Holden Karau, a software development engineer at Databricks, is active in open source and the author of Fast Data Processing with Spark (Packt Publishing). Andy Konwinski, co-founder of Databricks, is a committer on Apache Spark and co-creator of the Apache Mesos project. Patrick Wendell is a co-founder of Databricks and a committer on Apache Spark. He also maintains several subsystems of Spark’s core engine. Matei Zaharia, CTO at Databricks, is the creator of Apache Spark and serves as its Vice President at Apache. PROGRAMMING LANGUAGES/SPARK US $39.99 CAN $ 45.99 ISBN: 978-1-449-35862-4 Twitter: @oreillymedia facebook.com/oreilly L e a r n i n g S p a r k W e n d e K a r a u , l l K o n w n s k i , & Z a h a r i a i Learning Spark LIGHTNING-FAST DATA ANALYSIS Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia
Learning Spark Data in all domains is getting bigger. How can you work with it efficiently? This book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. With Spark, you can tackle big datasets quickly through simple APIs in Python, Java, and Scala. Written by the developers of Spark, this book will have data scientists and engineers up and running in no time. You’ll learn how to express parallel jobs with just a few lines of code, and cover applications from simple batch jobs to stream processing and machine learning. “ Learning Spark is at the top of my list for anyone needing a gentle guide to the most popular framework for building big data applications.” —Ben Lorica Chief Data Scientist, O’Reilly Media ■ Quickly dive into Spark capabilities such as distributed datasets, in-memory caching, and the interactive shell ■ Leverage Spark’s powerful built-in libraries, including Spark SQL, Spark Streaming, and MLlib ■ Use one programming paradigm instead of mixing and matching tools like Hive, Hadoop, Mahout, and Storm ■ Learn how to deploy interactive, batch, and streaming applications ■ Connect to data sources including HDFS, Hive, JSON, and S3 ■ Master advanced topics like data partitioning and shared variables Holden Karau, a software development engineer at Databricks, is active in open source and the author of Fast Data Processing with Spark (Packt Publishing). Andy Konwinski, co-founder of Databricks, is a committer on Apache Spark and co-creator of the Apache Mesos project. Patrick Wendell is a co-founder of Databricks and a committer on Apache Spark. He also maintains several subsystems of Spark’s core engine. Matei Zaharia, CTO at Databricks, is the creator of Apache Spark and serves as its Vice President at Apache. PROGRAMMING LANGUAGES/SPARK US $39.99 CAN $45.99 ISBN: 978-1-449-35862-4 Twitter: @oreillymedia facebook.com/oreilly L e a r n i n g S p a r k W e n d e K a r a u , l l K o n w n s k i , & Z a h a r i a i Learning Spark LIGHTNING-FAST DATA ANALYSIS Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia
Learning Spark Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia
Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia Copyright © 2015 Databricks. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or corporate@oreilly.com. Editors: Ann Spencer and Marie Beaugureau Production Editor: Kara Ebrahim Copyeditor: Rachel Monaghan Proofreader: Charles Roumeliotis Indexer: Ellen Troutman Interior Designer: David Futato Cover Designer: Ellie Volckhausen Illustrator: Rebecca Demarest February 2015: First Edition Revision History for the First Edition 2015-01-26: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781449358624 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Learning Spark, the cover image of a small-spotted catshark, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-449-35862-4 [LSI]
Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1. Introduction to Data Analysis with Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 What Is Apache Spark? 1 A Unified Stack 2 Spark Core 3 Spark SQL 3 Spark Streaming 3 MLlib 4 GraphX 4 Cluster Managers 4 Who Uses Spark, and for What? 4 Data Science Tasks 5 Data Processing Applications 6 A Brief History of Spark 6 Spark Versions and Releases 7 Storage Layers for Spark 7 2. Downloading Spark and Getting Started. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Downloading Spark 9 Introduction to Spark’s Python and Scala Shells 11 Introduction to Core Spark Concepts 14 Standalone Applications 17 Initializing a SparkContext 17 Building Standalone Applications 18 Conclusion 21 iii
3. Programming with RDDs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 RDD Basics 23 Creating RDDs 25 RDD Operations 26 Transformations 27 Actions 28 Lazy Evaluation 29 Passing Functions to Spark 30 Python 30 Scala 31 Java 32 Common Transformations and Actions 34 Basic RDDs 34 Converting Between RDD Types 42 Persistence (Caching) 44 Conclusion 46 4. Working with Key/Value Pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Motivation 47 Creating Pair RDDs 48 Transformations on Pair RDDs 49 Aggregations 51 Grouping Data 57 Joins 58 Sorting Data 59 Actions Available on Pair RDDs 60 Data Partitioning (Advanced) 61 Determining an RDD’s Partitioner 64 Operations That Benefit from Partitioning 65 Operations That Affect Partitioning 65 Example: PageRank 66 Custom Partitioners 68 Conclusion 70 5. Loading and Saving Your Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Motivation 71 File Formats 72 Text Files 73 JSON 74 Comma-Separated Values and Tab-Separated Values 77 SequenceFiles 80 Object Files 83 iv | Table of Contents
Hadoop Input and Output Formats 84 File Compression 87 Filesystems 89 Local/“Regular” FS 89 Amazon S3 90 HDFS 90 Structured Data with Spark SQL 91 Apache Hive 91 JSON 92 Databases 93 Java Database Connectivity 93 Cassandra 94 HBase 96 Elasticsearch 97 Conclusion 98 6. Advanced Spark Programming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Introduction 99 Accumulators 100 Accumulators and Fault Tolerance 103 Custom Accumulators 103 Broadcast Variables 104 Optimizing Broadcasts 106 Working on a Per-Partition Basis 107 Piping to External Programs 109 Numeric RDD Operations 113 Conclusion 115 7. Running on a Cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Introduction 117 Spark Runtime Architecture 117 The Driver 118 Executors 119 Cluster Manager 119 Launching a Program 120 Summary 120 Deploying Applications with spark-submit 121 Packaging Your Code and Dependencies 123 A Java Spark Application Built with Maven 124 A Scala Spark Application Built with sbt 126 Dependency Conflicts 128 Scheduling Within and Between Spark Applications 128 Table of Contents | v
Cluster Managers 129 Standalone Cluster Manager 129 Hadoop YARN 133 Apache Mesos 134 Amazon EC2 135 Which Cluster Manager to Use? 138 Conclusion 139 8. Tuning and Debugging Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Configuring Spark with SparkConf 141 Components of Execution: Jobs, Tasks, and Stages 145 Finding Information 150 Spark Web UI 150 Driver and Executor Logs 154 Key Performance Considerations 155 Level of Parallelism 155 Serialization Format 156 Memory Management 157 Hardware Provisioning 158 Conclusion 160 9. Spark SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Linking with Spark SQL 162 Using Spark SQL in Applications 164 Initializing Spark SQL 164 Basic Query Example 165 SchemaRDDs 166 Caching 169 Loading and Saving Data 170 Apache Hive 170 Parquet 171 JSON 172 From RDDs 174 JDBC/ODBC Server 175 Working with Beeline 177 Long-Lived Tables and Queries 178 User-Defined Functions 178 Spark SQL UDFs 178 Hive UDFs 179 Spark SQL Performance 180 Performance Tuning Options 180 Conclusion 182 vi | Table of Contents
分享到:
收藏