logo资料库

learning-spark-streaming.pdf

第1页 / 共288页
第2页 / 共288页
第3页 / 共288页
第4页 / 共288页
第5页 / 共288页
第6页 / 共288页
第7页 / 共288页
第8页 / 共288页
资料共288页,剩余部分请下载后查看
Chapter 1. Introducing Spark Streaming Large-scale data analytics and Apache Spark This book focuses on Spark Streaming, a streaming module for the Apache Spark computation engine, used in distributed data analytics. Spark’s Streaming capabilities came essentially as an addition throughout the thesis of its first and main developer, Tathagata Das. The concepts that Spark Streaming embodies were not all new at the time of its implementation, and it carries a rich legacy of learnings on how to expose an easy way to do distributed programming at a massive scale. Chief among these heirlooms is MapReduce, the Google-created work born at Google, and that gave rise to Hadoop - and which concepts we will sketch in a few pages. Users familiar with MapReduce may want to skip to “More than MapReduce : how the model came about and how Spark extends it.”. For others, we will introduce here the main tenets of Spark Streaming: an expressive, MapReduce-inspired programming API (Spark) and its resilien
Chapter 1. Introducing Spark Streaming
Chapter 2. Core Spark Streaming concepts Apache Spark RDDs A Spark Hello World The reader unfamiliar with how to test a Spark application will refer to ???. However, let us give a quick rundown of how to launch a spark shell for testing. Most distributions of Spark come with the spark-shell executable. Launching this executable creates an instance of the Scala REPL, with a custom spark context started for you. This creates an instance of a SparkSession object, by default named spark and an instance of the legacy SparkContext with the alias sc. Since Spark 2.0, the SparkSession is the recommended way to interact with Spark, except for Spark Streaming, which relies on the StreamingContext as we will see later on. For the sake of this small intro, we will use the SparkSession or spark to interact with the Spark API. The Spark Session needs a cluster to work. The easiest way to get started is using the default local cluster which gets automatically initialized at the start of the shell whe
Chapter 2. Core Spark Streaming concepts
Chapter 3. Streaming application design Starting with an example : Twitter analysis Until now, we have developed mostly generalities on the functioning of a DStream and snippets of Scala code leading to stream usage. Now has come the time to put them together in an application. As we progress further, we will then study how to run this application most efficiently. The Spark Notebook Until now, we have created simple examples in the Spark shell. Beyond interactive shells, there is another way of approaching the development of Spark scripts, and that is interactive notebooks. So-called notebooks are web applications tied to a REPL (Read-Eval-Print Loop) — otherwise known as an interpreter —, They offer the ability to author code in an interactive web-based editor. The code can be immediately executed and the results are displayed back on the page. In contrast with the spark-shell, previoulsy executed commands become part of a single page, which can be read as a document or executed as a
Chapter 3. Streaming application design
Chapter 4. Creating robust deployments Using spark-submit In the last chapter, we have studied how to create an interactive applicaiton using the Spark Notebook. To get a clearer idea of how Streaming applications are really used in production, we are now going to focus on doing this with Spark’s spark-submit script. Note that the stages used in this are important notions that will be reused in the application. Let’s consider again our Twitter streaming application from [Link to Come], that counted the most frequent hashtags in tweets. 1 package learning.streaming.demo 2 3 import org.apache.spark.streaming.{Seconds, StreamingContext} 4 import org.apache.spark.SparkContext._ 5 import StreamingContext._ 6 import org.apache.spark.streaming.twitter._ 7 8 /** 9 * Collect at least the specified number of tweets into json text files. 10 */ 11 object BestHashTags { 12 13 def configureTwitterCredentials(apiKey: String, 14 apiSecret: String, 15 accessToken: String, 16 accessTokenSecret: String)
Chapter 4. Creating robust deployments
Chapter 5. Streaming Programming API In this chapter we will take a detailed look at the Spark Streaming API. After a detailed review of the operations that constitute the DStream API we will learn how to interact with Spark SQL and get insights about the measuring and monitoring capabilities that will help us understand the performance characteristics of our Spark Streaming applications. In the Spark Streaming programming model, we can observe two broad levels of interaction: - Operations that apply to a single element of the stream and, - Operations that apply to the underlying RDD of each micro-batch. As we will learn through this chapter, these two levels correspond to the split of responsibilities in the interaction between the Spark core engine and Spark Streaming. We have seen how DStreams or Discreatized Streams are a streaming abstraction where the elements of the stream are grouped into micro batches. In turn, each micro-batch is represented by an RDD. At the execution level,
Chapter 5. Streaming Programming API
test
https://www.iteblog.com
关注大数据猿(bigdata_ai)公众号及时获 取 最 新 大 数 据 相 关 的 电 子 书 。 或 者 访 问 https://www.iteblog.com/archives/tag/eboo ks/获取。
1. 1. Introducing Spark Streaming 1. Large-scale data analytics and Apache Spark 2. More than MapReduce : how the model came about and how Spark extends it. 1. A Fault-tolerant MapReduce cluster 2. A distributed file system 3. Two higher-order functions 3. Optimizations in a reduce operation 1. Associativity : a necessary condition. 2. Shuffling 3. Map-side combiner 4. To Learn more about MapReduce 1. The Spark ecosystem, approach and polyglot APIs 2. Multiple frameworks, and a framework scheduler 3. A Data Processing engine 4. A polyglot API 5. A MapReduce extension 6. A SQL interface, expanding into a DataFrame interface. 7. A Real Time processing engine 8. In-memory computing, with impact on processing speed and latency 9. MapReduce and memory legacy 10. Spark’s Memory Usage 11. A customizable cache 12. Operation Latency 5. How Spark Streaming fits in the Big Picture 1. Micro-batching 2. A strong Streaming characteristic 3. A minimal delay 4. Throughput-oriented tasks 6. Why you would want to use Spark Streaming 1. Building a pipeline 2. Productive deployment of pipelines 3. Productive implementation of data analysis 7. To learn more about Spark 8. Conclusion 9. Bibliography 2. 2. Core Spark Streaming concepts 1. Apache Spark RDDs 1. Resilient Distributed Datasets 2. Transformations and Actions 3. The Shuffle 4. Partitions 5. Debugging RDDs 6. Witnessing caching 2. Spark Streaming Clusters 1. The Standalone Spark cluster 2. Yet Another Resource Negotiator (YARN) 3. Apache Mesos 4. Spark Streaming : a delicate deployment 3. To learn more about runinng Spark on a cluster 4. Fundamentals of a DStream 1. A Bulk-synchronous model 2. The Spark Streaming Context https://www.iteblog.com
3. Representing regular updates to a fixed window of data 4. The Receiver Model 5. Receiver parallelism 5. Conclusion 6. Bibliography 3. 3. Streaming application design 1. Starting with an example : Twitter analysis 1. The Spark Notebook 2. Creating a Streaming Application 3. Creating a Stream 4. Transformations 5. Actions and Dataflow 6. Expressing a Dataflow 7. Starting the Spark Streaming Context 8. Summary 2. Windowed Streams 1. Windowed Streams 2. A word on changing the batch interval 3. Slicing your Stream 3. Other Data Sources and Connectors 1. Apache Kafka 2. Apache Flume 3. Kinesis 4. Apache Bahir 5. How to write a quick stream generator for testing : SocketStream , FileStream , QueueStream 4. The Lambda Architecture 1. The evolution of ideas, rather than products 2. A classical but difficult example 3. Batch processing and a program’s life time 4. A Streaming improvement 5. A fundamental difficulty: back to the Lambda architecture ? 5. Saving Streams 1. Stream Output and other operations 2. A word on content selection 3. Reasons for saving a stream and scaling into real-time 4. How to Save Streams with DataFrames 6. Bibliography 4. 4. Creating robust deployments 1. Using spark-submit 2. Thinking about reliability in Spark Streaming: Closures and Function-Passing Style 3. Spark’s Reliability primitives 4. Spark’s Fault Tolerance Guarantees 1. The External shuffle service 2. Cluster-mode deployment 3. Checkpointing 4. A hot-swappable master through Zookeeper 5. Fault-tolerance in Spark Streaming: the context of the Receiver model https://www.iteblog.com
6. Spark Streaming’s Zero Data Loss guarantees 7. Cluster managers and driver restart 8. Comparing cluster managers 9. Job stability: A time budget question 1. Batch interval and processing delay 2. Going deeper : scheduling delay and processing delay 3. Fixed-rate throttling 10. Backpressure 1. Why backpressure 2. Dynamic throttling 3. Tuning the backpressure PID 11. Fault tolerance in Spark Streaming 1. Planning for side effect stutter in transformations 2. Idempotent side effects for exactly once processing 3. Checkpointing and its importance 12. The Reliable Receiver and the Write-Ahead Log 13. Apache Kafka and the DirectKafkaReceiver 1. The Kafka model and its Receiver 14. Parallel consumers 1. The Receiver model vs. reliable receivers 15. Bibliography 5. 5. Streaming Programming API 1. Basic Stream transformations 1. Element-centric DStream Operations 2. RDD-centric DStream Operations 3. Counting 2. Output Operations 1. foreachRDD 2. 3rd Party Output Operations 3. Spark SQL and Spark Streaming 4. Spark SQL 1. Accessing Spark SQL Functions From Spark Streaming 2. Dealing with Data at Rest 3. Join Optimizations 4. Updating Reference Data 5. Stateful Streaming Computation 1. UpdateStateByKey 2. Statefulness at the scale of a stream 3. updateStateByKey and its limitations 4. mapwithState 5. Using mapWithState 6. Event-time Stream computation with mapWithState 6. Dynamic Windows 1. reduceByWindow 2. Invertible Aggregations 7. Caching 8. Measuring and Monitoring 1. The Streaming UI 2. The Monitoring API 3. Conclusion 9. Bibliography https://www.iteblog.com
Learning Spark Streaming First Edition Francois Garillot and Gerard Maas https://www.iteblog.com
Learning Spark Streaming by Francois Garillot and Gerard Maas Copyright © 2017 Francois Garillot. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( http://oreilly.com/safari ). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com . Editor: Shannon Cutt Production Editor: FILL IN PRODUCTION EDITOR Copyeditor: FILL IN COPYEDITOR Proofreader: FILL IN PROOFREADER Indexer: FILL IN INDEXER Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest December 2017: First Edition https://www.iteblog.com
Revision History for the First Edition 2017-06-19 First Early Release 2017-08-28: Second Early Release See http://oreilly.com/catalog/errata.csp?isbn=9781491944240 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Learning Spark Streaming, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the author(s) have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author(s) disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-94424-0 [FILL IN] https://www.iteblog.com
分享到:
收藏