Streaming Systems 流处理系统.pdf

发布时间：2022-06-20 发布人：admin 分类：说明书资料大小：7.56M 资料格式：pdf 举报版权申诉

15e0548b-8c7e-482a-b721-144100a605ee.pdf-第1页.png

第1页 / 共230页

15e0548b-8c7e-482a-b721-144100a605ee.pdf-第2页.png

第2页 / 共230页

15e0548b-8c7e-482a-b721-144100a605ee.pdf-第3页.png

第3页 / 共230页

15e0548b-8c7e-482a-b721-144100a605ee.pdf-第4页.png

第4页 / 共230页

15e0548b-8c7e-482a-b721-144100a605ee.pdf-第5页.png

第5页 / 共230页

15e0548b-8c7e-482a-b721-144100a605ee.pdf-第6页.png

第6页 / 共230页

15e0548b-8c7e-482a-b721-144100a605ee.pdf-第7页.png

第7页 / 共230页

15e0548b-8c7e-482a-b721-144100a605ee.pdf-第8页.png

第8页 / 共230页

Preface Or: What Are You Getting Yourself Into Here?

Navigating This Book

Takeaways

Conventions Used in This Book

Online Resources

Figures

Code Snippets

O’Reilly Safari

How to Contact Us

Acknowledgments

I. The Beam Model

1. Streaming 101

Terminology: What Is Streaming?

On the Greatly Exaggerated Limitations of Streaming

Event Time Versus Processing Time

Data Processing Patterns

Bounded Data

Unbounded Data: Batch

Unbounded Data: Streaming

Summary

2. The What, Where, When, and How of Data Processing

Roadmap

Batch Foundations: What and Where

What: Transformations

Where: Windowing

Going Streaming: When and How

When: The Wonderful Thing About Triggers Is Triggers Are Wonderful Things!

When: Watermarks

When: Early/On-Time/Late Triggers FTW!

When: Allowed Lateness (i.e., Garbage Collection)

How: Accumulation

Summary

3. Watermarks

Definition

Source Watermark Creation

Perfect Watermark Creation

Heuristic Watermark Creation

Watermark Propagation

Understanding Watermark Propagation

Watermark Propagation and Output Timestamps

The Tricky Case of Overlapping Windows

Percentile Watermarks

Processing-Time Watermarks

Case Studies

Case Study: Watermarks in Google Cloud Dataflow

Case Study: Watermarks in Apache Flink

Case Study: Source Watermarks for Google Cloud Pub/Sub

Summary

4. Advanced Windowing

When/Where: Processing-Time Windows

Event-Time Windowing

Processing-Time Windowing via Triggers

Processing-Time Windowing via Ingress Time

Where: Session Windows

Where: Custom Windowing

Variations on Fixed Windows

Variations on Session Windows

One Size Does Not Fit All

Summary

5. Exactly-Once and Side Effects

Why Exactly Once Matters

Accuracy Versus Completeness

Side Effects

Problem Definition

Ensuring Exactly Once in Shuffle

Addressing Determinism

Performance

Graph Optimization

Bloom Filters

Garbage Collection

Exactly Once in Sources

Exactly Once in Sinks

Use Cases

Example Source: Cloud Pub/Sub

Example Sink: Files

Example Sink: Google BigQuery

Other Systems

Apache Spark Streaming

Apache Flink

Summary

II. Streams and Tables

6. Streams and Tables

Stream-and-Table Basics Or: a Special Theory of Stream and Table Relativity

Toward a General Theory of Stream and Table Relativity

Batch Processing Versus Streams and Tables

A Streams and Tables Analysis of MapReduce

Reconciling with Batch Processing

What, Where, When, and How in a Streams and Tables World

What: Transformations

Where: Windowing

When: Triggers

How: Accumulation

A Holistic View of Streams and Tables in the Beam Model

A General Theory of Stream and Table Relativity

Summary

7. The Practicalities of Persistent State

Motivation

The Inevitability of Failure

Correctness and Efficiency

Implicit State

Raw Grouping

Incremental Combining

Generalized State

Case Study: Conversion Attribution

Conversion Attribution with Apache Beam

Summary

8. Streaming SQL

What Is Streaming SQL?

Relational Algebra

Time-Varying Relations

Streams and Tables

Looking Backward: Stream and Table Biases

The Beam Model: A Stream-Biased Approach

The SQL Model: A Table-Biased Approach

Looking Forward: Toward Robust Streaming SQL

Stream and Table Selection

Temporal Operators

Summary

9. Streaming Joins

All Your Joins Are Belong to Streaming

Unwindowed Joins

FULL OUTER

LEFT OUTER

RIGHT OUTER

INNER

ANTI

SEMI

Windowed Joins

Fixed Windows

Temporal Validity

Summary

10. The Evolution of Large-Scale Data Processing

MapReduce

Hadoop

Flume

Storm

Spark

MillWheel

Kafka

Cloud Dataflow

Flink

Beam

Summary

Index

Streaming Systems The What, Where, When, and How of Large-Scale Data Processing Tyler Akidau, Slava Chernyak, and Reuven Lax

Streaming Systems by Tyler Akidau, Slava Chernyak, and Reuven Lax Copyright © 2018 Tyler Akidau, Slava Chernyak, and Reuven Lax. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Editors: Rachel Roumeliotis and Jeff Bleiel Production Editor: Nicholas Adams Copyeditor: Octal Publishing, Inc. Proofreader: Kim Cofer Indexer: Ellen Troutman-Zaig Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest August 2018: First Edition Revision History for the First Edition 2018-07-12: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491983874 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Streaming Systems, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-98387-4 [LSI]

Preface Or: What Are You Getting Yourself Into Here? Hello adventurous reader, welcome to our book! At this point, I assume that you’re either interested in learning more about the wonders of stream processing or hoping to spend a few hours reading about the glory of the majestic brown trout. Either way, I salute you! That said, those of you in the latter bucket who don’t also have an advanced understanding of computer science should consider how prepared you are to deal with disappointment before forging ahead; caveat piscator, and all that. To set the tone for this book from the get go, I wanted to give you a heads up about a couple of things. First, this book is a little strange in that we have multiple authors, but we’re not pretending that we somehow all speak and write in the same voice like we’re weird identical triplets who happened to be born to different sets of parents. Because as interesting as that sounds, the end result would actually be less enjoyable to read. Instead, we’ve opted to each write in our own voices, and we’ve granted the book just enough self-awareness to be able to make reference to each of us where appropriate, but not so much self-awareness that it resents us for making it only into a book and not something cooler like a robot dinosaur with a Scottish accent. 1 As far as voices go, there are three you’ll come across: Tyler That would be me. If you haven’t explicitly been told someone else is speaking, you can assume that it’s me, because we added the other authors somewhat late in the game, and I was basically like, “hells no” when I thought about going back and updating everything I’d already written. I’m the technical lead for the Data Processing Languages ands Systems group at Google, responsible for Google Cloud Dataflow, Google’s Apache Beam efforts, as well as Google-internal data processing systems such as Flume, MillWheel, and MapReduce. I’m also a founding Apache Beam PMC member. 2

Figure P-1. The cover that could have been... Slava Slava was a long-time member of the MillWheel team at Google, and later an original member of the Windmill team that built MillWheel’s successor, the heretofore unnamed system that powers the Streaming Engine in Google Cloud Dataflow. Slava is the foremost expert on watermarks and time semantics in stream processing systems the world over, period. You might find it unsurprising

then that he’s the author of Chapter 3, Watermarks. Reuven Reuven is at the bottom of this list because he has more experience with stream processing than both Slava and me combined and would thus crush us if he were placed any higher. Reuven has created or led the creation of nearly all of the interesting systems-level magic in Google’s general-purpose stream processing engines, including applying an untold amount of attention to detail in providing high- throughput, low-latency, exactly-once semantics in a system that nevertheless utilizes fine-grained checkpointing. You might find it unsurprising that he’s the author of Chapter 5, Exactly-Once and Side Effects. He also happens to be an Apache Beam PMC member. Navigating This Book Now that you know who you’ll be hearing from, the next logical step would be to find out what you’ll be hearing about, which brings us to the second thing I wanted to mention. There are conceptually two major parts to this book, each with four chapters, and each followed up by a chapter that stands relatively independently on its own. The fun begins with Part I, The Beam Model (Chapters 1–4), which focuses on the high-level batch plus streaming data processing model originally developed for Google Cloud Dataflow, later donated to the Apache Software Foundation as Apache Beam, and also now seen in whole or in part across most other systems in the industry. It’s composed of four chapters: Chapter 1, Streaming 101, which covers the basics of stream processing, establishing some terminology, discussing the capabilities of streaming systems, distinguishing between two important domains of time (processing time and event time), and finally looking at some common data processing patterns. Chapter 2, The What, Where, When, and How of Data Processing, which covers in detail the core concepts of robust stream processing over out-of-order data, each analyzed within the context of a concrete running example and with animated diagrams to highlight the dimension of time. Chapter 3, Watermarks (written by Slava), which provides a deep survey of temporal progress metrics, how they are created, and how they propagate through pipelines. It ends by examining the details of two real-world watermark implementations. Chapter 4, Advanced Windowing, which picks up where Chapter 2 left off, diving into some advanced windowing and triggering concepts like processing-time windows, sessions, and continuation triggers. Between Parts I and II, providing an interlude as timely as the details contained therein are important, stands Chapter 5, Exactly-Once and Side Effects (written by Reuven). In it, he enumerates the challenges of providing end-to-end exactly-once (or effectively-once) processing semantics and walks through the implementation details of three different approaches to exactly-once processing: Apache Flink, Apache Spark, and Google Cloud Dataflow. Next begins Part II, Streams and Tables (Chapters 6–9), which dives deeper into the conceptual and investigates the lower-level “streams and tables” way of thinking about stream processing, recently popularized by some upstanding citizens in the Apache Kafka community but, of course, invented decades ago by folks in the database community, because wasn’t everything? It too is composed of four chapters: Chapter 6, Streams and Tables, which introduces the basic idea of streams and tables, analyzes the classic MapReduce approach through a streams-and-tables lens, and then constructs a theory of streams and tables sufficiently general to encompass the full breadth of the Beam Model (and beyond). Chapter 7, The Practicalities of Persistent State, which considers the motivations for persistent state in streaming pipelines, looks at two common types of implicit state, and then analyzes a practical use case (advertising attribution) to inform the necessary characteristics of a general state management mechanism. Chapter 8, Streaming SQL, which investigates the meaning of streaming within the context of relational algebra and SQL, contrasts the inherent stream and table biases within the Beam Model and classic SQL as they exist today, and proposes a set of possible paths forward toward incorporating robust streaming semantics in SQL. Chapter 9, Streaming Joins, which surveys a variety of different join types, analyzes their behavior within the context of streaming, and finally looks in detail at a useful but ill-supported streaming join use case: temporal validity windows. Finally, closing out the book is Chapter 10, The Evolution of Large-Scale Data Processing, which strolls through a focused history of the MapReduce lineage of data processing systems, examining some of the important contributions that have evolved streaming systems into what they are today. Takeaways

As a final bit of guidance, if you were to ask me to describe the things I most want readers to take away from this book, I would say this: The single most important thing you can learn from this book is the theory of streams and tables and how they relate to one another. Everything else builds on top of that. No, we won’t get to this topic until Chapter 6. That’s okay; it’s worth the wait, and you’ll be better prepared to appreciate its awesomeness by then. Time-varying relations are a revelation. They are stream processing incarnate: an embodiment of everything streaming systems are built to achieve and a powerful connection to the familiar tools we all know and love from the world of batch. We won’t learn about them until Chapter 8, but again, the journey there will help you appreciate them all the more. A well-written distributed streaming engine is a magical thing. This arguably goes for distributed systems in general, but as you learn more about how these systems are built to provide the semantics they do (in particular, the case studies from Chapters 3 and 5), it becomes all the more apparent just how much heavy lifting they’re doing for you. LaTeX/Tikz is an amazing tool for making diagrams, animated or otherwise. A horrible, crusty tool with sharp edges and tetanus, but an incredible tool nonetheless. I hope the clarity the animated diagrams in this book bring to the complex topics we discuss will inspire more people to give LaTeX/Tikz a try (in “Figures”, we provide for a link to the full source for the animations from this book). Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values determined by context. This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. TIP NOTE WARNING

Online Resources There are a handful of associated online resources to aid in your enjoyment of this book. Figures All the of the figures in this book are available in digital form on the book’s website. This is particularly useful for the animated figures, only a few frames of which appear (comic-book style) in the non-Safari formats of the book: Online index: http://www.streamingbook.net/figures Specific figures may be referenced at URLs of the form: http://www.streamingbook.net/fig/ For example, for Figure 2-5: http://www.streamingbook.net/fig/2-5 The animated figures themselves are LaTeX/Tikz drawings, rendered first to PDF, then converted to animated GIFs via ImageMagick. For the more intrepid among you, full source code and instructions for rendering the animations (from this book, the “Streaming 101” and “Streaming 102” blog posts, and the original Dataflow Model paper) are available on GitHub at http://github.com/takidau/animations. Be warned that this is roughly 14,000 lines of LaTeX/Tikz code that grew very organically, with no intent of ever being read and used by others. In other words, it’s a messy, intertwined web of archaic incantations; turn back now or abandon all hope ye who enter here, for there be dragons. Code Snippets Although this book is largely conceptual, there are are number of code and psuedo-code snippets used throughout to help illustrate points. Code for the more functional core Beam Model concepts from Chapters 2 and 4, as well as the more imperative state and timers concepts in Chapter 7, is available online at http://github.com/takidau/streamingbook. Since understanding semantics is the main goal, the code is provided primarily as Beam PTransform/DoFn implementations and accompanying unit tests. There is also a single standalone pipeline implementation to illustrate the delta between a unit test and a real pipeline. The code layout is as follows: src/main/java/net/streamingbook/BeamModel.java Beam PTransform implementations of Examples 2-1 through 2-9 and Example 4-3, each with an additional method returning the expected output when executed over the example datasets from those chapters. src/test/java/net/streamingbook/BeamModelTest.java Unit tests verifying the example PTransforms in BeamModel.java via generated datasets matching those in the book. src/main/java/net/streamingbook/Example2_1.java Standalone version of the Example 2-1 pipeline that can be run locally or using a distributed Beam runner. src/main/java/net/streamingbook/inputs.csv Sample input file for Example2_1.java containing the dataset from the book. src/main/java/net/streamingbook/StateAndTimers.java Beam code implementing the conversion attribution example from Chapter 7 using Beam’s state and timers primitives. src/test/java/net/streamingbook/StateAndTimersTest.java Unit test verifying the conversion attribution DoFns from StateAndTimers.java. src/main/java/net/streamingbook/ValidityWindows.java Temporal validity windows implementation. src/main/java/net/streamingbook/Utils.java Shared utility methods. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and

分享到：

赞收藏

资料库

Streaming Systems 流处理系统.pdf

相关推荐

大数据

热门标签

最新资料