logo资料库

《Streaming Systems》 - Tyler Akidau.pdf

第1页 / 共456页
第2页 / 共456页
第3页 / 共456页
第4页 / 共456页
第5页 / 共456页
第6页 / 共456页
第7页 / 共456页
第8页 / 共456页
资料共456页,剩余部分请下载后查看
Preface Or: What Are You Getting Yourself Into Here?
Navigating This Book
Takeaways
Conventions Used in This Book
Online Resources
Figures
Code Snippets
O’Reilly Safari
How to Contact Us
Acknowledgments
I. The Beam Model
1. Streaming 101
Terminology: What Is Streaming?
On the Greatly Exaggerated Limitations of Streaming
Event Time Versus Processing Time
Data Processing Patterns
Bounded Data
Unbounded Data: Batch
Unbounded Data: Streaming
Summary
2. The What, Where, When, and How of Data Processing
Roadmap
Batch Foundations: What and Where
What: Transformations
Where: Windowing
Going Streaming: When and How
When: The Wonderful Thing About Triggers Is Triggers Are Wonderful Things!
When: Watermarks
When: Early/On-Time/Late Triggers FTW!
When: Allowed Lateness (i.e., Garbage Collection)
How: Accumulation
Summary
3. Watermarks
Definition
Source Watermark Creation
Perfect Watermark Creation
Heuristic Watermark Creation
Watermark Propagation
Understanding Watermark Propagation
Watermark Propagation and Output Timestamps
The Tricky Case of Overlapping Windows
Percentile Watermarks
Processing-Time Watermarks
Case Studies
Case Study: Watermarks in Google Cloud Dataflow
Case Study: Watermarks in Apache Flink
Case Study: Source Watermarks for Google Cloud Pub/Sub
Summary
4. Advanced Windowing
When/Where: Processing-Time Windows
Event-Time Windowing
Processing-Time Windowing via Triggers
Processing-Time Windowing via Ingress Time
Where: Session Windows
Where: Custom Windowing
Variations on Fixed Windows
Variations on Session Windows
One Size Does Not Fit All
Summary
5. Exactly-Once and Side Effects
Why Exactly Once Matters
Accuracy Versus Completeness
Side Effects
Problem Definition
Ensuring Exactly Once in Shuffle
Addressing Determinism
Performance
Graph Optimization
Bloom Filters
Garbage Collection
Exactly Once in Sources
Exactly Once in Sinks
Use Cases
Example Source: Cloud Pub/Sub
Example Sink: Files
Example Sink: Google BigQuery
Other Systems
Apache Spark Streaming
Apache Flink
Summary
II. Streams and Tables
6. Streams and Tables
Stream-and-Table Basics Or: a Special Theory of Stream and Table Relativity
Toward a General Theory of Stream and Table Relativity
Batch Processing Versus Streams and Tables
A Streams and Tables Analysis of MapReduce
Reconciling with Batch Processing
What, Where, When, and How in a Streams and Tables World
What: Transformations
Where: Windowing
When: Triggers
How: Accumulation
A Holistic View of Streams and Tables in the Beam Model
A General Theory of Stream and Table Relativity
Summary
7. The Practicalities of Persistent State
Motivation
The Inevitability of Failure
Correctness and Efficiency
Implicit State
Raw Grouping
Incremental Combining
Generalized State
Case Study: Conversion Attribution
Conversion Attribution with Apache Beam
Summary
8. Streaming SQL
What Is Streaming SQL?
Relational Algebra
Time-Varying Relations
Streams and Tables
Looking Backward: Stream and Table Biases
The Beam Model: A Stream-Biased Approach
The SQL Model: A Table-Biased Approach
Looking Forward: Toward Robust Streaming SQL
Stream and Table Selection
Temporal Operators
Summary
9. Streaming Joins
All Your Joins Are Belong to Streaming
Unwindowed Joins
FULL OUTER
LEFT OUTER
RIGHT OUTER
INNER
ANTI
SEMI
Windowed Joins
Fixed Windows
Temporal Validity
Summary
10. The Evolution of Large-Scale Data Processing
MapReduce
Hadoop
Flume
Storm
Spark
MillWheel
Kafka
Cloud Dataflow
Flink
Beam
Summary
Index
Streaming Systems The What, Where, When, and How of Large-Scale Data Processing Tyler Akidau, Slava Chernyak, and Reuven Lax
Streaming Systems by Tyler Akidau, Slava Chernyak, and Reuven Lax Copyright © 2018 Tyler Akidau, Slava Chernyak, and Reuven Lax. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Editors: Rachel Roumeliotis and Jeff Bleiel Production Editor: Nicholas Adams Copyeditor: Octal Publishing, Inc. Proofreader: Kim Cofer Indexer: Ellen Troutman-Zaig Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest August 2018: First Edition Revision History for the First Edition 2018-07-12: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491983874 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Streaming Systems, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-98387-4 [LSI]
Preface Or: What Are You Getting Yourself Into Here? Hello adventurous reader, welcome to our book! At this point, I assume that you’re either interested in learning more about the wonders of stream processing or hoping to spend a few hours reading about the glory of the majestic brown trout. Either way, I salute you! That said, those of you in the latter bucket who don’t also have an advanced understanding of computer science should consider how prepared you are to deal with disappointment before forging ahead; caveat piscator, and all that. To set the tone for this book from the get go, I wanted to give you a heads up about a couple of things. First, this book is a little strange in that we have multiple authors, but we’re not pretending that we somehow all speak and write in the same voice like we’re weird identical triplets who happened to be born to different sets of parents. Because as interesting as that sounds, the end result would actually be less enjoyable to read. Instead, we’ve opted to each write in our own voices, and we’ve granted the book just enough self- awareness to be able to make reference to each of us where appropriate, but not so much self-awareness that it resents us for making it only into a book 1 and not something cooler like a robot dinosaur with a Scottish accent. As far as voices go, there are three you’ll come across: Tyler That would be me. If you haven’t explicitly been told someone else is speaking, you can assume that it’s me, because we added the other authors somewhat late in the game, and I was basically like, “hells no” when I thought about going back and updating everything I’d already written. I’m the technical lead for the Data Processing Languages ands Systems group at Google, responsible for Google Cloud Dataflow, Google’s Apache Beam efforts, as well as Google-internal data 2
processing systems such as Flume, MillWheel, and MapReduce. I’m also a founding Apache Beam PMC member. Figure P-1. The cover that could have been...
Slava Slava was a long-time member of the MillWheel team at Google, and later an original member of the Windmill team that built MillWheel’s successor, the heretofore unnamed system that powers the Streaming Engine in Google Cloud Dataflow. Slava is the foremost expert on watermarks and time semantics in stream processing systems the world over, period. You might find it unsurprising then that he’s the author of Chapter 3, Watermarks. Reuven Reuven is at the bottom of this list because he has more experience with stream processing than both Slava and me combined and would thus crush us if he were placed any higher. Reuven has created or led the creation of nearly all of the interesting systems-level magic in Google’s general-purpose stream processing engines, including applying an untold amount of attention to detail in providing high-throughput, low-latency, exactly-once semantics in a system that nevertheless utilizes fine-grained checkpointing. You might find it unsurprising that he’s the author of Chapter 5, Exactly-Once and Side Effects. He also happens to be an Apache Beam PMC member. Navigating This Book Now that you know who you’ll be hearing from, the next logical step would be to find out what you’ll be hearing about, which brings us to the second thing I wanted to mention. There are conceptually two major parts to this book, each with four chapters, and each followed up by a chapter that stands relatively independently on its own. The fun begins with Part I, The Beam Model (Chapters 1–4), which focuses on the high-level batch plus streaming data processing model originally developed for Google Cloud Dataflow, later donated to the Apache Software Foundation as Apache Beam, and also now seen in whole or in part across most other systems in the industry. It’s composed of four chapters:
Chapter 1, Streaming 101, which covers the basics of stream processing, establishing some terminology, discussing the capabilities of streaming systems, distinguishing between two important domains of time (processing time and event time), and finally looking at some common data processing patterns. Chapter 2, The What, Where, When, and How of Data Processing, which covers in detail the core concepts of robust stream processing over out-of-order data, each analyzed within the context of a concrete running example and with animated diagrams to highlight the dimension of time. Chapter 3, Watermarks (written by Slava), which provides a deep survey of temporal progress metrics, how they are created, and how they propagate through pipelines. It ends by examining the details of two real-world watermark implementations. Chapter 4, Advanced Windowing, which picks up where Chapter 2 left off, diving into some advanced windowing and triggering concepts like processing-time windows, sessions, and continuation triggers. Between Parts I and II, providing an interlude as timely as the details contained therein are important, stands Chapter 5, Exactly-Once and Side Effects (written by Reuven). In it, he enumerates the challenges of providing end-to-end exactly-once (or effectively-once) processing semantics and walks through the implementation details of three different approaches to exactly- once processing: Apache Flink, Apache Spark, and Google Cloud Dataflow. Next begins Part II, Streams and Tables (Chapters 6–9), which dives deeper into the conceptual and investigates the lower-level “streams and tables” way of thinking about stream processing, recently popularized by some upstanding citizens in the Apache Kafka community but, of course, invented decades ago by folks in the database community, because wasn’t everything? It too is composed of four chapters: Chapter 6, Streams and Tables, which introduces the basic idea of
分享到:
收藏