Streaming Systems
The What, Where, When, and How of Large-Scale
Data Processing
Tyler Akidau, Slava Chernyak, and Reuven Lax
1
Streaming Systems
by Tyler Akidau, Slava Chernyak, and Reuven Lax
Copyright © 2018 Tyler Akidau, Slava Chernyak, and Reuven Lax. All rights
reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,
Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales
titles
promotional use. Online editions are also available for most
(http://oreilly.com/safari).
contact
our
corporate/institutional
or
800-998-9938
corporate@oreilly.com.
For more
information,
sales
department:
Editors: Rachel Roumeliotis and Jeff Bleiel
Production Editor: Nicholas Adams
Copyeditor: Octal Publishing, Inc.
Proofreader: Kim Cofer
Indexer: Ellen Troutman-Zaig
Interior Designer: David Futato
Cover Designer: Karen Montgomery
Illustrator: Rebecca Demarest
August 2018: First Edition
Revision History for the First Edition
2018-07-12: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491983874 for release
details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc.
Streaming Systems, the cover image, and related trade dress are trademarks of
O’Reilly Media, Inc.
2
While the publisher and the authors have used good faith efforts to ensure that
the information and instructions contained in this work are accurate, the
publisher and the authors disclaim all responsibility for errors or omissions,
including without limitation responsibility for damages resulting from the use
of or reliance on this work. Use of the information and instructions contained
in this work is at your own risk. If any code samples or other technology this
work contains or describes is subject to open source licenses or the
intellectual property rights of others, it is your responsibility to ensure that
your use thereof complies with such licenses and/or rights.
978-1-491-98387-4
[LSI]
3
Preface Or: What Are You
Getting Yourself Into Here?
Hello adventurous reader, welcome to our book! At this point, I assume that
you’re either interested in learning more about the wonders of stream
processing or hoping to spend a few hours reading about the glory of the
majestic brown trout. Either way, I salute you! That said, those of you in the
latter bucket who don’t also have an advanced understanding of computer
science should consider how prepared you are to deal with disappointment
before forging ahead; caveat piscator, and all that.
To set the tone for this book from the get go, I wanted to give you a heads up
about a couple of things. First, this book is a little strange in that we have
multiple authors, but we’re not pretending that we somehow all speak and
write in the same voice like we’re weird identical triplets who happened to be
born to different sets of parents. Because as interesting as that sounds, the end
result would actually be less enjoyable to read. Instead, we’ve opted to each
write in our own voices, and we’ve granted the book just enough self-
awareness to be able to make reference to each of us where appropriate, but
not so much self-awareness that it resents us for making it only into a book
and not something cooler like a robot dinosaur with a Scottish accent.
As far as voices go, there are three you’ll come across:
Tyler
1
That would be me. If you haven’t explicitly been told someone else is
speaking, you can assume that it’s me, because we added the other authors
somewhat late in the game, and I was basically like, “hells no” when I
thought about going back and updating everything I’d already written. I’m
the technical lead for the Data Processing Languages ands Systems group
at Google, responsible for Google Cloud Dataflow, Google’s Apache
Beam efforts, as well as Google-internal data processing systems such as
Flume, MillWheel, and MapReduce. I’m also a founding Apache Beam
PMC member.
2
4
Figure P-1. The cover that could have been...
Slava
Slava was a long-time member of the MillWheel team at Google, and later
an original member of the Windmill team that built MillWheel’s
successor, the heretofore unnamed system that powers the Streaming
Engine in Google Cloud Dataflow. Slava is the foremost expert on
watermarks and time semantics in stream processing systems the world
over, period. You might find it unsurprising then that he’s the author of
Chapter 3, Watermarks.
Reuven
5
Reuven is at the bottom of this list because he has more experience with
stream processing than both Slava and me combined and would thus crush
us if he were placed any higher. Reuven has created or led the creation of
nearly all of the interesting systems-level magic in Google’s general-
purpose stream processing engines, including applying an untold amount
of attention to detail in providing high-throughput, low-latency, exactly-
once semantics in a system that nevertheless utilizes fine-grained
checkpointing. You might find it unsurprising that he’s the author of
Chapter 5, Exactly-Once and Side Effects. He also happens to be an
Apache Beam PMC member.
Navigating This Book
Now that you know who you’ll be hearing from, the next logical step would
be to find out what you’ll be hearing about, which brings us to the second
thing I wanted to mention. There are conceptually two major parts to this
book, each with four chapters, and each followed up by a chapter that stands
relatively independently on its own.
The fun begins with Part I, The Beam Model (Chapters 1–4), which focuses
on the high-level batch plus streaming data processing model originally
developed for Google Cloud Dataflow, later donated to the Apache Software
Foundation as Apache Beam, and also now seen in whole or in part across
most other systems in the industry. It’s composed of four chapters:
establishing
some
terminology, discussing
Chapter 1, Streaming 101, which covers the basics of stream
processing,
the
capabilities of streaming systems, distinguishing between two
important domains of time (processing time and event time), and
finally looking at some common data processing patterns.
Chapter 2, The What, Where, When, and How of Data Processing,
which covers in detail the core concepts of robust stream processing
over out-of-order data, each analyzed within the context of a concrete
running example and with animated diagrams to highlight the
dimension of time.
Chapter 3, Watermarks (written by Slava), which provides a deep
survey of temporal progress metrics, how they are created, and how
they propagate through pipelines. It ends by examining the details of
6
two real-world watermark implementations.
Chapter 4, Advanced Windowing, which picks up where Chapter 2
left off, diving into some advanced windowing and triggering
concepts like processing-time windows, sessions, and continuation
triggers.
Between Parts I and II, providing an interlude as timely as the details
contained therein are important, stands Chapter 5, Exactly-Once and Side
Effects (written by Reuven). In it, he enumerates the challenges of providing
end-to-end exactly-once (or effectively-once) processing semantics and walks
through the implementation details of three different approaches to exactly-
once processing: Apache Flink, Apache Spark, and Google Cloud Dataflow.
Next begins Part II, Streams and Tables (Chapters 6–9), which dives deeper
into the conceptual and investigates the lower-level “streams and tables” way
of thinking about stream processing, recently popularized by some upstanding
citizens in the Apache Kafka community but, of course, invented decades ago
by folks in the database community, because wasn’t everything? It too is
composed of four chapters:
Chapter 6, Streams and Tables, which introduces the basic idea of
streams and tables, analyzes the classic MapReduce approach
through a streams-and-tables lens, and then constructs a theory of
streams and tables sufficiently general to encompass the full breadth
of the Beam Model (and beyond).
Chapter 7, The Practicalities of Persistent State, which considers the
motivations for persistent state in streaming pipelines, looks at two
common types of implicit state, and then analyzes a practical use
case (advertising attribution) to inform the necessary characteristics
of a general state management mechanism.
Chapter 8, Streaming SQL, which investigates the meaning of
streaming within the context of relational algebra and SQL, contrasts
the inherent stream and table biases within the Beam Model and
classic SQL as they exist today, and proposes a set of possible paths
forward toward incorporating robust streaming semantics in SQL.
Chapter 9, Streaming Joins, which surveys a variety of different join
types, analyzes their behavior within the context of streaming, and
finally looks in detail at a useful but ill-supported streaming join use
7