In this podcast we are talking to Tyler Akidau, a senior engineer at Google, who leads the technical infrastructure and data processing teams in Seattle, and a founding member of the Apache Beam PMC and a passionate voice in the streaming space. This podcast will cover data streaming and the 2015 DataFlow Model streaming paper and much of the concepts covered, such as why dealing with out-of-order data is important, event time versus processing time, windowing approaches, and finally preview the track he is hosting at QConf SF next week.
Key Takeaways
- Batch processing and streaming aren’t two incompatible things; they are a function of different windowing options.
- Event time and processing time are two different concepts, and may be out of step with each other.
- Completeness is knowing that you have processed all the events for a particular window.
- Windowing choice can be answered from the what, when, where, how questions.
- Unbounded versus bounded data is a better dimension than stream or batch processing.
Subscribe on:
When did Apache Beam graduate, and what does it mean?
- 1:38 It graduated in December 2016, and is a full-fledged top-level project at Apache.
- 2:23 Graduating means that we have built a community, it’s not just a single-company project, and it’s done following the Apache way.
How does Apache measure the community?
- 2:38 They look at the distribution of committers and commits, to make sure there aren’t all coming from a single company.
- 2:58 They look at user and developer list interactions.
What was the big challenges that Spark solved over batch processing with Hadoop?
- 3:48 Spark solved a bunch of issues; both streaming and batch processing.
- 3:58 For streaming - Spark wasn’t the first on the scene - was to create pipelines that could continuously process data.
- 4:08 People have been creating pipelines using batch processing systems since the map reduce paper was published in 2004, by creating finite data sets which are subsequently processed.
- 4:28 The semantic change for streaming systems offering continuous processing of data is the big change that these systems brought about.
Why was there a lack of trust with streaming systems initially?
- 4:48 The early streaming systems sacrificed correctness at some level to provide lower latency.
- 5:03 You ended up with a set of streaming systems in the industry that couldn’t be relied upon to give you absolutely correct or repeatable answers.
- 5:13 This meant that you couldn’t use them for billing systems, for example.
- 5:18 You aren’t going to charge people money when you can’t be sure you’re adding the numbers correctly, for example.
- 5:23 This led to the Lambda architecture where you run a weakly consistent but low latency streaming system alongside a strongly consistent but high latency batch system, and get the best of both worlds.
- 5:38 This was the big benefit that Spark streaming originally brought to the table; a streaming system that also had a strong consistency semantics.
What was the big contribution of the Dataflow Model paper to the community?
- 5:58 There were two things; the biggest overall contribution was that it gave a compelling and coherent picture for why dealing with out—of-order data was important, and the fundamentals around it about the basics of how you could build these systems.
- 6:23 Another aspect of the early streaming systems was that they didn’t have a strong notion of event time correctness.
- 6:43 This paper really called this out in a clear way and cited examples of where this is important.
- 6:58 It was, in some sense, a call-to-action.
- 7:08 There are arguments to be made that there wasn’t anything academically interesting in the paper; the main interesting academic contribution was the dynamic merging window support.
- 7:23 A lot of the impact came from the clear message that it presented.
What do you mean by event time and processing time?
- 7:53 When you’re analysing data from users, you want to record when events occur on the user’s device.
- 8:18 You later want to analyse those events in the context of when the interactions happened with the system, and between users of that system.
- 8:33 To do that, you need to utilise the time when the event happened, like when the user tapped the screen.
- 8:43 The important thing to realise is that the time when the event happens may be very different from the time that you see that event.
- 8:58 It’s particularly exacerbated with mobile devices, because the user can be using the phone in airplane mode for example.
- 9:13 Your system won’t get those events until they land and turn off airplane mode.
- 9:18 When the system actually sees the events is processing time.
- 9:28 Historically systems didn’t really provide primitives to allow you to reason about these two times.
- 9:38 Developers would try to group records when they were actually seen - and as long as they were roughly they were in order, you can kind of get away with this.
- 9:53 When you bring mobile devices into the equation, or introduce network partitions in a data centre, you can start getting grossly inaccurate results if you rely solely on processing time.
- 10:13 What you really want to use is event time, and put events into context using that.
What is completeness?
- 10:28 Completeness is an additional step beyond - event time tells you an exact time of when the event occurred, and completeness is giving you a sense of knowing that you’ve seen events up to a particular time.
- 10:58 The idea of watermarks - discussed in the paper - is that if events can arrive out-of-order, then you need some other way of knowing when you have seen all of the events.
- 11:28 For example, if you are processing hourly windows, and you want to know if you have processed all events between 11 and 12, then how do you know you have all of the events before 12?
- 11:38 Once I know that I have seen all of the inputs up to a specific time, I can declare the results of the processing correct and complete.
How can you guarantee that someone isn’t still in airplane mode?
- 11:58 The short answer is: you can’t. The mobile case is the tricky case.
- 12:08 Watermarks are data-source specific; you build them specifically on whatever system you’re using to give you the data - Kafka, or PubSub or HDFS - each give different mechanisms to allow you to reason about completeness.
- 12:28 That’s combined with the nature of the input source itself; for example, there’s a vast difference between mobile devices and web frontends.
- 12:48 If you’re logging monotonically increasing time from a fleet of web frontend servers, then you can create an accurate watermark.
- 13:28 With mobile events, if you’re logging events into PubSub - where you don’t have strong ordering guarantees - you need to build a heuristic of knowing what your input sources are like and how the underlying transport work, and make a best guess.
- 13:48 For example, we try and guess the events queued up in PubSub and from that build a histogram of what that distribution looks like - and if you use that as a minimum cut-off, then you can create a watermark based on that.
- 14:23 It won’t catch grossly late records - like the airplane mode example previously - so you can’t use it as an accurate guarantee that you’ve seen anything; but you can use it as a confidence interval that you have seem almost all of the events up to that time.
What are the different windowing mechanisms that you can use?
- 15:08 The idea of windowing is to group records and event times - so trying to colocate events that happened in some time frame window that occurred together.
- 15:28 Effectively the different types of windowing are: fixed windows (e.g. every hour, every day - constant sized windows); sliding windows (fixed size, but moving); sessions.
- 16:28 The idea with sessions is that you’re trying to identify a stream of user activity.
- 16:38 People play a mobile game for a period of time and then stop.
- 16:43 When you’re trying to understand user engagement, it’s useful to see what they do in these bursts of activity.
- 16:58 You can take these sets of totally independent events, and you can put them together in a group that relates them together that ties the events from these bursts together.
So is there some kid of session id used to do this?
- 17:23 You can take some kind of group identifier or sessions over a team id and use that to form a group.
- 17:43 You look at the times of those events - and you can associate events that happened within a minute of each other.
- 17:53 Once you hit that last event, there won’t be any events after that cut-off can be considered to be a part of that session.
- 18:28 You have an explicit event cut-off delta, where any events that don’t happen within that time are considered to be part of a subsequent session.
What made you think of combining streaming and batch processing?
- 19:08 Within Google we had independent batch and streaming systems, and having worked on them for years it was clear that they weren’t really that different.
- 19:28 What we were aiming to do is to converge the APIs of the two different systems so that developers could move fluidly between them.
What does the capability matrix look like from batch to micro-batch to streaming?
- 19:58 The reason it works so well from an API perspective is that batch is effectively a sub-set of streaming.
- 20:08 If you have a cleanly defined API for describing a streaming API, it applies to a batch process as well.
- 20:23 People have done windowing in batch systems in two ways; one is to pre-window the input data (such as computing daily stats), and the other is to break down subsets of batches.
- 20:53 The one thing that you tend to not utilise as much in batch is triggers - in streaming, that allows you to define when you want the outputs to be materialised.
- 21:18 When you have an unbounded stream, you have to have some way of knowing when to produce results.
- 21:28 Beam as it currently exists gives you a large numbers of ways to slice and dice data, which is probably more than you actually need.
How do you handle the choice of windows?
- 22:08 There are four questions to think about - what, where, when, how.
- 22:18 What are you calculating? Where in event time are you calculating it? When in processing time do you want to see results? How do you want refinements of those results to relate?
- 22:43 So the first question: are you calculating sums, or machine learning models; this is what you think of when you are doing traditional batch processing.
- 23:03 The rest of the questions allow you to deal with time, which is important when you’re dealing with unbounded sequence of events.
- 23:08 The importance of event time is when the event actually happen.
- 23:38 The processing pipeline is handling results, and in batch processing the answer was traditionally at the end of processing batch input.
- 23:53 When you’re processing an unbounded sequence of events you need to know when you are going to kick out a result.
- 23:58 There a few major ways of handling this.
- 24:08 One way is to have a repeated cadence; for example, every minute or every hour.
- 24:13 For example, if you’re calculating daily statistics what you’d get is a group of records by day then every minute you’d produce updated results every minute.
- 24:43 This allows you to build up a view of what’s happening in that window during the day without needing to see the whole day’s results to produce an interim result.
- 25:03 It’s like an SQL standing query, except that it is updated on every record insert instead of periodically.
- 25:18 Another type of trigger in Beam is a watermark trigger; where you want a single result for a given window, or a way of annotating results at the end of the window.
- 25:33 You can set up a trigger when the watermark goes past the end of a window; the system has some confidence that you’ve seen all the events that are covered by that window, and so you can do some processing that can be complete.
- 26:18 If you want a system that can perform anomaly detection of search query traffic, that can look for spikes or dips of particular search terms, you don’t want to continually report negative effects.
- 27:28 You can use watermarks for completeness or for regular updates are the two main uses.
- 27:38 The last question is how do refinements and results relate.
- 27:48 This gives a way to fine-tune the outputs.
What happens to late arrivals that are received outside that window?
- 28:08 There are a few common options.
- 28:13 One is to just throw the event away; in some cases, you just don’t care.
- 28:18 In others, you want to include that event in some way with the prior results.
- 28:23 If you’re doing periodic materialised view updates, then you can add them into the next update of results.
- 28:33 In other cases you might want to give an explicit way or retracting the old answer before restating the new answer, in case the difference triggers some kind of threshold.
- 28:53 The other mechanism is to partition the answers into the previous answer and some kind of dead letter result.
How would you describe the shape of data in applications today?
- 29:43 The vast majority of data sets we use today in business are infinite data sets - it keeps coming.
- 30:23 A lot of businesses care about what users are doing, and data keeps on coming.
- 30:33 Often they are wildly out of order, particularly for mobile applications, you never know what the delay is between the user performing the action and the back end receiving the event.
What is the difference between bounded and unbounded versus batch and streaming?
- 30:58 It’s a better characterisation of the data that you are processing.
- 31:23 The thing we care about is essentially: is the data finite or infinite?
- 31:28 You can process a bounded data set with a streaming system; you can also process unbounded data with a batch system.
- 31:33 Ever since MapReduce was created at Google, it was used for search queries, as an ad-hoc way of ahead-of-time grouping of finite data.
- 31:58 There are de-facto uses for a particular type of data set.
What will your talk at QConSF include?
- 32:48 The first part of the talk will be talking about streams and tables, and the second how that relates to streaming results in SQL.
- 32:58 Its basically covering two chapters of the book I’m writing - it tries to capture the essentials, and if people want to go further then they can look at those two chapters.
- 33:13 The first two-thirds of the talk about streams and tables.
- 33:33 The idea with a database is that transactions come in, are applied to a transaction log, and then applied to the tables.
- 33:48 That log is essentially a stream; the application of a stream over time is what creates a table.
- 34:03 The flip side of how you go from a table to a stream is viewing the transaction log as a change log of the tables.
- 34:38 I relate this idea to the Beam model in two ways.
- 34:58 First it walks you through MapReduce as a standard batch processing mechanism.
- 35:03 It shows how stream and table theory relates to MapReduce.
- 35:23 It also lets you see this in the context of bounded data.
- 35:38 After we have a basic theory of how streams and tables relate, we compare it to the Beam model.
- 35:58 You can see that windowing is related to tables, and triggers are related to creating streams from tables.
- 36:23 In the last section, we talk about SQL and streaming.
- 36:53 The key thing to call out is that SQL is built on relational algebra.
- 37:13 The point I try to get across with streaming is that is data changing over time; so what we are doing is extending the evolution of that relation over time.
- 37:48 You don’t have to change anything in SQL algebra - it just works for streams and tables.
- 39:08 The last five minutes of the talk how to extend SQL for streaming - if you hammer home the point of integrating it in cleanly and choose good defaults then you don’t need extensions - and when you do, you can limit it to a specific set of things.
How do you join unbounded streams?
- 38:58 All of your joins are stream-to-stream joins.
- 39:13 There are different ways of creating those streams.
- 39:18 A stream-to-stream join is merging two groups together based on some sort of key.
- 39:28 As you have a certain amount of data that you care about, you can emit a result.
- 39:33 The interesting thing comes with temporal joins when you try and join based on event time.
- 39:43 Julian Hyde, who is the lead engineer on the Apache Calcite project, proposed an example of currency conversions.
- 40:03 Imagine you’re joining a stream of currency conversion rates with a stream of currency conversion requests, both of those streams could be out-of-order.
- 41:08 You want to slice up the currency conversions of validity windows.
- 41:28 What’s interesting is that these windows can shrink; a subsequent currency conversion rate can come in that applies from a point in time.
- 41:48 We don’t support shrinking windows in Apache Beam, we only support growing them.
You’re also a track host for QConSF: what else is happening?
- 42:33 There’s two main themes; modern concepts in stream processing, and interesting large-scale use-cases.
- 42:38 For example, we have a talk from Matt Zimmer at Netflix who is giving a talk on custom windowing with Flink.
- 43:28 The use-cases talks include Facebook, and Strymon optimisations for data centres.