Apache Samza is a stream processor LinkedIn recently open-sourced. In his presentation, Samza: Real-time Stream Processing at LinkedIn, Chris Riccomini discusses Samza's feature set, how Samza integrates with YARN and Kafka, how it's used at LinkedIn, and what's next on the roadmap.
Presentation transcript edited by Roopesh Shenoy
Apache Samza is a stream processor LinkedIn recently open-sourced. In his presentation, Samza: Real-time Stream Processing at LinkedIn, Chris Riccomini discusses Samza's feature set, how Samza integrates with YARN and Kafka, how it's used at LinkedIn, and what's next on the roadmap.
Bulk of processing that happens at LinkedIn is RPC-style data processing, where one expects a very fast response. On the other end of their response latency spectrum, they have batch processing, for which they use Hadoop for quite a bit. Hadoop processing and batch processing typically happens after the fact, often hours later.
There's this gap between synchronous RPC processing, where the user is actively waiting for a response, and this Hadoop-style processing which despite efforts to shrink it still it takes a long time to run through.
That's where Samza fits in. This is where we can process stuff asynchronously, but we're also not waiting for hours. It typically operates in the order of milliseconds to minutes. The idea is to process stuff relatively quickly and get the data back to wherever it needs to be, whether that's a downstream system or some real-time service.
Chris mentions that right now, this stream processing is the worst-supported in terms of tooling and environment.
LinkedIn sees a lot of use cases for this type of processing –
- Newsfeed displays when people move to another company, when they like an article, when they join a group, et cetera.
News is latency-sensitive and if you use Hadoop to batch-compute it, you might be getting responses hours or maybe even a day later. It is important to get trending articles in News pretty quickly.
- Advertising – getting relevant advertisements, as well as tracking and monitoring ad display, clicks and other metrics
- Sophisticated monitoring that allows performing of complex querys like "the top five slowest pages for the last minute."
Existing Ecosystem at LinkedIn
The existing ecosystem at LinkedIn has had a huge influence in the motivation behind Samza as well as it’s architecture. Hence it is important to have at least a glimpse of what this looks like before diving into Samza.
Kafka is an open-source project that LinkedIn released a few years ago. It is a messaging system that fulfills two needs – message-queuing and log aggregation. All of LinkedIn’s user activity, all the metrics and monitoring data, and even database changes go into this.
LinkedIn also has a specialized system called Databus, which models all of their databases as a stream. It is like a database with the latest data for each key-value pair. But as this database mutates, you can actually model that set of mutations as a stream. Each individual change is a message in that stream.
Because LinkedIn has Kafka and because they’ve integrated with it for the past few years, a lot of data at LinkedIn, almost all of it, is available in a stream format as opposed to a data format or on Hadoop.
Motivation for Building Samza
Chris mentions that when they began doing stream processing, with Kafka and all this data in their system, they started with something like a web service that would start up, read messages from Kafka and do some processing, and then write the messages back out.
As they did this, they realized that there were a lot of problems that needed to be solved in order to make it really useful and scalable. Things like partitioning: how do you partition your stream? How do you partition your processor? How do you manage state, where state is defined essentially as something that you maintain in your processor between messages, or things like count if you're incrementing a counter every time a message comes in. How do you re-process?
With failure semantics, you get at least once, at most once, exactly once messaging. There is also non-determinism. If your stream processor is interacting with another system, whether it's a database or it's depending on time or the ordering of messages, how you deal with stuff that actually determines the output that you will end up sending?
Samza tries to address some of these problems.
Samza Architecture
The most basic element of Samza is a stream. The stream definition for Samza is much more rigid and heavyweight than you would expect from other stream processing systems. Other processing systems, such as Storm, tend to have very lightweight stream definitions to reduce latency, everything from, say, UDP to a straight-up TCP connection.
Samza goes the other direction. It wants its streams to be, for starters, partitions. It wants them to be ordered. If you read Message 3 and then Message 4, you are never going to get those inverted within a single partition. It also wants them to replayable, which means you should be able to go back to reread a message at a later date. It wants them to be fault-tolerant. If a host from Partition 1 disappears, it should still be readable on some other hosts. Also, the streams are usually infinite. Once you get to the end – say, Message 6 of Partition 0 – you would just try to reread the next message when it's available. It's not the case that you're finished.
This definition maps very well to Kafka, which LinkedIn uses as the streaming infrastructure for Samza.
There are many concepts to understand within Samza. In a gist, they are –
- Streams – Samza processes streams. A stream is composed of immutable messages of a similar type or category. The actual implementation can be provided via a messaging system such as Kafka (where each topic becomes a Samza Stream) or a database (table) or even Hadoop (a directory of files in HDFS)
Things like message ordering, batching are handled via streams.
- Jobs – a Samza job is code that performs logical transformation on a set of input streams to append messages to a set of output streams
- Partitions – For scalability, each stream is broken into one or more partitions. Each partition is a totally ordered sequence of messages
- Tasks – again for scalability, a job is distributed by breaking it into multiple tasks. The task consumes data from one partition for each of the job’s input streams
- Containers – whereas partitions and tasks are logical units of parallelism, containers are unit physical parallelism. Each container is a unix process (or linux cgroup) and runs one or more tasks.
- TaskRunner – Taskrunner is Samza’s stream processing container. It manages startup, execution and shutdown of one or more StreamTask instances.
- Checkpointing – Checkpointing is generally done to enable failure recovery. If a taskrunner goes down for some reason (hardware failure, for e.g.), when it comes back up, it should start consuming messages where it left off last – this is achieved via Checkpointing.
- State management – Data that needs to be passed between processing of different messages can be called state – this can be something as simple as a keeping count or something a lot more complex. Samza allows tasks to maintain persistent, mutable, queryable state that is physically co-located with each task. The state is highly available: in the event of a task failure it will be restored when the task fails over to another machine.
This datastore is pluggable, but Samza comes with a key-value store out-of-the-box.
- YARN (Yet Another Resource Manager) is Hadoop v2’s biggest improvement over v1 – it separates the Map-Reduce Job tracker from the resource management and enables Map-reduce alternatives to use the same resource manager. Samza utilizes YARN to do cluster management, tracking failures, etc.
Samza provides a YARN ApplicationMaster and a YARN job runner out of the box.
You can understand how the various components (YARN, Kafka and Samza API) interact by looking at the detailed architecture. Also read the overall documentation to understand each component in detail.
Possible Improvements
One of the advantages of using something like YARN with Samza is that it enables you to potentially run Samza on the same grid that you already run your draft tasks, test tasks, and MapReduce tasks. You could use the same infrastructure for all of that. However, LinkedIn currently does not run Samza in a multi-framework environment because the existing setup itself is quite experimental.
In order to get into a more multi-framework environment, Chris says that the process isolation would have to get a little better.
Conclusion
Samza is a relatively young project incubating at Apache so there's a lot of room to get involved. A good way to get started is with the hello-samza project, which is a little thing that will get you up and running in about five minutes. It will let you play with a real-time change log from the Wikipedia servers to let you figure out what's going on in and give you a stream of stuff to play with.
The other stream processing project built on top of Hadoop is STORM. You can see a comparison between Samza and STORM
About the Author
Chris Riccomini is a Staff Software Engineer at LinkedIn, where he's is currently working as a committer and PMC member for Apache Samza. He's been involved in a wide range of projects at LinkedIn, including, "People You May Know", REST.li, Hadoop, engineering tooling, and OLAP systems. Prior to LinkedIn, he worked on data visualization and fraud modeling at PayPal.