Kartik Paramasivam, who works in streams infrastructure at LinkedIn, wrote two articles over the summer on why and how his company attempts to avoid Lambda architecture using Apache Samza for data processing.
Stream processing technologies are useful when it comes to getting quick results out of streams of data, but they may fall short for demanding use cases with high consistency and robustness requirements.
Lambda architecture has been a popular solution that combines batch and stream processing. The fundamental idea is to have two separate data lanes: a speed layer to provide low-latency results using stream processing technologies and a batch layer to run jobs to provide accurate results on bulk data. Lambda architecture is complex to implement as it relies on multiple technologies and implies the merge of results from both data lanes.
At LinkedIn, Samza is used to perform the processing on data streamed from Apache Kafka. One of the problems described in the article is the late arrival of events. A RocksDB-based key value store was added to Samza to hold the input events for a longer time. On a late arrival, the framework will re-emit any result for the affected window-based computation, given there’s enough information locally to recompute it. A RocksDB based solution was found preferable since relying on an external store (e.g. NoSQL) would imply network and CPU overhead for communication and serialization.
Apache Flink, another stream processing framework, is capable of computing over time windows based on timestamps, that can be event specific or the ingestion timestamp, and provide consistent results for out-of-order streams. It holds the data in memory and triggers window computation on the receipt of a watermark event. Watermarks represent a clock tick and provide Flink a notion of time (that can be event specific).
Other problems such as the processing of duplicated messages due to at-least-once delivery guarantees have been addressed by most frameworks with internal checkpoint mechanisms.
The last set of stream processing problems is the ability to experiment with data as it flows through the system, in an interactive fashion. Agile experimentation is usually performed on batch systems using high-level languages such as SQL, available on commercial products such as Azure Stream Analytics.
Samza SQL is described by Julian Hyde as an effort that applies Apache Calcite to Samza stream processor. Samza SQL is not production-ready yet. Instead, LinkedIn uses Hadoop’s batch nature to perform iterative experimentations with offline datasets, some copied from the same live services databases that Samza-based stream processors query when handling a stream.
Flink is also working towards a robust stream SQL support. Flink’s 1.1, released in August 2016, supports filters and unions on streaming data. Flink also supports Complex Event Processing as a high-level description on how to react to sequences of events.