Netflix recently published a blog post detailing how it built a reliable device management platform using an MQTT-based event sourcing implementation. To scale its solution, Netflix utilizes Apache Kafka, Alpakka-Kafka and CockroachDB.
Netflix's Device Management Platform is the system that manages hardware devices used for automated testing of its applications. Netflix engineers Benson Ma and Alok Ahuja describe the journey the platform went through:
Kafka streams processing can be difficult to get right. (...) Fortunately, the primitives provided by Akka streams and Alpakka-Kafka empower us to achieve exactly this by allowing us to build streaming solutions that match the business workflows we have while scaling up developer productivity in building out and maintaining these solutions. With the Alpakka-Kafka-based processor in place (...), we have ensured fault tolerance in the consumer side of the control plane, which is key to enabling accurate and reliable device state aggregation within the Device Management Platform.
(...) The reliability of the platform and its control plane rests on significant work made in several areas, including the MQTT transport, authentication and authorization, and systems monitoring. (...) As a result of this work, we can expect the Device Management Platform to continue to scale to increasing workloads over time as we onboard ever more devices into our systems.
The following diagram depicts the architecture.
Source: https://netflixtechblog.com/towards-a-reliable-device-management-platform-4f86230ca623
A local Reference Automation Environment (RAE) embedded computer connects to several devices under test (DUT). The Local Registry service is responsible for detecting, onboarding, and maintaining information about all connected devices on the RAE. As device attributes and properties change over time, it saves these changes to the Local Registry and simultaneously published upstream to a cloud-based control plane. In addition to attribute changes, the local registry publishes a complete snapshot of the device record at regular intervals. These checkpoint events enable faster state reconstruction by consumers of the data feed while guarding against missed updates.
Updates are published to the cloud using MQTT. MQTT is an OASIS standard messaging protocol for the Internet of Things (IoT). It is a lightweight yet reliable publish/subscribe messaging transport ideal for connecting remote devices with a small code footprint and minimal network bandwidth. The MQTT broker is responsible for receiving all messages, filtering them, and sending them to the subscribed clients accordingly.
Netflix uses Apache Kafka throughout the organization. Consequently, a bridge converts MQTT messages to Kafka records. It sets the record key to the MQTT topic that the message was assigned. Ma and Ahuja describe that "since device updates published on MQTT contain the device_session_id in the topic, all device information updates for a given device session will effectively appear on the same Kafka partition, thus giving us a well-defined message order for consumption."
The Cloud Registry ingests the published messages, processes them, and pushes materialized data into a datastore backed by CockroachDB. CockroachDB is an implementation of a class of RDBMS systems called NewSQL. Ma and Ahuja explain Netflix's choice:
CockroachDB is chosen as the backing data store since it offered SQL capabilities, and our data model for the device records was normalized. In addition, unlike other SQL stores, CockroachDB is designed from the ground up to be horizontally scalable, which addresses our concerns about Cloud Registry's ability to scale up with the number of devices onboarded onto the Device Management Platform.
The following diagram shows the Kafka processing pipeline comprising the Cloud Registry.
Source: https://netflixtechblog.com/towards-a-reliable-device-management-platform-4f86230ca623
Netflix considered many frameworks for implementing the stream processing pipelines depicted above. These frameworks include Kafka Streams, Spring KafkaListener, Project Reactor, and Flink. It eventually chose Alpakka-Kafka. The reason for this choice is that Alpakka-Kafka provides Spring Boot integration together with "fine-grained control over stream processing, including automatic back-pressure support and streams supervision." Furthermore, according to Ma and Ahuja, Akka and Alpakka-Kafka are more lightweight than the alternatives, and since they are more mature, the maintenance costs over time will be lower.
The Alpakka-Kafka based implementation replaced an earlier Spring KafkaListsener based implemntation. Metrics measured on the new production implementation reveal that Alpakka-Kafka's native back-pressure support can dynamically scale its Kafka consumption. Unlike KafkaListener, Alpakka-Kafka doesn't under-consume or over-consume Kafka messages. Also, a drop in the maximum consumer lag values following the release revealed that Alpakka-Kafka and the streaming capabilities of Akka perform well at scale, even in the face of sudden message loads.