Last month, Netflix followed up on several previous posts about an operational metrics processing platform, Mantis. The platform enables fine-grained, device-level event publish and capture for operational metrics and allows Netflix to build highly granular, real-time insights applications that provide deep visibility into the interactions between Netflix end-user devices and AWS services including operational dashboards and alerting at the individual title level measured against SPS.
Netflix existing service-level monitoring systems aren’t suitable for understanding and diagnosing problems relating to device-level behavior in the context of a specific user, device and entertainment title, whose unique combination comprises an asset on which to execute data capture, transformation, reporting and alerting. Mantis-driven anomaly detection enables engineers to track rolling windows of unique events per asset, and allows them to respond quickly to production issues and know who’s affected in a high-volume, high-cardinality real-time streaming and batch-data environment.
Mantis integrates into the existing Netflix infrastructure by allowing developers and their applications to submit jobs for producing, processing, and querying events from about 20 different data sources including services like Zuul, API, personalization and playback services, and device logging data. The decoupling between producer and consumer allows isolation and flexibility for detecting and fixing production issues.
The architecture is based on Apache Mesos and provides an abstraction layer between an application developer and a cluster of EC2 servers as a shared pool of compute resources for stream processing jobs. The application developer can configure a job through a set of APIs as well as a GUI, and subsequently edit job configurations and query current metrics. They can then build applications on top of these data, but remain decoupled from the internal implementation details of Mantis.
Message guarantee levels vary by Mantis job, for example at-most-once guaranteed delivery versus at-least-once guaranteed delivery via Kafka semantics. When asked about alternative architectures that could have included Spark Streaming, Mantis engineer Neeraj Joshi noted the Kafka based implementation allowed
more control over how we schedule the resources so we can do smarter allocations like bin packing etc. (that also allows us to scale the jobs).
Mantis uses a master and agent cluster model that uses Fenzo, a resource manager and recently open-sourced project in the form of a scheduler Java library that autoscales Mesos worker clusters by adding and removing instances. Scaling is based on resource-usage metrics, job-scheduling timelines and human interaction with jobs through resource-usage dashboards. Fenzo helps dynamically assign EC2 resources based on job-scheduling timelines and active resource usage. A job manager provides metadata retention, SLAs, artifact locations, job topology and lifecycle.
Mantis can execute streaming, non-blocking jobs with back-pressure awareness, data transformation, and asynchronous result storage where a job can be defined as single-stage for basic transformation/aggregation use cases, or multi-stage for sharding and processing of high-volume, high-cardinality event streams.
When asked about Mantis job customization, user defined jobs and internal implementation details Netflix engineer Nick Mahilani noted:
Several reusable jobs exist that the users can submit with different set of parameters. For example, some jobs are parameterized to connect to different sources, perform grouping based on different keys and detect anomalies based on parameterized thresholds. Some jobs also accept parameters that are dynamically compiled into a template job.…
Users can develop new applications that can be submitted as jobs. They can focus on writing the jobs without worrying about scaling or resource provisioning concerns. A Mantis job is implemented by including the Mantis-runtime library and implementing a Java interface. The job is passed an RxJava Observable<MantisEvent> which job authors can transform using Rx operators. The results are plumbed into the next stage or the sink for other jobs to consume the transformed stream. To deploy the job, the users package the job as a .zip artifact, which is distributed to the Mantis cluster.
Mantis is reported to saturate the network interface card(s) on the servers for operational use cases with very little CPU usage.