Netflix's goal is to predict what you want to watch before you watch it. They do this by running a number of machine learning (ML) workflows every day. Meson is a workflow orchestration and scheduling framework that manages the lifecycle of all these machine learning pipelines that build, train and validate personalization algorithms to help with the video recommendations.
Netflix development team recently wrote about Meson framework and how it helps with the ML pipelines. One of the goals of Meson is to increase the velocity, reliability and repeatability of algorithmic experiments while allowing engineers to use the technology of their choice for each of the steps in the pipeline.
Some of the technologies that play an important role in machine learning pipelines at Netflix include Spark MLlib, Python, R and Docker.
A typical machine learning pipeline that drives video recommendations includes the following steps:
- User Selection
- Feature Generation
- Model Training
- Model Validation
- Publish Model
Selecting a set of users is done via a Hive query to select the cohort for analysis. Then the data cleansing and preparation is done by a Python script that creates two sets of users for ensuring parallel paths. In the parallel paths, one uses Apache Spark to build and analyze a global model with HDFS as temporary storage. The other uses R language to build region (country) specific models. The number of regions is dynamic based on the cohort selected for analysis.
Model validation is done by Scala code that tests for the stability of the models when the two paths converge. The whole process is repeated if the model is not stable. Finally, a Docker container is used to publish the new model which is picked up by other systems.
They use resource manager tools like Apache Mesos to satisfy the resource requirements needed in the ML workflows. Mesos provides task isolation and abstraction of CPU, memory, storage, and other compute resources. Meson leverages these features to achieve scale and fault tolerance for its tasks.
Meson also consists of Scheduler and Executor components.
Meson Scheduler: This component manages the launch, flow control and runtime of the various workflows. Meson delegates the actual resource scheduling to Mesos by passing the memory and CPU requirements to Mesos. Once a step is ready to be scheduled, the Meson scheduler chooses the right resource offer from Mesos and ships off the task to the Mesos master.
Meson Executor: This is a custom Mesos executor which allows the team to maintain a communication channel with Meson. This is useful for long running tasks where framework messages can be sent to the Meson scheduler. This also enables to pass custom data.
Once Mesos schedules a Meson task, it launches a Meson executor on a slave after downloading all task dependencies. While the core task is being executed, the executor takes care of other tasks like sending heartbeats, percent complete, status messages etc.
Meson also offers a Scala based DSL that allows for creating customized workflows. There is also native Spark support in Meson which allows for monitoring of the Spark job progress from within Meson. It also has the ability to retry failed spark steps or kill Spark jobs that may have gone astray.
Netflix team plans to open source Meson in the coming months and build a community around it.