A Software as a Service (SaaS) system requires a robust Control Plane as its brain, which automates infrastructure provisioning, deployments, configuration, monitoring, fleet management, capacity allocation to customers, etc.
During the third day of QCon San Francisco, Sergey Bykov, an SDE at Temporal Technologies, presented "Durable Execution for Control Planes: Building Temporal Cloud on Temporal." In his talk, Bykov introduced the concept of Durable Execution with a real-world example of how his company uses it to build the Control Plane for Temporal Cloud, including lessons learned. The session was part of the "Architecting for the Cloud" track.
Bykov started his session with how to host a managed cloud service by explaining the tenancy model for single- and multi-tenant provisioning. With a single tenant model, one encounters a simpler model with dedicated resources, better isolation, and the product is a physical instance. On the other hand, with multi-tenancy, one has a more complex model, with shared resources, better economics, and the product being a virtual instance. In addition, he shared the first learning that managed multi-tenanted services (serverless) are what customers want.
Next, Bykov talked about units of products that customers get in cloud resources, units of isolation, capacity, and management. He discussed the namespace as a service in the Temporal Cloud and provided an example. Furthermore, he explained the difference between the control- and the data plane following the analogy of telecommunications. Moreover, he pointed out the concerns with the control plane around provisioning, configuration, monitoring, resource management, metering, and billing.
Regarding the data plane, Bykov explained the cell architecture, where the primary objective is to ensure that data plane resources are as isolated as possible. This approach mitigates potential issues within the system, as any problems arising within one cell will ideally not impact the rest of the network. Additionally, when it comes to deploying changes or updates, one implements the change within a single cell and thoroughly tests its impact, thereby reducing the potential blast radius of the modification. In the session, Bykov showed Temporal Cell Architecture and provided an example of Temporal Cells in AWS.
Bykov continued with his session, discussing a scenario around the control plane where one deals with provisioning, configuration, and monitoring. The provisioning can be done through a workflow-ish process with underpinning durable execution technology that creates cells. Bykov refers to creating a namespace (choose a target cell, create namespace artifacts such as namespace record in the DB, roles, user permissions, etc, provision infrastructure artifacts, validation of endpoints) as a virtual resource for a customer.
The lesson learned here is that it’s great not to worry about failures, retries, backoffs, and timeouts, which enhances developer productivity.
Next, Bykov discussed another scenario: the rollout of a new version of Temporal software to all cells in a fleet, where he explained the concept of deployment rings and the rollout through the durable execution of a workflow.
Deployment of a new version takes time considering deployment rings, and Bykov shared that the lesson learned here is that long-running operations are trivial with Durable Execution.
In the last part of the session, Bykov talked about entity workflows that represent structured sequences of tasks and actions that define the processing and handling of data or objects. He refers to entity workflows as digital twins of cells, namespaces, customer accounts, and users.
The lesson learned here is that entity workflows are great digital twins.
Lastly, Bykov provided a summary of his session:
- Managed multi-tenanted services (serverless) are what customers want
- The Control Plane is the brain of the service
- Control Plane operations take time and face failures
- Implementations gravitate to workflows
- Durable Execution is an obvious fit
- Entity workflows are natural digital twins
InfoQ interviewed Sergey Bykov after the session.
InfoQ: After you talk, an attendee in the audience asks how you get back to the code where the failure happens when provisioning the namespace.
Sergey Bykov: This question is critical to understanding the framework and the tradeoffs. If the work process crashes/restarts while executing step 3 of a workflow, it will start executing the workflow function from the beginning. However, because all 'step' calls are wrapped in workflow.ExecuteActivity() and other wrappers like that, all of them get intercepted by the framework.
The framework (SDK) will load the history of the incomplete execution from the server that has steps 1 and 2 recorded as completed with outputs of their execution. Instead of re-executing these steps, the SDK will return the recorded outputs, and step 3 will be the first to execute.
We call this process Replay (hence the conference name). Workflow code gets the illusion that it re-executes completely. In the process, it reconstructs any memory state from those recorded outputs, even though only the remaining steps will, in fact, execute after the restart.
InfoQ: In your talk, you mentioned other companies building a control panel through Temporal. Why do companies like Redpanda, Netflix, Datadog, and HashiCorp choose Temporal?
Bykov: Many workflow systems have been built that use a Domain Specific Language (DSL) to define workflow logic over the years. They work for relatively simple use cases but gradually become challenging to maintain as complexity grows. It's very difficult to debug (or test) an issue in a YAML/JSON definition of logic when it gets executed by a generic runtime. The workflow-as-code approach removes this indirection and puts developers in complete control of the code's execution.
Even with workflow-as-code, there are many edge cases and limitations that aren't obvious from the start, especially as code evolves over time timers, performance isolation, dead-letter queues, the determinism of re-execution, scalability, system protection (throttling), fairness of execution, etc., etc. These companies realized that building such a solution in-house is a massive multi-year effort without guaranteeing success. The build vs. buy decision is straightforward here.
Of the available workflow-as-code solutions, Temporal is a clear thought and market leader laser-focused on developing this approach into a new paradigm of building software - Durable Execution. With that, it makes perfect sense for these companies to leverage Temporal's investments into the platform, and in many cases into its hosted cloud service, while focusing on their core competencies and delivering their unique value to customers instead of solving very general problems.