Key Takeaways
- Large, complex systems hinder development speed. "Platform as a Runtime" simplifies the environment, enabling quicker development cycles.
- Platform engineering goes beyond CI/CD. It streamlines the development process by simplifying code, automating integrations, and eliminating dependencies. This translates to significant developer productivity gains.
- Platforms promote organizational scalability by fostering standardized development practices. By codifying methodologies, tools, and best practices, consistency is ensured across teams.
- Progressing to a platform as a runtime allows organizations to reduce the microservices footprint and cost and manage a single platform version with its own lifecycle, separate from the business microservice lifecycle.
Introduction
Many companies turn to platform engineering to help scale their development teams and increase developer experience for engineer efficiency. However, platform engineering usually stops at the CI/CD pipeline. As systems become larger and more complex we need to take the concepts of platform engineering to a higher level – to the code level – by creating platforms and abstractions that will reduce cognitive load, help simplify and accelerate software development, and allow for easy maintenance and upgrades to the platform. This will reduce cross-company tasks like fixing the infamous Log4J security vulnerabilities. And while we are at that, let's see if we can also reduce our cloud cost by reducing the footprint of each microservice. Let’s move from "platform" to "Platform as a Runtime".
The impact of complex software systems on developers and organizations
Large and complex systems can hinder a company's ability to innovate and adapt quickly. These systems often demand that developers address large amounts of information and concerns, leading to cognitive overload. As an engineering manager, I've witnessed this firsthand. New feature development, regardless of its size, can be significantly slowed by the need to address a multitude of cross-cutting concerns, such as network contracts, regulations, and various non-functional requirements that exist alongside core business needs. This is especially true for Wix, you see, Wix is an open platform that exposes many APIs to 3rd party developers, being such all the services need to work in the same way, we have many guidelines on how to build a service and what are the best practices to handle scale and be part of the ecosystem.
For instance, a software platform may have a requirement that every database change operation needs to send a domain event. The fact that developers need to remember to define the domain event message and implement it on every DB operation, adds to the cognitive load, time, and complexity of their feature. In every system there are many additional requirements such as the system may require support for multiple languages and currencies, ensuring compliance with GDPR regulations, handling "delete" notifications, implementing best practices such as optimistic locking or having a version field in addition to last update date on every database schema, integrate with other sub-systems like IAM or other components of the ecosystem they are part of. This ever-growing list of considerations and "best practices", can significantly impede the release cycle, especially as system complexity increases.
Why software gets complicated
Usually, software systems start small, but as they progress and become larger they become complex systems with intricate dependencies making it harder to understand how changes in one part might affect another.
Software systems are getting increasingly large and distributed across multiple servers and cloud components. Managing and maintaining these distributed systems adds another layer of complexity. Each component and feature has its own best practices and requires special knowledge. For instance, to send a domain event you need to understand how to use Kafka, to learn its APIs, its delivery guarantees (at least once) and what are the best practices of using Kafka.
The same goes for databases like MySQL or MongoDB, search engines like Elastic Search, and even other internal services that you integrate with like your feature flags system. Basically, you need to understand and learn how to best use every component you use in the best way possible.
Another point that contributes to the complexity is the lack of a standard way of developing software across teams and developers. For instance, one developer may define a database schema with a primary key as UUID, while another as Long. One developer may implement the GDPR features for "delete" and "get my data", while another that was under pressure from the business to release features quickly only develops "get my data" but without the "delete" functionality. There can even be different implementations, for instance, one developer implements "GDPR delete" as a hard delete, another as a "soft delete", and a 3rd as data anonymization without actually deleting the records. While these all might be valid solutions, when someone (i.e. the legal team) asks how you implement a GDPR delete, the answer should probably not be "it depends". Systems should behave in a predicted and consistent way.
It is almost impossible to ensure that all developers 100% comply with all the system's non-functional requirements. Even a simple thing like input validations may vary between developers. For instance, some will not allow Nulls in a string field, while others allow Nulls, causing inconsistency in what is implemented across the entire system.
Usually, the first step to aligning all developers on best practices and non-functional requirements is documentation, build and lint rules, and education. However, in a complex world, we can’t build perfect systems. When developers need to implement new functionality, they are faced with trade-offs they need to make.
In many cases, we look at trade-offs in and between three pillars:
Code - When choosing how to build our system, for instance choosing between writing code in a monolith or a microservice, we face several concerns that may affect our decision. How easy it is to understand the existing code and the domain(s), can we break an API and what will be the effect on the system, how easy it is to refactor code and test and how can we scale our engineering org so multiple teams can work without or with as little dependency on other teams when writing their own features.
Deploy - In this pillar we make trade-offs in relation to the release lifecycle, i.e, can multiple teams release new versions of their code to production whenever they want. How easy and quick the deployment process is. What are the risks with each deployment (the more code you deploy, the greater the chance is for a bug). Another thing to consider is keeping backward compatibility and breaking APIs. In a monolith for example it is easy to refactor and break an (internal) API because you have control over the entire code base, as opposed to a microservices environment where breaking an API can cause unexpected incidents due to its distributed nature.
Run - In this pillar we consider the operational aspects of our system. What are the performance requirements, and how easy is it to scale parts of the system? When we run on production, how easy it is to understand (monitor) the system. In case of an incident, can we quickly find the owner of the part of the system that fails?
While documentation is a necessary step to define how we would like to develop software and what are the recommended best practices, in reality, developers have a lot of freedom to choose what and how to implement them. Multiple teams will have different internal libraries that implement parts of the guidelines and system contracts in different ways.
These varieties in implementations create ever increasing technical debt on the system as a whole, since every change in a cross-system requirements will need multiple teams to make changes, different bugs to be fixed in different implementations that basically do the same thing. Not so long ago we had the Log4J vulnerabilities that required almost every team to work on a fix. Making sure that 100% of the code base is fixed was a tremendous task.
The need for standardization
Complex environments demand standardized coding practices.
While defining these standards and consolidating technology stacks are crucial, simply documenting them isn't enough. As I mentioned earlier, too much documentation can overload developers with information.
The solution lies in codification. We can translate these standards, guidelines, and best practices into an opinionated development platform. What we need to provide is a coding platform that developing within, will automatically take care of most of the system's cross-cutting concerns and will make it very easy for developers to code within the guidelines, basically creating a golden path to quick product feature development.
For example, encryption of PII fields. The platform should automatically handle encryption and decryption of the fields without the developer needing to learn, understand, and even use the encryption library. For instance, just by annotating a field as @PII, the platform would automatically encrypt and decrypt the field as it is being written and read from the database, so developers don’t even need to think about it in their code.
Since the cost of developing such a robust platform is very high, we try to limit as much as possible our software stack. Granting unrestricted freedom to deviate from the standard platform increases the system's complexity and maintenance burden, thus any divergence should be carefully evaluated considering the added complexity it introduces.
The need for standardization comes to mitigate scaling challenges. Microservices is another solution to try and handle scaling issues, but as the number of microservices grows, you will start to face the complexity of a Large-Scale Microservices environment.
In distributed systems, requests may fail due to network issues. Performance is degraded since requests flow across multiple services via network communication as opposed to in-process method calls in a Monolith. Monitoring the system becomes harder since calls are distributed across multiple services. Security becomes a bigger issue because, with every microservice we add, we increase the attack surface. And let’s not forget the human factor: It becomes harder to maintain standards, quality, and protocols across multiple teams and services.
These are the obvious shortcomings, but hidden issues that we encounter in large-scale systems are cost and maintainability. Let me explain:
When writing a microservice, you usually use some kind of framework like Spring, you also have all your internal libraries and dependencies that you need, for instance, logging libraries and JDBC drivers that you build and package into your microservice. What that means is that over 90% of the code that runs in a microservice are actually the frameworks and libraries you package and deploy. The business logic you actually write in each microservice is less than 10% of the code, at best, depending on the size of the microservice. In many cases, we even saw that the business logic is less than 1% of the code that is packaged within a microservice.
All this code is duplicated and deployed hundreds and thousands of times in your production environment, increasing the footprint with every new microservice. This, in turn, increases your cloud cost and makes it harder to align the different frameworks and library versions.
At Wix, we operate over 4000 clusters of microservices, which is causing us some pain. So, we tried to mitigate these issues. We approached this problem by building Platform as a Runtime (PaaR).
To analyze the problem domain, we looked at how developers write code and chose the technology stacks across three pillars: code, deployment, and runtime. We split the solution into two parts: Platform and Runtime.
Platform: Developer experience on autopilot
The platform focuses on the developer experience, by codifying best practices, contracts, regulations, and most importantly integrations into the code middleware components of our production environment. Imagine it as a highly customized framework tailored to your company's specific needs. It handles non-functional requirements, reduces boilerplate code, and minimizes cognitive load. When developers work within the platform, things simply "work as expected."
We internally called this project "Nile" and its focus was to streamline software development and the goal was to bring the most value to the developers focusing on the developer’s experience.
This approach goes beyond traditional frameworks and platform engineering, we took platform engineering to the code level from the CI/CD level. Most companies offer frameworks that developers utilize, but they fall short of creating a platform that seamlessly integrates the framework with the organization's operational practices.
For instance, consider GDPR compliance. To fulfill a GDPR data deletion request, you typically subscribe to a Kafka topic and listen to "delete my data" requests. A basic framework might allow you to easily subscribe to the topic, but developers would still need to code the message processing and deletion logic. A robust platform, however, would automatically subscribe to the GDPR topic, process the message, and initiate data deletion from the database – all without requiring additional developer intervention, the only thing a developer would need to do is to annotate the PII fields, and the platform would do the rest automatically.
Runtime: Optimized service footprint and deployment
The runtime component of PaaR focuses on optimizing service footprint and deployment strategy. Instead of bundling the entire platform and framework with your code artifact, the runtime holds the platform code and manages all network communication (incoming and outgoing). This eliminates the need to package the platform with every microservice, enabling independent release cycles of the platform separate from the "product" artifacts. Each deployed artifact simply connects to the runtime, resulting in a smaller service footprint, think of it like a runtime dependency as opposed to a build time dependency.
By reducing artifact size, PaaR allows for greater density within nodes. The footprint of a guest (i.e., your microservice) is reduced dramatically since it is not bundled with all the frameworks and common libraries. A single runtime host can efficiently serve multiple guest services, creating a virtual monolith.
In order to support a wide range of programming languages, we embarked on a 'Platform as a Runtime' initiative dubbed "SingleRuntime," which communicates with guest services using the gRPC protocol over a local network (localhost). This approach will enable us to develop in multiple languages while maintaining a unified platform.
While PaaR is still a work in progress, we have experienced significant success with Nile. The platform brings a lot of value to the developers, we managed to improve our internal developer’s velocity by 50%-80%. Our developers' experience has improved since they can now focus on building the business logic of their products, as opposed to spending a lot of time writing boilerplate code and implementing all the non-functional requirements, reducing the amount of code they have to write and test, and releasing products much faster than before.
The platform's impact is so significant that we as a company have decided it is worth it to rewrite all our legacy services (there are hundreds of legacy services) into Nile in the next year.
Another underrated benefit of adopting a single standard platform that does a lot of the heavy lifting for you is the improved product quality. Product developers are freed from repeatedly implementing non-functional requirements, as these are now provided by the platform and implemented according to best practices, in the most efficient way by the platform team. Additionally, any new feature added to the platform is automatically available and active on all services built within the platform, saving cross-company efforts.
One example is data locality. Only a couple of services supported data locality before we moved to Nile, but as soon as we developed data locality support into the Nile platform, in a single day, hundreds of services that did not support data locality before, now have support for it, and all this without involving any product developers. Once they were compiled with the new platform, they got the data locality support "for free". This saves the company hundreds of man weeks, had we wanted to support it without having a unified platform.
Should you build your own PaaR?
Developing a Platform as a Runtime (PaaR) solution is a substantial undertaking best suited for organizations facing significant scaling challenges. If your microservice environment is relatively small, in the low 100s of services, alternative, more cost-effective solutions for scaling might be preferable. You can start by enforcing standard libraries, maintaining rigid control over 3rd party dependencies, and building rules to enforce standards. One of the things we did at Wix, was to build a "generally available" (GA) enforcer that would force everyone to deploy their artifacts to production with the latest libraries and frameworks at least once every 2 weeks.
Once you scale to thousands of microservices, you could start building a platform.
For organizations ready to embark on a PaaR journey, my advice is to prioritize building the platform first. Focus on automating and streamlining the development process and integrations by taking platform engineering to a higher level of abstraction – not just infrastructure, but also the software layer itself.
Focusing on business logic worked for us since the platform team’s customers are our own product developers.
Building a platform involves a critical review of tens of thousands of lines of code. The way we approached it was to challenge ourselves with every line of code that we evaluated, we asked the question: "Does this line of code belong here?" The design goal of the platform is to isolate core business logic within the codebase and codify everything else into the platform, instead of the product service itself. As Steve Jobs once said: "The line of code that’s the fastest to write, the line of code that never breaks, that doesn’t need maintenance is the line you never had to write." As naive as it sounds our KPI was the number of lines of code, we aimed to reduce as much as possible the lines of code a product developer has to write which are not business logic related.
An important lesson we learned is that we needed the platform team to be in the right mindset. We had to have a Value-Driven platform team. While this can be its own topic, I would mention one thing we discovered: the most crucial KPI for the platform team is "developer adoption". If developers aren't using your platform, it might not be delivering real or enough value. This kind of thinking was crucial to the team, collaboration with product developers is very helpful in the adoption of the platform since product developers are always involved in defining the platform features, and capabilities and setting requirements that actually solve their real problems.
One last thing I would like to share: The path to achieving these goals was not easy. Aside from the technological challenges, there is also the human factor. Developers can be apprehensive of abstractions and unseen functionalities. In order to win the hearts and minds of developers, consistent communication about progress and ongoing education regarding the platform's inner workings are vital. This transparency demystifies the "magic" and empowers developers to debug and contribute effectively.