InfoQ Homepage Presentations When DevOps Runs Its Course - We Need Platform as a Runtime

When DevOps Runs Its Course - We Need Platform as a Runtime

View Presentation

Speed:

Download

50:13

Summary

Aviran Mordo describes how Wix is building its own Platform as a Runtime (PaaR) infrastructure that allows developers to ship software faster, more securely, and with higher quality.

Bio

Aviran Mordo is VP engineering at Wix.com. He has over 30 years of experience in the software industry and has filled many engineering roles and leading positions, from designing and building the US national Electronic Records Archives prototype to building search engine infrastructures.

About the conference

Software is changing the world. QCon London empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Mordo: For your next service, which one of you will build a monolith? Who would build a microservice? Who would use serverless? Why do you think we have so many answers? Why is this not one? This is what we're going to talk about. The reason why we had diverse answers is because it depends. Each technology, each stack that we choose serve a different purpose. Each one shines in its own way. With the monolith, it's a very efficient way to run code, because everything runs in-process, super performant because there is no network calls. With microservices, you get the ownership. It's geared toward a scale of teams and engineering.

Of course, with serverless, you get easy deployment and scalability. Complexity is killing your software developer. The growing complexity in modern software systems is slowly killing software developers. How can we regain control without losing the best of these technologies? This is a quote by Scott Carey, he used to run InfoWorld, now the dev lead publishing. When we need to choose a technology, we're basically trading off between three pillars: how do we code, how do we deploy our code, and how do we run or operate our code?

I'm Aviran. I'm VP of Engineering for Wix. This was my first computer. I've worked in a bunch of companies, from small startups to huge corporations, even owned my own failing startup. Eventually ended up at Wix.

Wix - The Website Builder Platform

Who knows what Wix is doing? Let me set the context about what is Wix. Wix is the leading website builder platform in the world. You can come to Wix with an easy drag and drop, you can build your website. We have vertical solutions with the e-commerce platform, and event platform, and booking, and restaurant. Everything that you need to build your own web presence, we provide. In addition to that, if you think about the website builder UI as an IDE, we also let you write code. We have our own serverless. You can write frontend code, backend code. You get databases. Everything is handled for you in a managed platform. Why is this relevant? You'll see in the next slide.

A little bit about Wix in numbers. We have over 250 million website builders using our platform, about 7% of the internet's websites running on our platform, a billion users, humans, not bots, visiting our website. We run in a fairly large scale, 4000 microservices clusters across three data centers. We're also a global company, spread around the world, Europe, Israel, and U.S., Canada. Back to our complexity issues. We are so very complex. What is our responsibility as engineers? We are getting paid to deliver business value. If we run a successful company, we need to bring business value first. Good, successful companies provide high quality code fast. This is how you beat your competition.

Why Is It Complicated?

Let's talk a little bit about why is it complicated. With a monolith, when you build a monolith, we build a single service. It is fairly simple to start. Everything is there for you. All the code run in the same process. Life is easy, when you start with the monolith. This is actually my recommendation for startups, start with a monolith. Don't do the microservices. Don't go crazy. You have one service. What are the pros and cons? These are the tradeoffs, some of them, of a monolith. Coding is easy. Everything is accessible. It's easy to test because you don't have the integration test. You don't have end-to-end test. Easy to break APIs because refactoring is very fast. Simple topology.

As we scale, we get to the hurdles of a monolith. You start mixing domains. You get spaghetti code. It's hard to synchronize between teams. Who is the owner of a monolith that contains many domains? There are tradeoffs, but there also are benefits. Next, let's take a look at microservices and serverless. They're very similar. With a microservice when you go to distributed systems, now you no longer run in a single process. You have your dependencies, other services that you're depending on. Not so simple anymore as a monolith, but this is a textbook case. In a real world, you don't just have direct dependencies. You have something closer to this, you have indirect dependencies and long flows. If you are a company that is fairly big like Wix, you get to something like this.

This is not Battlestar Galactica, this is a map of our production system from a couple of years ago. Wix has over 4000 microservices clusters. Did we mention complexity? Yes, this is a fairly complex system. When we talk about microservices and serverless, we trade off other things. With the code we want single responsibility and single domain. It's easier to code like this, but we trade off the cross-cutting concern that has to be replicated across the microservices. Testing becomes more complex. It's easy to deploy. Every team has their own independent lifecycle. Every deployment, since you have a long flow, you need to keep backward compatibility, because you don't know who is going to call you. Refactoring becomes hard. Breaking an API becomes harder. When we run, it's easy to scale. We have a clear owner of a service, but the topology becomes complex, performance gets hurt. There is no perfect system.

Code Complexity

I want you to join me on a journey that we started about 4 years ago, looking at all those issues and trying to tackle each and every pillar, and trying to figure out, how can we build something else, something that will solve most, if not all of our pillars from code, deploy, and run. Let's look at the code. We'll start with an example. I want you to meet Dana. Dana is a new developer, and she gets a task from her manager. She needs to write a task management system. Pretty simple system. You write your task. You assign a person. You have some status complete or in progress. Very simple.

I want you to think about what it would take and how long would it take you to write this simple service, so did Dana. She went to her manager. She got the task. Now she goes to her manager and said, "Yes, Aviran, I know how to build my task management system. I need to do domain modeling and design a good API." Then her manager just tells her, "Yes, but you need to think about the request flow. What comes in the request flow?" "I need to do authentication and authorization. Pretty simple. I can handle this task." "Don't forget to do the input validations and the object mapping between the APIs and your database. Of course, don't forget to call RPC. You need to figure out how to do that.

Of course, fetch the secrets, because you need to go to the database." Dana thinks about it. Ok, now I know I have to do a little bit more things, but fine, I can handle that. Now I can go and write my code. "You need to also worry about data access and domain events, because every action needs to send a domain event, and also error handling, because we need good error messages." Dana scratches her head and said, "I can do that also." Now I can go and write my code? "Almost, but you also need to consider GDPR and PII and caching, and, of course, logging.

Don't forget about logging, and how you test your systems." Dana is starting to get frustrated. That's a lot of things. I just need to write a task management system. It's pretty simple. Three APIs, that's it. Now I can start writing the code. Almost, because now you need to do scaffolding, and figure out how the build system works, and A/B testing, gradual rollout, and how to debug, and monitor, and a lot of concerns just to write a simple task management system. Someone talked about complexity, I think, in the beginning of the talk.

Now she's really mad, and she comes to her manager, how can you explain and how can you help me figure out all those things? We have best practices for every single one of them. How can we teach her? We will write a documentation. We wrote the framework. We wrote documentation from the framework. Now Dana knows how to use the framework. It was such great documentation. She learned a lot about it. We helped her when we wrote documentation to use identity, because identity is a fairly complex domain. We also helped her to understand the GDPR laws and what she needs to do and how to send webhooks.

We wrote a bunch more documentation on webhooks and error handling and databases and some more documentation. Now Dana is really upset, and we need to help Dana. How can we help Dana? We went to this lady who did some science, and we recorded this session. A fairly famous guy came and answered that question. I want you to hear what he has to say.

Jobs: The way you get programmer productivity is not by increasing the lines of code per programmer per day. That doesn't work. The way you get programmer productivity is by eliminating lines of code you have to write. The line of code that's the fastest to write, that never breaks, that doesn't need maintenance is the line you never had to write. The goal here is to eliminate 80% of the code that you have to write for your app.

How Can We Code Faster in a Complex Environment?

Mordo: I tend to agree with what Steve has to say. We embarked on a journey. How can we code faster in a complex environment? We started this journey, like I said, about 4 years ago. What we did is we sat with our CEO on a weekly meeting, and we looked at thousands of lines of code, and we tried to figure out which line of code should not be there. We looked at how developers develop software, and we looked at the code that they have written, and we said, should this line be there or not? If it shouldn't be there, if it doesn't help, if it's not a business logic, because this is what we get paid to do, to build business logic, we should try to figure out how to remove this line of code.

Removing this line of code is to build a platform. We sat down and we built a platform. We call this platform, NILE. It's basically codifying all the guidelines and all the best practices into the platform. We have developers just develop inside the platform. Everything that we saw before, all the cross-cutting concerns, all the things that Dana had to do, we just codified it into the platform. Some of the parts that we have in NILE is the framework and the search integration.

This is a good example of search integration. You write to your database and now you want to index your document, so the platform does that for you. Every time you update, insert, delete a record, it automatically integrates into Elasticsearch and updates the document. The developers don't have to do anything other than to annotate the domain object with searchable. That's it. The platform does a lot of the things for you.

Deployment

We built a platform to help with the code. Next pillar, let's talk about deploy. Those two projects went in parallel, and we have learned from each other. We set up to build our own serverless platform. Now you ask me, why do you need a serverless platform? You got AWS. It's a great serverless. You are right, but maybe we can do better, or something that is more custom to us. I'm not saying we can do better than AWS lambda, but it is tailored towards our needs. Let's look at a typical software stack. We've got our virtual machines, and on top of that, we run some container.

Then we start to build our software on this stack. We got our microservice framework, whether it's Spring or something else, whatever you choose, and then we build our own internal framework. This is the trusted environment framework, all the contracts between the services, all the cross-cutting concerns that we have. Everybody has these frameworks and libraries, the common libraries. On top of that, we build business logic. This is basically the pyramid of how we build software. How does a microservice looks like? We take those things, the frameworks, the trusted, our own frameworks or libraries and business logic, we package everything together and we deploy it. Ninety-nine percent of the code that a microservice runs is not your code. It's not the code that you have written or that the product team have written.

Normally, if you're small scale, that is not a big issue, but if you're running thousands of microservices, this starts to become a problem. Because every microservice has a large footprint. We package everything, so most of the code that runs on our production environment is not code that we actually write. Frameworks and libraries are tied to the lifecycle of our product lifecycle. It's not the infra team has control about the lifecycle of the libraries. Everybody that builds and compile and package, they take the latest version of the framework. We got multiple versions on your production. You all remember the Log4j incident? How do you update 4000 microservices?

This is a common library that sits in the framework. If you have that kind of issue, you need basically to stop all the company and make sure that 100% of all your services took the patch. It's a very complicated and long process. You can't leave one service which is not patched because then you are open to the vulnerability. How do you manage that? If you're not that young of a company, you probably have some services that have no owner. They left. This is an issue. What about supporting multiple languages? We are a JVM shop. We write in Scala. What if we want to bring another language to the company? Now we have to build and duplicate our platform, our frameworks. We need to keep feature parity. It's a lot of work. It's very costly.

Platform as a Runtime (PaaR)

Let's imagine a perfect world. What would we want our platform to be? We want the code to be easy and fast with minimum integration tests and no boilerplate. We want our deployment to be faster. We want a scalable system. We want it to be low cost. How can we achieve that? Remember, everything is about tradeoffs, and every system and every stack that we choose have their own pros and cons. Why not try to take the best of all worlds, the best of microservices, the best of serverless, the best of monolith and the best of managed platform. Wix, our product, is basically a managed platform for our users. What is this thing that we're trying to build? This is where we came up with the Platform as a Runtime.

What I'm showing you is some of the things that we've built and we run, and some of the things that we are planning to build and working on them right now, so they're not there yet, but we have a lot of POCs, and we're building towards that. I owe you a little bit of explanation about the managed platform. Wix is a managed platform for our users. Like I say, you can write code on top of your website. Basically, you write on the browser or your IDE, you click save, and that's it. We handle everything for you. We handle the deployment. We launch your Kubernetes. We provision databases for you. Everything that you need, we handle.

All you have to do is click save. It's a low-code platform. We started with our first attempt, which is the Platform as a Runtime version 1. If I say version 1, there's probably going to be version 2 later on. This is what we tried. We took Node.js, because that's a different stack, we wanted to experiment with something other than the JVM. We built on top of the application framework, the service integration layer, the data services layer, and on top of that, you build your own business logic.

Let's do a breakdown. We choose Node.js because it's lightweight. It has the ability of dynamic code loading. It's very easy to learn. The application framework is basically how we handle the HTTP header, authentication, authorization, monitoring, logging, BI, the experiment system. This goes in the application layer. The service integration layer is basically, we took all the RPC clients, we work with gRPC. We took all the RPC and REST clients. We built all the libraries. The integration to all the clients is being done there. If you build your application, your service, you don't need to do the integration. You don't need to do the lookups. You just get it out of the platform. Of course, also the Kafka integration. In the data services we put DynamoDB as a key-value store. What we did is we took everything except your business logic. We took the platform, and we package only that, and we put it on the cloud.

Basically, we put the platform on the cloud. Where does your code go? Your code without the platform, just the interfaces, and we dynamically load your code into the platform. I said it's different between this and AWS lambda, because you can say, "I can do that with lambda." You can't do this with lambda. Since it is a trusted environment, this is our company's code, we can just put more functions or small services in the same container, in the same platform, because it's trusted.

With lambda, it's a non-trusted environment, so you basically have to package everything together, all the platform you have to package. You cannot share the platform between different services. Since it is a trusted environment, it's our environment, I trust my own developers, we can do that. That gives us a bunch of benefits. The platform basically has no integrations, because all the integrations are provided to you by the platform. I have less testing. I don't need to do the integration testing. I can think about small functions because I don't need the overhead of a large package with all the frameworks. Deployment is very fast. Basically, I have zero boilerplate.

Developer Experience (Code Example)

Let's look on the developer experience. We took a lot of the concept that we built with NILE, we looked at every line of code, and we said, also here, should this line of code be there, or should the platform provide you with this functionality? Let's go back to Dana. Now Dana has a change request. Now she needs to add an API to retrieve the task details from the task microservice. For each task, you need to retrieve the person that is assigned to this task, and return the details. She has another task that she needs to write an audit log to a database.

If you think about it, Dana doesn't really know where to put this API because it doesn't really belong in the task microservice. It doesn't really fit on the contacts microservice. It's a mesh. In a standard microservice world, you would actually stick it in one of your existing microservices, because the cost and the overhead of building a new microservice just for this one API, it's not cost effective. In a world of a serverless function, we can actually do it as a separate service, because they are very small, so none of the artifacts. Let's see what Dana has to write. This is an actual code. This is how we write code on this serverless platform. She imports the task server.

Remember, everything is integrated to her. She gets the client of the task server. She gets the RPC client for the contact server. That's it. This is what she has to do to get the client. Then she exposes a new API, a new endpoint, which is GET, a REST endpoint. She calls the task server to get the task. With the task, she extracts the contact ID. She calls the contact server, she gets the contact details, returns the task and the contact. That's it. This is literally all the code she has to write. This is the working code. Now she needs to add a Kafka Consumer.

It shouldn't be that hard. I said, we'll look at every single line of code. Should it be there? Should it have boilerplate? This is how she implemented the Kafka Consumer. She adds a Kafka Consumer. She gets a data source from the context. She didn't have to write any connection string, no handling connection pools, nothing, no boilerplate. She gets everything from the platform. This is it. This is everything she has to write. Everything is integrated, no boilerplate code. Everything runs. She's really happy. Once she pushes the code, between 1 to 2 minutes, it's running on production. We got minimal code. We don't package our framework. The deployable is very small because we don't package the framework, so it can be also very fast. I think we're in the right direction.

Optimizing the Runtime

For the deployment pillar, we build platform as a service. Taking this same concept, let's see what we can do with the run pillar. How can we optimize that? Run is the R in PaaR. This is the runtime. Let's look at some of the strategies that we can play around with, with this concept. We got our serverless node. This is the serverless runtime. This is the platform. We can still keep an ownership even with this strategy, because we can give containers for every team or for every business unit, so they can handle their own platform container. Even if the platform has an issue, we have a clear owner. This is the team that all of their functions are running on the same container.

Nobody will complain about, your service interrupts my service, and bring me down. Remember how Dana wrote the code, some of the things, some of the functions are running in-process. They're dynamically loaded into the platform, but Dana doesn't care. She writes the code the same way, and the platform takes care of making a network call or an in-process call. It's making her life very easy. Another thing that we can do, let's see, we need to scale. We have F4 that we need to scale it out, and we can simply put it on another container, and the platform we know now to load balance between those two functions across containers. Some of them will run in-process, and some of the calls will run out-of-process.

Another nice thing that we can do, we can optimize the function affinity. Let's look at this example. We have F2 and F5, they're frequently calling each other. Think about what the JIT compiler does. It goes to a hot code and eventually it compiles it and optimize this code. We can do that on a network level, because now we can deploy F5 in-process into this container and replace the network call with an in-process call. We have a very efficient system where if latency is an issue, we can actually have a strategy to deploy it in-process.

What did we gain by that? Deployment size, check. We deploy small functions without the framework. We have a single version of our frameworks and libraries because they have their own lifecycle. We're not tied to the lifecycle of a product team that deploys their microservice. If we have another Log4j incident, we can basically deploy the platform once. One team owns the platform, deploy it, and everybody gets it. I don't need a cross-company effort to deploy 4000 different microservices, I just deploy one platform.

Are we done? Great. Success. We're missing one thing. I have one more bullet that we didn't answer, and that's it, adding an additional language. This was done with TypeScript. We are a JVM shop. We're a Scala shop. Most of our services are running in Scala. We embarked on version 2, and this is something that is still being worked on. We don't have that running. We have done a lot of progress. This is the serverless thing. It's running for 2 years, thousands of functions running this way. Developers love it.

Wix Single Runtime (Work in Progress)

The next thing that we're building is the Wix single runtime, and here we have to do some tradeoffs. What we did is we took the same concept of taking the platform, and we're collating the host, and separating it to different containers. We're trading off in-process to a network out-of-process call. Now the product teams, they build their business logic, but with a thin layer of an SDK, the SDK with a guest SDK. All it does, it knows how to communicate with the host, with the platform. Essentially, it looks something like this. We got a daemon set, which means it has one container. All the incoming, outgoing calls are coming through this daemon set. This is the platform, the PaaR Host.

Every function, every microservice, it can be larger than a function, runs on its own pod, but they run on the same machine. They run on a localhost. All the communication between the pod and the host are localhost, which is faster than actually going through the network. It's being done with gRPC. Cross host is being done by the platform via the platform host. The tradeoff that we took is we traded off in-process to an out-of-process, but we gained the whole Kubernetes ecosystem for deployment, for autoscaling. These things we had to write on our own if we wanted to reinvent the wheel, to do all the deployment.

It is a larger footprint than the previous solution with TypeScript, but still 50% lower than if you had to package the entire frameworks inside your microservice, because you have the JVM footprint and stuff. We did some benchmarks, I think it was done with 1 core and 2 gigabytes of memory, and we realized that performance-wise, we only lose about 2 milliseconds, as opposed to just an in-process, regular microservice. That was before any optimizations that we were trying to do. Two milliseconds is something that we can tolerate. Maybe you can't. Everybody has their own constraint. This is something that we can tolerate, but we think that we can gain that, because in a distributed system, these things talk to each other.

While we lose those 2 milliseconds on a single request, if they would have talked with each other, and this is the deployment strategy, and they probably will talk to each other, it's being done on a localhost, so I think we can gain the 2 milliseconds back, and maybe even be better at some point. We run a bunch of performance tests, and we realized that up to 15,000 RPMs, this is the benchmark. It's about 2-millisecond overhead and 50,000 RPMs. I don't think a lot of services have more than 15,000 RPMs, which is a large scale. We got, of course, services that gets millions of RPMs, and that would probably not be the solution. We can just package it as just a standalone microservice because of the scale.

Another thing we did is to change the work setup from this setup in the PaaR setup, or to package the platform inside just as a regular microservice. It's just a flag on the build system. That's it. A flag on the build config, and it will either package it for you or just use the SDK. Developers still don't have to do much, just, if it's a high scale system, package together, if it's a standard, relatively low RPMs, maybe up to 30k, 50k RPMs, which is still a lot, we can use this system. What have we gained? We've gained cost. We've gained a single framework. We can really deploy fast, security changes or legal constraint.

Think of a new rule, change to the rule of GDPR, the platform takes care of GDPR for you. We have support for multiple languages because we switch from an in-process to an out-of-process, all we have to do to support a different one, because it's a network protocol, then all we have to invest is just write this SDK. We don't have to rewrite the whole platform or the whole framework. Investing just in the thin layer of the SDK is much cheaper than keeping feature parity between frameworks.

Complexity Solved

Eventually, we code the platform. We deploy as a service. We run as PaaR. Or, as I like to say, you code like a microservice, you deploy like a function, and you run like a virtual monolith. We have taken the best of all worlds and created this Platform as a Runtime. What can we gain from that? We can bring business value fast, because you have to write a lot less code. Basically, from our measurements, we see that the velocity of the developers increased by 50% to 80%, which is huge.

Also, we can reduce our cost on compute by about 50%. Why is that? Because we can just push more density into a single node, because the footprint of a single service is about 50% smaller. We can just stick 50% more services on a single host, which is basically what you pay for AWS. You pay for a host. I think it's pretty good.

Questions and Answers

Participant 1: Can you talk a little bit about the process of your rolling strategy when it comes to breaking changes of the tool. Because, if I understood correctly, you have one single version of your framework that cascades down to all services. Let's say there is a breaking change, how do you go about updating?

Mordo: What happens when there is a breaking change?

We don't break. We keep backward compatibility like a gospel. It's a religion. We keep backward compatibility. You can't really break, and if you do, then this is a long process. Basically, what we do is we create proxies, so that from the previous API that proxies and change to the new API, if we get to that.

Participant 2: How do you keep up to speed with your framework, with all other architectural requirements. There's somebody coming in that wants to have a different type of database added to it, different streaming system, a cache, whatever.

Mordo: How do we keep the framework with the requirement that everybody wants something different, different database, different stream.

They don't. This is a highly opinionated system, and if you work on scale, you would know that you need to take a lot of the freedoms, because otherwise you will not be able to scale every change. You have to keep many different stacks of technology. This is a decision that we make. This is what you have and this is what you build. Same thing, if you think about AWS or Google Cloud. You would go to Amazon, say, I want you to build me this new database because it's a cool new shiny thing. No, you would use what the platform has to offer you.

However, I recognize that there are some cases where you need to opt out of the platform, and that is good. That is fine. First, you have to come with the platform team to see if we can solve your issues. The platform is built with layers, so basically, developers can take the lower levels of the platform, they don't get all the syntactic sugars and all the automations, but still they're using the same core libraries. That's one option.

At the rare case where they do need something else. I'll give you an example from Wix. Basically, most of our services are business services, like e-comm platform, and booking platform, and restaurants. Those are business use cases. We are also a media company. Because we're a website building company, we handle hundreds of petabytes of data, of images and videos and audio transcoded, and those have different requirements from the platform. They need to do video encoding and image manipulation.

Java stack is probably not the right tool for the job for the media platform. The media platform has their own platform, but for 80%, 90% of cases, stick with the platform. It gives so much value to developers, and this is from experience that we have, they really try to stay within the realm of the firm. They really hate to get out, because they realize how much value the platform gives them.

Participant 2: It's similar to this notion of using a boring architecture.

Participant 3: How do you test your function? Do you have some library tool that is helping to test your API? You are building some API, which is like, it's a REST API, and you use your SDK, so you don't have the whole environment to run. If you would like to bring up your function test, your API, how do you do it?

Mordo: First of all, you don't have to have the whole environment. In Wix's cases, it's literally impossible to run 4000 microservices on your laptop. You have your local runtime that you can actually test. If you really want to do the integration, you do it on production. You actually test on production. Since Wix is a multi-tenant system, you create your own test tenant, and you do not corrupt any other tenant. You can actually test the end-to-end flows. It's impossible to test locally. You actually deploy. We have something that's called deploy preview. It's part of our CI/CD system. You do deploy preview, GA. You deploy your specific artifact. It doesn't get any traffic.

On a test request, you can have a special header or something, and then we will route your test call into your new deploy artifact, which will interact with all the other artifacts. Since it interacts with other APIs, you would probably not corrupt any data, because, again, there is a strong tenancy. Part of the platform is that we prevent you from doing a tenancy mistake. It handles the tenancy for you. When you write your queries, we inject the tenant ID into your query. You are not allowed to tinker with the tenant. The platform actually injects the tenant from the authentication headers, so you do not corrupt any data. It's fairly safe.

See more presentations with transcripts

Recorded at:

Sep 26, 2024

Aviran Mordo
VP Engineering, Wix.com

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?