BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Wasm: What is Universal Compute Good For?

Wasm: What is Universal Compute Good For?

Bookmarks
48:22

Summary

Sean Isom describes a framework for building universal applications using browser-based Wasm, server side Wasm, and what is coming next with edge computing.

Bio

Sean Isom is a Sr. Engineering Manager at Adobe working on optimization and efficiency for the Ethos project. He comes from a background in C++ and graphics and stumbled into cloud development from an experiment in using Docker to port desktop software to run as cloud services.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Isom: Welcome to Wasm, what is universal compute good for? My name is Sean Isom. I'm a Senior Engineering Manager at Adobe. I work on our Cloud Compute Platform. We're going to talk a little bit about WebAssembly, a little bit about edge, a little bit about cloud architecture. I hope you can take away from this talk, how and why you can use WebAssembly to build a universal compute engine that allows you to bring your code closer to your data, increasing your efficiency, increasing your application's performance.

What Is Universal Compute?

What specifically do I mean by universal compute? Imagine a world where you can ship your code to any device and run the code where it makes the most sense. Sounds great. This might sound familiar. People have tried different techniques for this before through various virtual machine technologies, IDLs, bytecode formats. One example that comes to mind for me specifically, is some of the early days of Java, which I'm sure is something that a lot of people can empathize with. One of the early design goals of Java was to be able to have a portable bytecode format, can run on a plethora of devices, abstracting away some of that underlying hardware. Allowing developers to focus more on the business logic that matters to them than the underlying infrastructure and the underlying host. I would call WebAssembly the next generation of this. It is simply a more modern tool that's going to allow us to run that code not only on any device, not only on any form factor, but where it makes the most business sense to the actual application in question. Because that's where you actually derive value from these kinds of large architectural shifts.

Thesis

Let's establish a thesis statement. Actually, first, I always like to start my talks with a quote, and this is a good one, "Compute is nimble, and can be easily moved anywhere. Data is heavy and moving it has a cost." This is from an individual named Glauber Costa, and actually comes from an article on The New Stack called, "What the heck is edge and why should you care?" It's a really great write-up. I highly encourage you to go read it and learn a little bit more about why people want to move code to the edge in the first place. Let's apply this to establish our thesis statement, and that is this, WebAssembly is the best tool for universal compute. Moving compute only makes sense when you improve data locality, when you improve performance, or cost when you move some metric and move the needle in a positive direction. When you improve your data locality, you drive more efficient cloud native architectures in the first place. My goal really is to convince you that, yes, we can run the universal compute runtimes anywhere, but that there's heuristics and frameworks and techniques we can use to think about where that code should run both at runtime and at design time.

Some may argue with the fact that compute is actually that nimble. Let's say you're building something around a specific cloud provider SDK, you're running an AWS lambda function, you need to pull data from S3. You're probably using the AWS SDK for whatever language you're running in. You're locked into that framework, and you're locked in that way of doing things. Rewriting that chunk of code, that's more generically pulling data from some blob store is still easier and cheaper than running a lambda and pulling from a different cloud provider, or pulling from a proprietary piece of block storage. That data component is critical to understanding where your application should run, because, really, your application is just a set of code plus data. Compute is a fairly nimble*, nimble if we design it from the outset to be nimble, nimble if we design it to be moved around, but we want to move it on the basis of where our data resides, and where that data is going to be closest to the user and the use case necessary to operate on that data.

Overview

We're going to talk a little bit more about universal compute and some of the background on WebAssembly. How Adobe uses WebAssembly, both on the backend and on the frontend. A little bit about how one should think about architecting for edge and architecting for moving compute across different devices. Some of the value prop of building code for WebAssembly in the first place. I'm going to go through three different architectural examples where WebAssembly can power this concept of universal compute, allowing you to move that around. Then in on a discussion of scheduling, or where should the code run. I think there's a lot of useful heuristics you can apply at design time and at runtime, but you can also algorithmically schedule based off of some measured metric.

Universal Compute

This is a bit of a fascinating proposition, this concept of universal compute. I want to start with a bit of a background on WebAssembly. WebAssembly provides a universal runtime. It's a universal bytecode format. There's a set of associated tools and runtimes, like Wasmtime, which is one of the more popular ones backed by the Bytecode Alliance. WebAssembly is a set of standards and set of technologies that allows you to write your code once and run it anywhere. I love this quote, and I think it's attributable to Tyler McMullen, who's the CTO of Fastly, who've been some of the pioneers in the WebAssembly space. Let's discuss the evolution of this over time. WebAssembly started in a web browser. That's why it has the name WebAssembly. A lot of the motivation for doing this was being able to run native code, run high performance code in a sandboxed way within the context of a web browser. You're bringing that core compute and that business logic to a new form factor to solve some business problem. To increase your number of channels you have to sell software to your users. To be able to integrate in with more modern stacks, web-based UI frameworks, for example. It allows a common codebase to be able to work within the confines of a web browser in a fairly standardized way that is both sandboxed, isolated, but can also efficiently communicate with the host system itself.

Here in the example for a web browser, the host system is twofold. The host system is the web browser, but there are also underlying system processes as well. There's underlying compute capabilities that live outside of that web browser that are necessary. Over time, people started looking at WebAssembly holistically, and said, is this something that can run outside of the web browser? There's a well-defined set of standards for sandboxed, highly performant bytecode systems, that can be AOT compiled for native hardware. You can imagine a flurry of use cases start springing up around running this outside of a browser. Maybe plugin systems for a game. Maybe some dynamic application, some function as a service, a method to write very specific chunks of sandboxed code and run them in a safe, performant way. All of these outside of the browser use cases start appearing. Things like the Bytecode Alliance, and WASI or the WebAssembly System Interface spun up to allow WebAssembly to work well outside of a web browser. This is really interesting. We've got web, we've got desktop: two different worlds? Maybe not. Maybe the way we should be thinking about this is it's still the same WebAssembly, it's still the same code. It's still the same guarantees. It's the same application model that we've written for this more generic computer, let's call it that. It's a generic compute runtime. It's running in a plethora of different form factors, that same code, independent of system resources, and can run in both of those locations.

We've got two, let's think of more. What about IoT devices? At the end of the day, these can be viewed kind of like a desktop computer, just a much more constrained compute environment. With the performance of WebAssembly, with the safety of WebAssembly, why not run that same code that's running in your web browser on your refrigerator, on your toaster? There's the same guarantees across these different form factors, performance, security, being able to run that same chunk of code in a more constrained form factor, be that a browser, be that a desktop, maybe a server, and that extends to edge compute as well. When we think about edge compute, often, there's a lot of different definitions. We associate this most with things like telcos, different edge cloud providers that are not centralized, but maybe provide some resource that's still multi-tenant, still programmable, and closer to the user than a centralized cloud region or data center. Although maybe not quite as far out as an endpoint device like some of these other three. Why not bring that same chunk of code that's running in a web browser, why not run that same function on a desktop? Why not run that same function on your toaster? Why not run that same function on an edge device? There's no reason you can't.

You can also bring this to the cloud. This is where we come full circle with traditional architecture between the client and the server, client and the cloud. There's a lot of different architectures for what we call cloud and a lot of different ways to define cloud. Things like AKS WASI node pool support, and a lot of the PaaS platforms have sprung up over the last year. There's really easy and straightforward and performant ways to run WebAssembly within the cloud, as well. This is where this concept of write once, run anywhere in universal compute becomes really cool. I don't have to write WebAssembly for this one specific thing, this one specific use case, or this specific device. I think the concept really shines by being able to abstract the system, abstract the host, abstract the language, and be able to run the same core business logic across form factors where it makes the most business sense. That's because WebAssembly is simply a lot more than just the runtime itself. It's the developer ecosystem. It's the mechanism we use to communicate across components. It's the mechanism to access that data in a standardized way, separating it from the compute to be able to manage those dependencies.

If you're running on the cloud, on the right, for example, you're probably modifying multiple containers. You multiply that across all of your different applications, all of your clusters. Let's use the Log4j vulnerability as an example. Anybody remember that from a few years ago? You've got to rebuild all those containers. You got to ship those sets of containers across all of your environments and all of your clusters. Maybe we should invert that model a bit and start thinking about dependencies in terms of capabilities. That's what allows us to have a better developer experience. Or we can just update the code and manifest in one place, and let the runtime itself take care of patching those libraries when they're abstracted away from the business logic. That's why I think WebAssembly is really turning out to be a much beefier, much stronger technology that allows us to not only run in multiple places, but support the ecosystem and the data for an application in multiple places. Specifically, in this talk, we're going to focus a little bit more on these boxes on the interaction between edge compute, cloud compute, and client hardware, and how they can interact with the same WebAssembly modules.

Wasm At Adobe

Let's give some preliminary background. Adobe has run WebAssembly successfully for many years now. Actually, take a step back looking at some of the predecessors, some of the previous attempts, things like asm.js. Has anybody heard of Google's Portable Native Client? These are some early attempts at running safe native code in the web browser. The reason we want to do that is because we can reuse some of these native codebases that a lot of large companies have that have many decades of investment. This has kind of led and bled across different products. It started with Lightroom Web. I think that actually originally launched on asm.js, with our more modern web stack products like Acrobat JS, and the actual Photoshop beta, which is in a web browser, mostly using the same common code. There's a lot of integration into UI frameworks, so WebAssembly is becoming the default. You see that in some of our next-generation products, like Firefly. This is a recently announced beta, which is a generative AI engine, which is a very hot topic for many organizations today. As we've adopted WebAssembly, it's gotten to the point where it's powered a lot of our frontend products. The reason behind that is because it meets all of those guarantees that we talk about. It's a high-performance system. We can utilize those existing codebases. It's secure by default in a multi-tenant environment like a web browser.

What Is the Edge?

Speaking of multi-tenant environments, obviously, there's backing services behind those products as well. Adobe runs Kubernetes. As the complexity of our user facing products grow, the size of the backend capabilities grow as well. I'm sure this is something most people are interfacing with, seems like. There's very few organizations today that are not running Kubernetes. This is not an exhaustive list of everything that's running in what I would call our cloud. You can imagine that centralized Kubernetes and container infrastructure powers most of our capabilities. We got over 5000 production microservices, 5.5 million different running containers. No matter how you look at the metrics, we run a lot of compute on the backend to power those frontend products. You can imagine how we have a very high-performance end user product, very high-performance backend that's powering those products, but what sits between? This is where the concept of the edge comes into play. This is not just an edge talk. We're here to talk about WebAssembly. I think we have to give some definitions of this and talk about what lies between backend and frontend, in terms of being able to ship computer out.

Remember that original quote? What the heck is edge, and why does it matter? I love using this graphic to explain. This is from our friends at Cosmonic. I think edge means a lot of different things to a lot of different people. To me, edge is a good form factor for shifting compute, and allowing us to have better data locality. By edge, specific to this talk, I mean things in this graphic, like the regional or access edge. These are systems that are outside of maybe the major cloud providers, outside of corporate data centers. They're run outside of centralized locations. These are compute nodes that are just out further towards user endpoint devices. There's other definitions as well, like some might actually consider a web browser to be an edge device. That's fine. That's maybe an extreme example of, as a company, you're not paying anything for that compute. You're paying something for the data to get that WebAssembly module to the user. The runtime cost is, from a financial perspective, essentially free. There's no right or wrong answer on this. This is intended to say this is a spectrum. For this talk, we will define edge to be sets of services that live between centralized data centers, and that endpoint user device. There's a lot of different architectures, lots of different ways this can play out for different organizations.

Adobe: Service Architecture

Then, let's dig in on service architecture as well. What might a typical web service look like that's running on that backend? What are some of the different capabilities it's doing? What are some of the functions that we can use maybe to offset some of this compute on the backend? First of all, services come in all shapes and sizes. Within our multi-tenant platform, you can graph services utilization, and resource, and capacity requirements like this. These are five dimensions that matter to us. There can be different dimensions that matter to different applications. Essentially, there, you are mapping some system capability, in this case, CPU, memory, network, disk throughput, and IOPS. You're mapping some fixed capability to a more dynamic client runtime that's not going to fill all of that. That's ok. If you were to 100% utilize your CPU, your accountants would probably be very happy, but then your application would probably have huge QoS problems. At the end of the day, that's not a good thing. If you fully utilize your memory, it's probably crashing and restarting. If your disk is full, it probably can't do much else. If you were to graph a bunch of different applications across here, what you'd note is the shape of that yellow line in the middle, that shape of that utilization graph is going to be all over the map. That's the blessing and the curse of multi-tenant systems, is you have to play Tetris a bit. You have to bend back. You have to fit. You have to utilize those backing resources effectively, to match the requirements of the application itself.

Let's dig in on the specific service of that graph. What the service does doesn't matter. I just want to use this as an example for the rest of the talk. By default, it's Java and Spring Boot. It's running in Kubernetes, in a containerized environment. It's going to be running at about 28% CPU utilization, 55% RAM utilization. It's going to have a pretty large RAM working set because of how Java allocates the heap in a container. It's going to have a workflow that looks something like this. It's going to be running some thread pools that you can take a thread off a thread pool and be able to handle some incoming web request. It's probably a REST API request that's coming in from a frontend product. It might be coming in from another service. You might have a chain of services calling each other. There's a lot of different answers to this, a pretty typical pattern. You're going to take that input from the REST API, you're going to transform it in some way. You're going to perform some data operation on that input. You're going to go out and request data from some dependency. It could be a database. It could be block storage. It doesn't really matter what the dependency is. There's very few services that take data and crunch a number, and spit something out without talking to some dependency. It's going to send that response. It's going to send that response synchronously back to the user, but probably also asynchronously back to another set of services. Maybe you've got an eventually consistent system, you need to update some system of record.

Edge: 1-3 Box Architectures

This is not supposed to be an exhaustive example. The idea is just start thinking about, if I've got a bunch of containers on the backend that are serving frontend products that look like this, what can I do to run those in different environments. Let's think about how we can architect for edge, and how we can discuss cloud architecture in terms of boxes. I like boxes. I'm a pretty visual person. A lot of architecture is two boxes these days, you've got a client, and you've got a server, and a data center in this nomenclature. Your users just want to do a thing. They don't really care about the architecture behind the product that they're using. Let's say they hit a button in your product, and they want it to go perform an operation. There's already some decision points at runtime for the product, or done as you by you as the developer, as to how to power that request. For example, on an endpoint user device, you have to think critically about, can I quickly run this functionality? Let's say I'm running some massive ML model. I'm running a generative AI model, like I said, all the hype these days. I probably don't want to pull down that entire model, load all of the weights into my local VRAM, spin it up, do the inferencing locally and use that result in another function. That's very wasteful and very silly, also, probably an extreme example. For a constrained compute environment, you might want to answer this question a little bit differently than in a desktop environment where you have a GPU. If you need to perform some convolution-based process, you have a very powerful parallel processing mechanism to do that. Maybe on a mobile device with 5% of your battery remaining, you want to call out to some web service to perform that same operation instead.

The other decision point is around data locality. Does this device have the data locally to do the thing? Let's use the example of an order history system. Every single order is probably stored in some database somewhere on a server. That makes sense, because the actual payment gateway lives on that server. Does the user device need the entire order history for the operation? Let's say that the function you're accessing, the user doing a thing, is trying to view their shipment status on their most recent order. You probably don't need that entire order history in local cache, maybe you could cache something more recent, closer to the user. There's plenty of other real-world examples. These are just some of these decision points that when you're writing client-side code you have to make.

Where edge becomes really powerful is thinking about a closer method to run offline compute to a user than a data center itself, are their existing compute functions that can be split off from a client application? Does it make more sense to run in a more powerful environment, utilizing the network injecting latency, but still, while keeping that compute as close to the user as possible? Is there a way to split off functions on the other side from the backend? Maybe we don't need all that round-trip latency, something is running on the backend, if we have a device that's capable enough using the WebAssembly model to run that compute closer to the user. Going back to the order history example, if you have to pull the full order history every time for your operation, edge makes no sense, because then you're going from client to edge. Your cache is cold. You're going to the server, pulling from the database, going back out. There's no point to caching everything. That's wasteful. If you can summarize that user profile, or if a subset of the data makes sense to keep warm, you can not only reduce the cost, but you can move that data one time and improve the performance and the user experience.

Kubernetes: Running Wasm in wasmCloud

I do want to talk about how to run WebAssembly inside Kubernetes. This is a graphic from a talk I did at QCon EU this year. You can go watch the talk if you want more technical details on this. For the sake of this talk, I want to make it clear that you can easily run WebAssembly code for your service in Kubernetes. That's what's going to allow us to shift that compute between the client-side edge device and also in Kubernetes on the backend. This is a method that we have set up to do this based off of wasmCloud. There's other ways to do this. There's no universal solution. Basic idea is to utilize providers within the context of an existing Kubernetes namespace. It's going to be one of those blue boxes on the right. Utilize something called the service supplier, which is going to create a Kubernetes service, provision ingress, essentially, for an HTTP request. Hook that up to an HTTP provider that's going to communicate via NATS outside of the client's namespace into a pool of common wasmCloud hosts. These are the things that are actually running the actors or the functions that power the business logic. Then, also, you can communicate back out. If you need to send a result or if you need to access part of that services' existing capabilities that are already provisioned in the namespace, you can bidirectionally communicate via the lattice network back into the namespace for horizontal service to service communication, pulling data from dependencies, S3. There's a few links on how to do this at the bottom as well. Taking the previous Java example, let's say we got a large container that's doing things. We split off that function that's serving that request to more of a function as a service model. We've still got sidecars in the namespace. We're still running the compute the same way we were before. We have the capability to centralize that compute runtime within that pool of common host. Like I said, not intended to go deep into this, but just wanted to say that there are ways to do this fairly easily, to run this compute on a variety of form factors.

WASI: Foundation Ready

The other important thing is not just running the code, but being able to manage the code's dependencies. This is where WASI, or the WebAssembly System Interface comes in. More specifically, wasi-cloud, which is a new set of standards that are coming out of some of the WASI working groups. There are things like wasi-sql that are coming out that will make this possible. I bet this is an example that resonates with a lot of people. With most cloud architectures, at this point, you're probably running some SQL based system somewhere. It's a very common capability. You need the ability to interface with some backing store or KV functionality, in order to introduce state into your code, into your logic. We want to build upon the original foundations of WASI, which allowed us to have a system interface to safely expose host resources to a WebAssembly module. Like looking at some of the earlier work like wasi-random, wasi-fs, we're building off of those basic system capabilities, being able to access safely system resources. We're leveling those up towards cloud native architectures as well. Maybe you care less about programming for a specific database like Postgres, you just need access to a stateful backing store. You've got some sort of shim, and you're programming for the wasi-sql interface, and you can plug that in and it can work in a variety of these different environments independent of the backing provider. This is what's going to allow WebAssembly to operate across these different compute environments. Not just in terms of raw system capabilities, like random numbers, these very coarse POSIX style system capabilities, but higher-level cloud native capabilities or application-level capabilities as well.

Use Case 1: Background Removal Service

Let's talk through some actual use cases. I think this will help people start thinking about why shifting compute makes sense. My colleague, Colin Murphy gave this example at Wasm Day EU in 2022 of the background removal service. It's a very simple concept. You got an image here, in this case, it's a signature, and you remove the background of it. That's it. I'm sure everyone has probably utilized some capability like this in some product at some point in their lives. This is traditionally for us a capability that's provided by a Java service that looks very much like the one I described earlier. If we were to go back and take the actual function that does the image processing, that does the background removal, takes the I/Os and transforms it, performs some operations on it, sends the output back to the user for further use, I'll say superposition on top of the PowerPoint slide. Let's build the code that does that function as a WebAssembly module in the first place. The first step for that looks something like this. The logical answer is, let's run this logic on the client side, we can do this in a web browser. We can do this within a desktop product using some embedding engine for WebAssembly. There are also ways to do this on mobile. These are really cheap compute devices. That's the green box in terms of cloud spend.

Maybe instead of running this back in a server-based environment, maybe instead of running a service for this, we run this on the client device itself. Shifting that compute closer to where the data is, because the image itself is probably in some local store or local cache. Maybe you are power constrained. Maybe you need online functionality for further processing. You can run the same code on the edge through things like wasmCloud or other edge provider specific implementations. You don't need a full featured server. A constrained compute environment will work. It's less constrained than the free environment that's running on the mobile device, but maybe very low power. You got pretty fast ingress, pretty cheap egress as well. Lower latency than going back to the server-based environment. You're still running pretty close to the user but you've offloaded that compute. You've exchanged that runtime cost for a network cost, and associated network latency, which, whether that's a bad thing or not for you, really depends on the application, depends on the use case. There still might be some reason to run this in a data center as well, to run this within Kubernetes. You can run this as part of a native application. The real power comes when you're able to run that same background removal function as a WebAssembly module within Kubernetes as well, using that same core common codebase. The developer experience is great because you don't really care about where the code is running. That takes us back to the very original points of the talk. That's one of the original design goals behind all these systems, is being able to abstract away the underlying compute hardware.

Why would we do this in the backend? Maybe you need to transform the image for further operations. Maybe you synchronously need to update a document store because someone else is depending on that data. Maybe you're composing a stack of these single images into an ML model. There's all sorts of different use cases. In that case, the data residency to the backend, to where those more powerful models live, is more important than the data residency locally, and therefore the network cost makes sense. There's no right or wrong answer to the question of, where do I run this code? WebAssembly allows us to choose the best place to run it, and to have that universal compute engine based off the actual product use case.

Use Case 2: Procedural Generation

Let's go to a bit more of a complex example. I personally love this one, not only because it's code I wrote, but also because it introduces the concept of programmatic and decision-based scheduling. This is an example that I gave at WASM I/O earlier this year about how to use Wasm to optimize your data. There's often large chunks of data that are generated through some vector processing. In this case, we're going to use 3D models, as an example. You can replace the output 3D model with the set of instructions that are used to generate that 3D model in the first place, in this case, geometry operations. You can represent those instructions through a formal grammar. There's an example on the right, it's just a formal grammar that's been transpiled in C++, can be compiled down into a WebAssembly module, that when you run the entry point of that function, is going to actually generate the output, it's going to generate the data. This is super valuable, because then you can shift from before where you might be streaming, let's say from some centralized data center, a very large dataset, you can shift that to the client device, or you can run that on the edge. Or you can run that generation step at runtime, save some of that bandwidth for transferring what could be a huge binary object, and save some of that latency as well while exchanging that for runtime costs. It's almost the opposite of the previous example.

Let's see what this looks like in practice. On the client side, you've got a graphics engine that's hooked up to a GPU. It doesn't really matter which one. It's going to determine that data is non-resident. It's going to need to make a network call out to get that data. Why? We've all played video games where that is not necessary, you get everything installed on your device locally. Think of a larger example, think of something like a map. You can store a high-resolution map of the entire world on your cell phone. That would be a pretty silly idea. There's always going to be use cases where we need high resolution data where that's not present on a local device. In this case, it's going to be a 2-box architecture, we're going to call out to the edge for that. That's running some service that can respond with graphics. That service then has an important decision to make. First, it has to check to see if it is in cache locally. If not, it has to make sure the cache is hydrated. How that cache gets hydrated isn't super important for this example. Assuming that data is resident, what do we do? Do we send the procedural instructions? Do we send the grammar back to the user, back to the endpoint device, and then run that generation locally? Or do we run that generation step on the edge and send the generated model to the user? For one, we're introducing runtime compute at the edge. For the other, we're exchanging that for network bandwidth. The answer is, you can do either. There's different use cases for doing either.

Let's say you've got three users that are geolocated within each other, and in some shared immersive experience, and so therefore, they're requesting the same tiles repeatedly. In that case, your code can determine that it's getting very frequent cache hits for the same data, it probably makes sense to generate it once on the edge, and not have to run that generation 10 times very frequently, and send the actual data itself to the user. Maybe a user is running a very powerful device, and you can measure that through some empirical data. In that case, maybe it makes sense to send the grammars to the user and not cache the generated output on the edge. There's a lot of different cases for this, and you might want to actually make a programmatic runtime decision to decide which scheduling case makes the most sense. The end result is the same. Either we send the generated model to a user, or we send the grammar to the user, generate the model, either way it gets uploaded to that device's GPU. Then you see an output that looks something like that. The whole process starts all over again. There's no right or wrong answer as to where to run your code. What's important is, the generation for this is written in WebAssembly. It can run on a Wasmtime-based runtime, wherever that can live, be it on the edge, be it on the client, be it on the server.

Use Case 3: ML Inference at The Edge

Last example, this is also interesting, because this is a very hot architectural topic these days, how to do machine learning inferencing on the edge. This is a very important topic for a lot of applications. In this case, I'm going to use the example of user impression. Let's say a user does a thing in a browser, doesn't really matter what kind of client. If you want to personalize that user's experience based off of some data that we know about the user. Instead of sending that data all the way back to your data center, identify who that user is based off of the metrics that are passed in from the client, and personalize that experience. You can look up that profile and run that inference model at the edge instead. This is becoming a pretty common pattern. You'd send those results, that personalization step back to the user for immediate use after running the inferencing on the edge, and then you'd pass it back to the data center to allow further training or further processing of that data.

Let's describe this system in terms of capabilities. As a user, you do a thing. It doesn't really matter what the thing is. It hits the ingress for the edge and you got a profile service that's just looking up to see if I have data in the edge cache, if I have information. If you do, it can connect the user to their model, and it can run that inferencing model using tools from WASI on the edge. If it's not in the cache, that's fine, we can, in this case, somewhat synchronously still communicate with the data center, run that same profile service written in WebAssembly, running in the backend in Kubernetes as well, you can run that same profile service, but hook it up to a database instead. In that case, you can then hydrate the cache in the edge, send the data back to the user. This is a classic case for caching, but the idea is to get as much of the information to the edge node as possible, and still utilizing the backend capabilities, could be for further retraining. Data scientists seem to always tweak their models. There's always going to be updates to the model as well. We still have a need for that data center, but we're able to write chunks of code and write modules that run, regardless of the underlying host architecture, speed up the overall application, maybe even reducing cost as well. This gives a high-performance user experience, increases the flexibility of the code to run in the place that makes the most sense, given that data locality.

Where Do I Run My Thing?

Three examples, maybe they've resonated with people, or maybe not depending on your architecture in your systems. As an architect, as a developer, you probably have a lot of questions like, where do I run my thing? My first answer is, build the code for run anywhere. Build the code for WebAssembly, by default. You will probably pay a higher price for doing that, especially in certain languages, than you might for just deploying it in a container, or deploying it native. I'm not here to claim that you're automatically going to be able to build everything from WebAssembly, and it's going to work super well the first time. Pay that slightly higher price, get it working in WebAssembly, test it and make sure you have the capabilities to run it everywhere first. Utilize some of the upcoming wasi-cloud capabilities to unlock being able to run it in different locations at runtime. Consider what matters most to your business. Are you optimizing for total cost? Are you optimizing for user experience? Do you need failsafe levels of reliability? There's no right or wrong answers here. It's all about what matters to the application itself. Then consider not just the cost, but the cost of performance when you need those additional levels. Sometimes you'll pay a financial premium in order to have a better user experience. Sometimes you can tolerate a slightly degraded experience in order to keep your product margins.

These kinds of decisions lead us to three different categories or buckets. The cheapest way to run things, obviously, is to run it side by side with your data, so there's no latency. You can run totally distributed compute, based off the lowest latency as well. That's an extreme example. Completely private networks, not going over public internet, at all, those are going to be the cheapest and the fastest way to do things. How many applications can squarely fit in that green box? Probably very few things of actual substance. There's other options as well. Do I exchange some of that cheap compute for runtime compute? By runtime I mean online, I mean at an edge or a cloud provider. Can I use methods that reduce latency? Can I shop around for the cheapest egress? Can I shift compute around? Sometimes you need heavier capabilities. Database to edge replication is a cost and is expensive. As I outlined earlier, if you are moving everything from your database to edge, does the edge really make sense? Does it really provide that much value? Maybe, maybe not. I personally don't think so. If you're running a batch process, though, maybe round-trip latency doesn't matter. That's a good example of, I'm doing asynchronous offline processing and exchanging that runtime cost. I think this is just a heuristic framework for how people should be thinking about where to run their code. Nothing is going to squarely fall in one of these boxes, but for this application, pick out the individual items that matter to you.

Dynamic Scheduling

Lastly, this leads us to the concept of dynamic scheduling. Some of what I just described is necessarily at design time, but I think the real power of WebAssembly is unleashed by shipping your code everywhere. By building for WebAssembly by default, abstracting away your dependencies through wasi-cloud, and being able to ship your code to all of these different form factors, which is going to enable you to make some of these decisions automatically. Shift that compute around dynamically at runtime, depending upon metrics and measured weights. These weights can be set for what matters to your business, and so things like execution time. You can run high resolution timers. You can measure things, and you can say, here's the performance of this device as measured in real-time. You can measure ping. You can measure latency. I'm sure everyone has played a game where it starts lagging, and then it gets better when the runtime switches you to a different multiplayer server. You can do these same techniques within your code as well, but you have to weight it according to your application's business metrics. Some costs can be measured.

Then I want to leave with this, I think a dynamic scheduling framework can be built. This is a minimization problem. This is a basic set of components I would consider. You can use your own compute components. This is not just designed to be an exhaustive list, but some of the factors I've seen that are common across application architectures. Let's say device CPU. I define this as the sum of the CPU nanoseconds times that weight, times that weighted business cost. Like I said, for real-time applications, every nanosecond counts. You're going to want to assign a higher cost to that. Your edge costs which I define as the sum of your resources running on the edge times whatever rate that that edge provider is charging you, times your edge accuracy, which really, this should be inverse edge accuracy. Essentially, what is your hit rate on your cache and your edge? Do you need to go back to the data center every time? If so, adding the edge is infinitely more expensive. Your storage cost, which is your storage weight times your bytes in whatever rate you're paying for that, plus any cost of replication. If you're duplicating data in multiple environments, you've got multiple data charges, while subtracting any deduplicated data. Maybe you can move it from the backend to the edge and it stays in the edge. It's data that isn't necessary to keep in cold storage, for example.

Latency, which is just the sum of your providers. I say latency plus a factor because some types of latency are better than others. Some are built-in, like if you need to communicate to a database on localhost always, that is a very defined factor. That is something that you're always going to have to do. Maybe you don't want to factor that in your application. Maybe if you're moving compute, I need to pull something from a data center in China while I'm here in New York, that's going to be an extremely expensive factor. We want to weight that heavier. I call this Kubernetes compute, really, this is just your data center compute. Like I said, I think probably everyone's running Kubernetes at this point. The sum of your resources plus the overhead. Remember our compute graph shape, we're mapping client services to backend nodes, times the inverse of that edge accuracy. Whatever the remaining is between your edge and your backend compute, that's going to factor higher on your backend. Then last, obviously, the egress cost, which I define as the weight times the sum across your different providers across every single hop, the sum of your bytes times your rate. That's a pretty easy one to measure and define. This results in the cost of performance. You want to minimize this cost. This is not intended to be an exhaustive list. You can define your own resources. These are what works for us.

Universal Compute (Recap)

This takes us back to our original thesis, original topic of the talk, what is universal compute good for? We've defined universal compute as using WebAssembly to write your code once and run it everywhere on a plethora of form factors. You can bring your code closer to where the data lives. Maybe the data that's important is on the backend. Maybe it's on the user side, and you want to shift that compute around to match the cardinality of your data. You've got to make sure that the data itself when you do have it resident, you have to move your code closest to that use case for the data, which is often where your user is. Through techniques like edge caching, you can minimize that cost. Build your code for WebAssembly. It's the enabler of being able to run your code without assuming your runtime environment through tools like WASI and wasi-cloud. Outsource your scheduling decisions to a minimum set of heuristics that matter to your application. Consider actually writing code that allows you to pick the best path and the best runtime environment itself. I think the future for WebAssembly as a compute engine is extremely bright. I'm excited about all the possibilities that universal compute is going to bring to drive down costs, increase performance, and give us the best user experience for the next generation of applications.

 

See more presentations with transcripts

 

Recorded at:

Dec 05, 2023

BT