BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Reducing Developer Overload: Offloading Auth, Policy, and Resilience to the Platform

Reducing Developer Overload: Offloading Auth, Policy, and Resilience to the Platform

48:31

Summary

Christian Posta discusses what developer pain looks like, how much it costs, and how Istio has solved these concerns by examining three real-life use cases.

Bio

Christian Posta is Global Field CTO at Solo.io, former Chief Architect at Red Hat, and well known in the community for being an author (Istio in Action, Manning, Istio Service Mesh, O'Reilly 2018, Microservices for Java Developers, O’Reilly 2016), frequent blogger, speaker, open-source enthusiast and committer on various open-source projects including Istio, Kubernetes, and many others.

About the conference

InfoQ Dev Summit Boston software development conference focuses on the critical software challenges senior dev teams face today. Gain valuable real-world technical insights from 20+ senior software developers, connect with speakers and peers, and enjoy social events.

Transcript

Posta: I'm going to be talking about moving off policy and resilience to your internal developer platforms. How many people are working on or use an internal developer platform at the company at which they're at? I work for a company called solo.io. Solo is based here in Boston, actually, in Cambridge. At Solo, we work on open-source cloud networking technologies. We are very involved in the Istio project, the founders and the lead contributors. We lead the innovation and a lot of the stuff that's happening in the Istio project, as well as some of the surrounding periphery projects, some of which we'll be talking about.

Digital Experiences - Driven by APIs and Services

As you know, digital experiences are how we do things today, when we get around. I flew here to Boston. I got on my phone. I booked to travel. How many of you remember what it was like before that? You had to pick up a phone and make a call to a travel agent to book your flights, to book your hotel. You had to call some random number to book a taxi, and hopefully that taxi showed up. Digital experience has made our lives a lot easier today. We save a lot more time, and that's extremely valuable. APIs and services underpin the apps on our phones and the digital experiences that we use. These organizations, these businesses, they see APIs and services and the infrastructure to get those live and into production as driving a lot of business value, being differentiating.

The developers and the time that they put into building out these new services, new capabilities, are extremely valuable. The challenge with that is getting code and getting these changes into production. I work with organizations that experience those all the time. A big part of that is the siloed nature of their organizations. They build their code. They stage. They get it ready. You got to now make changes to your API management system, to your load balancers, to your firewalls. The way these organizations have been structured aren't conducive to being able to do this very quickly. They've made decisions in their own silos. The integrations between those silos are very complex and brutal and expensive. If you want to get anything done, what do you do? You open tickets, and you sit and you wait.

Hopefully, these teams will go off and go to these UIs, and point and click, and these manual steps to get things done, make the changes, and then eventually you can get things into production. That's why you're probably seeing and we see this. We work, like I said, very closely with these organizations that are going down this path of building platform engineering teams. Building internal developer platforms to try to bridge and work across these silos, to try to build internal APIs, to support automation so that they can build the tools, the workflows, the UIs, the experiences for their developers to be able to self-service, and build, deploy their code into production as quickly and safely as possible. They build these paved paths or these golden paths.

The platform engineering team and the platform itself is intended to be a business value accelerator. We want to improve the ability to get these APIs and these services out into production faster. We want to improve efficiencies. We want to maintain or improve compliance, do things like reduce cost, and lock-in, and so on. This is a trend that we're seeing.

Internal Developer Platforms

How many people have built their own house? It's not easy. It's not something I would want to do. I've lived through a remodel, and don't want to do that again. When you build a house, you don't start with laying electrical wire or putting doors. You start with the foundation. You start building walls and a roof and so on. I think about internal developer portals like a house. It's a foundation, a platform on which you can get value out of it. Build value. Do things. Work from home. Raise a family. Raise your pets. Sleep comfortably, and then be productive the next day. Internal developer platforms lay that foundation so developers can actually get their work done and be efficient.

The organizations that we work with are trying to improve their cross-team and silo communications. They're adopting new technologies like containers and cloud, and building automations, CI/CD, putting in observability tools, and so on. There is something missing, and it is fairly glaring when we start working with them. Anybody care to take a guess what's missing from this list right here? Security is very important. Absolutely. These services need to communicate with each other.

Gregor Hohpe just released a book on platform engineering, platform strategy. He identifies the needs for an internal developer platform, and he calls it the four-leaf clover of engineering productivity, and delivery pipeline, monitoring operations, the runtime, compute, all that stuff. Obviously, very important. You need to solve the communication problem. You need to solve the networking problem. When we work with these organizations, we see this as, they built the foundations, they put the containers and lambda and whatever else they're going to use, CI/CD, and they still have challenges getting code out into production.

T-Mobile did an internal review of, why is it taking weeks to get changes out into production, even though they've adopted these modern platforms? You can go and take a look at this talk where Joe Searcy goes into the details of that research and how they built their platforms to solve some of this. They found that 25% of developers' time was spent on these nonfunctional requirements, like routing, like security, some reliability stuff. That they would open these tickets, try to get changes into production, and sit and wait. They would find that for production issues and outages and stuff that they had, 75%, three-quarters of those were caused by network misconfigurations.

Modernizing Networking

If you've not addressed the networking and communication parts of your platform, you're still relying on some of the existing ways that don't quite fit the way you're building your cloud platforms, then your house is not finished. It's probably not a good idea to live in it. Some of the outdated assumptions that we run into, especially in the networking and API management space, is those around how you specify policy. Oftentimes, this is implemented as firewalls and firewall rules and so on in these organizations. Things like, we're going to write our policy in terms of network location or network identity, things like IP addresses. If something in the network changes, those policies become invalidated.

In this very simple case, we're saying, this IP address can talk to this IP address. What we really mean is service A can talk to service B. If we start adding more workloads to the node that has service B, now we have this drift in policy. Service A can now talk to service C also because it's deployed on this IP address, but that's not really what we intended. This is a very simplistic example, and this can get very complicated, but as the network changes, the workloads change and shift and move, you get this policy bit rot, just like you do with code. Another one that frequently pops up in the cloud space is, these IP addresses are ephemeral. A host or a VM can go down and come back up and potentially come up with a new IP address. Or, in Kubernetes, pods will recycle and have new IP addresses. Policies written in terms of those IP addresses are going to be invalidated as IPs get reassigned.

Another one that we see is, these organizations, they've implemented these API management systems to solve things for rate limiting and security and observability, and so on, originally intended for external APIs, now they're using them for internal APIs. Now to go make changes and get changes into production, you have to open tickets. It's not uncommon that we see organizations saying it takes couple weeks to make changes to their load balancers, to their API management systems, and so on. From a workflow standpoint, this causes bottlenecks. From a technology standpoint, these systems, the tenancy model we see where certain APIs misbehaving will impact other APIs and so on.

From a technology standpoint, there's bottlenecks and issues there as well. They can force unnatural traffic patterns. We've seen this many times, where workloads in a single cluster want to communicate with each other, one API calls another. To do that, they have to be forced out of the cluster, out through load balancers, out into some centralized API gateway, then through the gateway, back down through load balancers, eventually back into the cluster. These workloads might have been deployed on the same host, but for them to get the policies and authentication, authorization, observability, they have to go through these unnatural patterns.

One thing that I pointed out in T-Mobile, is that, developers sometimes they're like, "I don't want to open tickets. I don't want to deal with all this infrastructure stuff. I'll just write it in my code. I'll just write the security stuff, the load balancing. I'll just write the service discovery stuff. I just put it into my code". That gets expensive, doing across different languages, different frameworks, making the business logic all convoluted, because now you have this networking code in there. From a security standpoint, it's not that easy to get right. Using tokens, using keys, usernames, passwords, putting that all over the environment, it's very easy to have a mistake creep in and have a security vulnerability. I blogged about this in much more detail, especially around using JWT tokens as service identity. Using JWT tokens, all that stuff, for user identity, you log in, OAuth, whatever, that's all good.

For service-to-service communication and workload identity, there's a lot of things that can go wrong there. What we need for modern networking is something that's not highly centralized. We don't need a distributed implementation. We need to tie it into our existing infrastructure as code, GitOps style workflows and automation. We want standard interfaces to be able to integrate with other pieces. There's not going to be one technology that does everything. We need standard interfaces. We need to do the networking stuff that we've been doing already, traffic routing, load balancing, authentication, authorization, rate limiting, observability. We still need those capabilities. If you've built your platform and you don't have those capabilities, then your house is unfinished.

Finishing the House (Istio)

Like I said, I remodeled my house probably about 5 years ago now. It was a bit of a pain. Eventually, we did reinstall new pipes, new electrical, ACs, especially in Phoenix right now. To live comfortably in your house, you need those pieces. The analogy there is, the locks on the doors are like the authentication/authorization for specific requests, for every request between services. You need load balancing, retries, timeouts, circuit breaking, zone-aware load balancing, things like telemetry collection, distributed tracing, logging. Then, like I said, integrating with other parts of the systems. Maybe you have a policy engine like OPA, or Kyverno, or something, or your own homegrown one.

Maybe you have your own existing API gateway, and you need to integrate with that as well. We need nice, standard interfaces, just like when you're building a house, to assemble these components. It needs to work for all the applications in your environment, regardless of what language, regardless of what framework they're using. We don't want the developers to go off and re-implement this themselves. That's where things like a service mesh come into the picture. Something like Istio, I've been working on it for about seven-and-a-half years now, comes into the picture to solve this problem, transparently enabling things like mTLS, mutual authentication, telemetry collection, distributed tracing, and traffic control.

Anybody using Istio? Istio has been out for a bit, and has traditionally been implemented by deploying proxies next to the instances of your workload. If it's a Java app, next to your JVM. If it's a Python app, the actual process that runs the Python code. If it's in Kubernetes, actually inside the pod. Deploying, getting this infrastructure-y stuff into your applications creates this coupling and friction.

First of all, how do you onboard? How do you get applications on? Now you have to inject this thing into it. If you already have other sidecars, do those play nicely together? How do you do upgrades? You got to restart all your apps because you got a new sidecar. Then there's the performance, there's the overhead. Sidecars were a necessary evil to implement this type of functionality, but this is the last thing I'm going to talk about sidecars. We're going to talk more about the functionality of the service mesh, and we'll dig into a little bit how we implement this functionality without using sidecars. In September 22, I think we announced publicly in the Istio community an implementation of Istio that doesn't use sidecars. In May, actually, we finally got it to the point where it is now in a state where it's usable in production. We have people using it in production already.

Demo

Let me just show you a quick demo what that looks like. In my demo app, we have three apps in three different namespaces. We have web API. We have recommendation. We have purchase history. If I go into web API, you'll see we have an app running there. This signifies the web API team, different teams across the Kubernetes cluster. We don't have Istio. We don't have any sidecars or anything installed. Recommendation looks the same. Purchase history actually has two different versions, which we'll use later in the demo to illustrate routing. In the default namespace, there's a sleep app that we'll use as a client, at least in this next step. In httpbin, another sample application. Web API calls recommendation.

Recommendation calls purchase history. If we call web API through the sleep app, we'll see that, indeed, web API calls recommendation and recommendation calls purchase history here. We're going to come over here and we're going to actually start our demo. What we're going to do is we're going to install Istio. This will take a second, but you'll notice in the command here, I'm going to use the new ambient profile. This is going to install Istio's ingress gateway, which is just a default. It's an Envoy proxy for getting traffic into the mesh or into the cluster. It's going to install the Istio control plane, Istiod. Then it's going to install a couple components that will enable us to do the sidecar-less implementation of Istio. I'll talk a little bit more about what those components are.

As part of the installation, I want to install Grafana. I want to install some observability apps so that we can see some of the tracing. We'll give it a second. We'll take a look. You can see here in the namespace list, Istio system now appears. I click into there. Let's see, things are still coming up. Grafana's up. Istiod, right here, the control plane, the ingress gateway, and a few other components, ztunnel. We'll wait for Kiali to come up. We'll use Kiali here in a second. The first step we're going to do is we want traffic into the mesh. We want it to come in through the Istio ingress gateway. We're going to apply some routing policies to allow traffic in through the ingress gateway. The ingress gateway is exposed on this external IP address.

Once we apply this policy, we'll end up using that IP address to make the call. The routing policy is very simple, match on a particular host, istioinaction.io. Then, route it to web API. Like I said, traffic comes into web API. Web API calls recommendation, which calls purchase history. We'll do that. If we actually make a call through that IP address back here, you can see that we're getting the correct response, and it goes through the Istio ingress gateway. What we're going to do is we're going to add our web API, recommendation, and purchase history, to the mesh, and we're going to do that by labeling each respective namespace. In this case, we'll also do the default namespace. There are some sample apps there. It will label the recommendation, and then the last one here, purchase history. That's it. Our apps are now part of the service mesh. There is no sidecar running here, as you can see. This is Istio Ambient mode.

If we come over here to the Kiali console, yes, let's go ahead and port forward that. Let's do some port forwarding and get the Kiali console to come up. We don't have much traffic going through here, so let's get some traffic. We'll send about, I think it's 250 requests through the gateway, and that's good. We should take a look at our workloads that we have deployed here, web API, recommendation, and purchase history. We see that Kiali recognizes the workloads here. If we look at the Istio config, we can see the Istio config that's been deployed just not much going on, really, just allowing traffic into the ingress gateway. Then, lastly, if we click on the traffic graph? We still don't see the traffic. Give it one more run, generate the metrics.

The metrics end up going into Prometheus. Kiali scrapes Prometheus, and should show the traffic flow. There we go. We can see web API calls recommendation, which calls purchase history. What we also see is through the lines between the different services, we see this lock. If I click on these locks and look off to the right-hand side here, we see that Istio is enabled. We have mTLS enabled between these services. The services are using SPIFFE workload identity, which we'll talk about. We've done nothing more than just label the namespaces. We've already got the services into the mesh. They're protected by mTLS, so their traffic's encrypted. We have workload identity assigned here. That's pretty good for just labeling a namespace.

Istio Ambient Mode (High-Level Overview)

The way Istio Ambient works, just at a high level, is it deploys an agent to each of the nodes in the cluster. This agent binds directly into the pod's network namespace. What that means is traffic will leave the pod, but Istio Ambient will have already done some stuff to that traffic, so the ztunnel will have done that. In this case, what it's doing is matching an mTLS certificate to that pod, and enabling mutual TLS. Once the traffic leaves the pod, it is already encrypted. It is already participating in mTLS. It's being tunneled to whatever the destination is. Obviously that other side will participate in the mTLS as well. I'm certain I'm going to get this question, so I'll draw a picture about exactly what the ztunnel is doing. Traffic is not going from the pod to the ztunnel. It is leaving the pod already having been encrypted by the ztunnel.

The ports are opened up inside of the network namespace of the pod, so we get the same behavior that we do with sidecars, sidecars actually deployed into the pod, but without deploying sidecars. That's great for layer 4 connections mTLS, but what about layer 7, things like request-based routing, or retrying of a request that failed, or JWT validation and authentication and authorization. For that, what Istio Ambient does is it injects a layer 7 proxy that we call the waypoint proxy, into the traffic path.

Since we already control it with the ztunnel, if there are layer 7 policies, we can route it to a layer 7 proxy. That layer 7 proxy, we don't want to treat that as some big, centralized gateway. What we want to do is have better tenancy for it. In Kubernetes where the default is to deploy a waypoint proxy per namespace, so for each namespace, you have your own layer 7 proxy. If you need more fine-grained tenancy, what you can do is deploy a waypoint proxy per service identity or service account in Kubernetes, for example.

You can control the tenancy of these layer 7 proxies. You don't get these big centralized API gateways, but you do have API gateway functionality. In these proxies that now live in the network somewhere, they can be scaled. If you want high availability, you scale multiple of these waypoint proxies. You can size the proxies more appropriately to the traffic that is destined, in this case, for pod B. The sidecar approach, you just scale up more pod B's, but then you get more sidecars, more proxies.

We've gone into a lot of performance analysis, resource cost analysis, comparing it to sidecar, comparing it to other service meshes. This was a fairly large deployment. I think we were doing in the order of 112,000 requests per second through this environment and setup. We took measurements of the baseline, what ambient looks like, what other service meshes look like, what the sidecar looks like, for comparison. Because of this optimization, where if you just need mTLS, you don't have to inject sidecars, you don't have to do anything, this network path becomes extremely fast. It's just layer 4. If you need to use layer 7 capabilities and inject the waypoint, this also ends up being faster than the sidecar, because we're processing HTTP only one time.

Sidecar does it twice, once on each side. The network hop that you take to get to the waypoint, the cost of that, we've seen in our performance testing is lower than having the sidecars perform the HTTP parsing, all the stuff that they need to do. Ambient ends up being simpler to onboard, simpler to upgrade and maintain, especially for security patches and all kinds of stuff that you have to do. It's a fraction of the resources that need to be reserved for CPU and memory to run sidecars, because you don't run sidecars. Performance is improved, especially in the layer 4 only cases. Actually, security gets a slight improvement as well. I'll leave you with this link here - https://bit.ly/ambient-book . Lin Sun and I wrote a book, "Istio Ambient Explained", and it goes into a little bit more detail about Istio Ambient. Go ahead and take a look at the istio.io website. Istio Ambient, like I said, just became available for production usage. It will eventually become the default data plane. Not right now, sidecars won't go away. That's the path that we're on with Istio.

Auth, Policy, and Resilience

Let's talk in more detail about auth, policy, and resilience, and how moving that to the platform makes a lot more sense, drives down costs, and so on. We'll look at a few examples. One is Trust Bank. I'm going to use public references, so you can go and if you're interested look them up. Trust Bank was this new digital bank that was starting up as a joint venture between a bunch of other big banks in Singapore, and they went from nothing to a million users in a very short amount of time. They were cloud first. They built on EKS and AWS, but for their networking components, they used Istio. The problems that they were trying to solve were around compliance, security, encrypted traffic. Things like, they started off with a handful of clusters. They wanted to add more clusters and deploy workloads into more clusters, but they didn't want downtime for their apps. They wanted to be able to move apps to different clusters, different regions. They started to encounter regulatory concern around data gravity and that kind of stuff as well. They needed to be able to control the routing.

From an authentication, authorization standpoint, they didn't want to force everything through these centralized systems. They needed a decentralized networking approach. That's where Istio came into the picture. We talked a little bit about what the existing assumptions and existing approach to defining policy looks like with IP addresses and firewalls and so on. What we want to solve is the service-to-service communication and authentication problem. How does service B know that it actually is service A that's calling it? That's where a specification called SPIFFE comes into the picture. SPIFFE is a spec for workload identity and how to get those credentials that prove you are a certain workload. It specifies what they call a verifiable identity document, SPIFFE, verifiable identity document, that is usually in the form of an X.509 certificate. It doesn't have to be, it can be other formats, but X.509 is a common one.

Then workflows for how does a workload get that document? How does service A prove that it is service A? The way that it works is, service A will request its verifiable identity document, an X.509 cert that says I'm service A. The SPIFFE workflow and the implementations behind the scenes will then go and say, I need proof that you're service A. I'm going to go do a background check, basically. I'm going to go attest that you are indeed service A. I'm going to go check the machine that you're running on. I'm going to check the context that you're running in. I'm going to check the attributes assigned to you. If that all lines up and you really are service A, then I'll give you this document. That X.509 document is presented by service A to service B, saying, I am service A.

Service B can look at it and check the authenticity of that document. A common way to do that is TLS and mTLS. This is where Istio comes into the picture. Istio can automate the mTLS connection transparently for your applications. Istio does implement SPIFFE, so they work very nicely together. Now if workload A is workload A, is cryptographically provable, and it's talking with workload B, and we know these identities, these identities are durable, we can write policy about what services are allowed to talk with which other services. In regulated environments, this type of policy is extremely important. Istio allows you to write this type of policy in terms of workload identity in a declarative format that can be automated, fit into your GitOps workflows and pipelines, and so on.

Demo

I'm going to jump into a quick demo that shows what that looks like. Again, we have web API calls recommendation, which calls purchase history. The first thing that we're going to do is we're going to lock down all communication in the mesh. We're going to deny all. In a realistic world, you can't just come in and shut off all traffic. You can iterate and incrementally add these authorization policies and eventually get to a zero-trust environment where you deny all. For this demo, we'll start with just lock down all traffic. The only thing we will allow is people can call the ingress gateway. We'll apply this authorization policy, and we're going to try to make a call to the gateway.

By the error message, you can't really tell, but it makes it to the gateway, but that gateway says, I can't route this to web API. Traffic can't proceed in the mesh right now, everything is denied. What we want to do is adjust the policy. We want to allow the ingress gateway to call web API. Using an Istio authorization policy, what we're going to do is do that based on identity, not what cluster this thing's running on, or what IP address this thing's running on, or what pod it is. We're going to do it based on workload identity, the SPIFFE workload identity that I just described. We're going to allow that traffic for web API. If we allow it and now make a call, we should see traffic go from ingress gateway to web API.

Not surprisingly, the rest of the traffic is still disallowed. Let's go ahead and add that. We'll add the policies to allow traffic between the rest of the chain of the services. Now, if I make that call through the gateway, everything proceeds. The calls work again. From the sleep app, which is in the default namespace, I shouldn't be able to call web API. Ingress gateway can. Other apps cannot. Let's take a look. I try to call it. That does not work, which is what we expect.

So far, this has been mutual authentication and policy enforcement based on identity, but we can be even more fine-grained than that. We can specify policies about what services can call what other services, and specific endpoints, specific HTTP verbs that they can call, specific headers that have to be present, or not present. We see a little bit of that in this next part. I want to allow the sleep service to call httpbin, but only at /headers, and only if this x-test-me header is present. Now we're looking at layer 7. We're looking at the request itself, the details of the request, and we're going to build some authorization policies in terms of some of those details. In Istio Ambient, if you're going to do anything with layer 7, you need that waypoint proxy that I mentioned.

This is that layer 7 proxy that gets injected into the network path. Those get deployed in Istio Ambient by default, one per namespace. I just created this waypoint proxy, if I come into the default namespace, we can see sleep, which calls httpbin. Now we've included this new waypoint proxy, and we're going to apply that policy to allow sleep to call httpbin. It won't work if we call /IP, that was not whitelisted, that was not part of our authorizations. If we call it with the right header on the right path, in this case, the call will go through. We get very fine-grain layer 7 authorization policies declaratively with Istio.

Traffic Control and Traffic Routing

The last section here goes into a little bit more of the traffic control and traffic routing. Intuit has given a number of presentations on how they've built their platform using Istio multi-cluster. One of their big use cases is, when they deploy a service, and they make an upgrade, they deploy the new version of that service into a different namespace. They might move it into a different cluster. What they need is globally-aware routing. When we deploy a new service, we don't want to take an outage. Another company gave a similar talk, but their motivation, their reasons for needing that global routing and failover, were for data gravity and compliance reasons.

GDPR, you have to have your data in a certain region, and if you want to access it, you have to go through and to that that region. Istio is really good at controlling traffic, both load balancing, being zonal aware, being multi-cluster aware, and routing across multiple clusters. When we have traffic control down to the level of a request, we can also implement things like resilience. We can be resilient in terms of load balancing. We can be smart and cost optimized in terms of load balancing. We can also do things like timeouts, retries, and circuit breaking, and offload that from the app developers having to worry about that.

Like I mentioned, globally-aware routing. If a service is talking to a service in one cluster, but the destination service fails, we can fail over to a different cluster transparently, and still have mTLS, still have our authorization policies enforced, and do it in a smart way. We're not going to automatically fail out to a different region. We would try to prefer locality. We'll try to prefer zonal affinity. Then fail out to a different region as necessary. I mentioned circuit breaking. If I'm making a call to a service and it's misbehaving, stop calling it, or at least for a period of time, back off. Then, slowly, try to call back into it.

Demo

This is the last part of the demo. Purchase history has two deployments, a v1 and a v2. If I call that, we see, in this instance, we had a response from purchase history v2. I call it again, also v2 it looks like. We should see it load balance. There we go. There's a v1, v1, v2, so it'll load balance about 50% is what Kubernetes does automatically, out of the box. What we want to do is force all the traffic to v1. That'll be the production available version. We'll write a default routing rule here that says, 100% of the traffic should go to v1, 0% should go to v2. However, we might want to introduce v2 as a canary. We want to be very specific and fine-grained about what services can call the v2 version.

To do that, we'll add a match that looks for a specific header, and if it has this header, then we'll route it to v2. Again, we're doing Istio layer 7 stuff, so we're going to add a waypoint proxy into the purchase history namespace. We can see that down here at the bottom. Let's go ahead and apply that routing rule. Now let's start making calls. Actually, we have deny all authorization policy. We have to enable that for the waypoint. Now what we're going to do is call the services 15 times. We're going to use jQuery to pull out the bodies, we see 15 times in a row. The call ends up going to purchase history v1. If I want to do the canary part, you can't see it, but at the bottom here, we called the services with this header, which triggers that match in the routing, and routes it to purchase history v2. What if purchase history is misbehaving? What if it's returning errors? I just deployed a version of purchase history that half the time is going to return 500 errors.

As we can see here, we call it a handful of times. With Istio, we can do timeouts, retries, circuit breaking type thing. Let's take a look at a retry policy that we will want to add here. If you're calling purchase history v1, down here in the bottom, you can see, I want to implement retries. I'll retry up to three times on 500 type errors. If I apply this and then make calls, we should see, every time the calls succeed. They might be failing. We can check the retry metrics. They might be failing in the background, but the retries are kicking in and making it so that the call succeeds eventually.

Conclusion

Istio in general provides a lot of capabilities for fine-grained modern networking. Solving this networking challenge should be part of your modern internal developer platform. You can see that here in the stack. This is how it lines up in block diagrams. One thing I will say, and you'll notice here, Istio is buried down here in the bottom of the stack. This is networking. Application developers shouldn't have to know about Istio. The APIs, the workflows, the interfaces, the experiences that you build for your developer platform should enable things like maybe doing a canary release, or publishing telemetry and metrics to Grafana dashboards and so on, so that they can see what's happening with their services.

Things like security policy are probably not driven by developers, but if your workflow includes that, then your workflow can generate those authorization policies and the details around what services can call which other services or API endpoints. All of that stuff should be automated away. They shouldn't have to know about this. From a platform like, what are the business outcomes of the platform?

Originally, I mentioned, you want to increase business value. You want to increase compliance. You want to make things more efficient, and reduce cost. From a value standpoint, your code on the developer's laptop, or in CI does nothing unless it gets into production. You can't get value out of that. Tools like Istio implementing something like this allows you to do more safe releases, canaries, blue/green deployments. A lot of network telemetry can be pulled back, distributed tracing, and so on. You can make those decisions about whether to go forward and so on.

 

See more presentations with transcripts

 

Recorded at:

Jan 29, 2025

BT