Well at Netflix, Insight Engineering is a group responsible for developing Netflix' realtime operational insight system, so we develop platforms to help Netflix basically know what goes bump in the night and what we should do about it.
So let’s use a different word and talk about monitoring. Traditionally every company has some sort of monitoring solution, and if you think about monitoring, monitoring is really about how is this server doing, how is that system doing, etc. We tend to think of this a little more holistically and we tend to think of monitoring as just a first step in figuring out actually what we need to do to manage our operations. So if you think about monitoring as observability, as the ability to know in relatively fine detail how any given server in our environment is doing, how any given application is doing, the job of Insight Engineering is to actually build the platform that both give us that observability, so monitoring platforms telemetry platforms, and then also actually build on that the ability to actually create insight, in other words decisions, recommendations and self healing systems, that kind of stuff. So data is necessary but it’s not sufficient to actually manage today’s production environments.
Barry: And just give us an idea of the scope of this operation, how many people do you manage, how many people are required to do this, what’s sounds to me like a monumental task.
Well our job which is just to build the systems for everybody else to use, my group is about 10 people plus me, though they do the heavy lifting.
Barry: And before this interview we spoke about canary analysis, that’s one of your main thrusts, I want to find out all about canary analysis.
So canary analysis, this is what I actually came to QCon to talk about. Generally canary analysis is a deployment pattern, in other words you’ve got let’s say a server working in production and you’ve got let’s say version 1.0 of that server, and you want to roll out a new version of the server into production to provide the same services you provided before, but with some enhancements of course, to hypothetically speaking 48 million streaming customers. You could just deploy the new version of the server into production and hope for the best, or you could do what we call canary analysis which is, let’s say you’ve got 1000 servers running 1.0, let’s deploy one server running 1.0.1 and shunt a proportional amount of traffic to that server, so now you’ve got 1.0.1 with about 1/1000th of the traffic that the 1.0 cluster is doing, and then you compare this canary to the base line cluster and see, is it doing reasonably well, is it doing worse, better, and then if it’s doing well maybe let’s invest, let’s increase our investment, increase our commitment to 1.0.1. by increasing the number of servers running 1.0.1 to let’s say 100, so now 10% approximately of your traffic is going to the 1.0.1 cluster and then you again compare the canary cluster to the base line cluster, and if it looks ok then maybe you go finally to 1000 of the new server and 1000 of the old server, and at that point actually 50% of your traffic is being handled by the new systems and if it's still looking ok, then you can shut down the old systems and you’ve moved forward to the new system environment. So it’s not a test pattern, it’s more a deployment pattern, it’s a way to deploy changes in production in a way that is a little more holistic we think, a little safer and gives you greater degree of certainty and minimizes the risks to your customers.
Barry's full question: Now there is a certain difference between real canaries in minds, and pieces of software and servers that you deploy in real life, one of them that occurs to me is incompatibilities between version 1.0.1 and version 1, so by putting out 1 unit with version 1.0.1, don’t you have to worry not so much about how version 1.0.1 is going to operate but how it will operate with, along with version 1, can this be a problem in doing canary analysis?
Yes, it can. For us what we’ve found was that generally speaking because our service oriented architecture approach where we don’t have a monolithic service, we have hundreds and hundreds of services interrelated, we don’t tend to see changes in a server actually come along with a required change in interface, because if you change the interface on the server then you also need to make sure that everybody else knows about that change in interface. So we tend to think of software deployments as actually decoupled from interface changes, so let’s say for example you have a new API that you want to support, generally speaking you wouldn’t want to immediately deprecate the old API anyway, whether you deploy 1 of 1.0.1 or a thousand of 1.0.1, anybody who is talking to that server will need to migrate to using the new API. So irrespective of the canary analysis, what you would you probably say is that let’s say that my API endpoint today is /api/v1, the new server might support also a /api/v2 but should support /api/v1 anyway, and so those two servers should actually be able to both work at the same time, irrespective of whether you use canary analysis.
Barry's full question: Now you describe a procedure where you introduce one unit, one server and I imagine when you do that you learn all kinds of things, and then you say you gradually up the ante to maybe a hundred servers or half of the servers that you have, what kinds of things are you likely to learn as you up the ante that you didn’t learn when you introduced the first server?
Well one of the problems with just having one server, and I talk about this in my talk, is at least in the public cloud environment, we work in Amazon, you get a certain degree of inconsistency between servers. Servers can be outliers, you are running virtual machines on hosts, maybe you have one server that’s running on a host that’s completely not busy and has fantastic performance or maybe that server is running on a host that’s incredibly busy and therefore that server has terrible performance. Deploying one server is actually less good than deploying multiple servers because then you have that outlier effect, it’s good to figure out whether you got or not significant problems, once every maybe I’ll be able to tell you whether or not you’ve got significantly high error rates for example and you should pull back right then, but generally speaking we find that we have the best fidelity, the best ability to compare baseline to canaries when you have a group of canaries that you can actually look at the aggregate signal from, because then we can see the effect of any given outlier within that group.
Barry: Ok, so now give us some of the details, I’ve got the general idea, deploy slowly, that’s a 2 word summary, but there must be more gotchas involved in doing this because otherwise it wouldn’t be giving a formal name such as canary analysis.
You know it’s interesting, I came here to this talk and whenever we do a talk about how we do things at Netflix, it’s very easy for these things to become very aspirational. I have two analytics engineers with masters in artificial intelligence working for me and a PhD from who’s worked on this and a whole bunch of platform people, we invested a whole bunch of money in this and we have really fantastic canary analysis story, but I would actually argue that the thing that I want people to get is that canary analysis really is actually just “deploy slowly”. It’s a little more than that, I would probably argue that at minimum canary analysis is deploy slowly throughout the process of deploying slowly look at your canaries, have the metrics to understand whether or not your Canaries are behaving well, and be able to segregate your metrics between canaries and baseline. That’s it, I think most people probably have that ability, most people probably could use that ability to great effect, and if you just do that and you haven’t until now, well frankly my job is done. So everything from that is a process of continuous improvement, so if you got that where do you go from there?
Well one thing is how you figure out what metrics matter to you? For us what we found was that we have a system where any developer can instrument their application to send any number of metrics they want, it’s entirely up to them, and that means that we have some metrics that are actually really useful for figuring out whether your canaries are doing better than your baseline, and we have some metrics that are frankly less useful, metrics that are more business oriented than application or system oriented, tend to be less relevant to figuring out whether your canary is working at least as well as your baseline. So metrics like, for example, CPU on a machine, requests handled on a machine, errors on the machine, those are a better indicator of canary success, than let’s say amount of dollars that we collected in sales from a given machine in a given period.
5. And why would that be the case?
Because if you think about the traffic that the canary gets, you really want to look at the characteristics of what the machine does, you want to look at the characteristics of the work and the quality of that work, and so their metrics that are more useful for defining the quality of the work that machines does, than for example dollars which are less relevant to version 1.0.1 of the server for example versus version 1.0 of the server. So if you think about a server as essentially managing requests, taking request, processing and returning something useful, you really want to look at the rate of what you are doing the work, the speed of what you are doing the work, so throughput versus latency and the quality of what you are doing, you are looking for example at error rates, they tend to be more relevant to the immediate tactical decision of whether to move forward in the deployment than other things that are more sort of longer term and more business intelligence type of metrics.
Well all the time, I mean, you know, look, no developer deploys something into production expecting it to break, generally speaking, and you have to remember that especially because canary analysis is not a test pattern, it’s not a replacement to any sort of testing, generally speaking by the time we deploy into production it’s already passed all the testing, it’s passed all the unit testing, it’s passed all the integration testing, it’s passed all the user acceptance testing, A/B testing etc. So generally speaking every time we deploy with canary analysis we expect the canary to pass and to be deployed. I’ve got stats in my talk about the frequency with which frankly our canary don’t pass and everyone of those you could consider to be a surprise and you would consider frankly that everyone of those is also a success. Canary analysis proves itself every time a canary fails.
Yes, so we actually recently had a minor outage where a system passed through canary analysis, some of the most rigorous canary analysis process, went into production 100% and then some time later, we found that there was a metric that we should have been looking at, but didn’t and so we ended up having to do a complete rollback. That can happen, canary analysis is not a panacea for all of our woes, it's meant to be a useful thing that will give you increased confidence in your deployment and in the quality of your code that you're getting into production, but it’s also an ongoing improving process and it’s also an ongoing fine tuning process, so at Netflix we don’t have anybody managing your canary configuration for you. Netflix developers are responsible for writing code, testing it, deploying it into production, responding to alerts, but also configuring their own canary analysis parameters. So for example that team had to go back to their configuration of their canaries and say: “You know this metric is actually a little more important to us” in terms of defining the success of the canary than we thought it was, so from now on will include that in our calculations.
Barry: Right, thank you so much!
Yes, my pleasure!