Transcript
Cummins: I'm Holly Cummins. I'm a developer. I work in the IBM Garage. Today, I want to talk to you about the planet. The reason I want to talk about the planet is we have a problem. The earth is getting warmer. When we look at the projections for how much the earth is going to warm, often we see numbers like 1 or 2 degrees. That seems maybe a little bit concerning, but not too bad. We wouldn't notice that on a summer's day. We think, I'll eat more ice cream. I'll have longer summers. A bit of a shame, but it's fine, really.
It's really not. It's not just warmer. It's uncomfortably warmer. We're going to see much more drought and all of the food poverty that goes with drought, the suffering, and the human social disruption that goes with that. We're going to see more floods and the devastation that goes with that. We're not just going to see floods, we're going to see submersion. Islands are going to disappear back into the ocean. We're going to see more hurricanes and the devastation there, and fires, and the destruction there. It's not so good. I think with climate change, it can seem a bit abstract.
The Climate Service's Flood Risk Graph for Tokyo
Earlier this year, my team in the IBM Garage, we did a project for a startup called The Climate Service. What they do is they quantify climate risk so that institutions which hold assets, can look at their exposure. We replatformed them onto Kubernetes, and so we had to port all of the logic across. Of course, we were validating as we went. We were looking at the flood risk graph for Tokyo. It was a bit ridiculous, because it just went up to 100, and then it just stayed at the top of the graph. We said, "We've made an error. There's some problem with our logic." The CEO, who knew the data best, wandered by and he looked at our graph, and he said, "No, actually, that is the graph." What it was showing is that by 2030, in Tokyo, there's expected to have a once in a century flood every year. That's not so good.
Technical Debt
The reason we've got into this situation, it's a bit like technical debt. When we have a code base, we often accrue technical debt. We make decisions that seem like a good idea at the time, and sometimes they are. Eventually, we do have to pay it back. There are consequences. What we've got now is a case of environmental debt. We've been making these decisions because they're convenient, but really, we're just borrowing against the future. We're going to have to pay it back. We're going to have to pay it back with interest. We're going to have to start drawing down some of that carbon, or we're going to be in a lot of trouble.
What It Has To Do With Me
So far, so existentially bleak. What does that have to do with me? I'm just a techie. What does that have to do with you? It turns out, it has quite a lot to do with us, because our industry, the tech industry, is a big contributor to climate change. When we think about industries and climate change, of course, we think about aviation. Aviation is the poster child for environmental irresponsibility. It contributes about two-and-a-half percent of worldwide emissions. If you look at tech, there's of course a whole bunch of associated stuff with tech, so outside the data center, there's devices everywhere. If we look just at the data center, we're looking at about 1% or 2%, depending whose numbers you read, maybe even 3% of worldwide emission. That's the same order of magnitude as aviation. We need to be looking at us and our industry, with the same critical eyes that we look at aviation with.
Waste
Of course, none of us try to do this. This didn't happen because of a lot of malice. None of us are going, "I can use all my company's resources and it will destroy the environment as a bonus." These things, they just happen and they just creep up on us. One of the big causes is waste. Things being caused by waste is great, because waste is... well, not always easy to fix, but when you fix waste, you can fix that and still keep a lot of the good things that IT brings. It's quite a satisfying solution. One of the big things of waste, we see has to do with what we do with our workloads. If we have a workload, it may be a medium-sized workload, and we need somewhere to run it. We put it either in a physical machine or a virtual machine, or we put it in a Kubernetes cluster. Usually, that place we run it has a lot more capacity than our workload requires. Our little workload just ends up pinging around inside. It's rattling around inside this machine. There's all this computational capacity that's not being used.
Utilization and Elasticity
This matters, because there's two key concepts that we need to think about when we're thinking about efficiency. The first is utilization. Utilization is how much of the computational capacity is being used by a workload. The higher the utilization, the better. The next is elasticity. Elasticity is how easily we can scale up or down the workload. Elasticity is what allows us to get high utilization, because if we can scale a workload up and down, then that means that we can always be having just the optimal amount of headroom. The cloud is fantastic for elasticity. That was one of the really great things about the cloud was how much in elasticity it enabled. You don't have to have the cloud for high utilization. When I joined IBM, it was about 20 years ago, and there was all these rumors swirling around that we were going to get rid of our mainframe business, because, while it had been doing well for us for a long time, mainframes are not exactly the cool kid on the block. Even 20 years ago, they were not the cool new technology.
Then at the same time, our industry started to have a conversation about climate impact. We realized that mainframes are a phenomenally efficient way to run a workload. The reason they're so efficient is because of their utilization. A mainframe will have about 95% utilization just because of how it runs its duty cycles and that kind of thing. An x86 just cannot meet that utilization, no matter how you stack its workloads. What it means is that if you have the same workload running on a zSeries, or on an x86, the mainframe will give you about 30% more performance, and for half the power. That's pretty cool. Obviously, we're not all going to switch all our workloads to mainframes, but it's an interesting model to have in the back of our mind.
Because one of the things that we see with some newer technologies is that the elasticity and the utilization could be better. If we think about our application, and it's running in its box, most modern technologies give you really good elasticity on the application, so you can scale it up. Every technology as well, it's going to have some overhead. If it's Kubernetes, it's a control plane. If it's just a physical box, it's going to be the operating system, it might be some virtualization layer. There's going to be some overhead there. Then when we use our elasticity, to go up and down, there's a limit. With Kubernetes, for example, we can change the manual replica count super easily. We can use horizontal auto-scaling. We have all these tools to allow the workload to scale up and to scale down as well.
Clusters
On the cluster side, it's a bit less good. Clusters are less elastic than applications. You can scale them up or down. It's hard work. The other thing is that even if you do that, say you scale your cluster right down. Then your application is going to take more of the cluster, but the control plane is going to take a lot more of the cluster because the cluster has overhead. Ideally, instead of shrinking our clusters, what we should be doing is getting more heterogeneous workloads on our clusters. There is a barrier to this. I think it's Conway's Law. Conway's Law says that your architecture is going to replicate your organizational structure. I think your cluster topology also ends up replicating your org chart. Because you don't want workloads from those other orgs on your clusters, because there's a whole bunch of management difficulty that go with that. You need to make sure that the costs flow back to the right place, and that's hard.
Then there's other concerns as well. In Kubernetes we have namespaces, and namespace isolation gives us quite a lot. It doesn't give us enough to overcome that Conway barrier, because there's still a problem of noisy neighbors. What happens if someone runs a workload and takes all my resources? There's scope collisions and subtle interaction problems like that. Of course, there's security. Security barriers are slightly porous within any environment like this. Sometimes slightly porous isn't good enough. If I've got my prod workload, I don't want it running with that weird experiment in the same cluster. They've got to be isolated. Having said that, of course, we can do better than having everything in its own cluster. For sure, keep prod on its own. Then maybe think about, can we combine Dev and staging and that weird experiment, and try and get our utilization up in some of our other clusters?
Zombie Workloads
If we do that, and we have lots of applications running in the same cluster, the application to control plane ratio is really good. Are we winning? It depends what those applications are. If those applications are Bitcoin mining, then almost certainly, actually, we are still losing in environmental terms. Even if they're not Bitcoin, there are a lot of things that can go wrong, because our industry has a horrible problem with zombie workloads. Zombie workloads, I think that maybe once had a purpose, they maybe once were alive in a meaningful way. Now they're just lurching around consuming resources, but not actually doing anything useful.
We've all been there. When I was learning Kubernetes, I created a cluster, and then I had too much work in progress. I got pulled away to other things, and I forgot about it for about two months. When I came back to it, I realized I'd done a fairly well spec'd cluster, and it was £1,000 pounds a month, just sat there, using energy, not contributing anything. As an industry, that's all of us. They did a survey, and they looked at 16,000 machines. They found that a quarter of them were doing no useful work. That's 4,000 machines of these 16,000 doing no useful work. The authors of the study said, perhaps someone forgot to turn them off, which is just so sad, and so true. The thing is, even if you don't care about the money, even if your organization has more money than you know what to do with, it's not just money, it's energy. Eighty percent of data center energy comes from fossil fuels. We're just kicking out carbon, when we have these wasteful workloads. What it means is that, as if we didn't have enough to worry about in 2020, we really need to worry about zombies destroying the planet.
Solutions
Is there anything we can do? Yes, actually, there is. I think we've all seen emails like this. I got this one last week, where it says, "We don't know whose servers are what, and we don't know what they're for, and we feel like we need to turn some of them off. We don't know how to even figure out which ones to turn off. Could you please go and find your own stuff and turn off?" It's not a super effective way of managing zombies. When the emails fail, usually what happens is it goes to a meeting. I spent 3 hours in a meeting with the CIO of a bank, just going through the estate, trying to figure out what these workloads were and whether they still had value. I have to say, it wasn't the most exciting meeting I've ever been in. Actually, a 3-hour meeting is the best case. Sometimes organizations can be working on this problem for years just trying to hunt down the zombies and destroy them.
We can be a little bit more effective with tags. If when people create a server, they give it tags that then helps identify what it's for and when it might be suitable to delete it. The problem with tags is people have to remember to do it. People have to choose the tags that will mean something to someone else. Then someone has to go back through, look at everything and manually delete the things with tags that seem like maybe they're eligible for deletion. It's still a fairly manual and fairly error-prone solution.
Then when the tags fail, what many institutions do is they bring in the governance. This breaks my heart because the beauty of the cloud is it's so frictionless, it's so easy to get stuff done. Then when we start seeing things like, before you can provision a server, you need to fill in a whole bunch of forms. Then we're almost back into the bad old days of it taking three months to provision a server. The thing with governance as well, is I'm not sure it's the most effective way to solve the problem. I think what works best is if you make the easiest thing to do the right thing to do, then the right thing happens and people are happy. We should always be trying to strive for that.
FinOps
One of the things that we're seeing new in this area that I think is going in that direction, is FinOps. FinOps, I like to think of it as the discipline of figuring out who forgot to turn their servers off in your organization. It's officially more about making sure that there's accountability for costs, and that costs flow to the right part of an organization. Once you have costs flowing to the right part of an organization, then I think what is going to follow quite naturally, is those climate benefits, because we get the costs down. Then we get the energy usage down as well. I'll be interested to see what happens with this one.
Is The Cloud Zombie-Proof?
In general, I love the cloud. I think it brings so many benefits. I have a sinking feeling that not only is the cloud not zombie-proof, it might actually be more vulnerable to zombies than we even were before. Because the thing about the cloud is it makes it so delightfully easy to provision hardware, but it doesn't, out of the box, give you any support for remembering to de-provision that hardware, for remembering to turn it off. We all have the IKEA cognitive bias, which is that if we make something, if we stand a server up, that was work, and we quite like it because we made it. We don't want to shut it down, because, what if we need it later? That would be so sad to have to redo that work. It's quite a nice cluster, because we made it.
GitOps and a Lease System
GitOps is going to help here. By GitOps, I really mean infrastructure as code. What infrastructure as code gives us is disposable infrastructure. We can have our server. We can spin it down. We can get rid of it in the confidence that we can get it back if we need it. Then, when we're done with it again, we can spin it down. It means it can spend a lot more of its lifecycle off. This gives us wonderful disaster recovery, of course, but it also allows us to be much more efficient. I started joking that we were going to see spinning down clusters was going to be the new lights off. At the weekend, on Friday, and maybe even in the evening, we'd leave the office and we'd turn out the lights, and then we'd turn off the cluster. Then I realized, actually, people started coming back to me and saying, "No, we already do that. It works great." I saw one statistic where they shut down their instances out of hours, and they reduced their cloud costs by 37%. That's a huge amount. That's, of course, not just the money. It's the carbon and everything else. It's not that hard. Once you've got the automation right, you've got your few little scripts, and then you just start seeing this benefit.
We can do simpler things as well. A colleague told me a story, again, this was another bank, and they were able to half their CPU usage by implementing a lease system. What this meant was that if someone wanted to provision something, it would expire in two weeks. They could renew it, of course. Just changing the default to be, we'll clean up after you, got rid of all of this waste.
Multi-Cloud Management and Traffic Monitoring
We're seeing new things as well. Multi-cloud management is an emerging area. It helps with moving workload to the most economically sensible place. It's also starting to have support for things like tracking your carbon, so you can see how much your estate is costing in that environmental sense. As well, we're starting to see things like traffic monitoring. If no traffic is going into a server, and if no traffic is coming out to a server, it's a pretty good bet, that server has no value, so we can kill it.
Micro-optimization
There are a few things to be aware of, because this is a very happy story. We can get rid of all this waste. Double win. No-regret solution. We do need to be aware of micro-optimization theater. Micro-optimizations are optimizations that make you feel really like you're making a difference, but they're not actually making a difference. Back when we all traveled, I used to make a really big point of never taking a taxi to and from the airport. I would always take public transit. That was a big sacrifice. It took more time. It took more work. Once, I got followed, which was incredibly scary. I really felt like I risked my life to not take a taxi, and so I was a hero. Of course, if you look at the big picture, the carbon saving from that taxi was negligible compared to the carbon of the flight. I felt like I had these hero points, but actually, I was fixing the wrong problem. That's what we see with micro-optimization, is, it's a lot of noise but it's not going to have that big a difference.
You might say, "Every little helps. Surely, it's better to take public transit." That is true, but you need to think about the opportunity cost. Because if you're spending time working away on something that has a fairly small benefit, that means you can't spend the time working on the thing that has the bigger benefit. You do need to be driven by measurement, really. Measure, don't guess, so you can focus on the optimizations that matter.
Jevons' Paradox
Another thing to be aware of when we think about efficiency is Jevons' paradox, which I like to think of as the highway problem. When municipal planners widen roads, they imagine that they're going to have these glorious, wide highways, with just a few cars, and everybody's going to get to their destination really fast. Of course, we all know what happens. What happens is, that's the case for a day, and then the new road fills up. We see this with data centers. Data centers have got far more efficient in the last 10 or 15 years, and yet the energy usage of data centers is still going up. It's because all of those efficiency gains have just been offset by increased usage. We need to not just focus on efficiency, and doing more, we need to focus on really making sure that we're doing less.
Unsolved Problem == Opportunity
Overall, these are a lot of problems. Some of them are technology problems. I think we have an opportunity as techies to innovate. We can be bringing our invention to these problems. There's so many things that we can be doing that will make the world a better place. If you're designing systems, then you really need to be putting in place the tools to support performance, that make it fast, that make it lean. That pleases users as well. You need to be making sure that you have that support for high utilization. You need to support multi-tenancy. You need to support elasticity. You need to have the tools for de-zombification, so giving them visibility into the workload. Then if you're a user of these systems, take advantage of the capabilities that you can get your utilization up, limit the sprawl. Go and hunt down those zombies. De-zombification makes a big difference.
See more presentations with transcripts