BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Cloud Provider Sustainability, Current Status and Future Directions

Cloud Provider Sustainability, Current Status and Future Directions

Bookmarks
34:48

Summary

Adrian Cockcroft explains what is available now in terms of green energy, and public roadmap statements and commitments that have been made by AWS, Azure and GCP.

Bio

Adrian Cockcroft has had a long career working at the leading edge of technology. He’s always been fascinated by what comes next, and he writes and speaks extensively on a range of subjects. Currently, he works as a Tech Advisor for Nubank, and is a Partner & Analyst at OrionX.net.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Cockcroft: I'm Adrian Cockcroft. I'm going to talk to you about cloud provider sustainability, the current status and future directions. I also call this DevSusOps, so adding sustainability concerns to development and operations. Although, I'm focusing very much on what the developer concerns are in this space. You can find me at OrionX.net, which is a consulting firm that I work for. Also, on mastodon.social and DevSusOps@hachyderm.io. I just want to talk about, why does sustainability matter? Why is this hard? Getting the terminology and mental models right is actually much more difficult than you might expect. I'm going to talk about cloud provider similarities and differences. What are the measurements? Then, the interesting part at the end, things developers need to build and what we need to get so that we can build them.

Why Does Sustainability Matter?

Let's start with, why does sustainability matter? Fundamentally, we want to leave the world habitable for future generations. There's also regulatory compliance coming down through board level from governmental regulations. There's market transition risks, as the markets move to electric cars from gas powered cars. There's physical risks to business assets, like flooding, and high temperatures. Green market positioning for your products. There's employee enthusiasm. People may not want to work for you if you don't have a green perception in the market. There's potentially reduced cost now or in the future. It turns out, renewable energy is now the cheapest form of energy. That's driving quite a lot of the work here. Then, some companies, it's your social license to operate. There's things we can do now. There's development. You can optimize your code, just make things more efficient so you're using less resources to get your job done. Operations, you want to deploy this code with high utilization, more automation. Just be careful about cleaning everything up as well.

Measuring Carbon

I'm going to focus mostly on measuring carbon, and why is this hard. Because, fundamentally, we're just multiplying two numbers together. We need to figure out how much energy some code uses, and we have to figure out the carbon content of the machine it's running on. That's pretty much it. It turns out that there's a lot more to this. What I just mentioned is called scope 2, which is the energy use. What actually matters also is scope 3, which is the energy used, and the carbon emissions of everything it took to make that computer, and deliver it, and put it in a building, and the cost of creating the building, and all of that stuff. That's the supply chain. Then there's also if you recycle it at the end, then, what's the energy used to recycle things? This is from a Microsoft document that gives a good summary of what's going on, https://learn.microsoft.com/en-us/industry/sustainability/api-calculation-method. You can see from Microsoft's point of view, the scope 3 is their upstream supply chain. The operations of their services is scope 1 and 2. Scope 3 is also the downstream disposition or recycling.

Then, if you're a cloud customer, though, what is reported to you by the cloud provider is your scope 3 is reported by you as your scope 3. As the carbon content of energy tends to zero, as we get more renewables, the scope 3 is dominating, because the shipping and manufacturing is not coming down as fast as the carbon content of energy. It's taking much longer to decarbonize silicon production, and shipping, for example. It will get there in the end, but probably more like 2040, 2050, rather than 2025 to 2030, which is what we're looking at for carbon. However, in Europe and the U.S., current renewable energy for the cloud providers is almost completely clean. They're very high percentage of renewable energy. It's really only in Asia, which is where the silicon is manufactured that the electricity there is particularly high carbon. If you're trying to build APIs with your developers for supply chain automation and pulling in all the data from your suppliers, this Pathfinder Framework is a good place to look.

Cloud Provider Scope 3 Differences, and Methodology

There are some differences across the different cloud providers for scope 3. AWS doesn't provide any scope 3 information yet. I wrote a blog. There was Computer Weekly story. There was a LinkedIn page with a response to that story. Fundamentally, they are working on it but they're taking their time. If you really need scope 3 from AWS, and you're a big enough customer, you can escalate, and they are providing estimates to people under NDA. That's true of pretty much anything you want custom from AWS. If you're big enough and you ask the right people, they will usually give you things under NDA, information that isn't publicly available. What they're working on is the APIs and all the infrastructure needed to provide more information to everybody, down to the individual. However, Microsoft has had this for a few years, they have a detailed paper and API. It's a good paper. They include recycling, but they don't include transport and buildings yet. Numbers are going to go up a bit when those are added. Google also has detailed scope 3, which I think includes transport and buildings, but doesn't seem to include end of life recycling. They claim that that's estimated to be immaterial. There's going to be differences from provider to provider, but hopefully not too much. Most of the time, what we really care about is the silicon and the steel in the machine itself. Then, probably a bit of the building it's in.

Then we look at the methodology. This is another diagram from Microsoft. The question is, how do you allocate the building and all of these other scope 3 things to the workload? What they do, as far as I can tell, is they take the proportion of the energy used by the different workloads or customers in the building, and then they apportion all the scope 3 proportionately to the energy use. Ideally, you'd want to say this machine uses this much rack, and it's using this much space in the building. That's a little too complicated. They just do a proportional allocation. You tend to find these kinds of simplifying assumptions quite a lot when you're looking at carbon calculations. This isn't bad, it's just you have to dig in and find out, because it may be that another cloud provider is using a different methodology to allocate scope 3 to workloads. That's one of the problems I'm going to highlight. It's just there are differences here.

There's some interesting information here from one of the AWS sustainability leads, Cyril. This is across Europe. What you can see here, if you just look at the columns for scope 1, 2, and 3 here. Scope 1 he's saying is the same. They have the same data center in different countries around Europe. This isn't cloud, this is just a data center. Scope 3 he's estimating is the same. Scope 1's the same, which is mostly the fuel used for backup generators. Scope 2 is all over the place. Like Poland has obviously a lot of fossil fuel in its energy, and it really dominates. If you're in Sweden or Norway, then your energy is dominated by scope 3. This is just looking at it from the basis of grid mix. What the cloud providers do is they buy additional renewable energy to offset that. We're developers, and scope 3 is about hardware, it's a Ops problem. Let's just concentrate on the energy, on the carbon content, because that's something that we can control by using our machines more efficiently. It's a number that we might want to manipulate.

Energy

Let's look at energy. How much energy does a specific line of code use? There's a few things out there. There's JoularJX, which uses the JVM instrumentation to try and come up with some estimates for energy monitoring, by the line of code almost. There's also, how much energy does a transaction use? There was a talk about the Green Software Foundation and SCI, which is, if you can figure out the energy you're using and the transactions you're doing with it in a workload, now you can start normalizing energy per customer, per transaction, per those kinds of things. That's good, but you still need to know the underlying energy in order to calculate SCI. What I'm focusing on is the underlying information you would need to go and do SCI at all, from a cloud provider. How much energy does a workload use, would actually be interesting. Maybe you need to get down to an individual container. What does that use? How much energy does a specific cloud instance use? This one I'm running on right now, how much is that currently using? Unfortunately, cloud providers, none of them tell you any of those things in enough detail for you to answer that question. It's a bit of a problem.

How could you measure it directly? If you have a physical access to a machine, you can put a monitor on it. This is a little box that you can get for 20 bucks that you can run your laptop or your desktop machine on and you can go and try and measure energy usage when you run different workloads and see if you can figure out how much energy they're using. You'll find, it's very unrepeatable and unreproducible. If you're trying to do this data for real and use the data, then I think you should understand how to do statistical analysis to make sure you're not reporting noise. That's what Gage R&R does. It basically extracts how much repeatability, reproducibility you really have in your data. Energy is a pain, we're just going to have an average hourly rate, which we'll get from somewhere.

Carbon Content

Let's try and figure out now where the carbon content comes from. That depends. What's the grid mix for energy generation in that location? Where did you get the measures of the mix? When did you get them? It turns out, the grid mix isn't really available accurately until a month later when you get your bill from the energy supplier, so maybe a month or two. Can you get hourly? GCP has an hourly 24 by 7 measure that they use for generating their estimates, so they're calculating based on an hourly rate, but they don't tell you what the actual hourly rate is, as far as I can tell. Unfortunately, these numbers also change after you use them for up to a year, for reasons that I'll get into later. The grid mix isn't just a thing you can depend on. Then, what really matters for the cloud is how much private provider energy was used. They don't actually tell you that. There's something called a bundled REC, you want to know how much that was used. You have to know how much unbundled REC was used. These are all things that matter. The private energy is the entity they contract with directly. Then the RECs are things that are traded.

Grid mix, you get it from the bill, a month or so in arrears, along with power purchase agreements. This is the Google wind farm, the Amazon solar farm, the battery farms that go with them, those kinds of things that are basically built under contract by the cloud providers. There's a lot of that happening. Then renewable energy credits are more tradable things. We also care about the power usage efficiency. It depends where it's measured, but that's the energy coming into the building. How much of it actually gets all the way to the computer and is running that CPU? It's typically somewhere between 10% to 100%, maybe twice as much energy is coming into the building as is actually being used for your computer, in some parts of the world. It's typically, for the cloud providers, 10% to 80% range of overhead. What we have then is to get the carbon we need to take the power mix. We need to know the power usage efficiency. We need to know how much capacity of something we're using, and we need to know how much emissions factor is per that capacity.

There's two completely different ways of reporting this, location and market based. Location based uses this utility grid mix, 24 by 7, this is what Google is using. They're saying every hour how much energy is coming into the building, and what was the grid mix at that time. They're not taking into account the fact that Google has a bunch of wind farms and solar farms out there. Market based accounts for these dedicated power purchases and the renewable energy credits that you bought in that same market. Market based means that the electricity is connected. You don't account for a power station in another country that's not connected to you. You can't use that. You have to use the ones that are connected in the grid to the same company so the electricity could flow. Not all countries are fully connected. In the U.S., there's multiple grids and there are interconnects. The interconnects don't have enough capacity to power everything across them.

Why It Matters

Why does it matter? AWS and Azure use regional market-based data. Google, originally, they have this claim that since 2017, they've been 100% renewable. That's global market. Effectively, generation in Europe and U.S. was being counted against Singapore, which is a bit. It was a good idea in 2017. It's not such a great idea now. The current data that Google has is location based. You can't really compare the data between location and regionally based, but it's probably better for tuning work to use the location-based data if you can get it. Really, what we want is every cloud provider to give us both. Here's that formula, the power mix, what are the problems? The utility bill that tells you your power mix is going to be a month or more depending on where in the world you are. It takes longer the further out into the developing world you go. Power purchase agreements, these are contracts to build and consume power generation. Amazon has over 20 gigawatts of power purchase agreements, I think solar, wind, and increasingly battery power so that they can use power during the day from solar, and run it overnight when the wind isn't blowing, and things like that.

The cloud providers are now amongst the largest energy producers in the world. They have energy on a similar scale to a lot of utility providers that you would normally buy energy from. In Europe, renewable energy credits are called guarantees of origin, but they're slightly different. Renewable energy credits are purchases of renewable capacity on the open market from existing generators. You can only claim that energy once, but you can claim it later on. You don't have to claim it exactly at the time it was generated, which causes other problems I'll get to later. RECs can top up on top of PPAs. What you typically do as a cloud provider, you buy lots of PPAs, and then a little bit extra here and there, so you buy some RECs. Maybe you've contracted for a wind farm, but it hasn't finished being built yet, so you might buy some RECs to cover for that. You've got the commitment. You know you're going to get it, but you're waiting for them to finish building it, those kinds of things. Or just to get your percentage to where you want it to be. What's good is that you buy as many PPAs as possible, and then you buy some RECs to top that up.

Guarantee of origin, renewable energy credit, fairly similar, a little bit different in how they're regulated and traded. A local market REC in the same grid, that's good. When you go cross-border or non-local market, that's energy that can't flow to you. You're assuming that the carbon you're saving in one part of the world is offsetting carbon being emitted somewhere else. You're connecting the two places over the carbon cycle, rather than over the electricity grid. This is also often used as a cheap carbon offset and greenwashing. This has got RECs a bad name. It's really the cross-border and non-local market RECs that people are buying them too cheaply that's got a bad name. Local market RECs actually make much more sense and are actually reasonably valuable. There's some links down below if you want to dig into this a bit more.

You get grid mix. It changes every month. There's hourly data for 24/7 starting to appear. Google and Microsoft are publicly working towards 24/7. AWS hasn't said anything about this. The cloudcarbonfootprint.org open source tool doesn't include these PPAs or RECs. It just works off the grid mix, which is ok in some places, but it's increasingly becoming inaccurate. Then there's an interesting startup called FlexiDAO working in the energy trading space.

Power Mix Problems and Misconceptions

Here's some problems and misconceptions. I try to illustrate this a bit. The problem is, let's look at the bottom two things: you've got fossil fuel energy and renewable energy. That is the power mix reported as the grid mix used by the Cloud Carbon Footprint. Then on top of that, you've got some renewable energy purchase agreements by dedicated capacity and a few credits. The power mix used by a cloud provider in a more emerging market space is some mixture of these things. This is where the energy from the PPAs is never more than the cloud provider needs. Here, you've got a wind farm, you're going to use every bit that that wind farm can possibly make all the time, there's none left over. That's what I mean by an emerging market. As we develop this, one of the problems here is that the RECs on top, the blue area, this is where it gets complicated. The more RECs you have, it makes the grid mix worse for everyone else. RECs take away from the grid mix. The other problem is RECs are traded for up to a year. You generate energy 1 month for up to 12 months. Twelve months later you could say, no, that was not green energy going into the grid mix, that is green energy I'm going to sell to somebody because they wanted to top up what they were doing a year ago. It may sound a bit bogus, but this is the rules. This is how it works. What that fundamentally means is the grid mix is not a stable metric for 12 months, and at various points in time it can decrease and look worse over time, just to get that out there. There's explanation in this URL, https://www.ecocostsvalue.com/lca/gos-and-recs-in-lca/. It basically goes into this. That's in emerging markets. What I mean by that is you're still at a deficit, you're still depending very heavily on the grid.

Then, if we get into a more mature market where the cloud provider is generating so much renewable energy that they have excess left over, what are they going to do with that excess energy that they can't actually use because they're generating more during the day than they can consume? They sell it as credits. Some of the renewable energy credits are now being put back into the market, and sold from cloud providers to anyone else that wants them. Then, any credits that aren't sold, that excess renewable energy goes into the grid mix and it makes the grid mix look better than it was before. Now it's not clear what's going on, these boundaries are getting very blurry. It gets pretty complicated to understand. This is why this is difficult.

Then there's this thing, if you follow this URL, the top one and that's shown below, we get what I'm calling the Norway problem. In Norway, it is 2% fossil electricity, it's 98% green. Great, let's move our data center to Norway and get green energy. Unfortunately, Norway sells 86% of the RECs for its green energy to other countries in Europe. The effect of that is if you actually set up in Norway and run your data center or your company, for that 86% that's sold to Europe, you actually get the residual mix from Europe back into that to compensate, and that ends up at about 50%. The net of this is even though almost all of the energy in Norway is hydroelectric, if you operate in Norway, you'll have to report about 50% renewable and 50% non-renewable unless you go out and buy those RECs from the Norwegian energy companies that basically say, yes, you are really buying renewable energy. This may sound stupid, whatever, but this is the rule. This is fundamentally because selling these across borders is allowed. This is considered a bad practice nowadays, but this is what happens. This is complicated stuff.

Let's look at power usage efficiency. It's not that well standardized, but you can compare across similar data center designs. AWS doesn't publish PUE. People tend to quickly point at this blog by James Hamilton that explains why it's complicated. Azure and Google do publish PUE. They have numbers for it. Azure's numbers are about 1.2, Google's are about 1.1. I wouldn't compare them across that much, because the methodology is all different. At least you can see that they are a lot better than most public data centers, which are 1.5 to 2. We're trying to figure out the capacity that we're using. We've got dedicated, capacity is relatively easy to account for, but then you've got the shared instances, network equipment, it's pretty difficult. This is hard for the cloud providers to report to you. If you are a multi-tenant SaaS provider, you have the same problem, you've got to figure out the capacity you use in your business. How do you provide common information to the people that consume your product? Everyone's got this problem. This allocation problem is a big, complicated area. There's a lot of interesting algorithms needed here. This is a developer problem. Some of you maybe end up working on it. Then the other thing we need to know is how much power an instance type, storage class, or service uses. This depends on utilization overheads, and all of this data isn't available.

Takeaways

What can you do today? Fundamentally, if you use less stuff, it makes the biggest difference. It reduces your energy use, but also it reduces your footprint in terms of scope 3. The best thing you can do is just use fewer computers, and use them for less time, just be efficient about it. High utilization and more efficient systems is just going to be the best thing you can do. If you're looking at cloud providers, all the cloud providers are really pretty similar. The numbers, different people might argue that one is better than another. In fact, they are all pretty much in the same place, there isn't that much difference. They're all buying enough renewables to cover most of their energy usage in U.S. and Europe. They all have the same problems in Asia. Scope 3 is dominated by the same chip suppliers for everybody. They're buying the same SSDs, the same Intel chips. It doesn't really make much difference in terms of the carbon footprint of a machine in terms of scope 3. There are probably some differences in terms of efficiency of shipping and buildings and things. The big dominant things that go up to make a machine, it's not that different.

Use any cloud provider. There will be detailed differences, but it's not going to be that different. Try to minimize use of Asia regions for the next few years. Most of them have a roughly two-to-three-year timeframe to try and get Asia to decarbonize and to get more renewable energy there. In two, three-year's time, we should be pretty renewable everywhere in the world. Fundamentally, they are all much better than a typical enterprise data center. There are papers from all the cloud providers, sort of 80% to 90% better reduced carbon compared to a typical enterprise data center. This illustrates that. It's a Google figure, it shows you in the U.S. and Europe, there are some pretty good green all-day numbers. If you look at Singapore, it's 4% renewable, and Taiwan is 17%. That's what I'm talking about Asia being a problem. That is the fundamental problem. There just isn't renewable energy, and there isn't the ability to put renewable energy in Singapore. Everyone's got that problem. This isn't a Google problem. This is an everyone problem, and everyone's working on it in different ways.

Measuring - Compare APIs and Schemas Across AWS, Azure, and GCP

Let's look at the APIs and schemas, I said, developer talk. Let's go and look at what data you can actually get across AWS, Azure, and GCP. This is the AWS data. It was released about a year ago. Customer Carbon Footprint tool is part of the billing console in an AWS account, and you get a monthly summary. The location, you get it for the three different continents, you don't get it down to the region. You get accounts, EC2 per account. Account by account, you can get how much was EC2, how much was S3, and everything else is in others. It's pretty low specificity. Then, you get down to 0.1 of a metric ton of CO2e for resolution. A lot of you, when you run these reports in Europe will just get zero. That's actually expected, pretty much. You've got less than a 10th of a metric ton of CO2e per month, because you're just not big enough to make it move off of that. This is scope 1 and market-based scope 2. This doesn't include the scope 3, and there's some criticism around that, but they're working on it. It'll get there eventually.

Azure actually have an OData API. There's a tool. Let's look at it through the API. There's a bunch of queries there, but, fundamentally, what you get out of this query is, again, a monthly summary. The day is always going to be one, when you get a date. It's going to tell you which month it is. This time, they have regions as well as countries. You have services as well as accounts. Every individual service, you can say, what is this service, and it will tell you the carbon for that service. The resolution is a kilogram instead of 100 kilograms. It's 100 times better resolution than AWS from looking at the summaries of the data and figuring this out. They also include scope 3. It's a market-based scope 2. Reasonable API. OData is a fun thing, you can write interesting queries with.

The way Google works is you export your data into BigQuery, and then use the BigQuery API on it. This is the export schema. Again, monthly summary. This time, they go down below regions to each zone. You get data per zone. You have projects on Google. If you wrap a workload into a project, you can get project specific data, at least at the monthly level. That's a nice feature. I think AWS never had the concept of a project, and just very recently came out with something that is a bit like a project. I'm hoping that more AWS support starts to do projects. Resolution, a 10th of a kilogram. That's like 10 times smaller than the Microsoft data that I've seen. They do scope 1, location-based scope 2 with placeholders in here for marketplace. They're stating they're going to have both, and scope 3. This is in some decent schema, and then some good examples of what the numbers might look like.

Workload Carbon Footprint Standard (Proposal)

Those are the measurements we can get, but what do we actually want? What I want is real-time carbon metrics for optimization. I want it to be just another metric like CPU utilization. I want it to be reported by the same tools we already use. I want whatever tool you're using to monitor your machines to report carbon as well. This is what I'd like to see. This is me proposing something I've made up. This is a proposal. It's not really been vetted by anybody. I'm launching that here. What I want to see is the same data for all cloud providers, but also data center automation tools, VMware, whatever, things like that. I'd like to see resolution, not monthly data, I want minutes, seconds, maybe. I want it to come out in CloudWatch, once a minute. I want to have a carbon number in CloudWatch as well as the CPU utilization. Country, region, and zone, I think that makes sense. I want to go down to containers, file systems, basically anything CloudWatch measures. Anything I can get a metric from that makes sense, I want to be able to get a carbon number for it. Actually, I don't really just want carbon, what I actually want is the energy usage, pretty high-resolution energy usage. I think millijoules or milliwatt seconds is probably good, because some of these things we might want to measure might be battery powered devices at the edge, which use very little energy. If you have millions of them around the world, they add up, so you want to understand that.

I want to have energy and carbon, because the carbon is going to be an estimate. I want location and market-based scope. What I mean by this is energy is final. We could figure out the energy, we're not going to revise that. Carbon would have to be an initial estimate, because it's going to be a guess of what the next hour or what that hour looks like. The data is there to do an estimate, but the numbers are going to move. What I'd want is to have an initial estimate, just so the numbers are there, that's roughly in the right ballpark. Perfectly good enough for tools and optimization algorithms, things like that. Then the audit report quality that the cloud providers currently give us takes a couple of months to produce. A couple of months is fine. Then, as I said, the RECs settle for a grid, so you really want to have another update after 12 months to say, this is the final number that we're never going to change again, potentially, that might be useful for some cases. If we compare across all these, this is what I'd really like to have to develop some cool tools on this in the first column, and then AWS, Azure, and GCP in the other columns, just to get it all onto one table.

What Tools Could We Build?

What could you build with this? I think the first thing is there's a lot of cost optimization tooling out there, which looks at the utilization of systems, it looks at the billing data, and it combines them together. Cloud Carbon Footprint is probably the most common one that's out there. This should be in all the tooling. Any tool that does cost optimization or performance optimization should be able to optimize for carbon as a metric, as well as latency or utilization, or dollars, or euros, or whatever. Then, SaaS providers, we need to attribute and allocate tools. This is a big, complicated problem. We're going to be building lots of tools here. There should be tools off-the-shelf that can plug in onto any provider and say, here's how you do allocation. Then I think it'd be interesting to have architecture planning tools. Like, what's the carbon footprint difference if you're trying to decide, should you run your own Kafka on EC2? Should you run AWS MSK, which is a Kafka as a service, or should you go to Confluent, or Redpanda, which is a Kafka compatible clone? Or AWS Kinesis, which isn't the same as Kafka, but it might be a lower carbon? It's got a shared control plane, so it possibly is, but it depends. The question is, could we get all of these different things to either publish or have a way to calculate what that comparison would look like? I really open this up to ideas. I'm really looking for input. What would you want to build if you could get this fine-grained data and do something interesting with it?

Measuring Climate Impact and Risk

Finally, there's climate impact and risk. The OS-Climate.org, I was involved in helping get them started while I was at AWS. A lot of open source software here. If you're trying to do transition analysis, there's economic modeling for business transitions. Then the other thing they look at is physical risk and resilience, things like trying to understand where your customers are. Like if your customers can't get to you, if you're a retail chain, and there's floods, what happens? People can't get to you, you've shut down. What's the impact to your business of increasing probability of bad weather effects? This is an area. It's all open source. If you want to go contribute to something, go take a look at it. Lots of interesting data science, climate science, and the transition analysis models are pretty sophisticated. Some cool software here.

Things Devs Need to Build

These are the things I think developers need to build, energy usage instrumentation for applications, attribution allocation algorithms, data lakes to collect energy and carbon measurements, instrumentation at the edge, IoT, mobile, whatever. Energy usage dashboards and reports. Supply chain, carbon interchange protocols, energy to carbon models, and climate impact and risk models.

 

See more presentations with transcripts

 

Recorded at:

Aug 23, 2023

BT