InfoQ Homepage Presentations Beyond Million Dollar Lines of Code: Practical Strategies for Engineering Cost-Effective Cloud Systems

Cloud

Beyond Million Dollar Lines of Code: Practical Strategies for Engineering Cost-Effective Cloud Systems

View Presentation

Speed:

Download

46:59

Summary

Erik Peterson discusses the actions, tasks, and approaches necessary for crafting software that meets technical specifications and controls expenditure in the cloud.

Bio

Erik Peterson is the Founder and CTO of CloudZero and a pioneer in cloud cost optimization. He has been building in the cloud since its arrival and has over two decades of software startup experience, with a passion for cost-efficient engineering and excellent margins. Erik is also a believer in Serverless computing, an AWS Ambassador, and a recovering application security survivor.

About the conference

InfoQ Dev Summit Boston software development conference focuses on the critical software challenges senior dev teams face today. Gain valuable real-world technical insights from 20+ senior software developers, connect with speakers and peers, and enjoy social events.

Transcript

Peterson: I'm Erik Peterson. I'm the CTO of a Boston startup called CloudZero. I started that company August 8, 2016, been almost done an 8-year journey. I've been building in the cloud, really, since the beginning, since this thing existed. It's been my love affair. I started in the security industry. Spent a lot of time in that world. When the cloud came around, I just got to work, started building, and I haven't looked back. There was one big problem that I had personally, which was, all the money that I saw my engineering team spending on that without a care in the world. That bothered me. I decided to solve that problem by starting CloudZero. I want to talk about that problem. I want to talk about how it relates to all of us as engineers. I want to get into the details.

Every Engineering Decision is a Buying Decision

I want to start with this statement. It's a very important statement, I think, that needs to change how we all think of the world. I want to frame this whole conversation around this, which is that every engineering decision is a buying decision. This is this giant wave that is coming at us that we cannot escape. Our only option is to surf it. Otherwise, it's going to crush us. Every single person who writes a line of code in the cloud, you are spending money. If you're traveling and you're out there and you write an expense report, there will be more scrutiny on that expense report for that $20 lobster dinner, maybe $50 lobster dinner, than there will be for the line of code that spends a million dollars. That's the challenge that we all have a responsibility to focus on.

I think this problem is an existential threat to the cloud itself. I don't think the cloud actually is going to be successful. We've all heard the naysayers. I love it. I don't think the cloud is going to happen in our lifetime. It's still relatively small given all things IT, if it doesn't make strong economic sense. That's our responsibility as engineers. Unfortunately, this is where we're living today. Anybody remember the days before DevOps? You threw it over the wall, you gave it to ops, everything was good, you went home. This is the reality that most of us are living in right now. It works fine for us. It's finance's problems now. More importantly, it's actually our business's problem. If we can't build profitable systems, some of the companies that we work for might not exist tomorrow. It's pretty obvious, this is going to cost you a lot. We're still trying to figure this out.

What Is a Million-Dollar Line of Code?

What is a million-dollar line of code? My talk in San Francisco at QCon last year focused on just what a million-dollar line of code looks like. I wanted to give you all just a quick example. Anyone want to spot a million-dollar line of code in this little slice of Python? It's hard to see. It's pretty obvious, or maybe it's not. I've got some code that's writing 1000 bytes, sending it to DynamoDB. No problem.

Problem is, there's a limit. 1024 means if you go over that limit, you're spending twice as much to make a write operation in DynamoDB. That means, because I'm off by just a few bytes, this code costs twice as much than it should. Easy to make that mistake, very hard to understand the context of that. How do I fix it? Just shrink it down. Get rid of that big long key. Make the timestamp a little bit easier to read. Cut my cost for this code in half. That was an actual million-dollar line of code running in production, running every day, handling billions of transactions, cut in half. That's the power of what us as engineers have at our fingertips. The question of course, is, how am I going to go about finding these lines of code? It's a little bit hard. That's what we're going to get into.

I've got a few axioms that I'm going to try to frame the conversation about. The first one is that only observed costs are accurate. We've all tried to estimate costs. We've probably used a cost calculator at some point. We've gone to the thing on the cloud provider's website and maybe even did a little math in a spreadsheet. None of that was really accurate, though, when it got to production, was it? First, a little word from our Lord and Savior, from this fine young gentleman, Donald Knuth. He was very, I think, ahead of his time in many ways. If we don't, many of us have this tattooed on our minds already, premature optimization is the root of all evil. As we go through this, I want us to think about, when does it make sense to actually start thinking about optimizing when trying to find these million-dollar lines of code?

Cloud Cost Challenges

Here are some of the challenges. The majority of cloud computing costs, they come from usage, not infrastructure. I may spin up 100 EC2 instances. Sure, that costs money, but what am I doing with them? What API calls are they making? The same piece of code can be both cheap and bankrupt me. That example I showed you, run once, costs a few pennies, run a billion times, costs a lot. The behavior of my application and how the world is interacting with it matters a lot, and unfortunately, it's probably just something like 3% of your code that accounts for 90% or 97% of that cost. The reality is that emergent behavior is what rules the cloud. It's not possible, actually, to predict the cost of a change.

When I have a small change happening by a junior engineer who changes the timestamp to be a couple bytes larger than it should be, our marketing runs a campaign that suddenly sends a million people to my website that I wasn't expecting, or ops goes in and sets logs to debug, all of these things create downstream effects that I cannot predict. You will never be able to predict it. You can make an educated guess. You can try to forecast what you're going to spend your money on based on past performance, but it is not indicative of what's going to happen in the future, and eventually it will be wrong. It will almost certainly bite you at the least appropriate moment.

One of the ways to get around this, and what I've seen folks do that makes solid sense, but surprisingly, it's not implemented everywhere. We track quality. We track performance. We track staging and verify that everything is looking good before we push to production, or maybe you're like me, I like to test in production. We don't track the cost of that environment before we pull the trigger. My recommendation for this now that we're observing our costs, is we pick a threshold, pick a number, up, down by 10%, start somewhere.

Track the before and after. You can get really detailed. Get into the unit metrics. We're going to talk about that. Get into the cost of the services I'm using: the compute cost, the API transaction volume. How much log data am I producing? Or, just start simple at the beginning. Deploy all that to an isolated account, track the total cost, 24 hours later, if your cost is up by 10%, you have a problem. I'm sure a lot of people have worked with or know Twilio. This is no different than what Twilio does today, to make sure that they keep their margins in check. Engineering team writes a bunch of new code, pushes up their compute cost, that means their cost per API transaction goes up, profit goes down. If those costs exceed your threshold, roll back your changes.

Axiom 2, an observed system is a more cost effective one. I thought of all ways I could throw in a Schrödinger's cat example into this, but the reality is, and this seems like obvious wisdom, but so few of us are doing it, is that you've got to actually pay attention if you're going to save money. Most engineers are not part of the cost conversation. Finance is in the mix. Maybe they're not even giving you access. You cannot build a well-architected system in the cloud without access to this information. It's like they asked you to make the website more performant, they're not going to tell you how many milliseconds a transaction takes. You've got to demand for this if you want to build a cost-effective system.

What Actually Costs You Money?

What actually costs you money? Not so obvious. Somebody turns on debug logging, suddenly you have terabytes of data flowing into logs. Maybe it's going into CloudWatch if I'm using AWS. Maybe it's going into Datadog, even more expensive. Maybe it's going someplace else, to a Slack channel, destroying some poor intern's soul. It's possible. All of that costs money. The semi-obvious, API calls cost money. You might be writing a ton of data to S3 and not thinking much about it. The API calls to S3 cost a lot of money. Maybe you have a billion calls just to check to see if the file is there, and you write the file once, that is a very expensive piece of code. If it's running slow, classic performance problems, of course, you use more compute.

Compute is a commodity. In some cases it's scarce, particularly if we're trying to buy those big AI instances these days. Things like reading and writing data, network traffic, even cross-region traffic, all this costs money. Maybe we already knew this, but they're hard to observe. The thing is, this should be obvious, like I said, but isn't, is that unobserved systems when you are not paying attention to the costs, are less cost effective. What does that mean, cost effective? There are a couple different ways to think about that, because your cloud bill going up isn't necessarily a bad thing, if your return on that investment is good. For every dollar I spend, if I make $10 back, you might not be actually worried about spending money. The importance is efficiency, and how good am I serving up the value that my product makes, or builds, or delivers to the world, and what am I getting in return for that?

I'm sure some of you have tried to calculate these things and guessed. Remember what I said earlier? Sooner or later, the emergent properties of the cloud are going to get you. Here's the proof. This is just one example across a lot of data that we've been tracking. The only thing that was introduced into this environment very intentionally was the ability for every single engineer on the team to suddenly know the cost of their actions. When they checked in some code, saw it run, they'd get a little note: cost went up, cost went down. Nothing else changed. Nobody told them what to do. Nobody forced them to do anything. All we did was create visibility. It was a fascinating experiment. From that action alone, we saw costs decrease and level off.

Actually now, what we see is this continues to grow as they go up, but we're tracking something called unit cost and unit economics, which I'll talk about. If your finance team says, "Yes, we're looking at it. We got it all covered". It doesn't count. The cost must be observable to the people who are actually building the systems. Imagine every stack trace for any bug that you ever created you had to wait a month before you got it, or somebody gatekeeped it away from you, we would all be doing different jobs. The same is true for cost. The cost must be observable by those building the system.

You might be wondering if this is going to slow us all down. Maybe this is going to destroy our ability to innovate. Maybe premature optimization is the root of all evil, and this is just premature optimization. We're just piling more work on top of every developer who's already doing a million things. The truth is, constraints drive innovation. Cost constraints are practically the only constraint that we have today in the cloud. Think about it, thanks to the cloud, we all have infinite scale. I can bring up 1000 machines if I wanted to. I can write a single line of code to do that. I can put it in a Terraform template.

I have infinite scale, but I do not have infinite wallet. Constraints drive innovation. I'm sure some people are familiar with the theory of constraints and other thoughts around this in the world of manufacturing, perhaps. If not, go check it out. Think about some of the hardest systems or most amazing systems that have ever been built in the world under extreme constraints. My favorite is the Apollo 11 Guidance Computer. It's ridiculously constrained in terms of the resources that it had. I have a billion of those on my wrist right now. It put a person on the moon. Constraints drive innovation, and we've lost our sight on these constraints.

The Cloud Innovator's Dilemma

This is what happens. We're living this, what I call the cloud innovator's dilemma. We all rush out, we adopt new technology. We start building rather rapidly on this, until somebody somewhere panics. I like to call that guy Mr. Bill Panic, he's right there. All innovation halts. This is actually more disruptive to velocity than anything else there is. Who here loves being called out of the flow moment that they're in to go fix a bunch of bugs, to go deal with build panic? Cost cutting begins, the Dark Ages returns. Stuff starts going crazy. Nobody is building anything cool, and you're wondering, when am I going to be able to get back to innovation again? The same curve happened in the security industry over a decade ago. If you remember, in the software security world, security was the responsibility of the security team, and that didn't work out so well over time.

Eventually, companies like Microsoft found themselves getting hacked practically every single day. They were getting tipped over. It was becoming a national security issue. Eventually, Bill Gates had to write the Trusted Computing memo that he wrote back in, I think, 2001, and he said, Microsoft's going to stop innovating until we solve the security problem. In fact, we're going to innovate on security. Some people thought that might have gone on for a few months. It went on for two years before a lot of people got back to building new software. That's how dangerous this dilemma can be, until we get back to innovating, and then the cycle repeats itself.

How do we solve this cloud innovator's dilemma? Donald Knuth also said that we should not pass up our opportunities in that critical 3%. I think we need to end that cycle and leverage cost as a non-functional requirement. Non-functional requirements are super powerful in the world of software design. They help us understand what our stakeholders want. They help us understand what our business wants. I believe very strongly that cost belongs in that 3% as a design constraint. Those constraints, remember, drive innovation. Think about some of the best systems that have ever been built out there. I know this for a fact from my own experience. One of the reasons I started CloudZero was because very early on in my cloud journey, I went to our CFO. His name was Ed Goldfinger.

It was a perfect name for a CFO. I thought I was living in a James Bond film. I went to Ed and I said, "Ed, I need the company credit card because I'm just slightly smart enough to know that I probably shouldn't put my credit card number into that box on the AWS website". He said, "No problem, Erik. We trust you. We hired you guys to go build some cool stuff. Here's the credit card, but you've only got a budget of $3,000". I thought that was insane. I thought we would spend that in probably 2 seconds. I really didn't have anything to base my beliefs on, because we were just getting started. I said, challenge accepted. Then we started innovating. What happened is we actually threw out the entire design that we were working on, where we came back and treated it as a non-functional requirement, this is back in 2009 and 2010, and we ended up building probably what to this day I consider a truly cloud native system.

A system that can only exist in the cloud. We love to talk about cloud native as something that can move from place to place. I think cloud native means cloud only. The reason we had to build that way was because we had to leverage the full power of what we had. We had to sequence workloads, and turn things on and off, and be very dynamic about how we were doing it. For that $3,000 we did some pretty amazing things. We built a system that generated millions of dollars in revenue, for $3,000. That kind of efficiency is possible when you embrace it. If that constraint of $3,000 hadn't existed for me, we would have never done it. We would have gone and spent probably a million dollars building that thing.

That system can never move back to the data center, because it was built only for the cloud. These are the things that prevent build panic. I want all of you to think about runaway cloud cost as a software defect from this point forward. It's not something that finance has to deal with. It's your code, your designs, your architecture, the decisions that you made, those buying decisions that are responsible for that cost. Runaway cloud cost is a software defect.

The Possible Side Effects of Cost Constraints

Like I said, there are some constraints here, or there are some side effects. You might need to say goodbye to the data center forever. I know that might sound a little scary, but you're already locked into something, no matter where you're going. Lock-in is a lie. You're building to the lowest common denominator. You're not embracing the true power. You're spending a lot of money to do it. This is even true if you use Kubernetes or whatever new container thing comes out tomorrow. You've got to embrace some of the managed services and other capabilities that the cloud offers you so that you can get out of the mode of running and building things that you can let somebody else do more efficiently for you.

Give your cloud provider the undifferentiated heavy lifting. You're probably going to have to do something more than just EC2. I know that might be tough, but think about it, for $3,000 you could build a system that generates millions of dollars in revenue if you embrace it. If you've been building in the cloud for a while, your developer velocity might not take a hit, but it probably will. People are going to have to learn new things. It's a great community, all of us. If you're new to this, it'll take a little hit. If it's not improving, try harder. Squeeze that constraint. Hang on to the reality that constraints drive innovation and will push the team to think harder and faster about what they're building. I assure you, they will thank you for it. We love solving hard problems.

There's somebody who agrees with this as well, it's the CTO of AWS, Dr. Werner Vogels. He first started talking about cost as a positive property of the cloud in 2013. He got up on stage, he gave his keynote. There were five properties of the cloud, one of them was cost goes down. I think at that time, he was thinking the same way, that cost when treated as a constraint, will drive innovation. Unfortunately, him and all of AWS got too excited about lift and shift and all the money they were making, and they forgot this journey. I think it's horrible. Lift and shift was probably one of the greatest lies that the cloud providers ever told. I don't blame them. They allowed the conversation to shift away from the amazing things that we could build if we treated it as a more important property of the systems that we were building.

In 2023, Werner, 10 years later, almost to the day, got back up on stage last year at re:Invent, and he shared these thoughts with the rest of the world. I hope it continues down this path. I hope all of the cloud providers continue down this path. I hope all of us realize the importance of this. It really does drive some amazing innovations. We think about a lot of ways where we're just living in a world where, it works, why would I want to mess with it? Ask ourselves, does it really work for the business? Does it really work for the companies that we're supporting? Are we building a more profitable system so that we can hire larger engineering teams and build more things? Chances are, if we're not thinking about it, we probably are not.

That leads me to axiom 4, how do we understand what actually makes sense to the business? Sometimes when we look at our cloud bill, it's a big number. Yes, I'd like to spend as much as a new Lamborghini every single day. What does that mean? Is that good? Is that bad? Is it scary? Cost is a horrible metric, just a big, scary number. How do I know what's good? It's lacking any context. The only reason we know response times down to the millisecond for websites are good or bad is because, maybe Google published their site rankings, and said, if you get your latency down, it's better. They set the standard. Somebody needs to set the standard. Your businesses set the standard. It's called a margin. If you're in the software business, you're probably wanting to achieve something around 80% on your profit margins. The way you get at that is to understand the unit economics of the products that you're building.

Back in 2009 and 2010, when I was building that system, I had nothing to go off of in terms of how to think, how we're going to stay into this budget. I was building a security scanning tool, and I had $3,000. I figured out how many sites I needed to scan, divided it by the budget I had, came up with something that said, you can only spend a few pennies per scan. That's what drove how we started iterating on that architecture.

Leverage Unit Economics as the Bridge between Finance and Engineering

This is the bridge between you and finance. If you've ever seen those people sneaking around, wondering what we're all spending our money on, and they speak a scary language sometimes. I think engineering and finance are two long-lost siblings. They just grew up in different parts of the world. They speak completely different languages. They're two very highly technical, specialized disciplines. You have more in common than you actually realize. They're just as scared of you. They really don't know what Kubernetes is. They really don't understand the cloud. They might be trying. What they want are metrics that actually make sense to them in terms of improving the bottom line, because you've got these two worlds of data that need to come together if you're going to build a truly optimized, profitable business.

You all as engineers are the key. What is unit economics anyways? Do I have to go back to school? Do I need to go get an economics degree? Hopefully not. I'm going to give you a crash course. It's very simple, as we can tell here. It's a system of using objective measurements to ensure your business is maximizing profit while delivering and developing cloud-based software and services. Perfect. No problem. We're all ready to go. It's going to take some data. It's going to take a little bit of thought about how we calculate this. Let's get into it. A unit metric is a representation of the rate of change for a demand driver. A demand driver can be something like, I'm DoorDash, and I deliver meals. People want to order meals.

What's my cost to deliver a meal when I look at my cloud infrastructure? I'm a credit card company. People want to do a transaction. What's the cost to do a credit card transaction? What's the cost for someone to sign up? These are demand drivers. These are things that drive our cost. It's the other side of the equation. The goal of a good unit metric is to present that incremental consumption of some resource in terms of that demand driver. How much money am I consuming to deliver that meal? Looks something like this: consumption on the top, demand driver on the bottom. This is my unit metric. Some great things that we just focus on unit cost: total cloud cost, cloud cost per service, cloud cost per feature, per customer, per microservice.

Break it up. I can take other drivers. How much vCPU compute am I using? How many gigabytes am I storing? I can map that against how many daily active users. I was at the FinOps X conference. FinOps is an emerging discipline that's growing pretty fast. If you don't know about it, there's probably close to 15,000 people now in that community that exploded out of nowhere. It's designed around building this bridge between finance and engineering. They had their conference in San Diego. I got up on the stage with my good friend Vitor from Duolingo. How many people use Duolingo? They know exactly what their cost per daily active user is. Because we've been working together on it, so that they can build a more cost-effective app, when you sit down at night to make sure you freeze that streak.

This is how you need to think about it. $25 million spent on EC2, who cares? It's a worthless metric. It's pointless. It's not helpful. Maybe it's curious. Maybe it sounds good at parties. If you're in the business of doing transactions, 25 cents per transaction, when you sell those transactions for $10, is a very profitable business. You can go spend $25 million all day, every day, and no one will care. In fact, everyone will be happy. In fact, finance might even come to you and ask, how can we spend more money? Which means more toys for all of us. That's what I was thinking back then when I had that impossible challenge. I want to go play with the cloud toys, I'm going to have to make this work. I want to equip all of you to do that. Don't think about how much money you're spending. That cost number is not useful. Think about the unit metric for your business. Start looking for the data that's going to allow you to calculate that.

The last piece I want to leave you all with is that engineering is responsible for the cloud cost. Just like the security industry went through that transformation so many years ago, we all are in the midst of this transformation right now, whether it was the economy, or COVID, or both. That was this cloud world's Pearl Harbor moment. That was Bill Gates's Trusted Computing memo that said, we're going to halt innovation until we figure this security thing out. That was really the before and after for the application security industry. After that, all the companies started to understand, I have to adopt security as part of my software development lifecycle.

Today that's the way it's done. There are tools built for all of us to bring security into that development process. We need to bring cost into our development process. We're very early days. We're still trying to figure this out. Some of the challenges though, like I said, is that it's, in many ways, almost impossible to calculate what they're going to cost until we've observed them. I can take a Terraform template or a CloudFormation template, and I could add up all the resources that are going to be created, and I would spit out a number. Until I know what the code and the users are doing that run on that system, I might not have a very accurate estimate of what that all is going to cost. Engineering has to be responsible for cloud costs.

Runaway Cost is a Software Defect

We have to think about runaway cost as a software defect. How many people remember this chart? We've been living it since the days of IBM and Raft and Rational and who knows what? Those were good days. Maybe not. It's been pounded in our heads. The further you get away from the moment the code was created, the harder it is to fix that bug, almost exponentially harder. Maybe it is. It's a math thing. Now thanks to the cloud, it actually costs a lot more. In 2022, somebody crazy came up with this number, which I have a still hard time believing, but all the bugs that we wrote, all those quality issues that we were responsible for, may have cost the economy $2.4 trillion, if you can believe that.

A lot of that stuff was probably built in the cloud, which means a lot of that stuff spent a lot of money in the cloud. I don't think that number is even included in here. If you take Gartner's projection on what the world's cloud spend was in 2022, which is somewhere around $500 billion. When you take Flexera's report that says about 30% of all cloud spend is waste, that means about $143 billion should be added to that number. Our software defects caused that to happen. I think it would be really amazing if we all walked around with a number above our heads indicating our total lifetime cloud spend. That's probably a Black Mirror episode or something. It's scary.

How many people have a million-dollar number hanging over their head? Anybody want to admit that? How many people have a multimillion-dollar number hanging over their head? How many people have written bugs that spent $30,000? No regrets. That was one of the challenges those years ago when Ed was beating me up at that company. I'd get dragged into his office once we got more of the engineering team engaged, and he would be like, what happened to those good old days when we were just spending $3,000? Because I'd love to say that that story continued, but eventually we started doing lift and shift when we figured out that it costs a lot more. Then I felt like I was going to the principal's office every week. It was painful.

Sometimes I had to admit to him that it was my code, that single line of code, maybe even a million-dollar line of code, that was responsible for all this. It cut me deep. I wanted everyone on the team to feel that they could be part of making that never happen again. It's hard, because we need to catch these things sometimes when there's only a few dollars. You look at it running in staging, it's only costing $3 to run, but break that out, and that's where unit economics come in handy. How many users are we expecting? How many transactions are we expecting? What things are we expecting my application to do? Let's multiply that $3 by whatever that number is, and discover, it might be a few million dollars, that change that we just made, if we let it run. I've seen single lines of code cost over $50 million annualized. One line of code. That's how powerful this is.

It's hard to find those lines of code because we have to watch them running in the real world. If we're breaking all these things out, we can zero in on the parts of the system that matter. If we are correlating them to the check-ins that we make and we make it part of our engineering process, then it should be very easy for us to zero in on. In fact, I think it is.

Buy Better or Build Better?

There's two ways to think about that. Most of us are probably living in an organization, or maybe have been living in an organization where it's finance-led optimization. Finance-led optimization is really about buying better. This comes from the world of procurement, where we all started. Discount optimizations, buying RIs. Maybe you're doing this as well. Maybe you have a shared responsibility. Fantastic. A lot of organizations, they're still learning their way. Budgeting, forecasting, trying to understand what's going to happen the old way, probably being very frustrated that none of the old ways are really working. What's more powerful is on the other side, building better.

Let's write some better code that's going to improve our product and cost to serve for unit economics. Let's respond to anomalies when we see a change in the environment that triggers that runaway cost. Let's address the root cause. Let's treat it like an incident. Let's go about reducing waste. We've got a love-hate relationship with rightsizing, because a lot of times the opportunity cost is greater than the time spent on it. It's also a smell, that your architecture, you might still be too in love with that architecture. You might be treating it like pets. The problem is that all of that should be automated, if you're building the right systems. You shouldn't have to be thinking about that architecture, so when the next compute architecture comes out, it should be very easy for me to upgrade to it.

That challenge is not solved after the system is deployed. It's solved during design. You have to realize, everything you build will eventually rot and decay and go away. Another favorite quote of mine from Werner was that everything fails all the time, and in the cloud, it's guaranteed to happen all of the time. I almost felt like it was a feature, created some evolutionary pressure on the environment in the systems we were building so that we would create more efficient systems, or at least more resilient systems. We've still pushed against it. We need to do both of these, because there's only so much that finance can do. We've got to partner with them. We've got to go back to that bridge.

Key Takeaways

I'll give you these five axioms. Take this with you. Every engineering decision is a buying decision. The world has flipped. The challenge for us is to build systems with the least amount of resource consumption now. It's not just good for what we're trying to do as businesses, but it's good for the planet. We used to see optimization as maximizing the use of what I was given in the data center. You gave me 10 servers. If I didn't use those 10 servers, I was being wasteful. In the cloud, I have to solve a much harder problem. How do I use the smallest amount? It's a very almost unsolvable problem in some ways.

Questions and Answers

Participant 1: You mentioned how it's easy to go through and determine the commits that may have caused these issues. How do you handle scenarios where it may take a couple days to figure out the real cost, because it has to go into production, has to get traffic, in that time window, maybe 200 commits have gone in?

Peterson: You're going to have to calibrate over time. You probably won't get it right the first time. It's like old load testing. In the good old days, like, we tested it with this load then we pushed into production. That fell over, and you realized you were synthesizing the wrong load, or you're doing something that wasn't testing real-world scenarios. Like anything, you're going to have to iterate. I like to say that you should commit and iterate over time. The mechanical parts of this is that you really should try to carve out how you're going to break apart your spend allocated, or attributed to the teams or the products or the features. It's hard to do this with just tagging. You might need to take other methods and things like that. That's your first task.

How can I break this out and isolate the system that I'm observing? That might mean you need to stand up and account for it. That might mean you need to tag it specifically. Maybe you have to apply a couple different techniques. Really focus in on that first. Then once you're really clear on how you've been able to isolate that, now you've got staging versus production, figure out what moves the needle for it. Then those become your demand drivers that you're going to simulate against that environment before you push to production. Go through those processes. You got to walk before you run.

Eventually, when you get there, you'll be able to understand if you should break the build and roll back within 24 hours or less. That 24 hours exists only because the cloud providers themselves typically don't deliver the data fast enough. One day I hope to break that, we call it the cloud cost sound barrier. We'll figure out how to break that soon actually.

Participant 2: Are there any services that allow you or help you in tracking cost, the way that you describe them, by units?

Peterson: If you haven't done any of this, go look at what your cloud provider could give you for free before you get out of the gate. Amazon, Google, Microsoft, they all have a center for FinOps or cloud cost within their products. Sometimes you'll find certain limitations of what you can do there. It's the best place to get your feet wet. Then, when you get to the stage where you're struggling with some of that stuff, I'll steer you in the right direction. I'll even tell you to go use a different product if I think it's better for you.

Participant 3: I'm curious on how the fourth and the fifth axiom correlates, the unit economics and even engineering being responsible. Because, let's say my company does 20 cents per order, taking the example of DoorDash. If you translate that at a company level, even though an engineer uses some extra cloud compute a day, it wouldn't really granulate to a cent per order. How do engineers feel responsible for that unit economics?

Peterson: This is where the partnership with finance is probably important, because they can maybe help you get some of that data to build that up. One very important task after you start going down the path of building out some of the metrics that you should be tracking is to understand that those metrics are even correlated to how your company makes money. It's one thing to think about it as a correlation against cost, but more importantly, correlated against how you make money. You may discover that your company has a horrible pricing model, that your code is unbelievably efficient, but you're just not selling it correctly, and it's not even your problem.

It's a very liberating day because you can go to finance and be like, "We are badass, we got this covered. It's all those salespeople are just selling it wrong. You got to raise the price". You got to figure out how it correlates against how you make money, not just against your cost. Some of these metrics are very helpful though, like cost per vCPU, can give you a very quick idea of just how efficiently your code is running against a whole cluster of machines. It's not a bad metric to start with. Think about how your company makes money.

See more presentations with transcripts

Recorded at:

Nov 06, 2024

Erik Peterson
CTO & Founder @CloudZero

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?