BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Cloud Native is about Culture, Not Containers

Cloud Native is about Culture, Not Containers

Bookmarks
47:17

Summary

Holly Cummins shares stories of customers struggling to get cloud native and all the ways things can go wrong.

Bio

Holly Cummins is the worldwide development practice lead for the IBM Garage. As part of the Garage, she delivers technology-enabled innovation to clients across a range of industries, from banking to catering to retail to NGOs. She is an Oracle Java Champion, IBM Q Ambassador, and JavaOne Rock Star. She co-authored Manning’s Enterprise OSGi in Action.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Before we can start talking about what Cloud-native is about and isn't about and what we should be thinking about, we need to have a common definition or a common understanding of what even is Cloud-native? It turns out that this is where it gets a little bit tricky.

Some of you may know Daniel Bryant, chair of this conference. If you don't know him, you haven't been paying attention. Some of you may also have seen Bilgin, because he spoke in this room in this track earlier today. How many of you were in Bilgin's talk? One or two.

I feel a little bit guilty, because for a while now, every time Bilgin writes an article or gives a talk, I go around the internet saying, "He's wrong, he's wrong, he's wrong," which is a little bit uncool to do. I feel ok doing it because he knows way more than me about the subject. I feel like it's ok if someone knows more than you to tell them you're wrong, and it's extra ok now I feel because he's a colleague of mine.

He works at Red Hat, I work at IBM. I figure that's ok. I'm not sure whether he thinks the same. Every time I see him I feel a little bit guilty that I'm going around talking about him. What started me saying all over the internet that Bilgin was wrong was that Dan Bryant did this tweet. Bilgin had written an article. He writes really great articles, and there was this picture.

If you look at that picture, you can see the traced out a progression from SOE to microservices to Cloud-native, and you can see that Cloud-native has loads of microservices all over and the difference is in how things are wired. I looked at that and I thought, I'm part of the IBM garage. One of the things we do is we write Cloud-native applications for a customers, and I at that time almost never used microservices in my apps. I was pretty sure I was writing Cloud-native apps. Then I thought, well, does that mean I'm doing it wrong or is maybe the definition of Cloud-native a bit complicated?

I should say as well that the article that Bilgin wrote that I didn't like the picture in was called "Microservices in post-Kubernetes Era", so it would be a bit ridiculous if he wasn't talking about microservices in that article. In lots of other places we do see the same assumption that microservices equals could native and Cloud-native equals microservices, including in the Cloud-native computing foundation. If you look at their website, what is Cloud-native, it says it's all about microservices and it's all about containers and you dynamically orchestrate them. Then that puts me in this really weird position, because not only am I saying Bilgin is wrong, I'm saying the Cloud-native computing foundation, what did they ever know about Cloud-native? I'm sure I know way more than them, right?

Well, no, obviously I don't. I'm on the wrong side of history on this one. I will admit that. I'm still going to die on my little hill, that Cloud-native is about something that's much bigger than microservices. Microservices are one way of doing it. They're not the only way of doing it.

In fact, you do still do see this diversity of definitions with other people. If you go and you ask a whole bunch of people what Cloud-native means, some people will say born on the Cloud. I think this was very much the original definition of Cloud-native back before microservices were even a thing. Some people will say it's microservices. A lot of people will say it's microservices.

Some people will say, oh no, it's not just microservices, it's microservices on Kubernetes and that's how you get Cloud-native. Well, I think Cloud-native shouldn't be about a technology choice. Sometimes I see Cloud-native used as a synonym for devops, and a lot of the principles that we want to be doing are the same, but again, it's an interesting mushing together.

Sometimes I see Cloud-native used just as a way of saying we're developing this in 2020, we're going to use modern best practices, it's going to be good, it's going to be robust, we're going to take everything we've learned over the last 20 years and put that in and that's what makes it Cloud-native. Cloud is an afterthought.

Sometimes I see Cloud-native used just to mean Cloud. We got so used to hearing Cloud-native that every time we talk about Cloud we just feel like we have to stick the native afterwards, but we're just talking about Cloud. Sometimes, and this is another definition that I think is quite correct, sometimes when people say Cloud-native what they really mean is idempotent. The problem with this is if you say Cloud-native means idempotent, everybody else goes, "What?"

What we really mean by idemopotents is rerunnable. If I take it, I shut it down, I start it up again, that's ok. That's a really important characteristic for services on the Cloud. We've got all of these definitions, so then that starts to explain why we're not entirely sure what we're trying to do when we do Cloud-native.

Why?

This matters, because when we're thinking about technology choices, when we're thinking about technology styles we really want to be stepping back just from I'm doing Cloud-native because that's what everybody else is doing to thinking, what problem am I actually trying to solve? To be fair to the Cloud-native Computing Foundation, who do know what they're doing, they have this why right on the front of their definition of Cloud-native. They say, "Cloud-native is about using microservices to build great products faster." We're not just using microservices because we want to, we're using microservices because they help us build great products faster.

Really, we do have to have that step back to say, what problem are we trying to solve? Why couldn't we build great products faster before? I think all of us are guilty of this. Sometimes the problem that we're trying to solve is that everybody else is doing it so we have a fear of missing out unless we start doing it as well. Well, my CV looks dull maybe isn't the right reason to choose technologies.

Why Cloud?

I think to really get to why should we be doing things in a Cloud-native way, we want to step back and say, "Why were we even doing things on the Cloud?"

Cost

Back when we first started putting things on the Cloud, cost was the main motivator. We said, "I've got this data center. I have to pay for the electricity. I have to pay people to maintain it. I have to buy all the hardware. Why would I do that when I could use someone else's data center?"

What made there be a cost-saving between my own data center and someone else's data center is that my own data center, I have to buy enough hardware for the maximum capacity, the maximum demand. If its someone else's data center, I can pool resources.

Elasticity

Really, the reason Cloud saves you money is because of that elasticity. You can scale up, you can scale down. Of course, that's quite old news now. We all take that for granted.

Speed

Really, the reason we're interested in Cloud now is because of the speed. Not necessarily the speed of the hardware, although we do get that too with things like GPUs and quantum computers on the Cloud, but because we can get things to market way, way faster than we could when we had to actually had to print them onto CD-Roms and mail them out to people, or even when we had to stand them up in our own data center.

Why Cloud-native?

Then the question is, well, that's great, but all of that we get just with Cloud. Why do we need Cloud-native? The reason we need Cloud-native is because a lot of companies found they tried to go to the Cloud and they got electrocuted.

12 factors

This led to the 12 factors, and the 12 factors were a set of mandates for how you should write your Cloud application so that you didn't get electrocuted.

But then going back to that original conversation, the 12 factors had nothing to do with microservices. They were all about how you managed your state. They were about how you managed your logs. They weren't anything to do with microservices, so then really I think the 12 factors are all about being idempotent, but I think with the 12 factors they didn't want to say the idempotent factors for obviously reasons, so they said the 12 factors. You definitely do need a synonym for the 12 factors.

Then when we think about containers, with containers, containers are so good. They solve so many problems. I think when we look at some companies and they're running 100, 200, 300, 400, 5000 containers, we think, oh, my application only has six containers. I must be doing it really wrong. I'm not as good a developer as them over there.

Well, no. It's not a competition. It should be tuned.

Speed

When we think about Cloud, again we want to be thinking about that speed. The reason we want lots of containers is because we want to get things to market faster. If we have lots of containers and we're either shipping the exact same things to market or we're getting to market at the same speed, then all of a sudden those containers are a cost. They're not actually helping us. If we have this amazing architecture that allows u to respond to the market but were not, then that's a waste. If we have this architecture that means we can go fast, but we're not going fast, then that's a waste as well.

How to fail at Cloud-native

Which brings me to how to fail at Cloud-native. For context, as Crystal said, I'm a consultant. I'm a developer with the IBM Garage. We work with startups. We work with large companies. We do lean startup, we do extreme programming, we do design thinking we do devops, we do Cloud-native, all with the aim of helping customers get really good apps on the Cloud.

On the way, we see a lot of customers who are on the journey to Cloud. Sometimes that goes well and sometimes there's these pitfalls. These are some of the scary things that I've seen over the course of it.

One of these, as I mentioned, is the magic morphing meaning. If I say Cloud-native and I mean one thing and you say Cloud-native and you'd mean another thing.

So, what is Cloud-native?

Sometimes that doesn't really matter, but sometimes it does. It is a problem. If one person thinks the goal is microservices and then the other person thinks the goal is to have perhaps an idempotent system, if one of those is paying the salary for the other one, that can end up really badly for one person. We see a similar problem. That's partly with the technical and architectural definition of what is Cloud-native. Does it mean idempotent? Does it mean microservices? Then we see a similar thing with why are we even trying to get to the Cloud?

If, as an organization, we want to go to the Cloud because we think it's going to allow us to get to market faster, but some of us are a little bit not on the same page and we're only going to the Cloud to deliver the exact same speed as we were before, but in a more cost-effective way, then we might have a misunderstanding.

Microservices envy

Often one of the things that drives some of this confusion about goals is because we're looking at other people and were looking at the amazing thing they're doing and we want to do those things ourselves without really thinking about our context and whether they're appropriate for us. One of our fellows, he has a heuristic for when he goes in to talk to a client about microservices. He says, "If they start talking about Netflix and they just keep talking about Netflix and they never mention coherence and they never mention coupling, then probably they're not really doing it for the right reasons, whereas if they're thinking about these more fundamental architectural issues, then they're probably starting from a good technical foundation."

Sometimes we talk to clients and they say, "Right, I want to modernize to microservices." Well, microservices, they are not a goal. No customer will look at your website and say, "Oh, microservices. That's nice." They're going to look at your website and figure out if it serves their needs, whether it has got any features on it, all of these other things. Microservices can be a really good means to that end, but they're not a goal in themselves. I should also say as well, microservices are a means. They're not necessarily the only means to that goal.

We talked to a bank out in Asia-Pacific. They were having a problem with their ability to respond to their customers and they were also having a problem because all of their COBOL developers were dying because they were old. The thing that was really driving them wasn't the aging workforce as it was sometimes. In this case, they were getting beaten by their competitors because their software wasn't good, and the reason it wasn't good was because they were just going really slowly and they had all this COBOL. They said, "Well, in order to solve this problem we need to get rid of all of our COBOL and we need to switch to a modern microservices architecture."

We thought, oh good. That sounds good. Then they added that their release board only met twice a year. It didn't matter how many microservices they had. Those microservices were all going to be assembled up into a big monolith release package and deployed twice a year. At that point, they're taking the overhead of microservices without the benefit. Again, going back to it's not a competition to see how many containers you have.

There's a really good reason to be wary of microservices in that context, because often what we end up with isn't a beautiful microservices architecture that we can release independently. What we end up with is a distributed monolith, and the reason that this is really bad is because one of the things that a monolith often has is things like compile-time checking for your types and synchronous communication. That's going to really hurt your scalability, but it means that you don't get bitten by the distributed computing fallacies. If you take that same application and then just smear it across the internet and don't put in any type checking or anything else into any error handling for the distribution, you're not going to have a better customer experience, you're going to have a worse customer experience.

There's a lot of contexts in which microservices are really the wrong answer. If you're a small team, you don't need to have lots of autonomous teams because each autonomous team would be about a quarter of a person. If you don't have any plans or any desire to release part of our application independently, then the microservices, they're going to allow you to scale independently, but that's not necessarily a problem.

In order to give security and reliable communication and discoverability between all of these components of your application that you've just smeared across a part of the Cloud, you're going to need something like a service mesh, or you might be either quite advanced on the tech curve or a little bit new to that tech curve. You either don't know what a service mesh is, or you say, "I know all about what a service mesh is. So complicated, so overhyped. I don't need a service mesh. I'm just going to roll my own service mesh instead."

This is not necessarily going to give you the outcome that you hoped for because you have a service mesh, but you have to maintain it. Another good reason not to do microservices is sometimes the domain model just doesn't have those natural fracture points that allow you to get nice neat microservices. In that case, you might say, "You know what? I'm just going to leave it."

Cloud-native spaghetti

If you don't, then you end up with the next problem, which is Cloud-native spaghetti. If you look at the communicate diagram for the Netflix microservices, I'm sure they know what they're doing. They've got it figured out. When you look at it, it looks to me like nothing but spaghetti. I just think, oh, that needs a lot of really solid engineering to make that work. If you don't have that solid engineering, then you end up in this situation.

I went to visit a client and they had a bunch of microservices and they said, "Yeah, we have this problem which is that any time we change any code at all, something else breaks." The dream of microservices is that they are decoupled, but decoupling doesn't come for free. Decoupling doesn't magically happen when you distribute things. All that happens when you distribute things is that you have two problems instead of one. We need to think really carefully about, have I actually made the system decoupled or have I just got a distributed monolith.

I love this illustration. It's from Cloudy With a Chance of Meatballs", and it is spaghetti in the cloud. That is exactly what you get.

In the case where they had really bad brittleness and connectedness, what had happened was that they had quite a complex object model. They looked at this complex object model and we said, "We know it's really bad to have common code between our microservices because then we're not decoupled. So instead we're going to cut and paste this common object model across all of our six microservices, and it's got about 20 classes and its got about 70 fields. Because we cut and pasted it rather than linking to it, we're decoupled." Well, no, you're not decoupled.

The dream of microservices is that you have all of these modules and each one has its own little domain, and they're quite distinct. If you have one big domain and you have lots of little microservices, then there's going to be a problem.

Does anybody know what this is from? My bad drawing of it? What I can see and what you can't see is that several people in the audience just went like that at the same time. I wish I had a picture of it.

I don't blame you for not recognizing this. This is the actual picture. This is the Mars Explorer. I like this picture because I don't know if you can see from where you are, but there's things there that really looked to me like they took a bin liner and they attached it onto a framework with duct tape and then it had a sad ending. It did not have a sad ending because they used a bin liner and duct tape. It had a sad ending for another reason.

Does anybody remember what happened to the Mars Explorer? Yeah, exactly. It crashed into Mars. That was the sad end of the Mars Explorer, and the reason was even sadder, because the reason was that that there were two control modules for it. One was on the Mars explorer. It was semiautonomous, and then there was a control unit on the earth that would, about every three days, it would come into view and then the planets would align and so it could be seen, and then it would send it some updates saying, "Oh, I think you need to shift a bit left and oh you're going to miss Mars if you don't go a bit right," and that kind of thing.

It was built by two different systems, two different teams. As you said, one of them was working in metric units. The other was working in imperial units. They thought these systems were nicely decoupled. There was a very significant point of coupling between them they hadn't realized, which was the agreement on the units.

The moral of the story in this case is that distributing the system did not help. Obviously in this case part of the system was on Mars, a part of the system was on earth. They had to distribute it. It's not like, well, if they just kept the Mars Explorer on earth everything would have been fine. It would have, but it would not have achieved the business goals of getting to Mars.

Microservices need consumer-driven contract tests

In this case the solution, the correct thing to do is that you need consumer contract-driven tests. How many of you are using consumer-driven contract tests at the moment? About two or three hands. I'm really interested by this one, because whenever I talk to anybody in my team or externally, I say, "Consumer-driven contract tests, consumer-driven contact tests." The uptake is always pretty low.

Are you using Spring Contract or Pact? Pact. Yeah. That's the one I like as well because I think Pact is such a great product. I think it solves such an important need. It is also extremely difficult to wrap your head around the first time you use it, which I think might be one of things that slows down Pact adoption. It is really important.

If you have this system and you've lost your compiled time checking and you have an API, then you need the systematic verification of the AI. You can't just use Swagger and say, "Well, the fields have kind to the same name, so we're good." You do need to say, "Is my behavior when I get these inputs the expected behavior? Are the assumptions I'm naming about that API over there still valid?" If not, things are going to get really bad.

I think a lot of companies are aware of this risk and they are aware that when they're doing microservices, there's an instability in the system and that in order to have confidence that these things work together, they have to do a UAT phase before releasing them. Then that leads us to the scenario which is not-actually-continuous, continuous integration and continuous deployment.

I talk to a lot of customers and they'll say, "We have a CI/CD." I'm like, well, CI/CD, it's not like a tool you buy and you put it in the corner and you say, "There's, CI/CD." CD/CD is something that you have to be doing. It stands of continuous integration and continuous deployment or delivery. Either way, if you're not doing that, then it's not continuous.

Sometimes I'll hear things, like, I'll merge my branch into our CI system next week. I'm thinking, CI, it stands for continuous. If you merge once a week, that's not continuous. That's almost the opposite of continuous.

Then I hear things, like, everybody is talking about CI/CD, CI/CD, CI/CD and we're only going to release every six months but you're doing CI/CD. No, you're not. You're doing the D but you forgot the C part.

Which brings us to the problem. I'm there thinking, well, we're all using this word continuous to mean releasing every six months and emerging once a week. Is it my definition of continuous that's wrong or has the industry shifted since continuous means really infrequent?

Then there is a question about, well, how often is it actually reasonable to be pushing to master? This is a conversation we have a lot ourselves. The reason it matters is because what we're really saying is I'm doing CI/CD, but how often am I integrating?

There are clearly definitions of continuous that would be ridiculous. If you pushed to master every character, that is technically continuous, but it is also ridiculous. If you do every commit and you aim to commit several times an hour, then that's probably a pretty good point. If you do every few commits, so you're pushing several times a day, so that's pretty good. If you do once a day, ok, once a week. I think that's getting really problematic. Once you get into one every month, it's bad.

When I joined IBM, which was about 20 years ago, just for context of this story, I was advised by the senior developers, we had a build system and a code repository that was called CMVC. I was working on WebSphere and there was a WebSphere build call that happened every day including on Saturday to discuss the build failures. OF course, you did not want to be on the WebSphere build call. The way to avoid being on the WebSphere build call was to save up all of all your changes on your local machine for six months and then push them all in a batch.

Of course, at the item, I was little. I was like, ok, that doesn't seem like quite the right advice, but sure. Of course, now with hindsight, I realize the WebSphere build call had to happen and the reason the build was always broken was because people were saving their changes for six months, before then trying to integrate that all together. Of course, that didn't work.

I'm a big advocate of trunk-based development. The technical definition of trunk-based developed is that you need to be doing it at least once a day to count as trunk based development. I think that's ok. Clearly, if you're doing it less than once...well, "clearly", in my world, if you're doing it less than once a day that's going to be bad. Again, bad-bad. I think somewhere between every commit and every few commits is a really good place to do it. If you're doing test-driven development, then when you get a passing test is a really good way to do it. This has a lot of benefits in terms of opportunistic refactoring and that kind of thing.

Then the next question, which is probably even harder, is how often should you release? Again, there's a spectrum that you could release every push. Many tech companies do this. You could release once every two years. Traditionally, this has been the model in our industry.

I think if you're doing it every push, you need to have a really good handle on feature flags and you need to have really good observability and SRE and that kind of thing. If you're doing it once a sprint, I think that's fairly traditional. That seems ok. Once every two years, that was what we used to do in web sphere. It's not ideal. Once a quarter is a bit sad. If you're doing it every push, you do need to have the right skills in your team, but it can be done and it can give really good results.

Again, what I like to do is that once you get a user story completed, that's a really good point to push. You still do need to have a bunch of other things in place to support that.

How often should you test in staging?

Then there's a next question, which is, how often should you test in staging? I'm going to production at this point. Another way of thinking about this, really, is if I have testing in staging before the delivery, then you've got to be testing and staging every delivery. That's got to be continuous testing in order to support the continuous delivery. Otherwise, you send up with these big handoffs and these big UAT phases. Again, that takes a lot of work, but it does have results.

If you don't do that, then you end up in the scenario of we've done all this work, it all works great, oh, but we can't actually release this code. Well, that's just value that's sat. That's inventory that's sat that's not getting out to customers that could be getting out to customers of the processes were more aligned. It makes me sad.

Then the conversation is, well, why can't we release this? What's stopping more frequent deploys? Again, often it's that fear about the microservices that we have to do integration testing probably manual integration testing for our microservices. I've seen customers and I think they had about 60 microservices and they had one single pipeline for all of the microservices to make sure that there was no possibility that some bright spark of an engineer could release one microservice without releasing the other 59 microservices. This obviously is not the value proposition of microservices, which is that they are independently deployable. It was the way that they felt safest to do it.

We also see a reluctance to actually deliver because of concerns about quality and completeness. Of course, these aren't ridiculous. You don't want to anger your customers. On the other hand, if you're not embarrassed by your first release, it was too late. You don't want to get things out.

Then it comes back to that question that you've got these beautiful microservices architecture that allows you to do faster and yet you're going really slow. What you're missing by that is that you're missing business feedback and you're also missing technical feedback and you would never ever drive a car like that. If we defer our releases, that's often what we're doing. The feedback that we get from doing a release, it's good engineering. It's also, it's just good business.

If our features are too scary, there's things we can do. We can have them actually in the production code base, but nothings talking to them. That's pretty safe. Or we can use feature flags so that we can to get them on and off, we can do A/B testing, that kind of thing, we can do canary deploys to make sure that there's not something really horrible. This is all a way of allowing you to get that release going out to production without being scary.

But, of course, all of those things, the automated testing, the feature flags, the A/B testing, they need really solid automation. Often when I start working with a customer, we have a question about testing and they say, "Oh, our tests aren't automated." What that really means is that at any particular point, they don't actually know if the code works. It hopefully works, but we don't have any way of knowing whether that's true.

Even if all the engineers are the most amazing engineers, other things happen, so systems that they depend on might behave in unexpected ways or if a dependency update changes behavior, then something will break even if nobody did anything wrong. That brings us back to we can't ship because we don't have confidence in the quality. Well, let's fix the confidence in the quality and then we can ship.

I talked about contract testing. That is really cheap and easy and can be done at a unit test level, but of course, you do also need automated integration tests. You don't want to be relying on manual integration tests.

In all of this, I think we talk now about CI/CD more often than the build, but in both cases, it is one of the most valuable things that you have as an engineering organization and it should be your friend, and it should be this pervasive presence everywhere. Sometimes the way the build works is that it's off on a Jenkins system somewhere and someone who is a bit diligent goes and checks it every now and then and notices it's red and goes and tells their colleagues, and then, but what's much better is just a passive build indicator that everybody can see that nobody has to open up separate page for and that if it's red it's really obvious, and that if it is a change if it's red. A traffic light works if you have one project. If you've got microservices, you're probably going to need something like that. Even if you don't have microservices, you're probably going to have several projects, so you need something a bit more complete than a traffic light. The traffic lights are so cute.

If you don't have this, then you end up with the broken window situation. I've arrived at customers and the first thing I've done is I've looked at the build and I said, "Oh, this build seems to be broken." They've said, "Yeah, it's been broken for a few weeks."

Of course, there's so many bad things with this because it means you can't do the automated integration testing because nothing is making it out of the build. Nobody else knows whether their integrations are getting worse.

It means that nobody notices if they make things worse within that particular microservice because the build is broken, and it means it just gets this culture so that when one of the other ones goes red people go, "Well, yeah. Now we've got two red. Perhaps we could get the whole set and then it would match if we got them all red." Well, no, that's not how it should be.

That's all on the tech side. That's how we as engineers manage ourselves and our code.

The locked-down totally rigid inflexible un-cloudy Cloud

But of course, particularly once you get to an organization of a certain size, you end up with another set of challenges, which is what the organization does with the Cloud. This is where we take the Cloud and we have the locked down, totally rigid, flexible, un-cloudy Cloud.

How do you make a Cloud un-cloudy? You say, "Well, I know you could go fast and I know all of your automation support is going fast, but we have a process. We have an architecture review board and it meets rather infrequently." It will meet a month after the project is ready to ship, or in the worst case, it will meet a month after the project has shipped. We're going through the mentions even though the thing has shipped. If it was architecturally problematic, we would have discovered it.

Someone told me a story that a client came to them and they said, "Oh, this provisioning software doesn't work. What had happened was we'd sold them this nifty software for provisioning, I think it was virtual machines, and it promised a 10 minute provision time. We told them, "Oh, it's going to be amazing."

Then the client was using it and they thought they were going to get a 10 minute provision time and actually it took them three months to provision a Cloud instance. They came back to us and they said, "IBM, your software is totally broken. You miss-sold it. Look, it's taking three months." Then we went in and we did some investigation, and it turns out what they did is another part of the organization had put an 84 step pre-approval process to get one of those instances.

All of the technology was there, but the culture wasn't there and so the technology didn't work. This is sad. We take this could, it's a beautiful Cloud, it has all these properties, it makes everything really easy, and then another part of the organization says, "Oh, that's a bit scary. We wouldn't want people to actually be able to do things. Let's put it in a cage!"

Well, and then let's add a whole bunch of paperwork. That old-style governance is just not going to work, as well as being really annoying to everything. It's not going to give the results. It's not actually going to make things more secure. It's probably going to make them less secure. It's definitely going to make them slower and cost money, so we shouldn't be doing it.

I was talking to another client, a large automotive company, and they were having a real problem with their Cloud provisioning. It was taking a really long time. They thought, "The way we're going to fix this is we're going to move from Provider A to Provider B." That was going to work great for a while, but of course, what was going to happen is that at some point all of their governance team were going to notice that they were on provider B and they were going to put the regulation in place and they would have had all the cost of changing but not actually any of the benefit. It's a bit like— I have sometimes been tempted to do this, confession, you're looking at a stove and you say, "Oh, that oven is really dirty. I'm going to move house so I don't have to clean the oven." But then of course the same thing happens to the other ovens, so you need a more sustainable process than just switching provider to try to outfox your own procurement.

If the developers are the only ones changing, if the developers are the only ones going Cloud-native, then it's just not going to work. Of course, it goes the other way as well. If it doesn't have some governance around it, then Cloud gets the problem that it becomes the mystery money pit. Of course, a lot of people have this problem where they look at their Cloud bill and they're like, "Hmm. Yeah, that large, and I don't really understand where it's all going or who's doing it."

Of course, with the Cloud, it makes it so easy to provision hardware, but that doesn't mean the hardware is free. Someone still has to pay for it and it is your organization. Of course, another problem is it doesn't mean the hardware is useful.

When I was first learning Kubernetes, I started it out. I created a cluster, but then I got sidetracked. I had too much work in progress. I came back to it after two months, and what I hadn't really realized until I came back to it after two months was that I was about £1000 a month for this cluster that was just completely value-free. I think there had been a vacation in there as well. It was August. I was lying on the beach.

A lot of what our technology allows us to do is to make things really efficient. We shouldn't even be doing them if we have Kubernetes clusters that have no value, then that's not good. Of course now, as well as being expensive, here's an ecological impact as well. Having a Kubernetes cluster consuming £1000 worth of electricity in order to be doing nothing is not very good for the climate.

For most of these problems, the problem seems like a technology problem and it's actually a people problem. I think this one is a little bit different. I think this one seems like a people problem and is actually a technology problem because this is an area where tooling can help. The tooling for this, it's getting more mature. There's more coming up and there is a lot of tooling that can help you.

Cloud to manage your clouds

Of course, this tooling ends up being on the Cloud so then you end up in the recursion situation that you have to have some Cloud to manage your clouds. We have multi-Cloud manager that will look at your workloads, figure out for the shape of the workload what the most optimum provider that you could have it on is financially and then it can do that move for your automatically. Then we'll probably start to see more and more software like this where it's looking at it and saying, "By the way, I can tell that there's actually no traffic to his Kubernetes cluster that's been sat there for two months. Why don't you fire Holly?"

Microservices ops mayhem

One of the things that we see with his growing complexity in terms of cost of Cloud is that there's also a growing complexity in terms of ops. We're using more and more Cloud providers. There's more and more Cloud instances springing up. We've got clusters everywhere, so how on earth do we do ops for this? This is where SRE comes in, site reliability engineering.

SRE

There's a whole bunch of goals for site reliability engineering. Because I am not a site reliability engineer, but goal for SRE is to give us that safety blanket and that reassurance and that modern style ops, which means that releases can be deeply boring and we can do them a lot.

The reason that we can do them a lot is because we can confidence in the recoverability, and it's the SRE who give us confidence in the recoverability.

I've got another sad space story. This is a Russian space story. In the 80s there was a little space probe called Phobos, and they wanted to make an automated check. At this time it was machine code, so there was no compilation or anything like that. There was something that was the equivalent of a linter for the machine code.

Every time an engineer would make a change they'd run it through the automated checks and then they'd do a push to the space probe. I think it was actually a Friday evening. They looked and the automated checker was broken and they said, "Oh, but I really want to do this change. I'll just bypass the automated checks and just do the push of my code to the space probe because of course my code is perfect."

What happened, it was a very subtle bug. They had forgotten one zero on one of the instructions. The effect of it was, so it's got those blue fins. They turn to orient towards the sun so that it can stay charging. What the one character change did was it disabled that turning. Everything worked great for about two days because they didn't have an automated battery indicator. Then at some point, the whole thing ran out of power, and once it ran out of power, there was nothing that they could do to revive it because the whole thing was dead.

That is an example of a system that is completely unrecoverable. Once it is dead, you are never getting that back. You can't just do something and re-image it to the space probe before it was, because it's up in space.

When we have systems like this, they're really unrecoverable. I think a lot of us think that all of our systems are almost as unrecoverable as the space probe, and certainty I think a lot of our management think these systems are as unrecoverable as the space probe. That is clearly a bad situation to be in. It is sometimes necessary.

Where we really want to be is we want to be at this top spectrum, end of the spectrum, where we can be back in milliseconds, we've got no data loss. Anything goes wrong it's just, ping, it's fixed. Well, that's really hard to get to, but there are a whole bunch of intermediate points.

If we're fast in recovering but data is lost, that's not so good, but we can live with that. If we have handoffs and manual intervention, then that is going to be a lot slower for the recovery. When we're thinking about deploying frequently and deploying with great boredom, then we want to be confident that we're at that upper end. The way we get there, handoffs bad, automation, good.

Ways to succeed at Cloud-native

That's a whole bunch of misery about all of the things that I've seen that can go wrong. Of course, it wouldn't be necessarily helpful to say everything goes wrong all the time because a lot of the time things do go really right, and when we develop software in this way, it can be so much more pleasant for the engineers and it can go out faster and we can spend less time on the toil and the drudgery and more time on the things that we actually want to be doing.

The way we achieve that is that when we think about Cloud-native and when we think about what characteristics of Cloud-native we want, we do have to have alignment across the organization. We can't have one set who are saying microservices and one set who are saying fast and one set who are saying old-style governance. That's probably not going to work, and when we make these decisions about what we're trying to achieve, we do want to be optimizing for feedback, so making sure that we get those feedback loops as short as possible because that's good engineering.

Questions and Answers

With that, I will leave it. I think I've got about four minutes for questions. Thank you.

Facilitator: Thanks, Holly. Any questions from the audience? Anyone? Ok.

Participant 1: I've heard a lot about the feature management also on other talks. Based on a little bit of postponing your real [inaudible 00:45:34] life because now you push more often, but you still disable it.

Cummins: Yeah. I think that's right. I think feature management isn't a silver bullet, and there's two problems. Once, as you've identified, is that you've got all of the technical pieces of the go live, right, but you're not actually getting any feedback. You're not getting any performance feedback. You're not getting any load feedback and you're not getting any user feedback, and then the other problem is every now and then you look at the post mortem of an incident and it says, "Yeah, we had a feature flag that covered 95% of the impact of this function."

There is that fear there that if an impact may not be caught by a feature flag and then you actually do have a bit of a disaster...I think we do have to have that openness that it's not a silver bullet, but again, if we're clear about what we're trying to achieve, and we say, "Well, this means that I don't have to have a whole bunch of ceremony about my release, or certainly not at the technical level because it is going live, and so then I can save my ceremony for the decisions to flip that switch, and because it's such a small switch flip and because I'm so good at releasing, if the switch flip turns out to be an unmitigated disaster, then I've got the recoverability," and 20 seconds later I see all my dashboards go like that and then I just flip it back again and I can get that fix out really fast because I've got the recoverability and I've got the boring release.

That one will be slightly more stressful because everything's just gone wrong, but at a technical level, it's still boring.

Facilitator: Any other questions? Ok. Thank you, Holly.

Cummins: Cool. Thank you.

 

See more presentations with transcripts

 

Recorded at:

Sep 22, 2020

BT