BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Curating a Developer Experience - A Hands-on Guide for Platform Engineers

Curating a Developer Experience - A Hands-on Guide for Platform Engineers

Bookmarks
45:50

Summary

Andy Burgin explains what the customer experience team did in one of their projects, starting from scratch and how they have attained feedback with recommendations from across the business.

Bio

Andy Burgin is a Principal Platform Engineer at Flutter UK and Ireland. He considers himself a Kubernetes and DevRel fettler. He is a small part of the organizing team for DevOpsDays London and ran the DevOps meetup in Leeds for almost a decade hosting over 50 events. He’s attended and has spoken at a bunch of DevOps conferences and in his own words is "an all-round DevOps nuisance".

About the conference

Software is changing the world. QCon London empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Burgin: I'm going to talk to you about curating a developer experience. My name is Andy. I'm a big fan of Kubernetes and YAML. That's me as a Kubernetes resource, for the platform engineers amongst you. I work for a company called Flutter, UK and Ireland. We are home to some of the biggest brands in sports betting and gaming.

We are not the company that does the mobile app framework. I've had a few people ask me about that, but that's not us. I work as a principal platform engineer in the container platform squad at Flutter. I spend most of my time developing and building Kubernetes clusters, but with a focus on developer experience.

Developer Experience (DevEx)

I'm sure you have heard of developer experience. At the minute, there's a little bit of a Vogue interest in it. I'm sure you have your thoughts and ideas on its perception. What I need to do for the sake of this talk, is align everyone on where I'm coming from on this. Many people might think that developer experience is something to do with developer productivity. It might be something to do with developer portals, or indeed something called developer platforms, which I think are platforms.

I don't think there's a lot of difference between them. There's been a whole bunch of research recently. There's a lot of research being put out in the last 18 months by a company called GetDX, working with Nicole Forsgren, of DORA metrics framework and The State of DevOps Report, and also the SPACE framework as well. It's interesting to see how the SPACE framework has evolved into this analysis of developer productivity. There are some very compelling numbers there to indicate that developer experience and developer productivity are very closely correlated. I'm not sure I'd buy into that. I see developer experience as getting people to use your stuff, and a byproduct of that is developer productivity.

That's mainly because I don't work in finance. For me, developer experience is more jigsaw pieces to the DevOps puzzle, creating empathy and collaboration between teams across those silos, and really solving the DevOps problem we've all been working on for at least the last decade. When I read things like the CNCF maturity model, to me, it just screams DevEx all the way through it, everything he's talking about, I see as a DevEx enabler. For me, this is the journey we set off to solve in 2019. I'm going to talk about my experience and my company's experience of implementing DevEx.

Of course, we started this half a decade ago, long before any of this research was there. This was really like the endpoint of the journey we were heading for, this compelling internal product. That's what we're trying to do. Ana's talk was very much from a strategic perspective on a non-individual contributor perspective. This talk is very much about what you as a platform engineer can do to create a developer experience. I'm going to explain all the practical things we did. I've got a bit on antipatterns. It's really important that you understand those, as well as all the things we did.

The Problem (2016)

This journey, we have to have a starting point in the journey. For us, we're going to go back to 2016 when it all began. If you'd have come to QCon in 2016, you will probably have had conversations with people about running Docker in production. Did anybody have those discussions back then? Because you were a maverick if you were doing that. You were cutting edge. You were bleeding edge. Back at work, we wanted to do that, we wanted to run containers in production. Obviously, we didn't want to do it in a maverick way, we wanted to do it with a proper container orchestration platform.

This document on your screen is the initial proposal document from the Bet Tribe's next-gen hosting platform. If you look at it, it doesn't mention containers. It doesn't mention any tech, really. It talks about business value. Outside of the squad, there should be no humans involved in the value stream between feature and customer. That sounds a little bit like DevEx to me. Getting to a point where you are releasing code to your customers quickly and safely, without having to raise tickets with other teams for networks or storage or compute. After many POCs and evaluations, the small team that was put together in 2016 settled on Kubernetes. I think that was a very good choice. The platform itself, just as a bit of background, from the technical perspective, was predating all of the turnkey solutions.

There was no EKS, there was no GKE, no AKS. None of those tools were there. We built things the hard way, primarily built out of makefiles, Terraform, and container Linux, still largely the foundation of it even today. It was built out initially in AWS and later on-prem. We worked with a customer to create this platform. Building platforms without customers is a recipe for shelfware, and we didn't want to do that. We worked closely with the team to implement a solution for them. We followed platform as a product paradigm, all the way through this. Not only was it a Kubernetes engine, it also had integrations and automation around firewalls, load balancers, storage, logs, monitoring, and a whole bunch of other stuff, DNS certs, eventually. We rolled this out, and the platform worked. Slowly over time, we had other development teams come on board. It's going really well. We put RBAC in place at this point, because we've got multi-tenant, and it's all going swimmingly.

If we go fast forward to 2018, it's a really big success. Development teams prefer using the Kubernetes platform to the VM infrastructure and infrastructure as code. I joined the team. I was very surprised when I joined the team. None of the engineers in the platform team or in the container platform team were particularly happy. We got a product, which was a great success, lots of adoption. Why are we not happy about this? It turns out, really, we were struggling with a whole load of growing pains. Our ways of working didn't scale. Code bases were treading on each other's toes with it. All sorts of unhappy engineers as a result.

In 2019, we made some changes. We split the container platform team into three squad alerts, or three teams. We created an engine team who looked after the building and the running of the cluster. We had a capabilities team that looked after all of those integrations, so logging, monitoring, storage, firewalls, load balancers. Then, we created a little bit of an experimental team, this one here. We called it a customer experience team. You could equally have called it the developer experience team, because our customers are developers. They are the people using our platform. We're a bit of an experiment. Nobody's done this before. We're going to try, and if we fail, we'll do something else with it. We can fail safely if we need to. I got involved in heading up this team. We're basically given a blank sheet of paper, a free remit to go and do DevEx. Where do we start? What do we do? We've got to work that out.

Baseline - State of Container Platform Report 2019

The first thing we do is we think, what would the DevOps handbook tell us to do? A lot of wisdom in there would tell us, you can't work out if you're improving things unless you measure them. We set out to create a State of Container Platforms Report, a very small version of The State of DevOps Report. To do that, we put together a survey, easy to fill in. Only a couple of checkboxes and a few options, and it's a freeform text. We sent that out to roughly about 350 engineers. Not all of them are using our platform, so we're not expecting a massive return on that. We get 31 responses, so roughly about 10%, which we're quite happy with. We would have been happy with 5%. The format of the survey asked teams, which area of the business they worked in.

We made it anonymous, because we wanted to hear the unpleasant as well as the pleasant, so we got honest feedback. We asked which clusters the teams were using, because we had a couple of different clusters. We asked which of those capabilities, teams were using. Were they using the DNS automation, the firewall automation, the load balancers? Then we put in something called usability statements. These proved to be very useful. A usability statement goes something like, I find doing X on the cluster easy, do you strongly disagree, disagree, agree, strongly agree? There's about nine of these for various features of the cluster.

I'll go through the numbers, when we do the year-on-year comparison. The TL;DR on this is the numbers were really good. We got lots of amazing feedback in the freeform fields. We did get some not negative feedback, but certainly constructive. They told us stuff we already knew: our documentation was a mess, and our onboarding process was somewhat complex. We knew that. We had something to aim at, and we've now got a baseline for what we do next.

2019 - 2020 (Forming)

What do we do? That's a very good question. What do we do as a team? We thought we should probably write that down. In my naivety, I found this thing called an OKR template. I didn't know what one was at the time, but I do now. I filled it in very badly. What we were going to use this for was really as a mission statement and areas of focus. You could see from here, we've said we're going to engage, empower, and support our customers. Under those three areas, there's various areas we could focus on, sub areas, and then some actual tangible things we could do. This was super powerful, because it made us as a team know what we were meant to be doing. If we were asked to do anything that really wasn't on that list, we really shouldn't be doing it. We had a really good starting point, and super important that we did that.

Who are our customers? Who are these development teams? We know the ones that wander up to our desk, in the days where you could wander up to desks and interrupt us. We knew the ones which would ping us on Slack. That's not everyone. There's quite a lot of engineers using this cluster. Who are they? We have a chatbot mechanism for raising support requests. We don't raise tickets, we go into the support channel, create a command that will summon the support engineer. You'll hopefully be able to exchange words and solve the problem, and then close off the help request. In the background, all that transcript goes into a Jira ticket. Nobody uses it, you use the chatbot interface. It's there as a record. We realized, we've got 3 years' worth of these Jira tickets from when we created the platform. We thought, if somebody has raised a ticket, they must be one of our customers.

Those are one of our developers that are using the platform. We use that information. First thing we did was we have a bit of a run with this analysis of Jira tickets, so we create an FAQ. We didn't have one. We took 3 months' worth of tickets, identified the frequently asked questions, and wrote some frequently answered questions. We got that. We put that on our wiki. We signposted the support channels, hopefully reduce a bit of toil.

We now had a list of people that used our cluster. That really wasn't very useful. Where did they work? Which teams were they in? We went diving into the various wikis, and found some org charts. We were very surprised some of them were even up to date. What we did is we printed them off really big on multi-page A3, and then we glued them together and we stuck them on a whiteboard. Then what we did was we took some whiteboard pens, because we're super advanced, and we marked on everyone on the org chart that had raised a helpdesk ticket, and suddenly we could see the areas of the business that were using our stuff.

We'd already started doing some training at this point. Another color we marked on there, who'd had training. Suddenly, we got this overlay of who was asking for help, who'd already had the training. We could see where the intersection was, or wasn't, and where it should be. Even areas of the business that weren't even using our stuff, but good, so we could do some marketing to them. We created a very primitive form of CRM system, but it let us know who we should be talking to.

Which leads us on to the next thing and the thing we as techies do not really like doing, we have to go and talk to these people. I don't think I've ever had a job interview when people have asked me what my engagement skills are like with other people, or how my meeting administration is. What I think as techies is the standups and retros, we are very good at them. We know exactly the format. We know what good looks like there. Meetings where we try and decide getting stuff done. Maybe it's just the people I work with, I don't find this quite so good.

We don't tend to like sending out agendas. We don't tend to chair a meeting. We don't take minutes, and follow-up actions seem sometimes like they just happen by look rather than judgment. I don't know if that's your experience. What we decided to do was create meetings with these customers we'd identified. We got four tribes, a meeting each month. We went for monthly cadence rather than quarterly. We thought we'd be able to create some engagement and relationship if we did that. Four meetings each month. We had a set structure to this. We made it open invite. Anyone could come, even if they didn't work in that tribe. We wanted to make sure we had an open-door policy.

The format of the session would be, we'd go through the actions from the previous meeting, so we can see what people have done and whether they followed up on stuff. Then we have updates from the container platform team. What have we been working on? What features are coming to the cluster? Are there any maintenance windows? Is there any updates they need to know about? Then we flip it round, and we get feedback from the tribe. We ask for updates about what they're working on. Any problems they've had? Any blockers? Anything we can do. We also try and get a steer on capacity, what have you got coming up? Being a sports betting company, we've got various football and other sporting events that can require a lot of resource and spikes in traffic. We try and get a steer on that in advance as we can.

Then, we actually write up all of that meeting, you minute it, attribute who said what in the meeting, and write some follow-up actions as well. Then we make those minutes public so that anyone who was in the meeting can see what their follow-up actions were. If they weren't in the meeting, they'll probably get notified anyway that somebody has delegated something for them to do, and they probably should have come to the meeting to say no. More importantly, this was a mechanism we could take back to the engine team and the capabilities team, and we could talk to them about what people were using their stuff for.

What were they doing on the cluster? Is there anything we can help? Occasionally we would create feature tickets to get work done to improve the cluster for our customers. Some things would go on our roadmap about future updates, and maybe larger pieces of work. Also, we could hook engineers up between the tribes and the capabilities or the engine team to go and work together. We were using it to create these feedback loops. The other thing we did with them being in the open was it gave visibility to our boss of what we've been doing. Remember, we're an experiment, gives us accountability. They can see what we're doing, and judge if we're delivering any value for them.

2019 - 2020 Roadmap

We also did some actual technical work as well, because we had certain things we needed to implement for our customers. We had some feedback that our onboarding process was difficult. We put some automation around that to simplify it a bit. It's still complicated even today. The main thing we did was work out after 3 years of organic growth, who owned everything on the cluster, because it turns out that spreadsheets someone created 18 months ago, hadn't been updated quite as regularly as it needed to be. We went through the arduous process of finding owners for everything on the cluster. Rather than putting it in a spreadsheet, what we did is we put metadata on the namespaces, so we actually use that as a source of truth.

We put labels on for which tribe owned this workload, which squad owned this workload, and which Slack channel we needed to use if there was any problems with the workloads in there. This was incredibly useful, even though it was time consuming, because what it let us do was use that metadata, programmatically in a cluster. The first thing we did was we addressed the problems we had with our logging pipelines. They could get very noisy, particularly in test environments with noisy neighbor syndrome. We sharded our logging pipeline, we created a pipeline per tribe. Then, the log shipping software would use the label for the tribe on the namespace of the workload, to ship that log message to the correct pipeline. That worked really well. We used it for loads of other stuff as well.

Let's talk about numbers. It's the end of our first year. We've sent that survey out again. We got 31 responses to start with, we've only got 25. July 2020 was rather different to July 2019. There's a little bit of survey fatigue around at the minute. We're not too upset by that. Again, the overall feedback we get is positive. We start seeing these usability statements as particularly useful as a trend analysis and indicator of what people think of the cluster. You remember these usability statements where I find doing X on the cluster easy, strongly disagree, disagree red, agree, strongly agree green.

Although there's an overarching green trend here, you can see in the second year, there's not too much of a difference. Most notably, our documentation, which is this one here, has been flagged as actually getting worse. We've probably got some work to do in a year ahead. These were incredibly useful to give us steers on what people thought of the individual features, rather than just what they were using.

2020 - 2021 (Norming)

Into our second year, so we decide to change our tact a little bit here. Rather than reaching out to teams and saying, how can we help you to make the cluster better? We actually changed the text a little and say, how can we all work together to make this cluster better? Try and create a little bit of shared responsibility, because we want developers to do the right thing on the cluster. What is the right thing? We write that down. What we do is we work with the tribes, and we come up with a series of best practices, of what good looks like for a workload on the cluster. We break these into three areas, there's build, deploy, and run. We prioritize these by MoSCoW, so must have, should have, could have, would have, that's what they are. We got some ranking on them.

What we realize is, for the run ones, there's about 20 of them. We can codify these, use Open Policy Agent. Then we can use Gatekeeper to record the status of each of the workloads against that policy. Then we wrote an exporter that would take that data and stick in our metrics store, and we use those labels of tribe, squad, and namespace. We could break them down by it. We produce the dashboard, which would allow us to look at the state of workloads on the cluster, and slice and dice it by the various checks or by the tribe or by the squad. Incredibly powerful. We could track workloads which were misbehaving and reach out to teams if we needed to. Just to be clear, on these best practices, we didn't write them. We collated them, we edited them, and we published them, but they were put together by the people using our cluster. This isn't us enforcing rules. This is us getting them to tell us what they think good looks like, getting buy-in because they wrote them.

Then, cost is a thing. We produced some rudimentary cost reports in Excel, because Excel is good. Again, we take the cost information out of the AWS cluster, we use the usage data, which is broken down by tribe, squad, and namespace, because we got those metrics. We can produce a retrospective report for the last 30 days, which is better than nothing. It's not ideal, but it means we can put the visibility of cost in front of our users. We can introduce the compliance metrics, as we call them, based on the best practices, and the cost information into our meetings.

We can have a little section after we've all exchanged the information about what we're all up to, and talk about the actual data. We can do trend analysis on this. We can flag up workloads which may be costing a little bit more than they did the month before. Is there something up? We can flag up any workloads which are tagged as severe in terms of criticality and compliance metrics. Teams can go away and look at those. We can do trend analysis over time.

We're getting to the stage where we could start using SLIs, SLOs, and SLAs based off these numbers. We never got to that point, but we could have extended it into that. Our survey said, your documentation isn't very good, so we decided to fix that. Not a fun process, there was over 500 documents, but we went through them. We removed the duplicates. We updated the outdated ones. We added ones which were missing. We restructured it by maintainer, and by customer, and by product. It was so much better afterwards, a lot of work, but glad we did it.

That brings us on to the survey of 2021. We're back up to 30 responses this time. It's gone back up, but it should be a lot higher. We got a lot more people using the cluster. Despite our best efforts of pestering and nagging people, and nagging people's boss and their boss's boss, and their boss's boss boss, we still only get 30 responses. We're a little bit disappointed with that, but, as you can see from the usability statements, it's getting greener, which is a very good statistical analysis technique. Things are improving. We're really happy about that.

2021 - 2022 (Storming)

Into our next year. This is more of what we call gardening rather than policing. We're very much encouraging teams to do the right thing. We're not going to come down heavy handed. We're going to encourage teams to help in this mission to make the cluster a better place for everyone. To do that, we decided to build some tooling that shifts left, that makes it easy if we shift left. We're not just YOLOing things at people and saying, your problem now. We're actually building tooling which helps people. Our cost reports were in Excel, hard to distribute, immediately out of date, and only show the previous month.

What we did is we pulled the cost data out of AWS into our metric store, and now we can build a dashboard, which showed by tribe, squad, and namespace, the actual cost metrics for each of the workloads on the cluster. Wasn't completely real time, it was about 24 hours out of date. What it does there is this uses the same dashboard system as the ones our developers use to build their applications or build their own dashboards. We're moving the data source as close to them as possible, so they can use it if they want. We also discovered quite by accident that another team was using our compliance data. They have a little bot that runs in their platform team's Slack channel. As part of the output from that, as well as upcoming changes or anything important to them, they've included the compliance metrics based on our data they've pulled that across. I thought that was cool. We didn't even know about that, at the time.

What I haven't spoken about so far is developer tooling. When we created the cluster in 2016, there weren't many options for developer tooling. We built out some basic pipelines with that first customer. Really, culturally in the organization, there was a whole bunch of autonomy. We didn't really feel as though it was our job at the time to get involved with their SDLC. Rightly or wrongly, over time, the teams developed various ways to deploy their applications using their own tooling. What came out of this through lots of evolution, over a couple of years was an inner source framework for a golden path, originally built by the Bet Tribe.

This wasn't built by us, even though we did have some input on it. This was built by a platform team in Bet, working with their development teams, to develop a platform for developers. Had a series of managed base images which are rebuilt weekly with a couple of different language runtimes. Teams could use that base image to build another image with their application in. Then there's a Helm chart, which wraps all the complicated Kubernetes stuff, will take that image and deploy it onto the cluster through pipelines, which are created as part of this. It abstracts a whole load of technical stuff like service meshes and stuff, so they don't have to worry about that.

One of the really compelling things about this, though, was when the team put this golden path together, is they worked with the SLM teams, and got parts of the acceptance into service process, which all workloads have to go through, pre-blessed. If you use the golden path framework, it meant that you didn't have as much red tape to do to get your application live. I think that's probably quite appealing to engineers having to do less paperwork. This was adopted into inner source and was used by lots of teams around the business. Some cross-tribe adoption there.

Training is super important to the success of our platform. We'd already started doing some basic training long before we were doing any developer experience work. We did a one-day workshop where we taught you how to build a Kubernetes application. The first thing we insisted on you being told about was Twelve-Factor Apps that you should build your applications in that way. Then, we added another one-day workshop after the successes of that, where we gave you an application and we gave you access to our test cluster. If you're on the course, you would deploy this application to the test cluster, and we would teach you how to add storage, load balancers, monitoring, logging, and a little bit of load testing, as well.

Theory being, if you had been on these two workshops, you would probably be reasonably confident to be able to deploy an application to live on our clusters, with only asking us minimal support. We also did a theory course as well, because not everybody wants to spend two days of their life elbow deep in YAML. Today we've had 507 people through this course. You could argue it's 250 because some of them have done the basic course and the advanced course, but are still nearly 3000 people training hours. We always send a survey out afterwards, to see what people thought.

The Net Promoter Score we get off the back of that is always incredibly high, usually somewhere between 90% and 96%. We're super proud of that. It got so popular our training at one point that we actually had a tribute act doing my course, which was weird in another tribe, they took our training material and run it themselves. I think that was a compliment. We sent out a survey this time, and we have disappointingly only got 19 responses. Did we try to get that number up? You can see from the usability statements, it's gone particularly well, it's a success, so as well with our DevEx efforts.

2022 - 2023 (Scaling Back)

Come into the next year, what do we do? This might sound a little counterintuitive, particularly for those of you that are thinking about riding the DevOps hype curve at the minute, but we decided to scale it back. There were certain signs that this wasn't delivering what it did previously. Our customers were very different. They weren't like this anymore, like they were in 2019. We were more like this now. We worked quite a lot with these platform teams inside of the tribes. We didn't really work so closely with the end developers, as we did. What that meant, if we looked at evidence, so if we looked at quality metrics, we could tell from the meetings that we were having the same conversations with the same people. It was with those platform teams. It wasn't with the developers anymore.

We'd taken all the low hanging fruit. We'd got all the easy wins, and a lot of the medium ones. Really, we were getting diminishing returns out of doing this. The thing is as well, that, internally, we'd become a lot more operationally focused. We were a lot better at managing and running the cluster than we'd ever been. We knew if things weren't working out right for our customers and for the developers. We could go and actually reach out to them. We could be a lot more proactive than maybe reactive as we were in the past. Those development teams and platform teams inside the tribe had also matured as well. They were getting very adept at running their applications, and not really needing as much help from us as they did previously.

What about numbers? Is there any numbers to back up whether this was the case, or even whether this has actually had a negative impact in deciding at the beginning of 2023 to scale this back? I haven't got any good metrics, unfortunately, but I have got some numbers. This is the number of helpdesk requests over the last 5 years per week. You can see, in the early days, as we went through adoption, we needed to provide a lot of support. More recently, that's decreased quite significantly. If we look at 2023, if not providing that developer experience function was causing problems, we'd have expected to see an increase again, and we didn't see that.

However, this is helpdesk requests. Helpdesk requests are of different complexities, different times. There is different effort required to resolve them. It's not a great metric. You might say, are you still getting workloads put on the cluster? Is there still expansion there? I don't have 5 years' worth of data. This is our CPU requests over time. Certainly, in the last few years, we keep seeing a 20% increase year-on-year on CPU requests. There's definitely increasing workloads. What you might also say then, and challenge me on, is there's more stuff, but is it new stuff? Is it not just people turning the replicas up? Is it not just the dials being turned up? If we look at the number of namespace creations, we can see that still happening.

Maybe not as much as it was in the early days, but we're certainly still getting new applications being deployed to the cluster. This is the takeaway, is that developer experience creates experienced developers and platform engineers. Don't forget that. They have their own frameworks and tools. They have their own domain specific knowledge and documentation. They're self-sufficient now. They don't need to rely on us as much as they did previously. The other thing to just remind ourselves about is we are a team that build and run a Kubernetes platform. We don't build workloads that deal with massive traffic spikes on Saturday afternoon, when someone scores a goal. That's what those teams do.

They are adept at building those kinds of applications to deal with spiky workloads, which a lot of teams would never see outside of sports betting. They got really good at that. We haven't completely stopped our developer experience efforts. We are still very much doing it proactively. We still reach out to teams if we see problems. We still will monitor things like capacity asks, and the load tests that are happening around Grand National. We're not averse to going and working with teams if they have some brittle applications, and helping them solve problems with that if we can help.

What's Next?

What's next? What are we going to do next? Our platform is from 2016. There's not a great deal of a lot of change in the actual structure of the cluster itself. We're now part of a larger organization. We've got a whole bunch of other brands we can work with who aren't using containers yet. We're looking to build out a container platform, that's for all of the brands across the organization, that's going to be more modern, more scalable, and more supportable than the platform we got at the minute. To do that, we're not going to do everything ourselves quite as we did before. We've got extra functionality, which we can lean on in our organization. As a cloud engagement team that look after all the AWS accounts and make sure they're secure, we want to be provisioning Kubernetes clusters in those accounts.

How can we do that securely? We need their help to do that to make sure that we're still compliant and follow the governance that they put in place. There's a team called developer experience, which is slightly confusing, given the title of this talk. They're more of a tooling and automation team that exists in the wider business. What we need to do is we need to expand their work in their tooling and automation, so it extends into a containerized environment, and allows teams that have never touched containers before to continue to use those tools and services to deploy containerized applications. Observability is a massive talking point in this industry. What we don't want to do is have metrics, logs, and traces which are just local to our cluster.

There's a centralized team that are running a platform for that. We can use that service to send our metrics, and our logs, and our traces to that. That can provide visibility across the whole organization across different platforms, not just our Kubernetes clusters. Then, we already worked really closely with our service lifecycle management folks for compliance and for governance. They have other functions. They have capacity functionality. They reach out to teams and manage capacity. We can teach them about Kubernetes, and we can use their contacts and their ways of working to introduce container capacity questions into that workflow.

This might be something we're going to do. We're still working on this. It will mean that we can be deploying Kubernetes clusters on behalf of our customers, and managing them for them, so they don't have to deal with the operational complexity of managing a Kubernetes cluster themselves. It also opens the door for more DevEx. We're going to have a whole bunch of users which haven't used containers before, so we can build on all the stuff that we've done previously, and all the new research that's come out. I think that's super exciting. All the evolution of the SPACE framework and the work that's been done by GetDX and Nicole, we can start bringing that into our workflows and the way we work, and do things in a much more modern way than we were before, and leveraging on that experience.

Summary

These were the things we made and did that took us from a team that said, how can we help, through to a team, how can we all help, through to a team that was data driven. We built tooling that helped teams understand the workloads on the cluster, and also gave them visibility of the costs, and training to be able to empower them to do that. Ultimately, we decided when to claw things back, as well. We were brave enough to do that, and understand when that was the right thing to do. There's a whole bunch of stuff here of things you can go and do in your organization. They're practical and you could go and do them. I just want to show some antipatterns of this. I have seen teams try and emulate our success a little bit, but not doing things properly.

I see teams trying to do a little bit of DX, rather than actually committing to it. This was my job for 4 years. It's a lot of work. I don't think you can do it on the cheap or shortcut to do it. It takes effort, it takes rigor, it takes discipline, this process. Sending around a survey via an email is super easy to do, to everyone on a distribution list, but that survey needs to be right. You only get one chance of doing that. At least read, "Accelerate". It's got a great chapter on how to build surveys and how to analyze results. If you have research functions inside your organization, go and talk to them. Doing research and understanding metrics is an actual job, is a proper thing. Go and get help to do that if you can. Don't create a Slack channel, invite people in, and say that is your communication mechanism for developer experience.

In my experience, you have to go and meet with people, even if it's on Zoom, at least monthly, not quarterly, and create those engagements, and those feedback loops. Build those relationships. Jessica talked about trust as a currency, and building trust, and how you can save it and spend it. That's where that happens in those meetings. Training is super important. I've seen teams put in a whole lot of effort creating amazing screen recordings, but they go out of date so quickly. If the subject matter of that recording changes, editing is really hard. Creating a good video is really hard in the first place, but updating it is a real time-consuming effort. For us, we found updating a repo with some demo files in and changing a few PowerPoint slides for the next session of training, much more efficient to do. I know that not everybody likes in-person training or webinar-based stuff, but we found it really worked for the majority.

Then there's my favorite team, we know what our customers need, they just know what they want, and it's not the same. If you see your customers or the users of your services as a problem, or an inconvenience, and give them names like petcock and wetware, I don't know that DevEx is really something you should be doing. If when you get asked for something, your first reaction is, what do you want that for? You want how much? Maybe this stuff isn't for you. You've got to actually want to support your customers and make this happen. I hear this, we've tried everything, they just won't engage with us, they won't talk with us. Two slides ago, I gave a big list of things you can do and how to do it, so you can do that. You just have to actually try.

Conclusion

This was a job, it took effort. It takes process. It's a bit of a weird job as well. It isn't for everyone. This is a sociotechnical problem to solve. It's about creating relationships, feedback loops, and engagement. It's a people problem. Installing tools is ace, but just installing a developer portal for the sake of it isn't going to solve your broken culture. That's not to say developer portals aren't applicable at all. For certain use cases, they're very applicable. If all you're trying to do is build relationships with your users, it might not be, it might be. The big takeaway is, understand when this is done, when you've reached the end of the journey. That's where we are now.

 

See more presentations with transcripts

 

Recorded at:

Dec 10, 2024

BT