BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Curating Developer Experience: Practical Insights from Building a Platform Team

Curating Developer Experience: Practical Insights from Building a Platform Team

Key Takeaways

  • Developer experience (DevEx) goes beyond productivity, encompassing aspects like ease of use, collaboration, and empathy. It addresses a sociotechnical problem by creating relationships and engagement.
  • Creating a positive DevEx involves understanding developer needs and challenges, fostering collaboration, and continuously using feedback loops to improve platform usability.
  • It's often a people problem, not a tooling problem. Tooling needs to help engineers "shift left" and genuinely empower users rather than being a burden or unused shelfware.
  • Practical steps like running annual surveys, creating usability metrics, and fostering continuous engagement with users can help the team track progress and identify areas for improvement over time.
  • As development teams mature, the DevEx function can ideally be integrated into day-to-day operations. Effective DevEx efforts can lead to self-sufficient teams with reduced reliance on platform engineers.

As a platform engineer, how do you help your customers go quicker? Which aspects of developer experience should you care about? More importantly, how do you curate an experience for them?

This article is about curating a developer’s experience. It describes my experience, what my company learned from implementing DevEx, and what you, as a platform engineer, can do to create a developer experience for the development teams that use your platforms. I will explain all the practical things my team did, what worked (and what didn’t), how success was measured, and how the team’s focus has evolved over the years.

Developer Experience (DevEx)

Many people think that developer experience has something to do with developer productivity. It might have something to do with developer portals or developer platforms.

There’s been a lot of recent research into DevEx, like the DORA metrics framework and The State of DevOps Report, the CNCF maturity model, and the SPACE framework.

Some compelling numbers from the research indicate that developer experience and productivity are very closely correlated.

I see developer experience as getting people to want to use your platforms, and a by-product of that is developer productivity. Developer experience is more jigsaw pieces in the DevOps puzzle, creating empathy and collaboration between teams across silos, and solving the DevOps problem.

The Problem

Back in 2016, Flutter UK&I wanted to run containers in production using a proper container orchestration platform. This is the initial proposal document for the Bet Tribe’s next-gen hosting platform.

Outside of the squad, there should be no humans involved in the value stream between the feature and the customer. Releasing code to your customers quickly and safely, without having to raise tickets with other teams for networks, storage, compute, etc.

The platform was built on Kubernetes, predating all of the turnkey solutions. There was no EKS, there was no GKE, nor AKS. We built things the hard way, primarily with makefiles, Terraform, and container Linux, which is still essentially its foundation today. It was built initially in AWS and later on-prem.

We built the platform as a product by working with a development team. The platform wasn’t just Kubernetes; it included integrations and automation for services like logging, monitoring, load balancers, firewalls, DNS, and many other functions.

The platform became widely adopted, with developers preferring it over the existing VM infrastructure and tooling. By 2018, we had many development teams using the platform, which was a great success.

I was very surprised when I joined the team in 2018, that none of the engineers on the container platform team were particularly happy. They were struggling with a lot of growing pains. Ways of working didn’t scale, and working with our codebases became problematic—victims of our success.

In 2019, changes were made to the team. We split the container platform team into three squadlets or three teams. We created an engine team that looked after the cluster’s building and running. Our capabilities team looked after all integrations (logging, monitoring, storage, firewalls, load balancers, etc). Then, we created an experimental team called "Customer Experience"—as our customers were developers, this DevEx team was given free rein (within reason) to improve things.

Baseline

The DevEx team set out to create a State of Container Platforms Report using a survey sent to 350 engineers, of which only a small number were using our container platform. We got a modest 31 responses.

The survey was anonymous, so the feedback was more likely to be honest; it asked developers which clusters and capabilities they used and included usability questions to measure the ease of use. This data proved critical for year on year comparisons.

The responses were very positive; we got amazing feedback in the freeform fields. We also received some negative feedback, but it was certainly constructive. This told us things we already knew: our documentation was messy, and our onboarding process was somewhat complex. We had something to aim for and a baseline to review each year.

Defining Our Purpose

Next, the team created a North Star defining a mission statement and our areas of focus:

The team decided we would engage, empower, and support our customers. Under those three areas, there were various aspects we could focus on and some real tangible things we could do.

This was powerful because it helped us know what we were meant to be doing as a team. If we were asked to do anything that wasn’t on that list, we really shouldn’t be doing it.

Who Are Our Customers?

Next, we had to identify who our customers were. We knew some of them through daily interactions, but there were many we didn’t—so who were they? We realised we had three years’ worth of these helpdesk tickets from when we created the platform. Someone who has raised a ticket must be one of our customers. Those are the developers that are using the platform.

We used these tickets to create an FAQ. We took three months’ worth of tickets, identified the frequently asked questions, and wrote some answers. We published them on our wiki and signposted them in support channels.

We now had a list of people who used our cluster, but it wasn’t very useful. Where did they work? Which teams were they in? We went diving into the various wikis and found some org charts. We printed them on huge multi-page A3 sheets, glued them together, and stuck them on a whiteboard. Using some whiteboard pens, we marked everyone in the org chart who had raised a helpdesk ticket, and suddenly, we could see the areas of the business that were using our platform.

Using another colour pen, we marked who’d had training. Suddenly, we got an overlay of who was seeking help and who’d already had our training. We could see where the intersection was or wasn’t and where it should be. We also saw areas of the business that weren’t using our platform, so we could do some marketing to them. We created a very primitive CRM system to help us know who we should be talking to.

Talk to our Customers

Next, we created meetings with the customers we’d identified. We worked with four tribes, holding monthly meetings with each and choosing a monthly cadence to generate better engagement and relationships than meeting quarterly. Meetings were open invite—anyone could come, even if they didn’t work in that tribe. We wanted to make sure we had an open-door policy.

The format of the session was:

  • Actions from the previous meeting
  • Updates from the container platform team
  • Feedback from the tribe: Did they have any issues or blockers, and how could we help? Were there any upcoming workloads (for capacity planning)?

Meeting minutes were documented and shared openly for four reasons:

  1. If you were in the meeting, you knew your follow-up actions.
  2. You could see what actions would have been delegated to you if you hadn’t.
  3. The discussion points were given to the engine and capabilities teams and shared, discussing how to improve the clusters for our customers and ourselves. We might create feature tickets to improve the cluster for our customers. Maybe items would go on our roadmap for future updates or larger pieces of work; we might even pair engineers with customers to work together. Most importantly, it created feedback loops between the container platform team and our customers.
  4. It made us accountable to our management—remember, we’re an experimental team.

Roadmap

After three years of organic growth, we needed to identify workload ownership on the cluster. Our records were outdated, so we undertook the arduous process of finding owners for everything on the cluster. The new ownership information was stored as metadata on the namespaces in the cluster as a source of truth. We added labels for which tribe owned this workload, which squad owned this workload, and which Slack channel we needed to use if there were any problems with the workload. This was incredibly useful, even though it was time-consuming because the metadata could be used programmatically for tooling.

The first example of this was addressing problems we had with our logging pipelines. They could get very noisy, particularly in test environments with noisy neighbour syndrome. We partitioned our logging pipeline, creating one pipeline per tribe. The log shipping software would use the label for the tribe on the namespace of the workload to ship that log message to the correct pipeline.

In July 2020, we reran the survey. The overall feedback was positive despite a Covid-related dip to 25 responses. We start seeing these usability statements as particularly useful as trend analysis and indicators of what people thought of the cluster.

Although there’s an overarching green trend here, there’s not much difference in the second year. Most notably, our documentation was flagged as actually getting worse. These were incredibly useful in guiding us on what people thought of the individual features rather than just what they used the cluster for.

Norming

Into our second full year, we changed our tack a little. Rather than reaching out to teams and saying, "How can we help you to make the cluster better?" we adjust the narrative to "How can we all work together to improve our clusters?" We are trying to create a sense of shared responsibility.

We worked with the tribes to define a series of best practices, identifying what "good looks like" for a workload on the cluster. We split these into three areas: build, deploy, and run. We prioritised these by MoSCoW— Must have, Should have, Could have, and Would have to apply some sense of importance.

We realised that we could codify some of these best practices and collect metrics by tribe, squad, and namespace using the metadata we had already collected. We then produced a dashboard that allowed the development teams to see their workloads’ "compliance" status or groups of workloads.

Just to be clear, on these best practices, we didn’t write them. We collated, edited, and published them, but they were put together by the people using our cluster. We weren’t enforcing rules. We got teams to tell us what they liked and got their buy-in because they wrote them.

We produced some rudimentary cloud cost reports in Excel showing retrospective figures from the last 30 days. The cost information and usage data were obtained from the AWS cluster, using the metadata to segment it by tribe, squad, and namespace.

In our meetings, we discuss the state of workloads and how they "comply" with the best practices. We can flag workloads that may cost a little more than they did the month before and start having conversations about data rather than opinions.

That brings us to the survey of 2021:

We’re back up to 30 responses this time. It’s gone back up, but it should be a lot higher. We’re slightly disappointed with that, but things are improving, as you can see from the usability statements in the results. We’re pleased about that.

Storming

Into our next year, there was what we call "gardening" rather than "policing". We helped teams do the right thing. We encouraged teams to help in this mission to make the cluster a better place for everyone. To do that, we built tooling that shifted left but in a way that helped teams.

Our Excel-based cost reports were hard to distribute, immediately outdated, and only showed a retrospective month of data. We automated the export of AWS cost data into our metric store, building dashboards that showed costs by tribe, squad, and namespace.

Cost metrics are now almost in real-time and available as data sources in the developer’s dashboard system, making it easy for them to incorporate cost data into their dashboards.

Developer Tooling

But what about developer tooling? When we created the cluster in 2016, there weren’t many options for Kubernetes developer tooling. Over time, the teams developed various ways to deploy their applications using their own tools and frameworks, ultimately evolving into a golden path framework. To be clear, this wasn’t built by the Container Platform team; it was built by a platform team in Bet Tribe, working with their developers.

One of the framework’s compelling features is that parts of the acceptance into service process are pre-approved by our SLM teams. If a development team uses the golden path framework, they don’t have as much red tape to get their application live. The framework was made an inner source initiative to encourage adoption by teams around the business.

Training

Training proved to be an essential and successful part of the DevEx offering. We had a one-day workshop where we taught developers how to build a Kubernetes 12-Factor application. We added another one-day workshop where attendees were given an application and access to our test cluster. We taught them how to deploy it and add storage, load balancers, monitoring, logging, and load testing. To date, that’s over 3000 person-hours of training for 500 developers, with a net promoter score consistently over 90%.

Scaling Back

This year, our annual survey received only 19 responses, but they showed increased satisfaction with our usability statements, which were almost entirely green.

Despite the success we had for the previous years, we decided to scale back our DevEx efforts at the start of 2023, integrating them into our day-to-day work. The decision was based on several factors, but the key one was that our customers were very different, not like they were in 2019. We were having the same conversations with the same people; these were no longer the development teams. Instead, we largely worked with the in-tribe platform teams.

Also, we were much better at managing and running the clusters than ever. We preempted and dealt with faults for our customers, proactively contacting them. Those development teams and platform teams inside the tribe had also matured; they were adept at running their applications and did not need the same level of support we offered in the early days.

Analysis of metrics showed that the impact of scaling back the DevEx function hadn’t introduced an increase in support tickets despite workload and adoption still growing (albeit at a lower rate than when we started).

We learned from the decision that developer experience creates experienced developers and platform engineers. They have their own frameworks and tools. They have their own domain-specific knowledge and documentation. They’re self-sufficient now. They don’t need to rely on us as much as they previously did.

What’s Next?

What’s next? Our platform dates from 2016. There’s not been a great deal of change in the overall structure of the cluster itself. We’re now part of a larger organisation. We have many other brands we can work with that aren’t using containers yet.

We’re looking to build a new container platform for all the brands across the organisation; it will be more modern, more scalable, and more supportable than the current platform. To do that, we will be working closely with specialist teams across our organisation.

The Cloud Engagement team looks after all the AWS accounts and baseline security. We want to provision Kubernetes clusters in developers’ accounts securely; to do that, we will need their help to ensure we’re still compliant and follow the governance they put in place.

Observability is a massive talking point in our industry. We don’t want to locally manage metrics, logs, and traces on our cluster. There’s a centralised team that is running a platform across the brands. We can use that service to consume our metrics, logs, and traces, providing visibility across the whole organisation and different platforms, not just our Kubernetes clusters.

We have worked closely with our service lifecycle management folks for compliance and governance. They have a capacity management function. They reach out to teams and manage capacity. We can teach them about Kubernetes and use their contacts and work methods to introduce container capacity questions into that workflow.

We will be deploying and managing Kubernetes clusters on behalf of our customers, freeing development teams from dealing with the operational complexity of managing a Kubernetes cluster themselves. It also opens the door for more DevEx. We will have many users who haven’t used containers before, so we can build on all our work and utilise all the new research in the field.

Final Thoughts

  • This was a job, and it took effort. It took process. There are many shortcuts at your disposal, but creating a monthly cadence to meet with your users and create feedback loops is very important.
  • This is a sociotechnical problem to solve. It’s about creating relationships and engagement. Don’t be tempted to create shortcuts, replacing them with easier but less effective alternatives.
  • It’s a people problem, not a tooling problem. Ensure any tooling helps shift left and genuinely empowers your users rather than being a burden or unused shelfware.
  • Understand when you are "done", and you stop adding value. Turn the dedicated DevEx function into day-to-day activities.

At QCon London, I gave a talk about Curating a Developer Experience, which goes into more detail on our journey.

About the Author

Rate this Article

Adoption
Style

BT