In this podcast, Daniel Bryant spoke to Michelle Krejci, service engineer lead at Pantheon, about the Drupal and Wordpress webops-based company’s move to a microservices architecture. Michelle is a well-known conference speaker in the space of technical leadership and continuous integration, and she shared her lessons learned over the past four years of the migration.
Key Takeaways
-
The backend for the Pantheon webops platform began as a Python-based monolith with a Cassandra data store. This architecture choice initially enabled rapid feature development as the company searched for product/market fit. However, as the company found success and began scaling their engineering teams, the ability to add new functionality rapidly to the monolith became challenging.
-
Conceptual debt and technical debt greatly impact the ability to add new features to an application. Moving to microservices does not eliminate either of these forms of debt, but use of this architectural pattern can make it easier to identify and manage the debt, for example by creating well-defined APIs and boundaries between modules.
-
Technical debt -- and the associated engineering toil -- is real debt, with a dollar value, and should be tracked and made visible to everyone.
-
Establishing “quick wins” during the early stages of the migration towards microservices was essential. Building new business-focused services using asynchronous “fire and forget” event-driven integrations with the monolith helped greatly with this goal.
-
Using containers and Kubernetes provided the foundations for rapidly deploying, releasing, and rolling back new versions of a service. Running multiple Kubernetes namespaces also allowed engineers to clone the production namespace and environment (without data) and perform development and testing within an individually owned sandboxed namespace.
-
Using the Apollo GraphQL platform allowed schema-first development. Frontend and backend teams collaborated on creating a GraphQL schema, and then individually built their respective services using this as a contract. Using GraphQL also allowed easy mocking during development. Creating backward compatible schema allowed the deployment and release of functionality to be decoupled.
Subscribe on:
Show Notes
Can you introduce yourself? -
- 01:25 I am a software engineer at Pantheon, and the tech lead for the server-side of our infrastructure.
- 01:30 That essentially means the public-facing APIs of the services.
- 01:40 I've been here for about four years, initially as a QA lead, then moving on to leading the services team.
What was the pain point that made you consider microservices?
- 02:10 Pantheon is a web ops platform, and our customers are Drupal, Wordpress, users who need to be able to scale.
- 02:20 Just because we run a PHP runtime environment, doesn't mean that our infrastructure is written in PHP.
- 02:25 We had this platform that was built to prove out this concept, and when I joined five years ago it was a series B company that proved their ability to scale container infrastructure.
- 02:40 They had done that at the cost of choosing technologies that would allow them to develop very quickly.
- 02:50 One of those choices was to put everything into a monolith.
- 02:55 There's a lot of wins, especially if you are an early stage start-up, to just have a single repository, a single platform, with no complicated deployment pipeline.
- 03:00 It's very clear who is deploying and what has been deployed, no thinking about how to organize your code, where dependencies are, and so on.
- 03:15 We then reached a point where we needed to scale out our platform, and that meant that we needed to scale our engineering team.
- 03:30 As we scaled our engineering team, we needed to be able to define discrete domains of responsibility and stabilise them.
- 03:35 I think that the initial thinking of giving the title the QA lead was really an ask to stablise where they were at.
- 04:00 The problem that we had wasn't so much things were breaking on the dashboard, but that feature development was at a gridlock.
- 04:10 It was at a gridlock because it wasn't clear whether something was a bug or a feature, and that's what QA is trying to do.
- 04:25 When you've been a fast moving startup, you have these great ideas, but you have to sit back and ask yourself which of these things is how it's supposed to be, and which is an accident of dealing with some prior decision was made.
How did you get started with establishing quality within the system?
- 04:50 There was an approach that I didn't follow, which is to blanket everything with tests.
- 05:00 If we had PRs going through acceptance tests each time, we'll make a decision on a one-off basis.
- 05:20 I've seen that break a lot of innovation pipelines and felt that wasn't right for us.
- 05:30 I felt really strongly was we needed to define the business domains, and isolate them within a service, and pull piece by piece into a service.
- 05:50 The quality of something isn't just the inputs and outputs of a function - it's what any person's ability in the company to explain what it does.
- 06:10 If we haven't done that rigor of understanding what that test is, it's not helpful having a red or green test.
- 06:15 We talk about technical debt, but conceptual debt is the greatest threat to the company.
- 06:25 For us, microservices allowed us the opportunity to have the level of conversation with the business.
- 06:35 Conway's law is "those who build systems are doomed/destined to represent the communication structure of those who build them"
- 07:00 If it doesn't make sense to the company, it's not going to make sense.
- 07:10 If there's a lack of specificity, that will show up in the design of the system itself.
I'm surprised that you were able to drive this transformation through a QA process.
- 07:45 I joined as a new employee and was given this title, but I had a Cassandra role coming in.
- 08:10 They were receptive to the idea of being able to make changes, and met the CEO - I didn't want to be the end of the conveyor belt of QA then I don't want that.
- 08:30 I wasn't interested in taking on a position where I was doing some self-flagellation for the company and have my decision overridden anyway.
- 08:50 I wanted to make sure they were serious about figuring out what was going wrong - and as it turns out, it was at the top of the conveyor belt rather than the bottom or the middle.
How did you get buy in to make those changes?
- 09:45 I had built a reputation for myself already, I had a presence discussing pipelines, continuous integration.
- 10:10 When I was in the job market, my values as an engineer were clear.
- 10:20 I was clear with Pantheon when I joined was that I wrote my own job description with them, to ensure that I would have influence at the level I needed to make it successful.
- 10:35 I wanted to make sure that they wanted to get the job done.
- 10:50 When you're in the job, you're not in negotiations, and you want to change course - that is a harder thing to do, because the status quo is so easy.
- 11:10 What I'm learning and trying to get better at is putting things in business terms and putting things in dollars.
- 11:20 In order to get to this place we want to reach, here is the trajectory and here are the costs that we're incurring.
- 11:30 I've been able to use some tools to measure time spent in toil and put a dollar amount on it.
- 11:45 If we consider technical debt as financial debt on the books, then that's a real thing on the books.
- 12:10 If it was thought of like that, it would be real in a way that doesn't require the emotional labor of engineers.
- 12:15 To an engineer, it seems like common sense, and therefore it's hard to ask for things that should be obvious.
- 12:25 As it's invisible to the business, finding a way to make it visible (since it's a form of debt) - quantifying that and telling that story is the only way I've succeeded in changing a company while I'm there.
Do you have any tools for visualizing that?
- 13:10 I made this recently - I have a line graph of debt incursion, and I made these boxes of all of the projects, and the technical debt of each they take on.
- 13:30 I tried to tie that into engineering labor and opportunity loss.
- 13:35 I stacked all of these boxes up for projects that we've done in the last year.
- 13:40 They each represent this dollar amount of (technical) debt.
- 13:50 It's necessary, but it needs to be tracked - and each project has a certain amount.
- 14:05 How many engineers do you know that are handling technical debt?
- 14:15 The more of that debt you have, the more labor you're spending on servicing that technical debt.
Has the move to microservices helped that?
- 14:35 No-one should think that moving to microservices should eliminate technical debt, but it might help identify where it is.
- 14:40 Our move to microservices created its own technical debt.
- 14:55 We had a Twisted Python monolith, which had a very creative implementation.
- 15:05 It had an early version of Cassandra, which we're storing configuration for thousands of sites.
- 15:25 It was being served with a backbone presentation layer.
- 15:35 We rolled out our own ORM for the project, throwing around dictionaries.
- 15:40 The conceptual part was that we didn't know what types there are.
- 15:50 If you don't have a schema change, you can go really fast, but it has a cost.
- 16:00 The first service that we pulled out - the service ticket system - which handles customer service requests, replies and works with third party services.
- 16:20 At the time, Kubernetes had just come out, and we made some mistakes because we were too ambitious.
- 16:45 It was the most unnecessary levels of testing and validation - to an unnecessary level.
- 17:00 I had these ideas of changing everything with one system, and now it's a piece of technical debt.
- 17:15 At the time, I wanted it to be an exemplar of what a service should be.
- 17:30 In the end, we had to route it through our monolith as it was the API for the clients, as well as the aggregation and authorization layers.
- 17:50 Even though we pulled the service out, our monolith was still the gateway for that service.
- 18:00 The next thing we did was to move our front end assets into a Kubernetes deployment; the process for deploying our front-end was cumbersome.
- 18:20 It could take us up to 45 minutes to roll back a bad deploy.
- 18:30 With Kubernetes, we could roll back the deployment within 20 seconds.
- 18:40 I had come from a world with Ansible and Puppet and Chef, provisioning tools that would turn A into B and would migrate them on demand.
- 18:50 With containers, you have a container A and then create a new container B, followed by deleting A - and to roll back, it's the reverse.
- 19:00 That transformed everything - and we had the confidence that if things were bad, they are bad for 20 seconds.
- 19:15 Then we could start to test it in isolation for something else.
- 19:25 The front end could point to a staging front end for our back-end monolith, which decoupled the development cycles.
Was the monolith running on Kubernetes?
- 19:40 Because we use Cassandra and some other aspects, we haven't moved that over, and it would be quite difficult to do so.
- 19:50 It's not impossible, but it's not a high priority right now - but maybe something that could happen in the next six months.
- 20:05 Having a safety net to know that you can rollback, even if you don't need it, is a game-changer.
You can then iterate faster on the front-end because they weren't coupled to the back-end monolith deployment cycles?
- 20:30 Exactly - they don't have to concern themselves with the back-end; they don't need to think about creating instance for themselves in order to work.
- 20:45 However, we've now got a slightly more complex set up - we have two services, and we could get by with a sandbox or developer environment.
- 21:00 The dashboard experience is crucial for a number of people - they want to see what is happening there.
- 21:10 Changing the API can have a significant impact between the front-end and the back-end.
- 21:15 We have missing tooling because we just complicated the whole development experience.
- 21:20 We created a tool that leverages Kubernetes itself.
- 21:30 Kubernetes is an orchestration tool, which provides the ability to label things.
- 21:50 We used a namespacing tool to provide a sandbox for every engineer - essentially, a clone of production.
- 22:15 Our deployment has a production namespace, which is special because it tells us a lot about of what cluster and infrastructure it's going to run on.
- 22:25 We can copy the configuration of the containers into another namespace altogether.
- 22:45 We created a sandbox namespace, which would be a clone of production, and when an engineer wants to do development we copy the sandbox namespace into their own environment.
- 23:00 Namespaces are a nice way of being able to contain entire ecosystems of services that we can test as if they were production.
- 23:15 We automated that and created its own service, so we created an internal tool that will spin up this environment for me.
- 23:35 Obviously there are certain differences between the production and sandbox instances, but making it intuitive was a smart investment into the tooling.
- 24:10 Investing time into tooling and replicable tasks together - without that, I don't know whether we would have been successful with thousands of services after that.
How do you handle data, and are there PPI concerns?
- 24:40 We don't copy any data from production - what we have been doing is mostly let the developers click around and make their own world.
- 25:00 That's not great, but it could be automated - and there are shared instances that point to the same shared monolith.
- 25:25 It depends where you are in the development stack.
- 25:30 Fast forward a couple of years, and now we're tackling the aggregation and authorization that I was talking about earlier.
- 25:35 We are using Apollo GraphQL gateway - moving to GraphQL is forcing us to define models and what we are using.
- 25:50 One of the advantages that the gateway is that we start to strangle the monolith.
- 26:00 It can also serve mocks, which means that the front end and back end can work independently.
- 26:15 It allows us to have mock data in the staging environment.
How do you define the contract?
- 26:35 The question of where to put the schema is controversial - and we've already changed it three times.
- 26:45 It's lived in the service, in its own repository, or all in the same place - but how do you disambiguate production versus staging schemas?
- 27:00 We want this to be simple and intuitive, but it feels complicated.
- 27:10 We sit down with the front-end engineers to determine them.
- 27:20 The point of doing microservices is that we can get to that work, and have some standardisation of what the services look like.
- 27:30 Our customer-facing services are Python3, Django, SQL backend, templated with Kubernetes templates, with CI templates.
- 27:40 We have a tool that allows us to spin up a service with our best practices, and just focus on the schema.
- 27:50 What you're building is the crucial thing.
What's the flow for deploying an idea through to production?
- 28:15 The ideal is that we want to sit down with product and design and look at the business request.
- 28:25 We want to be able to identify what domains that feature touches on - it's always a bit of a multi-touch service.
- 28:50 Let's say the change requires a new feature - we work together to design a schema to service that page.
- 28:55 We are able to work on it immediately - there doesn't have to be something remains as a JSON dictionary that we can visualize.
- 29:15 We can then see what's missing from the back end, and the service that we need to fill out that contract is either an extension of an existing one or a new one.
- 29:40 It's up to the engineers on how to service that contract.
- 29:45 It's a clear definition of done with that schema, and they'll be able to work independently.
Is there a CI pipeline that it gets pushed down?
- 30:10 I was hired as a QA engineer, but I don't see that road where we're going - and the staging area doesn't seem like the right thing for us.
- 30:20 Part of what GraphQL schema gives us is the backward compatibility, so it's addictive.
- 30:30 The back-end can start serving content without the front-end consuming it, and it can consume it when it's ready.
- 30:50 We have the opportunity in the sandbox environment to see what it looks like.
- 31:00 We have a set up where the PR spins up an environment with the product and able to test.
- 31:10 It's the approval of that environment - it's not a staging environment, because it's consuming production data.
- 31:25 We have the opportunity to roll back very quickly.
What would be the one key piece of advice you'd like to pass on to others?
- 32:00 You need an easy win fast, and I'd recommend that you identify something that could work with a fire-and-forget event queue.
- 32:10 When you are looking at a monolith, it's handling so much besides data serving, so there's a lot to tackle.
- 32:30 For us, it was when we were sending an e-mail to clients after an event happened, and some internal tooling.
- 32:35 There were a couple of key areas where I could fire an event and have another service consume it because it didn't require networking or authorization.
- 33:00 It gave us the ability to start playing with patterns and getting some quick wins, in order to clean up some gunk.
- 33:05 I created a small number of services within a few months, and that really started to exercise those muscles.
- 33:15 Don't complicate it, find the things that can give you quick wins, and trust in confidence.