BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Evolving Trainline Architecture for Scale, Reliability and Productivity

Evolving Trainline Architecture for Scale, Reliability and Productivity

Bookmarks
50:18

Summary

Milena Nikolic discusses how Trainline's systems architecture has evolved over the past 5 years to cater changes, as well as what's coming next.

Bio

Milena Nikolic is a CTO at Trainline, where she runs a team of 400+ technologists focused on enabling greener travel choices, connecting people and places. Before that she has spent more than a decade at Google, working as an engineer and an engineering leader across a number of different product areas - most recently as an Engineering Director for Google Play Developer Ecosystem.

About the conference

Software is changing the world. QCon London empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Nikolic: I'm going to talk about the lessons we had from scaling Trainline's architecture. Scaling will mean the usual things around like, how do we handle more traffic? That's not the only thing we're going to talk about. I'm also going to talk about the things we have done to make it possible to have more engineers work on our architecture at the same time, so we can speed up the pace of growth and the pace of innovation and all that. I'm also going to talk about scaling the efficiency of the platform itself, so that as the platform grows, it grows in a way that's efficient, and cost efficient, and all that. You'll get to hear about business lens of productivity and team impact. Business lens of cost efficiency, and that side of financial business impact, as well as actual traffic growth and being able to handle more business.

Trainline (The Rail Digital Platform)

How many of you know what Trainline is? We are Europe's number one rail digital platform. We retail rail tickets to users all around the world for rail travel in UK, Spain, France, Italy, really most of Europe. In addition to retailing tickets, we also follow users for the entire rail journey. We are with them on the day helping them with platform information. If there are any disruptions, helping them change their journeys, getting compensation for delays, as well as if there are any planned changes in terms of what they want to do with their ticket.

We provide all of that through our B2C brand, trainline.com, as well as a white label solution to our partners in the rail carrier space, as well as to other parties in the wider travel ecosystem, your GDSs, your TMCs, anyone else in the travel ecosystem who wants to have access and book rail. I'm going to mention a couple of numbers. I'm going to just give you just as a teaser of roughly size of the business, and then some numbers related to the traffic, although we will talk much more about that as we go through the presentation. Then also just the team. I think that gives you an idea of what the company is, and technically what we're trying to solve.

Just to give you an idea in terms of size of business, we're a public company. We're well established, profitable. Last year, we traded about 5 billion net ticket sales. Just gives you an idea of the scale in terms of technical impact as well. It's very difficult to pick one technical number that represents things, but I have just decided that maybe I show searches, which is, we do around 350 searches per second for journeys and for origin destination pairs. We do that over something like 3.8 million monthly unique routes. If people look for Milan, Rome, or Cambridge to London, or wherever else, that gives you a problem space of search. Then we have about 500 people in tech and product organization out of which majority is tech obviously. If that's not cool enough, this is cool, isn't it? I love this innovation.

This just gives you an idea of the problem space in the sense of, we know where each live train in Europe is at any given point in time. A, that's a lot of data. B, it's a lot of actions that we need to do. Every time when you see these yellow or red dots, it means a train is delayed or getting canceled or getting changed or something like that. Which means we need to do something with the customers that have tickets for that train, that we sold tickets for that train to notify them in a certain way, and all that. This might be just a graphical representation of some of the scale of the problem and effectively what we're dealing with.

What's Difficult About What We Do?

I've been at Trainline now, in July it's going to be three years. I remember as I was joining, I was like, I love trains. Very cool. Just super happy to be doing this. I was like, what do actually 450 engineers actually do? Two of them build the app, two of them build the backend, what does everyone else do? Obviously, I knew there was a bit more, but I was like, I'm not sure this is actually such a hard problem to be requiring that many people. Then, I think over time, or certainly even within first couple of months, it became obvious that certain problems were harder than I thought. I'm going to talk about three of them. There's certainly not an extensive list of everything that's hard about it. These are the few where I was like, actually, that's harder than I thought. The first one is aggregation of supply. We have more than 270 API integrations with individual rail and bus providers.

There is zero standardization in this space. You literally have one-off integration. There is no standardized API model, for any of you that know airline industry, for example, with the GDSs, like Amadeus have done 30 or 40 years ago, which involved standardizing all the airline booking APIs. That didn't happen in rail. It's like a lot of very custom integrations. As you can imagine, that comes with problems of high maintenance cost, because everyone is constantly updating their APIs. Non-trivial integration every time when you need to add a new carrier, a new rail company launches. Then I would say, even worse, journey search and planning, combined journeys over the very inconsistent and disintegrated set of APIs, is a problem that needs to be solved, because everyone has different API access patterns but equally different limitations.

For example, we get things like look to book ratios, where we can only search and hit certain API, within the certain limit of proportion compared to how many bookings we do through that company, as well as sometimes pretty strict and very old school rate limits in terms of what we can hit. A lot of complexity comes from that core idea that is ultimately the purpose of Trainline, which is the aggregation of all of the supply. One more thing that's related to that is that Europe has 100 times more train stations than airports. Just also in terms of scale, there's a lot here. Getting those journey searches to work, especially combined across multiple APIs and getting them to work fast, that was the problem. Then when I joined, I was like, actually, that is harder than I thought.

The second one, whereas I think this level of aggregation of supply might be somewhat unique to the rail industry aggregation, transactions over a finite inventory are not. I suspect many of you have worked on something similar at a certain stage, like the classic Ticketmaster problem, where you're not just selling something. In my previous career at Google, I was running part of Google Play Store. We were selling apps, which involve certain complexity, but it's a digital product, so there is no limited number. Then we can sell as many apps as we want, or as many in-app products as we want. It's all digital, and you don't need to check the inventory or anything like that. Right now, we're selling seats on unique trains, in unique class with a unique fare. It's quite limited. You can understand that transactionally they're just much harder to solve, to make it reliable, to make it fast, and all that.

Right now, we handle about 1300 transactions per minute at peak times. That's up from zero in COVID times. It's also up from, I think, probably something like 800, close to 3 years when I joined. It has been growing. Some of it as part of rail travel recovery as the economy was recovering post-COVID, and some of it as we were growing, and especially as we were growing in Europe. Does that resonate? Did any of you deal with transactions of a finite inventory and get the complexity of that? I don't know if everyone else is thinking like, super easy, I don't know what you're talking about. There is a layer of complexity here that comes with it.

Then the final thing that also ended up striking me at that early stage that was harder is the speed at which people expect you to fulfill a ticket, to literally give them a barcode they can scan as they walk through the barrier gates. The expectations are high. It has to be within a second. It has to be basically instant. That doesn't apply. When we buy rail travel, we usually buy it for a couple of months in advance, at best couple of weeks in advance. Usually, that email with an actual ticket arrives couple of hours later. It very rarely arrives instantly. We don't have that.

Right now, about 60% of all of our tickets bought are bought on the day. Quite a lot of those are actually bought literally as people are walking into Charing Cross station and getting the ticket on Trainline and scanning it straight away. From the point of completing transaction until people having the barcode they can scan on the barrier gates, and that involves interaction with industry level processes, so that the barrier gates can recognize the ticket is valid. Those expectations are pretty high.

3 Lessons on Scaling

That's some of the taste of some of the things that are hard. Now I'm actually going to go to the juice of the talk, which is going to be three lessons we had on scaling. Then again, as I talk about this, keep in mind, the first one is going to be more around team and productivity, and ultimately, the impact of architecture, both how it enables us and how it slowed us down, as we were scaling team sizes and then changing our teams. The second one is going to be on the cost efficiency and scaling the efficiency of the platform. The third one is going to be on scaling with growth in traffic and achieving higher reliability, or dealing with availability issues and reliability issues. Each one of the lessons, I'm going to start with a story. You need to follow me with the story. Then we'll eventually get the lesson. Some of these are pretty obvious, some of these are a bit more nuanced. It's going to take a second to get there.

1. Scaling Team Productivity

When I joined back in July 2021, it was really interesting, I think we had probably about 350 engineers. They were entirely organized in this cluster model. Do people know what cluster model is? It basically means organizing your teams around ownership of the parts of technical stack. We had Android app team, an iOS team, and web team, gateway or like Backend For Frontend team, e-comm, supply, a couple of different supply teams, order processing. All the teams were organized around that part of ownership of the technical surface, which is a way to organize teams. I've seen it in other places as well. At that size and scale, 350, and we were growing by another 100 I think over that year, it didn't work well. The productivity of the team was pretty low. This was very quickly obvious as I started looking, almost any project that you want to do involved at least 5, and often up to 10 different teams.

This was this very old school way of delivery managers managing massive, mega spreadsheets with dependencies and trying to plan, trying to estimate everything up front, and who's going to hand over to whom when, in which sprint. Obviously, everything will constantly get late, and everything would take three times longer to build. It was pretty slow. I was like, at this scale, this is not the right model for the team and what we need to do here.

In, I think, January 2022, we did a massive reorg of the team where we basically threw the entire structure, and created this structure that's also fairly standard in industry called platform and verticals. We had platform or horizontals owning technical stack, and verticals having all the people with different skill sets, your Android and iOS and web and .NET engineers and Ruby engineers and everything you would need, shaped around clear ownership of product and business goals. That was the big change we ran around that. Then we had it for about two years. I think that served us well and served a certain purpose that it had. I'll come into that for a second. It certainly improved alignment to the team to goals, and how much any random engineer would care about delivering something that makes a difference to the business.

It obviously came with other challenges, like models like these always comes with this challenge of tension between platform teams and vertical teams, where vertical teams are trying to land stuff as quickly as they can, as they should. Platform teams are always pushing back and saying, no, you're hacking that up. No, you need to do it properly. Can you please also refactor this while you do it and get rid of that tech debt, and all that. I think that tension is embedded function of the model, and a good thing to be having. Sometimes it can be frustrating to people on both sides, as you can imagine. Literally just now, a couple of months ago, we have again reorganized in a slightly less drastic way. We have reduced the ownership of the platform just to very core services that are absolutely touched and shared by everyone.

Then the other 50% of the tech surface, went into the verticals which we then renamed diagonals. That's the model we're trying now, and that was because we felt we'll get the best parts of platform and verticals, but we'll also be able to streamline and remove some of that tension that was happening previously between platform and verticals. I can give a whole talk about this set of organizational changes. That's not what I'm doing here. I want to talk about the architectural implications of this set of changes and what it means because I think that's more relevant.

The key question here with all three of the models is effectively to like, who owns each part of the technical surface? Who's in charge of all the sustain work, mandatory technical upgrades? Who's in charge in terms of driving the technical strategy and the vision for that part of the code base, moving you to the new patterns in the latest trendy JavaScript framework? No, we're not doing that. That's definitely not happening. Who effectively does all that core work both to sustain as well as to drive the technical roadmap? In addition to who effectively builds all the features, and all that? Then the second question here is like, which product or business or tech goal is the team on the hook for? All three of these have different answers to the two questions, that just defines the problem space.

It's not impossible that if you ask different people at Trainline, they might give you slightly different red-green lights on all of these. I think roughly with my leadership team, we aligned that with clusters, so that size where we were. A is alignment of engineering investment to business goals. P is productivity, how much we're getting out of the team. Q is the quality, like the quality of the technical work that's being produced in a way that translates the risk, in terms of like, are we adding technical debt? Are we making platform worse, or are we making it better? That trend.

We felt the clusters were pretty poor for alignment, because no engineers cared about any of the product or business goals. They were just happy to tinker around in their part of the code, because they didn't own anything end-to-end. Productivity wasn't great, end-to-end. Everyone was very productive in their little silo, but nothing was shipped fast end-to-end because there were too many teams involved and too many dependencies. Then quality was good, because people worked in a very restrained small part of the overall technology surface. They were very good at it. They cared about making it good. The quality was green.

When we moved to platform and verticals, alignment became super crisp clear. There was a team of engineers on the hook to grow our sales in France. There was a team on the hook to improve monetization, and some that were on the hook to improve customer experience related to refunds or whatever. The goals were very clear, so alignment was perfect. Even on the tech side where you had teams in charge of improving reliability or cost efficiency or all that. Productivity was much better, but the problem with it, again, because of that tension, verticals would say, no, actually, we don't feel productive because platform is constantly pushing back on us. I think platform had a point to push back on some of those things, but that's why that one is yellow.

Then quality was actually also pretty good because platform team was policing all the contribution, so they were ensuring that quality. Now with the model we removed, I think we slightly diluted the clarity of the alignment. However, I think we're really enabling productivity. We were flipping the model previously. I have my favorites among these. I think the most important point is that different things work better at different times for a company. I think, if you do one model for a really long time, sometimes it is good to shift it a little bit, because you get a different balance. This is not necessarily, this is the best and this is the worst, this is the right one, this is the wrong one. I think each organization will have something that works really well, and that is the right thing for it at that time. Again, going back to my previous point.

Let's talk about what this means for a more technical audience, when you think about it. I think it's pretty easy to establish that customer and business needs don't respect architectural boundaries. Is there anyone that would disagree with that statement, or maybe agree? I think it's entirely fair to say customer business needs don't respect architecture boundaries. You can design the platform in the most architecturally beautiful way, within months, I promise something will come up where you will be like, "That doesn't quite fit in into this beautiful logical domain model that I've designed. Now we have to think about it." At the same time, org structure needs to evolve with the business, because business priorities change. What you need to do, that simply needs to change. Along with that, the business strategy and org strategy, technology ownership will move as well.

Something where you thought that a certain set of e-commerce related functions will forever live with that one team, there now might be a different way to look at it. There might be recurring payments and one-off payments. There might be a completely different paradigm shift in terms of, everyone switching to Google Pay and Apple Pay or whatever it is. Technology ownership will move, and will need to move. If you're not moving it, I think you're being complacent, or the organization is being complacent and not getting the most out of its technical team.

I think that the way to summarize all of this is Conway's Law. Who knows what Conway's Law is? It's basically this idea that technology ends up getting the shape of the organization that makes it. How you organize your teams, that's how your technology ends up being built. My insight through this set of changes was that like, Conway's Law, of course, applies. It's a fact. The sad thing is that there is no perfect reverse Conway maneuver. It's really hard to maneuver yourself out of the technical architecture that a certain organizational structure has designed. The actionable advice out of this is, it's really important to build technology in mind, to build architectures keeping that fact of ownership transfers and external contributions in mind. I think we need to plan for that.

Specifically, I think what that means is enforcing consistency. Every Eng manager or every architect will have their favorite pet tool and pet patterns or ways of path to production, or anything like that. I think the only responsible thing for leaders in the company is just to say, "Sorry, we appreciate you have your weigh-in, how you like doing things, but this is the company way. This is how we do it." The main reason is that because that person probably won't be there in a couple of years, but even that team won't be there, that entire structure will look different. Then if you end up having to patch technology surface in terms of a lot of different patterns, and different sets of tools, and different languages, and different whatnot, that's the thing that ends up slowing organizations massively.

Because there is no perfect reverse Conway maneuver, the best thing you can do is enforce consistency. Then it's easier to transfer Lego blocks rather than entire custom-built items, because then you can just reassemble Lego blocks in a different way. That's the first lesson, consistency is king. As few languages and technologies as possible. Build technology for transferability of ownership and external contributions, even when it comes at the expense of autonomy of individual. I know engineers take that very personally, that autonomy. I did as well when I was writing code.

I think, ultimately, when you look from way higher up in terms of organization, the only thing that makes sense over 3 or 5 or 10 or 15 years. Trainline has been around for 20 years and we still have some stuff in production that was written 15 years ago. I think the best you can do is just like, try to keep things as consistent as possible with few languages, few technologies, and build everything for that kind of transferability.

2. Scaling Costs Sublinearly to Scaling Traffic

We're going to talk about costs. This is an interesting one, because when I was at Google, we never had to worry about production costs at all. Google had all these endless data centers, and it was all free. I was like a new grad software engineer and I could deploy something that I now know costs probably hundreds of thousands per year, millions, maybe, just in terms of volume. I think Google has potentially changed that philosophy a little bit in the last couple of years. I'll give you one fact for Trainline. Our AWS bill is about 25% of our overall software engineer compensation bill. It's non-trivial. It's a meaningful part of the business. I can easily make a case that, we should spend 10 people worth of time for a year, if we're going to save 10%, if that ends up being ROI positive for the company. How many of you had to worry about production costs? Most people had to worry about it. That's good. This will hopefully be useful for you.

When we realized that we're really growing over the last couple of years, and it felt like cost of platform was growing, not miles faster than the traffic but a bit faster than traffic. I like to run things in an efficient way. It's my engineering pride to do it in that way. I was like, let's take a goal. We don't have to do anything super drastic. CFO is not asking me for drastic cuts. I think just make ourselves disciplined and make sure that we're running a tight ship. Let's take on a goal that we're going to drive down the, at least annual run rate, if not the entire annual bill, 10% down in terms of production cost. Then we looked with my leads, how would that even look? What are the levers we have? We have a massive surface. We're very microservice oriented.

We have something like 700 services. We have more than 100 databases. I have opinions on whether that's right or wrong, whether that's too much. I think having maybe more microservices than engineers is an interesting one. It kind of works. The reason I say that is because when you look at the surface, it's not immediately obvious where the flab is. You don't know what's efficient, what's not. There is a lot of stuff, a lot of surface, a lot of jobs, a lot of databases, a lot of production platform and data platform, and a lot of different things. We looked a little bit like, at least what levers would we add. We thought about all of this kind of like, ok, we could look and there's probably some unneeded data we could clean up. We could consolidate on production environments, maybe we don't need seven non-production environments, maybe we can drop one, and that gives us some savings.

We can review probably some old low-value services that are either deprecated, orphaned. There might be stuff we turn down there. We obviously instantly had this idea of like, are we even right-sized? Are we overprovisioning the platform? We need to review just overall scale. We can look at the data retention policies. Then the tricky one, reviewing architectural choices that we made for services. I think the team has gone very hot for like Cloud Functions for lambdas in the past year. I was like, is that actually efficient? Should we have longer running services or should we shift some of the long running services to Cloud Functions and so on.

This is not necessarily an extensive list. There's probably more one can do when you think about the overall cloud cost. These are some of the things that I could think about, or that we thought about as a team. Some of them are lower impact in risk. Some of them are higher impact in risk. Now in hindsight, I wish we were a bit more explicit around those things upfront, but we made this list.

Then we decided that we're just going to delegate this problem down. We have smart engineers. We have smart engineering leaders. We're going to go to all of our teams that own parts of technology stack, teams in the platform, we're going to tell them, each one of you is on the hook to drive 10% of the bill down for your area. We had good enough attribution so we would know what's driving the cost. Can anyone guess what went wrong with that? I thought it was a perfectly valid plan. I was like, it's too hard to figure out top-down where the flab is. Individual teams know their systems much more, so we'll just split the bill and we'll tell them, each one of you brings your part of the bill by 10%, and that should be fine. The problem turned out, of course, that some parts of the platform were already very efficient, and others weren't. It's like the reward is disproportionate in these areas.

That 10% for some teams meant they had to really dig in and either spend unnecessarily a lot of engineering time getting the savings, or they started doing risky things, which is actually what ended up being more of a problem for us, specifically, most of them. We were hoping they'll go review architecture choices, and they'll switch lambdas to long running servers and a few of these different things. Most of them, they were like, I have a goal to drive this down by 10%, I think I can probably scale a bunch of things down and move from having 7 instances of the server to having 5, and maybe all that. Which in some cases worked, and it was fine.

Equally in the scope of three months, we had something like four outages caused by underprovisioning services. It's one of those where you get it out and you're like, it seems like it's working. Tuesday is the peak morning traffic for us. You do some changes Wednesday or Thursday, and then couple of days later, as we hit the peak, the autoscaling doesn't work fast enough or something else goes wrong related to being underprovisioned, and platform goes down in the morning peak, and the worst time, when we should be going live.

That's some of the stories effectively that happened. Ultimately ended up meaning, I think we did projects, which did give us 10% savings, but were possibly not the best value for money because engineers ended up spending more time than they should have really. Then we also had these unintended consequences in terms of outages where people went for what felt like a simple way to save money, but really was actually not what we needed to do at all. The so what of this one is like, cost management is important for long term efficiency of the platform. However, understanding where flab is in a very large technical system with a lot of especially very large fragmented microservice based system, does sadly require those with centralized thinking. I think just delegating that problem down doesn't work. That was my lesson and I was very sad to learn that.

Equally, predicting which cost reducing efforts are worthy can be tricky, especially for people who don't have maybe the full big picture but you only know their part of the stack. The lesson here for me was like, don't blindly push down cost saving goals to individual teams. There are more ways it can go wrong than necessarily right. If I were to do something like this again, I probably would have had a bit more of a centralized task force that can work with individual teams, but evaluate, where is the investment to save actually worth it versus not? That's basically it. Manage system cost savings efforts centrally. I think fully delegating ends up backfiring.

3. Scaling Large Microservice Based Architecture

The last lesson. The first one was more team productivity impact, and scaling that. The second one was scaling efficiency of the platform. The third one is scaling for growth in traffic and reliability. I am going to briefly try to cover three big bouts of outages we had. In October 2021, I think I barely passed my probation or maybe didn't even yet, and Trainline goes down for something like four hours, one day, and then it goes down two hours next day. Disaster, like really bad. What we realized happened was that at that stage, world was again recovering from COVID. We were recovering from close to zero traffic back to what was starting to be historical maximums, like day after day and week after week. The entire time through COVID, software development didn't stop at all. Trainline even didn't lay off any people.

You had 18 months of 350 engineers writing code that was never tested at scale. What could possibly go wrong when that happens? Imagine batching up 18 months of changes, and then suddenly, it goes to production. It was loose everywhere. It wasn't one simple cause. The biggest one that we realized, in this particular instance, was that, because it's very microservice oriented, a lot of new services have been added. All of them were maintaining connections to the database, especially as they were scaling at peak times, they were creating more connections.

Ultimately, the biggest thing that was causing the outage was the contention in terms of database connections. We were running out of pools. This is, especially in the parts of the platform where we still have relational databases. We are on an absolute mission to entirely move to cloud native storage, data actually still have plenty of relational databases. This was the bottleneck where we have them hosted. Even if it's a mega machine, it's still hosted on a single machine and cannot keep up with that many connections. That got tweaked, tuned. We survived that period, and we were good for a while.

Then, I don't know what happens in the fall. Now I have a little bit of PTSD every time it comes to October and November, and I just don't know what's going to be next. Somehow completely unrelated, but exactly a year later, we ended up getting to this place where, again, the database was knocking out. I think the previous one was some old Oracles and this one was the SQL Server. It's not exactly the same thing. It was knocking out, and we didn't see it coming. We were like, we thought we have super good observability, we keep an eye on everything, and it just starts happening. It's not like when you do a bad portion, you roll back within couple of minutes, and it's all safe. This is, again, hour-plus long outages until we've switched to a new replica of the database, just not good stuff.

It took a while to analyze what happened. Basically, over that year, we have been adding more of these features that are not purely related to transaction of buying a train ticket, but are actually related to the journey itself, so to traveling. You saw the nice map with trains moving, so we have a lot of features related to that where you can follow your journey. You can get a notification of the platform, the change, all that. All of those additional flows that are related to journey experience, have actually been talking to the orders database, because they need to know what train you're on. That's how it was designed.

Most of our observability was centered on all of our transactional flows, not on all of those other flows. What happened over the year was that gradually the traffic mix shift changed, and the load on the orders significantly increased. Because it happened gradually, we didn't really notice, and especially because it happened and we didn't think that's where it's going to go wrong. That's how stuff goes wrong. You didn't expect to go wrong where it does. We had to do some more database related stuff to fix that, and we just survived this. That was that.

Then, just couple of months ago, November this year, we had a fairly insane bout of DDoS attacks from what may or may not be attributed to some nation states related to conflicts that sadly we have in Europe right now. Most of the travel sector is seeing this. We're not the only ones. Many other companies are seeing that. As that was happening, we had to significantly tighten up our DDoS protections and change a few other things related to that. Then, just as we were feeling good about it, we were like, yes, they're hitting us so hard and we're holding up. We're not buckling at all. We just go down one day, for an hour and a half, whatever. That was the stuff that ended up in press. It happened a couple of times in the scope of a month. As we dug in, initially it looked like we were getting DDoS attacked, but it wasn't that. There were sloppy retry strategies all around the stack. We didn't really have a very coordinated retry strategy overall, like who retries what and where.

Even small issues of like network blips or something that goes wrong, could snowball into this fact where we end up DDoSing ourselves. Because something is down, and then, the client retries, and the Backend For Frontend retries, and backend services retry, all that. That ends up creating 10x load that eventually brings the platform down. That was fun. It resolved eventually.

What's common about all of these? This is the architecture lesson that I got out of it. None of these were caused by a single team, a single change, or a single regression. That's where it starts to get really difficult. For each one of them, it was a lot of straws, a lot of papercuts that ultimately cause you to bleed to death. That's how it goes. It's not like you can point to this one change, and then all your best DevOps processes, and you can very quickly detect it, and roll it back, and all that.

No, things build up over time, just like sloppiness in a certain area multiplied by a large team, everyone chasing their own goals, and a very microservice, kind of spread-out architecture like that. The so what is, predicting a bottleneck in a large microservice based system is really hard. I still don't know if we quite know how to predict it. You don't know what you don't know. Often, you're looking at the tragedy of commons. It's not one team gets it wrong.

Everyone just adds a few too many database connections. Everyone just adds a few too many retries. Everyone optimizes slightly wrongly for scaling, and just eventually the whole thing bubbles up. The best lesson I could get out of this, and I think we're still trying to learn and figure out how we can handle this better is, regularly review longer term traffic mix or load changes. Trainline is pretty good at doing all the release by release, or day by day changes. All of that is automated. You have alerting, monitoring. All of that is pretty solid. You sometimes need someone to sit once a quarter and look like, how did actually mix on couple of critical databases or services change over the past six months, or something like that? What is that telling us? Where is the next bottleneck going to be? I think that's the one.

That's something that we're now trying to put into action. Then, microservice fleet coordination is absolutely critical. Again, it's having to guide teams and have a really strong architecture function or principal engineering function or however you want to call it, but someone who's guiding teams, who's looking at the big picture on top of individual ownership of which teams own which part. Guiding teams on things that could add up to create a lot of mess, so like retry strategies, or scaling policies, or anything that speaks to any scarce resource that's downstream. That's that lesson. Observe over longer term, coordinate microservice fleet.

Recap of the 3 Lessons

Consistency is king. I think it's absolutely critical for productivity in the long run. If you're in a series A startup or a seed round that's just trying to churn something out very quickly, and you can throw away that technology, then you won't care. If you're genuinely building a business where technology should survive for 5 or 10 or 15 years, then I think just like insisting on consistency, even if engineers won't love it. Manage system cost saving efforts centrally, because delegation just doesn't really work. People lose the wider context. Then, observe over longer term and coordinate microservice fleet to avoid many outages.

Questions and Answers

Participant 1: You've mentioned building and architecting towards changing the structure and people that intuitively goes against optimizing for knowledge and change management in the near time. How do you balance that?

Nikolic: I think there is a question here of short, medium, long term. I think there is something where people absolutely love, you build it, you own it. I really believe that. As a strategy, you should carry the pager and pay the, being paged in the middle of the night cost of, if you don't architect or build something properly. However, that really only applies for the first six months of a service, nine, maybe up to a year. After that, again, that person who built it is going to move on anyway. This is a fast-moving industry, talent moves around. People would want a new challenge, all that.

The way we think about it is like, you build it, you own it, but eventually, you need to be ready to hand it over. When we had platform and verticals model, that was the model. Vertical would build a new service. They would be on the hook for it for like six or nine months. The relevant parts of the platform would advise through that period, what do they need? What's the criteria they need the service to pass for them to adopt it? The idea would be that within 6 to 12 months, they adopt it. There is that transition that forces you to. This is hard. It's not like, we always get it right.

This is the thing that causes contention in the teams, all that. I think trying to pull teams back out of that mindset, because I've seen too many times where you end up with a bus factor of one and there is only one person who knows the service, and no one dares to touch it. That's not good. Especially if you want to have an agile organization that moves and really always focuses on the most important things, you need to have these Lego blocks that anyone is at least moderately comfortable picking up and saying, yes, I'm not an expert, but I'm comfortable picking this up, and I can take care of it and learn more, and all that.

Participant 2: I have a question about the first lesson that you mentioned, the consistency. If you think about maybe 5 years ago, or maybe 10, when the microservices were out, we were so proud that they can be built with different languages, that you've stuff owned by the team, autonomy. Now, after that time, today we're talking about consistency. Then we say it's the king. Do you think it's the future? Do you see any downside to it? How do you see that?

Nikolic: You already need to have different languages for the frontends. In our case, that ends up being three different, like two on Android, and iOS. You have that fragmentation of the skill set that you have in the organization. Then on the backend, we have most of stuff in .NET. Then we have some Ruby stuff that came through acquisition, it was always too big to replatform and it always stayed as a sort of thing. It's really tricky when you need to put a cross-functional team that needs to deliver something, even if it's simple, one field being plumbed from the database to the UI. Something that's meant to be very simple. You still need 10 people to do it, which I find almost laughable. It's ridiculous.

Because you need your Android pair, an iOS pair, and .NET pair, and Ruby pair. Some of that complexity is hard to get around because of native platforms like Android and iOS having their own languages. That's why I would really try to bring it down. I know engineers don't love it, but I draw the line and say, no. I think it's important to leave a little bit of space for people to innovate and try new things. If the new thing works well and is approved, then there needs to be a path for making it official thing for everything and everyone else. I think just freely letting everyone choose their own thing, very quickly I think it just completely is unmanageable. If the business becomes successful there is no path. What do you do a few years down the line?

 

See more presentations with transcripts

 

Recorded at:

Oct 22, 2024

BT