Hi, Ralph. It is a real pleasure to meet you. My name is Mike and I am one of the cofounders of Gilt and ex-CTO. We started Gilt in 2007, originally on a monolithic architecture and over time, as the company grew to over 1,000 people, we developed a microservices architecture. Today, at Gilt, there are over 300-400 microservices behind the apps and the web sites.
It has been really interesting really building Gilt to experience the types of problems that we had early on, the types of problems as we scaled the business and today, the types of problems that then emerge with microservices architectures. So to start, I think one of the really compelling reasons why people today are adopting microservice architectures is that they saw the number of challenges that larger organizations have, namely we find that as teams want to move more quickly and have more control over their application across the boundaries of QA, across the boundary of operations, the question becomes “How to enable that in the context of a larger organization?” One of the biggest benefits of microservices architecture is that each team can own the software that it is developing and then can manage when to deploy to production and also to manage how it is behaving in production.
And by keeping these applications independent, we gain a level of isolation that just wasn’t possible before. And that then in turn leads internally to higher reliability, higher uptime, the ability in general to scale individual parts of an application independently. When we combine that with the growth of the internet and the pace at which many of the businesses today need to scale, that ends up being critical, because we can span, say, the user registration component or service of an application independently of some of the other applications. So that is a huge benefit and made a number of challenges easier. But then we end up in a situation where engineering teams are now managing literally hundreds and hundreds of applications within their organization and in some ways, at Gilt, we used to measure the number of microservices divided by the number of engineers and that gave us a sense of how many of these things existed. When everything is perfect, it does not really matter because the applications run for a long time. But say Java 6 is end-of-lifed – it is just one example - and we want to upgrade all of the applications to Java 7 or Java 8. All of a sudden, you have to go and touch every single one of those individual applications and so that becomes a tax on the development teams. That is kind of an example of the type of problem that emerges with microservice architectures.
The other just, I think it may be kind of obvious to say, but I think it is subtle in its implementation, but everything in the operations tends to need automation. Where with monolithic applications, one or two applications, we can get away with deploying things manually, setting up monitoring manually, responding to alerts manually. When you are running hundreds and hundreds of applications for reasonably sized engineering teams, now all of a sudden, every single task now needs to be automated or it just takes too long. So, as organizations grow with microservice architectures, they implicitly have to make a much larger investment into automation and tooling. We think that over time that provides the leverage, but it is a big, big investment in terms of what is needed to operate the microservice environments.
Yes. I don’t think that there are any hard rules for deciding when is an application too big or when is an application too small. We have certainly seen both extremes, both at Gilt and at other companies. For the applications getting too large, I think that ends up resulting in challenges of productivity or challenges of scaling. So, as an example, if it is difficult to add a feature to an existing application, is it difficult because the applications become so large that the testing process takes too long or the release process is too slow and that may be a sign that it is time to break a larger application into multiple pieces. On the “too small” side, I remember that we had and incidence at a microservice that was built to handle a single operation. So, this was like your ability to subscribe to a newsletter. That was just one function provided by a microservice and over time, we think that was too small and eventually we ended up with kind of a service for all subscription management, that then subsumed that particular feature.
In practice, it feels difficult to be able to make that decision in advance, because one of the challenges that you do not really know when you are creating the service, how much that service is going to grow or what the needs are going to be. So you are making a bet. One thing that has proven to be true is that it is easier to combine multiple services into one, rather than taking one services and breaking it up into two. So, in general, as we have approached this pragmatically, we say “Try to make the best decision about really understanding the context in which that service will exist and the API that it provides, but if uncertain, err on the side creating a smaller service”
Ralph: And after all, the architecture of microservices should provide a flexibility to do this afterwards.
Yes.
Dependencies become a big, big issue. I mean they were already an issue, regardless. What we have found with microservices is that over time, as we create services, part of what many development teams do is that they provide rich libraries for people to consumer-user services. And over time, technology changes. So, in 2008 we built services say using Java 6 and Apache HTTP client. But then Netty came out and we started using Netty client, and then it was Java 7 and then it was Java 8 and then, by the way, along the way, some teams switched to Scala 2.8, 2.9, 2.10 and 2.11 and JSON processing went from Jackson and Jerkson to Play JSON.
So these technologies are changing and it seems that every year the teams are making the best decisions for the types of libraries to offer for their services and that feels great. What happens at the end of that, is that if you need to consume half a dozen services that were built over a period of 10 years or 20 years, when you go to include those libraries, you are now pulling in – we call it “pulling in the world”, meaning that everything that was ever built gets included into your project. So now you have a really big project and the chance of having a dependency problem just increases exponentially, right? Because you pull then dependencies for a version of a client library from six years ago, a more recent version from 4 years ago, you pull then Java 6, 7 and 8 - how is that going to resolve? –Scala is running on the JVM and all of that has to interoperate. So we are just set up for basically guaranteeing a problem.
A couple of techniques that we found really help mitigate managing the dependencies is to 1. (probably the most important one) really focus on the design of the API for services or design a good schema for things like events, upfront, as first class artifacts. So one of the projects that we have built at Gilt and a personal project of mine is called APIdoc.me – it is an open-source and free software as a service hosting, but the idea behind the APIdoc is that you can describe your Rest service in a JSON format and it will provide documentation of your services as well as generate client libraries using the native dependencies of the frameworks that you are using. So, for example, if you want a Ruby client, it will generate a client … your service in Ruby where the only dependency is JSON Gem, I think. So, very minimal dependencies.
Similarly, if you are using a Play framework, you can download a client library for Play 2.2 or Play 2.3. Those clients do not actually require any additional dependencies to be brought into the project, yet provide a nice way to consume the service. Other projects that we have seen like Swagger 2.0 does something similar; they moved to a model where you describe your API first. In APIdoc we have experimental support for accepting Swagger’s input. In the event world we have seen this over and over again with the Protobuffers, Avro and Thrift – all of those system providing a way of describing your data as a first class artifact and then using code-generation to handle the serialization-deserialization. So even that has proven to be an amazing help, particularly when you have a couple of generated clients that do not pull in additional dependencies.
Going down that road, the development process changes now because we, as developers, have to be very cognizant of the changes that we make to our APIs and the changes that we make to a schema and in particular, we have to be cognizant to our backwards and forwards compatibility and to the extent that we manage those two things that minimize the number of times that we break existant applications, we can end up in a world where the number of dependencies that we have in our library are significantly reduced. Where we have dependencies, they are short, we do not get this long chains of dependencies where I depend on this library which depends on that library which depends on that library and then creates a much more manageable piece of software.
In that environment, when things break, it is much more likely that the author of the service will understand what broke because they are probably the person that introduced the dependency into the project. One thing that we have learned with this backwards and forwards compatibility, I think a lot of people understand backwards compatibility. The way we like to describe it is: imagine that you are receiving data and you store every piece of data in the database.
Three years from now, is the library that you are using to parse that data, will it still work over all of the data that you have collected? If so, you have successfully designed a backwards compatible system. Forward compatibility is different because you have a library that is actually accepting the messages. A year from now, if you change the schema of your message of your API, will that same library be able to process that request? If so, you have created forward compatibility.
And having both of those allows us to evolve this schema, without having seeing production issues or without having to pause and upgrade all of the infrastructure that you use in your libraries. One of the things we thought of experimenting with an APIdoc is that for API management there is this nice feature that if you watch an API that you are interested in, when the authors add features, you can receive notification of things added to the API and whether those are breaking or non-breaking changes and then we combine that with semantic versioning, which then allows the developer to say that “I am aware that I am making a breaking change, therefore I am going to increment the major version number of my API that provides a good signal to consumers that they are going to have to do some work to upgrade.” That combined with a few practices like making sure we support both versions of the API for some period of time, gives us a way of managing the dependencies and incrementally upgrading the dependencies without ever needing to coordinate multiple pieces of work.
Ralph: So you have some kind of inventory or repository and on the other hand, you try to design your APIs really tolerant, I would say, just like when receiving a JSON object, one more attribute that I do not need, I just ignore it.
Yes, and that is actually a very good example of forward compatibility. Some people will build client libraries that if they see an attribute that they did not expect, they throw an error. That is an example and that happens a lot. And actually, if you think about it, that is a nicer way to develop for today. Because if you receive JSON input or a binary input that you didn’t expect you get an error. The danger is you are not forward compatible because you have prevented the sender from ever, ever adding anything to the message and this is kind of one of the subtle nuances. And so providing tooling that institutionalizes the knowledge that we do have to think about forward compatibility, turns out to be really important.
In addition, it is somewhat subtle to actually understand the types of changes that result in breaking changes in your API. So, as an example, a lot of people will think “Well, I have an API and it returns a user and instead of returning a user I am not going to return a car”. People understand that that is going to break the API. But say you have a user and its primary key is an ID and initially it is typed with a UUID and you wanted to change that to a string. Is that a breaking change or not? Depending on how people are using this service, it could break code that is out there. It is our view that it actually is a breaking change, you have changed the data type in a non-compatible way with no way of guaranteeing that existing clients can handle that change and therefore that is a breaking change. Increment the major version number, provide a path or do not. One of the great things about tooling is that if we did provide that feedback to the developer, we can actually learn together what are the types of things that constitute breaking changes and non-breaking changes and get that feedback loop and that feedback loop is then what enables us to be a little bit better in the future than we are today at minimizing the number of breaking changes, right? If we do not have a feedback loop, we never get better.
Yes. It is compile time versus runtime dependencies - in some application frameworks like Node.js is all runtime. So, in general, if people are coming from a world uncomfortable with compilers, uncomfortable with type systems, the whole category of errors become compile time errors and in general, we find that those are amazing because you find out about them, without having to write tests and without having to see them in production. We personally love compile time errors, we think they are great and we see runtime errors as a disaster. If we end up with a runtime error in production, something just fundamentally broke in our end-to-end process and so we need to figure out what broke and then figure out how to get better at that.
Recently, I remember an example where we had an application that was communicating with multiple services. One of the services was using Ning Client 1.7 and one is using Ning 1.8 and there was some method signature that changed and due to the nature of the compiler, type erasure - I do not remember, it was not able to be caught at compiled time. We did not catch it in our tests either and so it got to production and it caused some production issue. It was a minor issue, but a production issue nonetheless. That is a problem. So, when we go back and we dissect, we say “Well, we had multiple versions of the dependent libraries in production. Let’s go solve that as a root problem”. But now we love compiled time checks. Those are amazing. Compiler is our friend .
I think that one thing that we have learned is that even if the architecture may look different, but fundamentally the organization is different from the beginning. I know when we were looking at this at Gilt – the journey started about 4 years ago and 4 years ago we had more than one code base, but largely a monolithic architecture. But really, the decision was that we wanted to empower the engineering teams to innovate and to be productive and to love what they are doing. So, we started at this premise that we actually wanted to build an organization where the engineers were trusted to do great work. We wanted to trust our people to do great work.
Then, from there, the management philosophy and the architecture evolved. So, it is profoundly different, it is a fundamentally different place to work, it is a different working environment, with different priorities. Not better, not worse, just different. In general, from there, the things that became important are who gets to make the decision that software is ready to be released and we said that we trust our people to make the decision. How do they release to production? That is something that we wanted to systematize. So early on, we invested in continuous delivery, which created this pipeline to production. Early on we synchronized, we serialized all the releases.
So, anyone could queue up a release but at any time one change would go to production at a time and when that release was complete, then the next change would go. As we have evolved, today at Gilt they have adopted Docker as one of the very few standards, as a way of describing how to run an application. That then enables tools like Elastic Beanstalk or code deployed in AWS or an internally developed tool named ION-Roller, which are also open source to then take over and take a Docker image and actually run it in the production environment. So, automating that pipeline becomes really, really important.
Culturally, it is kind of interesting, but we can kind of make decisions on what we allow and what we do not allow. One of the things that we decided early on is that you are not allowed to make a breaking release. You are just not allowed and if you deploy something and you broke something, you just roll it back. You roll it back by re-releasing it in most cases and it is actually profound. That means you do not actually have to coordinate across all the different teams because all the teams are working on making sure that they are moving the architecture forward.
As a micro example of that, we use a lot of Postgres at Gilt, probably 50-70 distinct Postgres databases in production there and we made the decision many years ago that we would deploy schema changes independent from application changes. That has turned out to be a really positive decision. We have an open source tool called Schema Evolution Manager that helps us manage schema changes as software artifacts themselves. That is great and we do not pretend that those changes can be rolled back. So there is a more direct process for individuals to carefully deploy schema changes, because roll back is not guaranteed. What it then did was to allow us that every single software change that is made, can be rolled back, at least for some period of time and that is powerful, right? Because now we can keep evolving the architecture, we are never breaking anything, we are never co-dependent on other release and when you make a mistake, we can roll back quickly. And mistakes do happen but they have limited scope, because we can deploy incrementally, we can measure and we can roll back very, very quickly.
So that really sounds like the throughput of features must have went up.
Ralph: I think in general we see increased productivity, increased developer happiness, or well being, we have an incredible long tenure of software engineers in particular, particularly in this environment we are very competitive for software engineering talent, Gilt has a very, very high, very long tenure and I would say it is a kind of 80-20 roll. Nothing is perfect. About 80% of it of the things we work on, we feel that we can do better, or faster. And then there is 20% of things like, as an example, we do a loyalty program cut across probably 100 different services because it touched every part of the consumer experience. That is harder to implement in a microservice architecture. If it were a monolith, you would open up a code base and you would build loyalty and you would deploy it and it would be significantly easier. So, definitely, there are features that are harder because of the architecture but I think that by choosing the boundaries between the services reasonably well that in general, on the whole, we feel like it is the much more efficient environment to work in.
Yes, there are plenty of disadvantages. It is interesting to think about it because I think we get attracted by the problems which microservices solve, which are significant. But then, when we think about it, the difference between managing one monolithic application and 10 services maybe it is ok. But managing 100 services – everything has to be automated. Everything! And if you get beyond that, it is even more important. And that is hard.
So, how do you monitor an application? Many people go into an application, they add some code for New Relic, they add some configuration for New Relic – done! Now how are you going to do that 100 times? If you do it manually, that is going to take some time. If you write a script to do it, you are going to end up with a configuration that is duplicate in each of the 100 repositories and now what happens when you want to upgrade New Relic or you want to add a new version, now you are writing scripts that parch stuff to apply new configuration files.
And then how do you deploy them? And after you deploy them, how do you know the monitoring is actually working and there wasn’t some mistake in the template? So, it is a little bit subtle, but add scale for the microservices architecture, if you feel like everything has to be automated and the level of investment required to achieve that is significant. So, this is one of the driving reasons why Gilt moved to AWS, because AWS provides a lot of tools that automate the management of this sort of infrastructure. They won’t do everything, but it is a significant step forward from private infrastructure, even with cloud software on top, but I think that is the nuance. If you an environment with 1000 microservices and most of them were written in Java and maybe you started five years ago, Java 6 is no longer supported.
Now we have to upgrade 1000 microservices from Java 6 to Java 7. So how are you going to do that? And so at the end of the day, someone has to go through the process of saying “I do not. We just let it run” or systematically go one by one, upgrade the application to the next version Java, increment to release it, measure it, make sure that it is working, make sure that there is no performance degradation and if there is, address to performance degradation, release it to production fully – and you are done with one. And now you get to do that 1000 times more.
And the other piece that becomes part of that is that the problem just gets worse because every single day engineers come to work and what do engineers do? They write code. And as they are writing new features, we know how to write microservices and we love all the benefits that come with them, so we just get more and more of these services so that when time does come when things need to be upgraded or back to this world, then how do you actually manage it? Then what happens from an organizational point of view is that over time ownership becomes real question in challenge.
If you have a ratio of 4 microservices per engineer, as an example, that means that on average, one way to do it is every engineer onto four applications. A better way to do it is to group engineers into teams or departments – at Gilt today it is a 20-25 person department that will own some software. Say a department owns 100 applications, but you have 25 people. So now what does it mean to own something?
Does that mean that you are an expert in every aspect of the software? If so, you may need to carve out 6 months a year just to learn and become an expert at some of the software. And by the way, if I join the organization and you give me a piece of software written in 2008, maybe I will say that I need to modernize it before I can take ownership because I do not understand the way it was written 10 years ago, right? So those things start entering in the conversation as well and I think we see it everywhere, but it feels more evident to me in the microservices architecture that ownership is a question and it needs to be built into the management of the organization in order to make sure that teams have sufficient time to truly own the applications that they are operating.
At a larger scale, we have seen an example of how Google solved this as they created a software reliability engineering group which allowed the teams to move their applications into another group that was optimized for running things at scale as opposed to feature development and that created this balance between “If you want us to run your software, here is a number of things you need to do: you need to instrument it and has to be running without exception, we would have to know how to monitor it and if you are doing all those things, we will take responsibility for that microservice off of your plate and we will operate it for you to allow the development teams to go then and work on something new.” I think we have a few models of how this scales over time, but fundamentally ownership will become front and center as something that needs to be managed in the organization.
If you think of when it makes sense to introduce microservices into an organization, one way to think about it is what are the big pain points that the organization is experiencing at the moment and how do you solve for those? I think that a common pattern has been just to scale at the internet at a pace at which traffic can ramp for many internet businesses leads people by necessity to start introducing microservices. So, just as an example, if you are lucky in your business and you have a large number of people coming to register, it will be nice, it will be easier to scale user registration if it is on dedicated hardware, on dedicated data stores that are optimized for that type of data, right? Take your user registration and put it in on the key value system, a dynamo system, put it on Mongo DB, scale in the Cloud – that would be easier to do than trying to scale it as part of a monolithic application that maybe is talking to a relational database.
That is a significant pattern that we have seen. The second is just as the number of people who are contributing to a single code base increases, we start to see challenges around how long does it take for the tests to complete when the test fails, who actually broke the test. When there are sporadic failures in the test, who is actually fixing those things as opposed to just re-executing the builds? How does QA happen? So, in a larger code base there is usually a centralized QA process because there is no way to know whether or not a particular change had a ripple effect on other features that the application provides. So a lot of times we will see a release process that then centralizes on another team and then there is the question of how does that release go to production – who is notified when the release in production is having an issue and then how does that get assigned back to a developer? So in some of the more monolithic architectures, we may see weeks or months go by between line of code written and actually line of code acting in production and so then you end up in an area where it takes a significant amount of resources just to figure out just who needs to involved to debug the issues.
If an organization is experiencing those problems, then they decide they would like to remove those problems and replace it with a different set and meanwhile microservices architecture are very good at solving those problems, right? Because the individual teams, to some extent, if you combine microservices architecture with continuous delivery, then every single time you made a change or merge a change to master, that change is going to production. If you couple it with an investment to understand how your production runtime works, you are measuring how the new version is doing against the old version, managing traffic onto the new version, you are going to end up with a very robust infrastructure that is resilient to failure. You will get those benefits and then it is for each individual organization to decide at what point do the consequences of the current style warrant the adoption of another style that will come with future weight.
I think there is a mix. I think there are cases in which microservices make a lot of sense and there are cases where a monolith makes a lot of sense. To me, one of the distinguishing characteristics is really the team that is working on the software. Like who are the people. Like if we are a group of four people and we are working on a piece of software and we are working on that software for 10 years. It is probably fine, because we all know each other, we are all going to be experts, we are going to communicate well – there is only four of us - and so to a certain extent we are not going to feel like we are slow. I think we will be able to continue to develop software quickly. If somebody new joins the team, they may look at the software and say “Oh my, this is a monolith” but that is really an expression of the lack of familiarity and maybe if we have to scale the team, we will be kind of inclined to pull things apart.
I think it really comes down to the teams that are actually working on it and to certain extent, the psychology of what it feels like to work. Can new people join the team and be productive? That is kind of important. Do people feel empowered to create software and see the impact of their work? It is pretty important. And how do we provide feedback automatically to teams so that when I make a change, I can see that the change is working in production or it is not, that it is having an impact on the business or not. That feedback loop is critical at least for the types of people that we grew Gilt with and I think that would drive the decision.
Today, because I am actually thinking about starting a new business, I think less about whether it is going to be a monolithic stack or a microservice stack, but I do think a lot more about what the APIs are and what is the context of each of those APIs and making sure that those APIs are correct and that we do not cheat. I think that from the beginning that the small team, if you are disciplined, you can get pretty far with that approach and that leaves you with the ability to then decide at a later date if it makes sense to actually decouple applications further to scale them. I would say that today I think much more about the quality of the API and managing the API then at specific implementations of the software.
Ralph: Thank you very much for sharing your insights and it was very interesting, very much fun. Have fun at QCon!
Thank you very much, Ralph. It was a pleasure to be here.