At the Dublin Microservices User Group, Christian Deger presented “Highway to Heaven: Building Microservices in the Cloud”, the AutoScout24 journey from deploying code into a monolithic application using a traditional IT development process, to utilising a microservice architecture that was built and deployed by cross-functional teams. This technical and organisation transformation enabled the business to react more rapidly to changing market conditions.
InfoQ recently sat down with Deger, an architect at AutoScout24, and asked questions about the process and challenges of driving a large-scale transformation effort. Deger shared the motivations for driving change within the organisation and technology, discussed the planning effort required, and also provided insight into the technology choices of Linux, Scala and AWS.
InfoQ: Welcome Christian! Could you introduce yourself and the core ideas behind your recent 'Highway to Heaven' microservice presentation please?
Deger: I joined AutoScout24 over 5 years ago as a .NET software developer, so I was able to see how the previous generation tech stack, and organization supporting it, worked. At the beginning of 2014 our new CEO, Greg Ellis, took over and he was questioning the future of our Windows .NET stack. At that time I was leading a team of developers, but the challenges of a transformation of our mature IT setup into a nextgen ‘web-scale’ IT platform, lured me into the more technical role of an architect.
We started project Tatsu with one team in November 2014 and since then we have ramped up to eight teams working in our transformation project. The name Tatsu has its origin in Japanese mythology and means Flying Beast. It is also a roller coaster in California and a therefore a very appropriate description for how we are experiencing our journey. As in a real roller coaster, it is always good to have something to hold on to. In our case that is a set of principles we came up with and constantly evolve. Those principles are guiding our discussions and keep us aligned. So the talk is about our approach, our principles and a collection of our experiences.
InfoQ: AutoScout24's migration from a monolith to microservices appears to be a large project. How much planning did you do upfront?
Deger: For AutoScout24 it is truly a large project, but we do not believe in big upfront planning. We of course did our due diligence for capacity and budget planning. But more important to us was to have a sound set of initial decisions. AWS and Linux were obvious choices. Picking the JVM and Scala wasn’t that easy. But with our .NET experience, this seemed like a good starting point. Language discussions amongst engineers cannot be avoided, which led to some heated debates about Scala at AutoScout24. It helped a lot, that we also wanted to do microservices and therefore could evolve into a more polyglot setup.
We then tried to find partners that could support and coach us in these areas. Finally, we decided on a capability to be re-written as our first microservice. Everything after that we learned and adapted as the first team started working.
InfoQ: We also noted that you mentioned that in addition to moving towards microservices, you also chose to migrate to the cloud and switch from .NET/Windows to JVM/Linux. With hindsight, would you recommend others implementing such a transformation?
Deger: For us the whole transformation and the core decisions were a set of puzzle pieces that fit together very well. We could of course have done the changes in smaller steps, but we would have lost our window of opportunity. Knowing our industry, the next set of challenges is coming anyway. So I would not change any of the core decisions in hindsight.
When changing everything at the same time, we decided to limit the impact on the organization, by starting with one team and placing a strong emphasis on learning. Every two or three month, new teams were added, and we mixed more experienced engineers with newcomers to the project. This worked great for the first teams. But we also learned that knowing when to stop this team cell division is key. When the learning phase ends and delivery becomes focus, we must allow the teams to stabilize.
From a technology perspective, we can all still learn a great deal about the JVM or Linux, but we succeeded in automating everything. Our complete infrastructure is code, including immutable servers. I didn’t hear once that one of our engineers would have preferred to solve those problems with the Windows technical stack.
InfoQ: Could you explain a little more about your reasoning to only build 'dev' and 'production' environments, and not include a staging environment?
Deger: When I started at AutoScout24, we had three staging environments and a production environment. It was a constant struggle to get a release candidate through all stages, because the environments had different configurations than production. Also, our staging ‘safety net’ didn’t always prevent us from having failures in production, because data, load and access patterns are different there.
Additionally, when releasing services individually, a staging environment for all services introduces the problem that you are integrating with different versions of other services than in production. They could be in the middle of their delivery cycle. So why go through all the trouble of having a staging environment? Wouldn’t it be simpler to only have one environment where all services are available?
We wanted to “Be Bold”, but not stupid. So we investigated what is needed to directly deploy into production with confidence. The service itself breaks or the new feature is not working as expected. The integration between services is broken.
- Dynamic feature toggles helped us to decouple code deployment from feature releases. This gives us full control over the release of new features. We built our own tool over two years ago. It is called FeatureBee (https://github.com/AutoScout24/FeatureBee) and we use it for canary releases, A/B testing and user acceptance tests. For example, everyone in AutoScout24 can enable a new feature for himself via our Chrome-Extension. Just recently Pete Hodgson explained the concept on Martin Fowlers site: http://martinfowler.com/articles/feature-toggles.html
- Consumer-driven contracts CDCs (http://martinfowler.com/articles/consumerDrivenContracts.html) help us to verify that a service is not breaking contracts with the currently running versions of services in production. Services run the current contract tests provided by their consumers in their pipeline without having the need for integrating with deployed services. We are currently moving to Pact (https://github.com/realestate-com-au/pact) for those tests
- Shadow traffic to new services allows us to iron out all performance problems before going live. This initial approach will not help us, for services that change state and after the initial release. So we are looking into canary releases on an instance level with shadow traffic. For this to work we would need to flag data written during from the canary, to be not used in production.
InfoQ: How important was the building of 'autonomous teams' alongside the technical changes that were required during the migration?
Deger: For us it was very important to build autonomous teams, because we believe autonomous teams allow to scale our organization, while still being fast. A freedom and responsibility culture is very powerful.
Our previous team setup was already partly cross-functional. The most relevant missing piece, was that we still had handovers to operations. So we tried to embed previous operations people into the product teams and make the whole team responsible for running and improving their services. We propagated the motto “We are all engineers now”.
We still need to closely watch how the different ‘T-shapes’ contribute to the overall capability of a team. It is very easy to slip back in old behaviors. When only engineers with operational background pick up infrastructure stories, they become the single point of failure in that team. We need to make sure the “You” in “You build it, you run it” means the whole team. On the other hand, it was very good to see that many engineers took the opportunity to grow their skills.
InfoQ: I found the 'principles' section of the talk very interesting. What was the most important principle (or series of principles) that you and your team learnt during the migration?
Deger: Many valuable discussions were centered around the principle of ‘shared nothing’. Of course there are things we need to share, so this principle always needs some explanation. Obviously shared behavior should be a service and services shouldn’t have a shared data store. The importance comes from the shift in our default behavior. We are trained to be DRY or to standardize. A lot of engineers are eager to build a shared library. But a shared library now couples services to that specific implementation and the underlying platform. We don’t want to optimize for the efficiency, but for being fast.
For example, we share a unified logging infrastructure. This is a macro requirement, because we need to correlate events from all services. In addition to the infrastructure, the data contract for the events is shared. But the initial implementation of the event publishing was not. There were strong objections against just copying the single class. Very soon teams evolved their version for their needs without needing to coordinate. After many months the implementations stabilized and we finally decided that the built one in internal OSS library combining all improvements. Teams new can now opt-in to use that library.
Another example was our dashboard services. The first team had set up a dashing server and the second team just added another dashboard to same instance. By accident both teams used the same global state within that service for one of their metrics. After two days we found out, that we actually switched values between both dashboards every 5 minutes. A very bad example of coupling indeed. Now every team is running its own ‘dashing’ server.
A second less controversial but very helpful principle is AWS first: “Favor AWS platform service over managed service, over self-hosted OSS, over self-rolled solutions”. We needed to decide what platform services would be part of our stack. So we reminded ourselves, that the we wanted to use AWS to take advantage of their managed services. Because we should avoid the undifferentiated heavy lifting, we should focus on our domain and be fast. We can still use one of the other options, but only with good reason.
The video for “Highway to Heaven: Building Microservices in the Cloud” can be found on YouTube, and the slides can be downloaded from Deger’s SlideShare account.