Increasingly, software developers have the ability to not only maintain and architect code, but extend their expertise to providing direction to the business. By using domain-driven-design (DDD), developers can discover customer behaviors and recommend practices that change the nature of the business.
Jef Claes, a ‘professional code slinger’ with a blog, was asked to refactor an application written by offshore developers into something that would be more manageable and extensible by their in-house team. Additionally, the newly refactored application would have to be more in line with his company's business needs as an online gambling site. He chose to employ the DDD methodology and its architectural approach for the refactoring effort; yielding aggregates, entities, and value objects, in addition to employing event sourcing as the new system's state mechanism. The resulting system discovered customer behaviors which led to changes in business practice as well as cultural changes in the company.
InfoQ interviewed Claes about this project and how it benefitted the business.
InfoQ: You worked at an online gambling company, where your team refactored their monolithic application into DDD-based microservices. Could you briefly describe the project, its successes, and its failures?
Jef Claes: Saying that we refactored our application into microservices would be a bit of an overstatement. However, since the monolith we inherited was bursting at the seams and not very well structured, we did put quite some effort in searching for meaningful boundaries, breaking things up. Starting out, this can be as trivial as trying to model boundaries as namespaces or modules, which can eventually lead to extracting the whole thing and putting it in its own physical boundary.
The first commitment to this project happened somewhere around four years ago, a few months after all land-based (thus licensed) Belgian casinos received the news that online gambling in Belgium would become legal.
The first version of the website that made it to production was mostly built by an offshore team, who had no feel with the domain at all. I can't blame them though; getting handed over a set of specifications, with no access to a domain expert, makes it close to impossible to build useful software. A good six months before releasing, a handful of local people were brought in - joining the death march. They made it possible to ship something, days before the official launch.
Once the website shipped, its main pain points became obvious after only a few days. Some of the domain concepts were not well thought-out, or went unchallenged for too long, causing unhappy customers and lots of money leaking. This could however not have been avoided as a whole. The Belgian regulation made it so that most competitors also found themselves in unknown territory, making mistakes along the way, iterating until somewhat workable models emerged. I do feel as if there should have been something to learn from markets abroad. Maybe most of those foreign websites being IP-blocked for Belgian players has something to do with it. Customers would also often suffer through degraded performance and bad quality, causing games (and other non-gaming functionality) to be slow or to freeze completely. Basic query and index tuning bought us some time, allowing us to remodel certain hot paths, and to move more and more compute out of the customer-facing application. A lack of helpful integration tests made for stressful months in which we, while trying to grow the business and competing in a tough market, searched for more simple, robust and performance friendly models, adding tests along the way.
InfoQ: You chose to implement DDD aggregates and entities in your new architecture. Why did you choose this approach as opposed to other, more traditional architectures?
Claes: I wouldn't necessarily classify aggregates and entities as an architectural concern, but more as a design choice which can be made locally in any component or module.
If you look at verifying a component's correctness, most behaviour can be verified by sending it a message and observing which messages come out, be it a response, a command, an event... I also think of SQL statements as messages as well. Even reading state from disk is as much request-reply as an RPC call over HTTP.
Sometimes, this can be all it takes to feel confident about the behaviour of your software. You send a command, observe the side effects and you're done. When you're working on a domain though, you will always come up with concepts, abstractions and models that live between the intent and its observable side effects. This is where the tactical building blocks of DDD come into play; they help you to materialize that model, keeping the mental model and the code close to each other.
In the existing code base, we did have quite some life cycles (wallets, games, bonuses...) which needed to be managed. There was quite some behaviour, but it was scattered all over the place. The anemic model was too expensive to maintain and very error prone, making it scary to touch. Not the place you want to be in when the business is constantly changing and trying out new things. So with each change we made, we started moving more and more behaviour towards the aggregates, away from the boundaries, extracting the obvious value objects where we could. Checking in 50 places whether debiting a wallet would result in a negative balance gets old really fast.
Defining the core properties (or invariants if you will) of the model using aggregates and value objects, makes talking and reasoning about the expected behaviour of your system so much easier.
We also tend to integrate with lots of external parties (think game or payment providers), which makes us approach our model from a lot of different angles. Having those core properties enforced deep within the model allows for a better night's sleep. External actors can send us anything they like, and we will do some validation on the boundaries - politely refusing some of their requests. But even when something slips through, the model will stay true, protecting its core properties.
InfoQ: Initially, you coded your aggregates to be too large, resulting in performance issues. Could you share your experiences with aggregates, including how to right-size them for performance?
Claes: Large aggregates in hot paths can end up being problematic when there's contention on writes, or when the amount of data read from or written to your data store is too heavy.
Make your aggregates as small as they can be, but not any smaller - as small as your invariants allow them to be. Use real-world analogies and scenarios to discover the true invariants. When potential invariants are revealed, be sceptical. There's a big difference between an invariant that needs to be strongly held and data that helps the aggregate to make a decision, but which doesn't require strong consistency. For example, maybe you want to avoid at all cost to overdraw a balance, but you wouldn't care too much about someone being able to exceed their daily deposit limit once in a blue moon.
InfoQ: Have you discovered any guidelines you will use in the future to limit an aggregate scope?
Claes: When implementing the model in a stateful fashion, I would probably try to avoid ORMs. Or to be more precise, I would avoid feature-rich ORMs. Being able to navigate through a graph, which basically walks through your whole database, is a great way to lose any sense of transactional boundaries. When using an ORM is a hard constraint, turn off lazy loading and avoid navigational properties going out of the aggregate boundary. When an aggregate does reference another aggregate, it's better to just stick to an identifier.
When using events instead of state, its temporal nature makes it easier to reason about the behaviour of an aggregate over time. When an aggregate is backed by an event stream, you can visualize that stream as a time line. Looking at that time line, you can discover potential smells by seeing time lines that either are too fat or too long. I consider a stream to be fat when there are a lot of things happening in a short span of time, when there's a lot of contention. Models that can avoid contention tend to perform better. To optimize long lived streams, you might want to look into snapshotting. Personally, it's something I'd rather avoid, because it does bring some accidental complexity. I would much rather search for a way to model my way out of that one. In a mature real-world domain, it's hard to find examples of long-lived processes. Closing of books is an analogy that doesn't get used often enough in software.
Prefer optimistic concurrency over pessimistic concurrency. Not only will less locks perform better, it will also cause problematic contention to surface faster.
InfoQ: How did you discover and implement your bounded contexts? Did you have to change course if/when you realized your bounded contexts were not correct?
Claes: There's a certain comfort in inheriting a big ball of mud; decisions can often be driven by the amount of friction you encounter changing things. Even when working in a small team, it's impossible to come up with a ubiquitous language that can be used to solve a variety of problems. Limiting yourself to one language constrains your options when trying to build a helpful model to solve a given problem.
Growing a business from scratch, we had to understand the different sub-domains first - the problem spaces, before we were able to decide on good bounded contexts. Since the domain experts were learning as they went as well (a land based casino is quite different from an online casino - the devil is in the details), we tried to close the gap between business and IT as much as possible. We were involved in the day-to-day operational concerns (which is a lot easier when you're still small) and joined them on trips to international trade shows, visiting exhibitioners, learning which problems they were trying to solve.
It was only when the sub-domains became apparent, that we were able to work towards establishing impactful bounded contexts. Now when I switch between bounded contexts, it feels like walking through a door, entering a separate room where you can tackle a problem with an unbiased mindset, allowing you to use the right tool for the job.
We have mostly suffered from defining bounded contexts too late instead of prematurely. Defining them too late often sets you on a path of premature generalization and a set of wrong abstractions. This makes you struggle with accidental complexity until you set aside the time for some rework - you learn and you adapt. Creating space to refactor towards deeper insights is critical if you want to build a long lived healthy system. Now that the whole team has some maturity within the domain, we're getting a lot better at finding the right boundaries early on.
InfoQ: Did events help your project? Were there unwanted side effects to the eventing approach?
Claes: I might be a bit biased, but to me it feels like it has been the most influential change we made. It has had a big positive impact on the design of our components, the architecture of our system and the efficiency of our operational support.
Removing coupling between commands and their side effects has allowed us to move a lot of compute away from customer-facing APIs to background workers. A good example of this is earning loyalty points. When a player places a bet, depending on some variables, he earns loyalty points. Placing a bet and earning loyalty points used to be wrapped in a single transaction, resulting in a performance penalty that can be easily avoided by moving the side effect of a bet (earning loyalty points) out of that transaction. Using events, we are able to run a projection that calculates loyalty points and is even able to award them in batch, greatly reducing the amount of operations required. Loyalty points being awarded in an eventually consistent fashion, allows for more comfortable maintenance and deployments. If customers are aware that loyalty points are only awarded every few minutes, this component being down for a few minutes does no harm.
Another happy side product of using events (or event sourcing) is that by having a history of everything that has happened in your system, the life of support agents becomes so much easier. They are able to see everything a customer has done in your system. That same history also has great use later on when you want to use data analysis or data science to support decisions.
Something I underestimated is how much stress reading events (while replaying projections) can put on your database. Admittedly, we store our events in a non-specialized database. Once events started taking up a substantial amount of space, replaying projections caused a lot of pages from disk to be read into memory. This effectively pushes out other pages that should stay in memory, causing degraded performance. This is far from an insurmountable problem, but it's something you need to be aware of so your infrastructure can cater to that. To buy you some time, you can scale up by buying a bigger box or you can put a caching layer in front. Later on you might want to look into having a dedicated database server for your most data intensive components. Even if it then still doesn't fit comfortably on one machine, there's a plethora of other options to explore.
InfoQ: What did you learn from analyzing your events?
Claes: With a bit of work, by clustering and visualizing event streams you can get to a view which tells quite a rich story. Being able to visualize player behaviour allowed us to categorize players in certain types and to treat each category differently. Analyzing players which stick out one way or the other, visualizing event streams can serve as a quick tool to find devious behaviour.
On a larger scale, we generally don't analyze raw event streams, but stick to projecting temporal data to a normalized model, so that more commonplace tools and products can be used for predictive analysis and clustering of players.
InfoQ: What are some common pitfalls for a developer (who is working on a similar project) to avoid as part of an application refactoring exercise?
Claes: Event sourcing is a powerful pattern that allows you to capture all changes to application state in a semantically meaningful fashion. These events can be used to build projections or when integrating systems. When it comes to refactoring a stateful model to an event sourced one, it's quite the paradigm shift. You need to make sure that it's worth the effort and that your team is up to the task. You might even want to consider breaking off the important piece of the model to rewrite it from scratch, and make the migration an explicit action.
When you don't want to or can't afford to invest in the full paradigm shift there's a middle ground. You can try a hybrid approach in which you, next to persisting state, also persist the events that led up to a specific state. You can do this by running a projection in the same transaction or by extracting the state object and the events from the aggregate to persist both. This does entail the risk of introducing a bug which causes split-brain in which your events do not add up to your state.
Making this compromise when you're just getting started avoids a big-bang. If you take it slow enough, you can take your time to grow and understand the technical necessities. If you want to take advantage of projections and messaging, see if you can get away with a pull-based approach. This often ends up simpler than its push counterpart.
You might meet some resistance when you're trying to sell refactoring towards an event-driven approach. One of the perks of working on legacy is that you often have some space to work with. After all, it's just a table with events in it right?
Once you get this far, make sure you get your domain expert on board. Stop talking about the tables in your database, but about the things that happen in your domain. Ask what needs to happen (commands) when something has happened (events). Look into event storming and try to set up a session with your most influential domain experts. In my experience, it doesn't take long before domain experts are all over the place. Do avoid the trap of using hard words such as projections, process managers, checkpointing, idempotency and so on.
Another neat little trick is to sneak as much data as you can on your events in management dashboards, reports and so on. Before you know it, they'll be standing next to your desk bringing interesting questions and observations to the table.Don't just obsess about the technicalities of events as a first-class citizen of your code base. Take your time to understand and to capture your domain. Domain Driven Design and events go hand in hand. If you get the semantics wrong, you will end up with a system that's held together by brittle contracts that break constantly. This will not only kill your efficiency moving forward but also your moral.
InfoQ: Can you quantify the impact on the business of the conversion to DDD events and the discovered customer types? Did the implementation result in substantial financial gains or cultural changes for the business?
Claes: This is a very hard thing to put a number on. I wish it was possible to take a peek at parallel universes. I'm not in a position to disclose absolute numbers, but we've seen a very substantial relative growth since we moved towards making events a first class citizen. We could have been able to pull it off without, but progress from an output and learning perspective would have been much slower.
The biggest cultural change must be the willingness to experiment. Events can make a system grow organically in which different parts come and go. In a living, unplanned system, there is no shame in implementing an experiment to throw it out when it fails. The data needed to fully analyze the impact of an experiment is often just not available when you don't have the complete log of customer behaviour.
About the Interviewee:
Jef Claes is a professional codeslinger, domain linguist and shuffler of data. In the past he has mainly worked in public safety and banking. The last two years, he's been active in the online gambling domain, working for a small Belgian company providing software to a quickly growing online casino operator. When it comes to buzzwords he often associates himself with DDD(BE), CQRS, C#, FP and F#.