Transcript
Abedrabbo: I'll be talking about our experience in introducing data mesh at CMC Markets. My name is Tareq. I'm a Principal Core Data Engineer. CMC Markets offers an extensive online financial trading platform that has been basically running successfully for a few years. My plan is to give you some context about us, where we started. Then, dive into five lessons we learned along the data mesh journey.
Context: Starting Point
CMC Markets by its nature is very data driven. Every part of the business, whether it's trading, risk, or marketing relies heavily on data. We are now undertaking an ambitious transformation program. The goal of this is to enable the business to build new products and to innovate in an agile way. Some of the key principles include the adoption of the public cloud, which is AWS, in our case. We are also adopting cross-functional squad delivery model. Self-service is an important aspect of that. The goal is to allow teams to be autonomous and to basically work in parallel and innovate. Data is a core part of the transformation. One way of looking at it is that while we are building new products on the cloud, most of the existing invaluable data is within existing systems on-premise. There is a gap to be bridged there.
Data and Transformation
When we started this journey, and we looked at the data situation at CMC Markets, what we found is that the data was mostly decentralized, which is a good thing, but siloed. What that means is that you have the knowledge of the data and the ability to do work with that data, often coupled in a single team. As a data consumer, you need first to find the data you need, and then ask and wait for some work to be done on this data for you. This would result in queuing and other inefficiencies, for example, unplanned work for the data owners. We also observed that there were limited conventions, standards, and norms around data shared across teams, for example, naming, labeling common data formats, and so on. Some common questions for the business are, where's the data I need? How can I actually understand the data when I find it? How can I trust it? Then, if I have some new data source, how do I make it available for new products? In a nutshell, how do we innovate at scale when data is a bottleneck?
Just to give you a visual representation of that, so this is a situation where we have a new product. At the top of the diagram in green, we have the product owner first needing data to build their product. Then we have a few teams here, data siloed within the team. Some of the team owns the data, some of the team also have data from other teams. The first thing is to look for the data. Once you identify what you need, then you need to get the data that has been shaped for the needs of the data owners to be exported, because there has been no prior thinking of what would make it easy to consume for new consumers.
Lesson Learnt 1: What Data Goes Inside the Data Mesh?
We learned a few lessons along the data mesh journey. I would like now to share them with you. The first fundamental question in the data mesh is, what data goes inside the data mesh, and what data doesn't? There are obviously a few opinions that we observed in the community. For us, we think all the data should go, so we have this idea of data neutrality. Data itself is neither analytical nor operational. Use cases can be. When people talk about analytical or operational planes, this often does not describe the data itself. This describes the way the data is actually organized in software systems and people around that in organizations. To then decide what data needs to go on data mesh, we introduced two key ideas. One is we brought the Pat Helland idea of data on the inside and data on the outside, and adapted it to our context. Then we also identified and introduced the idea of fundamental data sources.
The first thing is to understand what data goes in data mesh. Make this distinction between data on the inside, which is any data that is private to a team that is not needed by the rest of the organization, versus data on the outside. This data is typically of common interest to the rest of the organization. It needs to be identified as such. Then, we need to think hard about how to make it consumable. This goes through things like the schema, the description, the documentation, the quality attributes of this data. This data needs then to be exposed and be part of the data mesh. This distinction allows consumers to find the data they need, but also allows teams to maintain autonomy and decide on the best possible representation and technology to support the work they do within their teams, basically.
The other key concept we introduced is identifying fundamental data sources. You can think these in our domain as, for example, the tradable prices or the trades. While I'm trying to observe existing data silos and just add some technology to consolidate them, we did a mapping exercise to try to follow the data flows within the business. Then we identified these fundamental data sources, and then we followed them from the outside in.
Lesson Learnt 2: Data Discovery - An Essential Ingredient
Now that we know what data sources we need to focus on, or datasets in the data mesh, let's talk about data discovery. Early on, we also understood very quickly that data discovery is really essential to build a successful data mesh. You can think of a data discovery capability as something needed for people, which is, it gives data consumers, for example, a starting point to find on their own and self-serve the data they need. It also decouples them from the data producers, or the data owners so they can actually self-serve. For data sources, you need a way to onboard new data sources onto the data mesh. If the data is decentralized, you need to centralize the metadata so that it can be found and queried, and searched by the people. Also, data discovery actually is a subset of a bigger concept, which is metadata management, which is very useful for something like data governance.
In terms of the concrete implementation, we looked around, and we found a couple of interesting products. We chose Amundsen, which is an emerging but really good open source project, which started at Lyft. It has a great community, great traction. The fact that it's open source allows us to extend it and customize it. We already built some custom data and gestures. It has a very simple architecture, which is a real plus that makes it easy to deploy on the cloud. We did have to do some work to make it fit within our approach to infrastructure and deployment, basically. Another key point about Amundsen is that it's backed by a graph database. The metadata is natively represented by graphs, so anything that requires connectivity, like lineage, or maybe complex ad hoc insights of the data, you can access the graph database and query that. This is just a very simple screenshot, just to give you an idea of how Amundsen looks. It is a really usable, functional interface. This is a key point because the main target for a data discovery platform for us are the people.
Data Discovery: Approach and Challenges
In terms of our approach about how we introduced Amundsen, again, we identified this as an earlier requirement, and we started iteratively. Don't wait till you have all your data sources onboarded or identified, we can start with what we have and then refine as we go along. Amundsen is not just a tool to store the schema. We store a few other things, for example, the data owners, the classification, the quality attributes, how to access data, documentation, and other things. In terms of the approach, because the whole organization has a broad range of data sources, we decentralize the curation of the metadata and work with other people and other teams. However, because we're just starting the data mesh journey, we centralize the implementation of the technical workflow for the moment within the core data team. One very important distinction between data discovery and data catalog, which is, the focus is human, not other ETL processes, for example, although Amundsen does have an API. The approach is evolutionary. We are not trying to find the perfect taxonomy of data and then impose it on the business. Really, the focus is something that is usable by a wide range of people who do not understand the data sources necessarily, but need them rather than being extremely comprehensive.
This is just quickly to bring together this idea of the data, the metadata, and conventions, standards, and norms. This is literally a screenshot from a brainstorming session, and one way of looking at the different approaches on where data mesh fits. For example, in a data mesh, the data is decentralized. The metadata is centralized. The conventions, standards, and norms are shared but they are emergent and collaborative. Whereas in a monolithic data platform, both the data and the metadata are centralized. The conventions, standards, and norms tend to be shared in a top-down way. In data silos, which is our starting point, the data is decentralized. However, the metadata is decentralized, not shared, so are the conventions, standards, and norms. This is one useful way we found to look at our implementation and to guide it.
Lesson Learnt 3: Natively Accessible Data - Complex but Vital
The third point I'd like to cover is the importance of natively accessible data for the data mesh. I would like to use an example to illustrate that, which is what we are implementing right now. One very important goal of data mesh is to make data natively accessible on the cloud, so on the new products, and on-premise. Now, there are reasons why there are systems that are built on-premise and so on, and these reasons could include colocation, and latency, and so on. If you are to build really great products on the cloud, you need to make the data available in a really acceptable latency and in a form that is consumable by the new products we need. We started by the tradable prices stream, which is basically a latency data stream that we generate at CMC based on market prices. Again, we had to bridge this gap between having this low-latency stream on-premise and make it available on the cloud. The technology solution we picked for the job is Aeron, which is a low-latency and reliable messaging transport. The approach is to bridge the gap, build the bridge with Aeron. The bridge itself will be high fidelity, so it will preserve the semantics of the data to enable a variety of use cases on the cloud that are decoupled from each other and decoupled from where this data stream is produced. Once we have the data available on the cloud, then we can use it to build custom views, as required by the different classes of applications, for example, an event log or a snapshot of the latest prices.
Let's have a look at just this diagram that illustrates this. We can see at the left-hand side we have the data coming from the outside. Then at the bottom of the diagram, we have the pricing engine generating this fundamental data source of tradable prices, which is consumed by another system, the trading engine. Then we have the Aeron Bridge that then takes this data, and in a reliable and performant way makes it available on the cloud, that then feeds into Kafka, for example, for an event log, or DynamoDB for a snapshot. Then the new product on the cloud can consume this data. Obviously, to complete the picture, we have to add Amundsen Data Discovery, so we can actually advertise all the different data sources we have on Amundsen, and then new consumers can decide which data sources they want to consume. You can use the Kafka one, for example. If they have really low-latency requirements, they can use the native Aeron data source.
Lesson Learnt 4: Handling Cross-Cutting Concerns
I covered most of the data-centric and technical aspects. I'd like to go through a couple of organizational aspects of the data mesh. One is handling cross-cutting concerns. In a data mesh, there are many cross-cutting concerns, so these include data engineering, tooling, cloud engineering, some higher level cross-cutting concerns like data protection, security, and governance. The challenge, obviously, is to handle all these concerns consistently across deliverables that are autonomous and independent. Given the skill sets that are needed, we cannot really have a perfect spread of these skills in every team, so we cannot have a data engineer with knowledge of all these things in every team. What we do in this case, we take advantage of the squad model. We have identified SMEs that can then be part of the squad when they need to contribute their skills. From the other end, you have senior stakeholders who have some obligations, for example, data protection. These people can also participate in the delivery process, the iterative delivery process so that we can actually transform the requirements early on into actionable things we can include in our delivery.
This is just a logical view of the data mesh now, to just put it all together. We can see that now we have data domains. There's a distinction between data on the inside, which is private to the domain, and data on the outside which has been shaped in a way that is consumable for other products. Automated data is basically centralized in a data discovery platform. You have data owners who look after the data. Then you have a product owner who can interact with the data discovery platform. In the middle, you have the SMEs, who can facilitate and support the different activities that require their expertise around data, for example.
Lesson Learnt 5: Data Mesh as a Product
The final lesson is about data mesh itself as a product. Data mesh is a journey, it's not where we started. At the same time, it's not just like one recipe, it's not rigidly defined. It's not just like a technology implementation. We have to take multiple iterative steps, and we also need to imagine these steps, and imagine, basically, milestones in between, and so on. At the same time, we have different people working on things that are part of the bigger data mesh. We need to ensure that there is alignment in the approach. Also, the different capabilities we built for the data mesh, for example, in what I showed earlier, the data discovery or Aeron Bridge, are cohesive. We don't end up building just different things that do not add up to something cohesive together.
The only way of looking at it is that data mesh itself can be seen as a slightly different type of product. What we're doing now, we're introducing actually product ownership built around the data mesh to make sure that this cohesion is achieved, and we can assist every team that is joining the data mesh on their journey. Again, this is a very problem solving engineering-heavy role rather than a traditional product ownership role. It's definitely not the data owner role that exists also in the data mesh.
Data Mesh + Transformation
I'll say a word about the approach to data mesh and transformation. We think that we are lucky to have this transformation context, because data mesh requires a really broad scope to be effective. It's not just about changing some technology. It's not just about changing some parts of the business. You have multiple parts of the business that need to be involved at the same time. You have a shift in the mindset. Again, transformation context is ideal for that. In general, we found that the data mesh and the transformation initiative we have, share the same ethos, so we have a decentralized collaborative approach. The goal is to allow scale and autonomy at the same time. Obviously, data mesh has a strong Domain Driven Design Foundation. This is the whole idea of the business needing to reorganize itself around data domains. Again, transformation offers a great context for data mesh to be effective, rather than just localized or superficial in terms of technology.
Questions and Answers
Nardon: How do you convince people to use data mesh, and what are the benefits for using data mesh? What is data mesh? What's the difference from a data mart or a data warehouse? Maybe you can elaborate a little about, what is the advantage of data mesh? How did you convince your company to use this new architecture?
Abedrabbo: I think the first important thing to say is that data mesh is not just a technical solution or a way of structuring one dataset, it's more of an architectural and organizational approach. There's obviously the Zhamak presentation, which is the source of authority and detail on the concept. The main idea is to allow for data decentralization, not to have a single data platform that could quickly become a bottleneck. This is a common pattern that we see everywhere, including actually on new built cloud environments. Building some monolithic data platform is quite common. One way of thinking about it is that this is very similar to the move we saw from services, from monolithic applications to decentralized microservices. The ethos are quite similar.
Specifically, at CMC Markets, the transformation context, and I can't emphasize that enough, because we've seen around attempts at implementing data mesh while just adding some technical consolidation on top of existing datasets. This is good, but it's not exactly data mesh, because data mesh is also a way of decentralizing and achieving autonomy for the teams that need to build things with data. With the CMC Markets, one thing is, data is essential. We tried to understand and address the pain points, so we didn't try just to bring data mesh as the coolest idea. We tried to respond to an existing problem. The starting point is a bit different from other organizations where you have data monolithically at the platforms, we had silos, so there was already a decentralization. This decentralization was not well managed, in that there was no common language across the data domains that people can talk. The impact was concrete. We did this exercise of analyzing some of the products and seeing that to build new products that need this data you need to queue. Again, from a lean perspective, you are queuing, you are generating unplanned work. There's a lot of waste. Then the way to make the data available for this new product to consume, often is ad hoc, clunky, and requires shifting around big datasets, or creating connections to consume the data.
There was this gap, and this is where the data mesh came in. The short term ambition is to enable this agility and autonomy for the emerging products, and in the long term is to enable the organization to be able to utilize the data better. There are many areas of the organization that we haven't started working with that will benefit from that. Again, we chose to add context, we identified these fundamental data sources. We started with one pricing, which is fundamental for the business. That is a key enabler for some of the new products. Then we are now focusing on concrete deliverables, so the bridge I described, and Amundsen, so that we can build trust and confidence in the approach and move it from theory to practice.
This is the backdrop. It is an investment. We are lucky with the context of the transformation, because sometimes some of these things, there is no immediate benefit that is visible. I think in terms of, we made the case for benefits that would not be possible otherwise. That will actually support the transformation, basically. The point is, data mesh has at its core, a collaborative, decentralized way of working, which is what people prefer these days in terms of building products and stuff like that.
Nardon: What was used for metadata management, cataloging to enable self-service? Did you have the concept of data grading, so people knew how and where data could be used?
Abedrabbo: When we started with data mesh, we had this initial idea or maybe ambition that was not realistic, that we can look and just map all the data sources in the business and we can identify the right ones. Then it's easy because there will be five or six. Very quickly, we discovered this is not actually realistic. There are a couple of reasons for that. One is that there are some long term and clearly fundamental data sources like fuzzing, but there are other data sources that are actually useful on the short term, and that might be retained or not, they might move. We needed something incremental. We need to expose these things and find a starting point. This way, we understood that rather than waiting to hit this idealized view of all the data sources, let's have something, and we looked at a couple of things. We looked at Amundsen from Lyft. We looked at the LinkedIn one.
We chose Amundsen for different reasons, but one strong reason is the fact that it's backed by a graph database, and I'm a big fan of graph databases. I've worked with them quite extensively earlier in my career. Graph databases allow you to represent the connectivity in a native way. Actually, Amundsen then will act as this frontend on top of data stored natively as a graph, and you could also go into the graph and actually query directly the graph to have these ad hoc insights and understand how data relates to each other. We're using Amundsen and data discovery as a means to surface the data sources we want to onboard into data mesh in an iterative way, and for us also to discover and understand them.
What we did is now we are engaging with the different data owners and we are working with them to onboard some of the attributes in terms of grading. It's obvious because they can be ingested directly from the data source, so a database table will have a schema and the database, and stuff like that. Some of the other attributes are more cross-cutting, like data privacy, or data classification. Because we have a highly regulated business, and because there are people in the business who are accountable for these obligations around the data, like data protection, we are working with them to identify this common language that then we can do across the different data sources. This is where we are now.
Amundsen allows you this metadata management. It's immediately useful for a specific use case, which is discovering data that you need to build the product, for example. Let's say you need a product. You have a product idea. You look into Amundsen and you can find two or three data sources that support your idea. That is fine. There might be more data sources in the business that are not in Amundsen now, so that's fine, the answer might be good enough or not. If you look at data governance, we need to be able to use metadata management, support data governance, we need a broader coverage of data sources. Because if you have a data governance query, for example, identifying all the datasets with a specific user information, you cannot be happy with a partial answer. You need everything at that stage. This is why we are introducing data discovery to support transformation work, and new products, and onboarding datasets one by one, and iterating through that. We are expecting this to be a journey rather than a new thing. How we won people over is, again, we found people in the organization who were trying to build new products who couldn't because they didn't know what data they needed and where it was. We tried to simulate their use case and make it part of the data mesh, basically.
Nardon: How do you join datasets in the data mesh?
Abedrabbo: Currently, from a high level perspective, the data owners need to think about the data sources they need to expose, and expose them. Because we are not starting from a clean slate, we are starting from an existing organization, some of these datasets might be very clear, like pricing. It's a well identified stream. It has a well-defined format, and stuff like that. Some of them might already be different datasets joined in. We accept that there will be some of that now, so not every dataset needs to be very pure. We will iterate over that, and what is needed and what is not will emerge, and so on. If you do need to cross different things, then again, it's your responsibility as the consumer to identify the data you need, and then build a view that is properly crossing this data together. We don't mandate anything. People are free. This might look like an overhead, but it's not. Because this is actually, you just build the views you need, as opposed to the current situation where you need to find the data and get someone to do it for you. If someone else needs the same thing, then potentially the same work needs to be done slightly differently again. This is where we are with it now.
Nardon: What do you mean by natively accessible? This is a crucial concept in data mesh.
Abedrabbo: Again, just to explain it by an example, think about the fact that we have existing data sources on-premise on physical data centers. There are different reasons for that. This will be the case for a long time, because of colocation reasons, and latency, and what have you. Then we have these new products on the cloud. If you build a new product, any product that needs to be built will need a massive amount of data that now exists mostly on-premise. There are two ways of doing it. One is you go and ask for the data and then move the data around, and build custom, which is not viable at all because it will require a lot of people, a lot of effort, and it just doesn't scale very well. The other way is that we identify the datasets you need. Again, we are assuming that in a cloud context, you have a more flexible environment in terms of technology, so you can try different things and so on. Then for pricing, for example, we are building this bridge to be able to stream the prices in a low-latency and resilient fashion. Again, then as a new product on the cloud, you will be able to consume this on the cloud directly rather than going and doing ad hoc integration from on-cloud to on-premise. This is roughly the idea that the data need to be available in a way that can be consumed by the different applications, rather than this being a problem every time, basically. In our case, it means on the cloud.
Nardon: How did you handle data deduplication when sourcing from multiple sources?
Abedrabbo: I need some clarification, is it for the pricing data specifically? Because it depends on the data source, there are different techniques. For pricing, we have some guarantees from the data sources. Actually, the problem is not duplication, in this case, the problem is more latency and how recent the data is. The risk is to get data that is out of date or to have a gap, and ordering is more of an issue. Specifically for the datasets we're working on now, it's not an issue, but for other datasets there are different techniques that range from ordering, delivery guarantees, idempotent writes, deduplication filters. We will use the right technology to support the right use case, when it matches.
Nardon: How do you convince data owners to provide to the data mesh? I can see how to convince people that need the data, but people that need to provide mostly see as only more work, and more feed to take care of.
Abedrabbo: This is the idea of good data citizenship. In a data mesh, it's like building a town or a country, what have you, everyone needs to do their bit so that everyone can benefit. The transformation context gives this drive that we are trying to build better things, and so on. That's just the backdrop. Actually, there isn't really more work. There's more work that is decoupled from the consumer, because currently as a data owner, because you have a dataset that has been structured for your own needs, and you don't know what consumers want, and they don't know how to access the data, every time someone needs the data, they need to come to you. They need to ask you to do some work to find the data they need and actually expose it for you. Actually, moving towards a situation where you think about the data, you expose it, and so on, is a different type of work. The whole idea is to decouple the demand from doing the work. It will require some investment upfront. Some people are interested in it, because it will enable them to do things that are not possible now. We will work with data consumers who'll need the data, who have pain points, and then identify data sources. In time, we believe it will actually result in a reduction of ad hoc work and unplanned work and queuing, which is a source of waste in a lean system. It's not automatic. There's a lot of talking to people, in implementing a data mesh. There's a lot of that actually.
Nardon: I can imagine a shift like that. It's not easy.
Abedrabbo: It is a shift. It's exactly that. Yes.
Nardon: When would you use data mesh going directly to the data source versus calling a services API to get that information?
Abedrabbo: This is again about how to make the data accessible, I think, and in what form. Data mesh is not a single location or a single account on the cloud, there'll be multiple data sources. Some of these data sources for example, if you take the fuzzing stream, which is just a stream of data. One common query or request is to have a snapshot of the latest prices. In this case, we can expose that in a view. In my example, it was, we could use something like DynamoDB, because you have a key-value structure, with the keys and the latest price. In this case, people can either read directly from the view, or they can take the data and transform it further, if that is their need. Again, we do not police how people use the data. Our job in that case is to make sure that the data has delivered acceptable performance, latency, resilience, and so on. That people can find it and access it. In some other cases, people might want to expose or use the data through an API. Again, we do not mandate that. That depends on the use case. It's very important to say that my job as core data engineering, we do not own the data mesh, we are building the capabilities. Some of these capabilities are often longer term. We are trying to enable this decentralized approach, so we do not mandate how the data needs to be accessed. Any means that is appropriate for the dataset is fine.
Nardon: Where is a data mesh not a good solution?
Abedrabbo: It's really not a good solution if you have no need to scale the use of the data across the organization. If you have no issues with data being scattered across the organization, or siloed in one place. It is an investment, and therefore you need to have a driver for that. In our case, it's the need to build things and to scale. Autonomy is another big one, in parallel. Because we will never be able as a data-centric organization to build a data team that covers all the requirements, because every requirement is a data requirement in our organization. Then the better bet is to invest a bit in the approach and infrastructure to enable others to self-serve, basically.
The main thing is, data mesh is a journey. It's not where we are starting. It's going to be interesting. I'm not the only one. I have a colleague doing a data mesh user group. Hopefully, we'll be talking more about and actually sharing more as we mature the journey. It's still early stages, but we'll be sharing more about it as we go.
See more presentations with transcripts