BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Data Mesh: an Architectural Deep Dive

Data Mesh: an Architectural Deep Dive

Bookmarks
38:02

Summary

Zhamak Dehghani introduces the architecture of new Data Mesh concepts such as data products, as well as the planes of the data platform in support of computational governance and distribution.

Bio

Zhamak Dehghani works with ThoughtWorks as the director of emerging technologies in North America, with a focus on distributed systems and big data architecture, with a deep passion for decentralized technology solutions. She founded the concept of Data Mesh in 2018, a paradigm shift in big data management toward data decentralization, and has been evangelizing the concept with the wider industry

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Dehghani: My name is Zhamak. I head the emerging technologies at ThoughtWorks in North America. About three years ago, my colleagues and I came up with a new approach in managing analytical data. It's called data mesh. Data mesh was born out of sheer frustration for not getting results from building yet another data platform, yet another data lake, yet another data warehouse. It's been three years since then, we have learned quite a bit. What I want to share with you is a deep dive into an architectural aspect of data mesh. If you're getting started, hopefully you take some lessons away, how to think about architecture, how to break up that architecture. I'm going to get started in terms of the technology that you need to deploy.

Analytical Data

Data mesh is a paradigm shift in managing and accessing analytical data at scale. Some of the words I highlighted here are really important, first of all, is the shift. I will justify why that's the case. Second is an analytical data solution. The word scale really matters here. What do we mean by analytical data? Analytical data is an aggregation of the data that gets generated running the business. It's the data that fuels our machine learning models. It's the data that fuels our reports, and the data that gives us an historical perspective. We can look backward and see how our business or services or products have been performing, and then be able to look forward and be able to predict, what is the next thing that a customer wants? Make recommendations and personalizations. All of those machine learning models can be fueled by analytical data.

Today's Great Divide of Data

What does it look like? Today we are in this world with a great divide of data. The operational data is the data that sits in the databases of your applications, your legacy systems, microservices, and they keep the current state. They keep the current state of your stateful workloads. Analytical data sits in your data plane, it sits in your data warehouse, data lake. It gives you a historical perspective. It gives you an aggregated perspective. You can correlate all sorts of data from across your business, and getting a historical perspective, and be able to train your machine learning. That's just the definition and the current state of the universe.

Scale

What do we mean by scale? I think this really matters because this is where we're really challenged. If you deal with an organization that's constantly changing, and I argue that any organization today is constantly changing, your data landscape is constantly changing. The number of data sources that are pouring data into this analytical view of the world, constantly increasing the diversity of those changes, every new touchpoint with your customer is generating data. Every new touchpoint with your partners are generating data. You have these ambitious plans and proliferation of use cases to use that data, then you have a scale at the consumer side as well. There's a diversity of transformation. There's not just one or two or three pipelines, there are a ton of the transformations that need to happen to satisfy very different diverse use cases. The speed to that change needs to increase. Data mesh tries to address these concerns.

Paradigm Shift

Why is this a paradigm shift? Because it challenges assumptions that we have made for over half a century. It challenges the assumption that the architecture of analytical data like the ones that you saw with the lake and the warehouse, or lakehouse, or any form of those existing architectures, it has to be monolithic. It challenges the fact that data has to be centralized under the control of a centralized team so that we can get meaningful use out of it. It challenges the fact that if you're struggling with scale, the only way forward is dividing your architecture based on this technical partitioning, dealing with pipelines by one team, dealing with ingestion in another team, dealing with technologies that help you with serving the APIs and data services on one or the other end. These are all technical decompositions. It challenges the fact that we have designed our organizations around these technical tasks, separated data platform team or separated data lake team or data warehouse team, from these business domains where the data gets used or consumed. The reason for that is the scale, the problem of scale will not be addressed, when the number of sources grows, and the number of use cases grows with this centralized solution. As Thomas Kuhn who coined the term paradigm shift says, we can't just reject one paradigm without coming with another one.

The Four Principles

Here's the solution around data mesh. At a 50,000-foot view, this is a new way of thinking about how to decouple your architecture. How you decouple your data. How you decouple your data ownership around domains. Folks who are familiar with domain driven design, this should be fairly natural and organic to you. Because you probably have already broken down your microservices around domains, your ownership of those services, your ownership of the technology mapping your current business, around the domain of your business. Why can't we just do that for analytical data? Why can't we just say, let's actually break down the boundaries of that monolithic lake, based on the domains? The moment you do that, you'll find yourself in a lot of trouble. The very first thing is the silos of data, now that we have 50 different lakes instead of one, and all siloed from each other and don't talk to each other. One of the architectural and principles of data mesh is that now we're looking at data from a very different lens, we call it data products. Data shall be served as a product, which means it has some form of a baseline usability built into it. The behavior that makes the data usable, and interoperable with the rest of the data products and domains, make it natively accessible to the user. If a user comes and finds your domain, whether that user is an analyst or a data scientist, they have a very different access mode. Their access mode will be satisfied right then and there accessing the data and the domain.

The other problem you find yourself with is the cost of infrastructure suddenly grows. Every team needs to maintain its own very complex infrastructure to run those data pipelines, to serve those data products. We need to think about our infrastructure, the data infrastructure in a self-serve way a bit differently. We need to create new logical layers of infrastructure that really makes it easy to build and serve and discover these data products. Finally, in a distributed architecture, we need to think about how we are pragmatically going to make these data products interoperable, making them secure, not compromising privacy. How can we do all of those things when now we've decoupled these data products across the enterprise and smeared them all over the place?

This computation of federated governance is the fourth pillar. Each of these pillars have an architectural implication. At a very high level, the way you can imagine this architecture at an enterprise landscape is through this lens that while you still have this operational data plane, and analytical data plane, which is storing and providing access for a very different kind of data. One, the data on the inside. The data that is stored in the microservices databases. The other is the data on the outside. Data products that expose your data and your historical data over time, an infinite log of your data, an infinite temporal of your snapshots, however you want to expose them is now both combined together across your domains. One domain team, and I'm going to use the example of the digital media streaming, your podcast release team, or your artist payment team, or artist management team, these teams will not only take care of the applications, but also the analytical data.

Data Mesh

To do that we have this multi-plane analytical data platform. I'll go through some of the capabilities to get a feel of, what are these new platforms that we have to build to self-serve these domains, to build these data products? We will have a set of global governance and policies in terms of how are we going to control access to each individual data product at every domain? How are we going to make the schema language consistent across this domain? How are we going to make the IDs of the policies, the data that goes across these domains consistent so we can join them? There's a bunch of global policies that we're going to automate and embed into each of these data products, so that practically, we can still have an ecosystem that nicely plays with each other.

Data mesh, at a very high level, is both an organizational and architectural and really a technical shift, technology shift. Let's focus on the architecture. The moment we distribute our data across the boundaries of domains, and apply the strategic design patterns that was introduced with Eric Evans, right at the end of his seminal book on this, we find ourselves with a very different view of the world. We start pulling that big monolithic data apart around the functions of our business, not the technical functions of our monolithic data warehouse or data lake. Then, every domain, every bounded context, with its team, with its systems has two lollipops. These lollipops, I use them as an indication for interfaces to that domain, capabilities and APIs or whatever mode of interface you want to provide to the rest of the organization. Now we have two interfaces, two set of capabilities. One is I use a notation, a lollipop with an O in it, your operational capabilities, your APIs, release a podcast. Then, you have this D lollipop, which gives an interface into your domain, an access point, an endpoint in your domain, for sharing the analytical data from that domain.

Let's look at an example. You have various domains here. You have your user management domain that the people who build the microservices or services applications that are registering our users, the subscribers to our podcasts, and media streaming playlist. That domain will have an API or a system that registers users. Now, we'll have something else. We'll have a log of user updates forever and ever. Maybe we want to publish this user updates on a daily batch basis, or event basis. Those are all your choices that makes sense for your organization for that domain. Let's look at a podcast. We have a podcast domain, probably has a podcast player in it. The applications that actually people run, podcast microservice, whatever application you have. You have the operation of create podcast, and release podcast episodes. In addition to that, having an access to podcast listeners' demographic is a pretty good analytical data. To get that demographic, then you will have to probably use data from your users. There's a flow of the data still within the domain, serving data and consuming data between the domains. Or, top podcast daily. What are my top 10 daily podcasts? These are some of the analytical data that now the domain provides.

A New Architectural Quantum

If you think about data as a product, now we have this hexagon that I casually drop all over my diagrams, we have this data product as a new architectural quantum, as a new unit of architecture. Architectural quantum is a phrase that evolutionary architecture introduces essentially. It's the smallest unit of your architecture that has all of the structural components to do its job. In this case, I need all of the structural components, the code, the storage, the services, the transformation logic, everything so that I can provide that top daily podcast to the rest of the organization. It's the smallest unit of your architecture. What does it look like? If you zoom into that domain, now you've got your microservices legacy application, that box that implements your O lollipop, your operational APIs. Adjacent to it, you have one or many data products within that domain that have two very key interfaces. One interface is input data ports. The port is a set of APIs and mechanisms, by which you will be receiving and pulling data, or receiving data, however you have implemented that API, to get data into that data product. Then the operation data product operates on that data, and then provided as an analytical data on its one or many output data ports. That essentially becomes the interface for serving that data, and its metadata.

Let's zoom into that previous example. That user domain has user registration service right next to it. It has perhaps a user profile data product. This user profile data product is just receiving data from user registration service, but it serves two different data product output ports. It gives you your near real-time user profile updates and user registration updates, as events. Because of the consumer and the needs that the organization had, it also provides monthly dumps of the user profiles that it has. What are the user profiles of users that we have this month versus last month versus the month before? Let's zoom in to that podcast listeners' demographic data product. It has two input data ports. It gets the podcast player information to say, who's listening to the podcast? It gets the demographic information, the age, the location, the geographical location, all that demographic information from the user profiles, and just transform them and provides them as podcast listeners demographic. Marketing would love this data. They can do so many things, knowing the demographic of people listening to different podcasts.

What is this hexagon? What is this data product? If you look outside in before jumping into the mechanical bits in it, it's a thing that needs to have these characteristics to be a product. It has to be easily discoverable independently, easily understandable, have good documentation. Has to be addressable, independently. Interoperable with the other ones. Trustworthy. Has to have SLOs, so it can tell us in what intervals the data gets produced. What's the variance between when the data was captured versus the data was processed? What's the statistical shape of it? All of those SLOs that it needs to guarantee for somebody to actually trust this data. It has to have this polyglot form of providing the data because it has to be natively deriving or providing access for a very large spectrum of mode of access. Data analysts use spreadsheets, so they want to hook their spreadsheets or data warehouse or reporting system into a SQL engine, or some ports that can run SQL queries locally within this data product. Maybe data scientists on the other hand, they're totally happy with semi-structured JSON files or parquet files dumped on some Blob Storage, as long as they can traverse that storage across time. They have a very different mode of access, so it's very important the same data is presented and accessed in a polyglot format.

Data Product's Structural Elements

Then this data product, this quantum, the structural elements, definitely needs code in it. It needs code in it for the transformation. It needs code in it for serving the data. It needs code in it for embedding the policies. Who can access this data product? What anonymization we need. What encryption we need. A ton of other transformations that make sense to analytical use cases and not so much makes sense to microservices APIs? Do we need to apply some differential privacy? Do we need to give some form of synthetic data that represent the statistical nature of this data, but not the data itself? Can I see the forest but not the trees? These are policies that can be embedded and executed by the data product here locally. Of course, we need the data itself. What's in that data? We have the polyglot data access, but we have a bunch of metadata.

We have the SLOs. We have the schema. We have the syntax and semantics. We have the documentation. Hopefully, you have computational documentation. We have the policy configurations. We need to depend on an independent segmentation of our infrastructure. If each of these data products autonomously can be deployed, and serve their data, then the physical infrastructure needs to give a way of providing a logical decomposition of infrastructure. They don't own the infrastructure necessarily, but they are able to utilize a sharded, separated infrastructure for just that data product without impacting others.

Multi-Plane Data Platform

Then, if you think about, now, all of the infrastructure and the platform that needs to field building these hexagons and serving them and discover them, what does that look like? Our thinking so far is that we need three different categories, I'll call them planes, different classes of eight programmatically served capabilities at the infrastructure level. Down at the bottom of the infrastructure, you have this already. You've got what I call a utility plane. You have a way of orchestrating and provisioning your Spark job, if you're using Spark for transformation. You have a way of getting a Kafka topic. You have a way of provisioning service accounts and storage accounts. These utility layer capabilities already exist and their APIs, and hopefully, you have written your infrastructure as code for someone to call an API or a command, or send down a spec to an API to provision them. That's not enough. If you think about the journey of the data product developer, from the moment that they say, "I think I need an emerging artist data product, because I want to know what the emerging artists listen to." Or, "I want to run a machine learning model across a bunch of data from different sources, from the social platforms for my players to see, who are the emerging artists?" They go through this value stream. They go from mocking their data product, maybe work with synthetic data first. Try to source and connect to upstream data products, or microservices, or whatever they need to get their data in. They need to explore that. Once they think they have a case, then they've got to bootstrap building this data production really, and deploy it. Then maintain it over time, and hopefully, one day retire it.

To do that without friction with this decentralized model that anybody and every generous developer can do that, we have to create this new plane, new set of APIs that allows provisioning the management, and manage the lifecycle of data products. Declaratively create them. Be able to read data products, the version. Then now we're talking about this hexagon, this unit of architecture that independently can be declared and provisioned, not all of the bits and pieces we need. That's great.

Once we've done that, there is a ton of capabilities that only makes sense to be provided to the users of this interconnected mesh of data products at the global level without centralization. When you think about your architecture, every time you find yourself building something that totally centralizes, and without it, you are not able to use the data products, something's gotten wrong with the architecture. We're back to that centralization again. I'll give an example here. Knowledge graphs are the new hot thing. What are they? A way of semantically connecting these different data points together and be able to traverse this graph of relationship of data. An emerging artist is an artist that gets paid this amount, and these are our emerging artists that emerged last month. This graph of relationships. How do we do that, if all of these polyglot data separated? We need to have a schema system that each local data product defines the schema that has semantic links to other data products that they relate to. Once you have the emergence of that mesh of schemas that are linked together, you need a tool at the global level to be able to browse and search those things. That's this mesh experience plane that you've got to build.

When you think about that federation, the governance, what happens to these policies over time? You probably started with just access control, and over time, you will keep adding to them. First of all, they need to be programmatically built, automated, built into the platform and then embedded as a sidecar, as a library, whatever makes sense to your architecture, and accessed by every data product. An analogous situation of this is the sidecars that we use in service mesh in the operational world, that embed routing policies and failover policies, and all those discoverability policies, all those policies that make sense in microservices APIs. We have an analogous situation here that we need a way of injecting these policies and executing them at the level of granularity of every single data product.

Data Product Container

Then, with that hope, what other architectural components gets packed within this logical, autonomous units of data product? Input/output ports need to have, first of all, a standard. You have to have a set of standards that then you can connect these things together. You might say, for my organization, I will provide SQL access, and this is the API for running your SQL queries on my data product, or this is the interface for getting access to the Blob Storage.

These are the two types that I only provide. Or, this is the way to get to the event. You need to have just standard APIs for exposing your data and also consuming the data. Then, we are now introducing a new port, a new set of APIs. I call them control ports. A way of configuring, whether you are configuring these policies centrally and pushing them down, cache them in every data product, or just configuring and authoring them locally within those domains, you need to have a way of receiving and changing these policies and then executing them. You have a policy engine. You have a data product container that contains this new logical thing that has all of those components in it, your code, your storage, the APIs access to it.

Let's bring it all together. We've got domains now. Each domain has one or many applications, and there might be services, there might be apps, and one or many data products that are running on your infrastructure. We need some form of a container. I'll talk to how we are implementing that. Each of those data products have input ports and output ports. The data is the only thing that flows between the two, no computation happens in between. The computation happens within the data products. In that computation, it could be the transformation. You might want to have a pipeline in there, that's your choice, or you want to have actually a microservice in there to do the transformation. It's your choice, however you implement it. You have a lot of other things. You've got the data itself. You've got the schema. You've got the documentation. You've got your other services probably running in there.

When you think about your mesh, there are three layers of interconnectivity on the mesh. The first layer is the flow of the data, between input ports and output ports. It's your lineage essentially. The second layer is the links between your schemas, the semantic links between the schema so that you can join this thing. You need to use a schema language that allows a standard way of semantically linking these fields together. The third layer is the link of the data. The data itself. This listener is this user. If this is a listener versus the user domains. You have a different port, I'll call it the discoverability or discovery port, which is essentially your root API. If I access the pink data product here, the root API, the root aggregate for that should give me all information so that I can discover what is in it. What are the output ports? What are the SLOs? Where are the documentations? Where do I actually get the data? How do I authorize myself? The control port, which is the set of APIs for configuring the policies, essentially. Maybe some highly privileged APIs that you want to embed there that only governance folks can execute.

What is the platform? We talked about three planes. The plane that I think is really important to focus on, if you get started, is this data product experience plane. You might have an endpoint, your lollipops, your API data product deployments that you declaratively deploy your data products. Then that plane uses the utility plane, provision the Blob Storage, because my output port happened to use Blobs. On the mesh experience plane at the top, you have mesh level functions, search and browse. Those are the APIs that you probably want to expose up there.

Blueprint Approach

How do we go about architecting this thing? Here, I want to share three tips with you. Start with your data product. Start with designing, what is this data product? I have given you some pointers as to how to think about it. What are its components and elements? Once you start from there, then you can figure out, if I want to provide all those usability characteristics, and I want to delight the experience of my data analysts and any form of data user, then what are those APIs for my platform that I need to provide? Then you figure out the implementation. We've been around this block for three times already. We've built three, four different versions of this experience plane. Start very primitive with a bunch of scripts and templates that we can use, and now we're very sophisticated. I show you a spec file. We pass a spec file to an API in a command. Go tap like inverted model of purpose, inverted hierarchy of purpose from your data products, and then design that.

Learn one or two lessons from complex systems. Think about how the birds fly. The birds don't have an orchestrator or a central system who manages them. They just have local rules. I follow the leader and I don't run into the other birds. That's all I need to know. Then from the interaction of all the birds following those local rules, emerges this beautifully complex flock of birds that fly long distances. Apply the same complex adaptive system thinking. You don't need a highly elaborate schema that covers all of the domains, but you need a standard schema, locally defined, a standard model of defining the data globally, but locally defined schema, and then connect them.

Affordances

Thinking about the affordances. What are the capabilities I need to provide by a single data product? It needs a service data. It's secured locally. It's a decentralized model. These capabilities should be your affordances, should be available to the users locally. You manage the lifecycle. I need to be able to independently deploy this thing without breaking all the other native products, without upsetting these users, over time. We talked about, I need to pass the code into this lifecycle management. I need to pass a model spec that says, what is this data product, because we want to abstract away complexity of how to configure it? We want to have a declarative way. This is a few lines of a much longer model spec for a data product that says, what are these input ports and output ports? What are some of the policies that it has? Check this code in your repo of that data product. Pass it to your experience plane API, that platform of choice. I've borrowed this diagram from Eric's team that uses Azure for implementation. Then pass that to utility plane to configure all of the bits and pieces that you need. It's going to be a little bit ugly, at the beginning. It is already because we don't have that beautiful, simple, autonomous unit that you can deploy and does all of the things that I talked about.

When we started with microservices, we had one thing, we had a Unix process, this beautiful Unix process. Then we had Docker containers, or other form of containers that encapsulated our autonomous units of architecture. We don't have that in this big data. Our big data, autonomous units of architecture data product looks a little bit like Frankenstein creations, because you have to stitch in a lot of different technologies that weren't meant to be decentralized. It is possible. I'm hopeful that the technology will catch up.

Let's look at one of those other affordances. Let's say you're designing your app reports. You're designing serving the data. To me, these characteristics of analytical data that serve is non-negotiable. Your data needs to have some multimodal access. If not, all of your data gets dumped into some warehouse somewhere else. We're back to centralization. I highly recommend, design your APIs to be only read-only, and be temporal time variant. If it's not, then somebody else has to turn it into a temporal time-series data. You want that to happen locally here. Most importantly, immutable. The last thing you want is some data for a snapshot point in time changes and the cascading effect. The read function of this output port is always function of time. From what time to what time. We can talk about how to deal with time windowing and temporal information. I think we're lacking standardization in this field.

Once you decide what data, what time, and what duration you want data, then you have a choice of saying, actually, I want this data in the form of graphs, or semi-structured files, or events. This is in spatial topology. Polyglot data means that you might have many different spatial topologies of that time variant data. If you have an app developer who is looking at the same data, like media streaming usage, or play events, they may want to Pub/Sub access to an infinite log of events. Your report guys probably don't want that, they want a SQL access so they can run some of their queries here locally on the data product. Your data scientists want something else, to do visual design. They want a columnar file access. The output port definition and design of these output ports would need to satisfy these affordances: serving temporal data, time-variant data, polyglot data for the same core semantic of information.

Resources

If you want to learn more, I'll put a few references here. My semi-recent article on Martin Fowler gives you a flavor of the architecture discussion we have here. There is a wonderful community. This is a grassroots movement, started around data mesh learning. Please join. I will have deep dive tutorials, certified multi-day tutorials, hosted by DDD Europe. That all goes deeper into the design of each of these architectural aspects.

When Not To Use Data Mesh

Nardon: When is it not a good idea to use data mesh?

Dehghani: In a distributed system, in a decentralized system. A data mesh comes with complexity, managing these independent units, and just have harmony. If you don't have the pain points that I mentioned, the scale, the scale of diversity of sources, scale of the origins of the sources, scale of those very diverse domains and use cases, multiple teams that need to be autonomously responsible, if you don't have that, it seems a hell of an over-engineering to build the data mesh. Really, the motivation for data mesh were around organizations that are growing fast. They have many domains, many touchpoints, and great aspirations. Hopefully, your organization gets there someday. If you're not there yet, I don't think it makes sense.

It's a matter of also when. Right now we're really in the innovator, adopter, or the lead adopter curve of adopting something new. The technology that you can just buy off the shelf or open source use is limited. You still have to do quite a bit of engineering to stitch what exists today in this distributed mesh. If you don't have the engineering capacity, bandwidth, or just your organization likes to buy things and integrate, perhaps right now is not the right time, even though you might have the scale.

 

See more presentations with transcripts

 

Recorded at:

Nov 26, 2021

BT