BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Architecting for Data Products

Architecting for Data Products

Bookmarks
49:18

Summary

Danilo Sato discusses what constitutes a data product and different types of data products, how data products support data architecture at different levels, skills and team topologies needed.

Bio

Danilo Sato, VP of Global Head of Technology - Data & AI, advises executive leaders on topics ranging from data strategy and governance to building the products and platforms that bring the strategy to life. He is a member of Thoughtworks' Global Technology team. Author of Devops in Practice, Danilo was named by DataIQ as one of UK's 100 most influential people in data in 2022 and 2023.

About the conference

Software is changing the world. QCon London empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Sato: The track is about the architectures you've always wondered about, and I work in consulting, and sometimes, not always, I have lots of stories about architectures you probably don't want to know about. I'll try to pick the things that are interesting. One of the benefits of working in consulting is I see lots of things, and what I try to do with these topics is less about zeroing in on a specific solution or a specific way of doing things, but we like to look at things more like, what are the patterns, what are the different ways that people can do it maybe using different types of technology stacks?

What is this Organization's Business?

To start with, because we're talking about data architectures, I want to start with a question. I'll show you a data architecture, and I'll ask you like, help me find what is this organization's business? This organization has a bunch of data sources, so they've got ERP systems, CRM system. They do their finance. They get external data. Then they ingest their data through a data lake or a storage. They do that through streaming. They do that through batch. They do a lot of processing of the data to cleanse it, to harmonize the data, to make it usable for people. Then they do serving of the data. They might put it in a data warehouse, on a data mart where lots of consumers can use that data. They put it on a dashboard to write reports, to train machine learning models, or to do exploratory data analysis.

Which company is that? Has anyone seen anything like this before? I call this the left to right data architecture. Probably most people will have something like this, or a variation of this. What I like about using data products, and some of the ideas in data mesh, is that we can talk about data architectures slightly like this. I'll describe a different one now, and I'll use these hexagons to represent data products, and then some square or rectangles for systems. This company has data about content. They call it supply chain, because they have to produce the content. They add metadata to the content.

Then they make that content available for viewers to watch that content. When they watch that content, they capture viewing activities. What are people actually watching? What are they not watching? When do they stop watching? When do they start watching? Then they do marketing about that to get people to go watch more of their content. They have this, they call them audiences. When we're looking around different types of audiences, one of the things that marketing people like to do is to segment the audience so that we can target the message to reach people that might actually be more interested in a specific type of content than another one.

Which company is this? There's still a few of them, but at least the industry should be much easier to understand now, because we're talking about the domain. In this case, this was a project that we've done with ITV, which is here in the UK. I don't think I need to introduce but it's a broadcaster. They produce content. They distribute that content. They actually have a studios' business as well. There's way more of that, but this was a thin slice of the first version of when we started working with them to build the data mesh. The first use case was in marketing. What we were trying to do is bring some of that data they already had into this square here.

I didn't put the label, but it was the CDP, the customer data platform, where they would do the audience segmentation, and then they could run their marketing campaigns. The interesting thing is, once you do segments, then the segments themselves is useful data for all the parts of the business as well to know about. We can describe architecture in a more domain specific way. These are the things that hopefully will be more meaningful than the things underneath. Because if we zoom in into one of these data products, you'll see that it actually has a lot of the familiar things.

We'll have ingestion of data coming in. It might come in batches. We might do it in streaming. We might have to store that data somewhere. We still do pipelines. We still process the data. We have to harmonize and clean it up. We will need to serve that data for different uses. That can be tied in to dashboards or reports, or more likely, it could also be useful for all the downstream data products to do what they need. When we look inside of the data product, it will look familiar with the technology stacks and maybe architectures that we've seen before, but data products gives us this new vocabulary or this new tool, how we describe the architecture that's less about the technology and more about the domain.

Background

In this talk, I was looking at, where should I talk about? Because those two, to me, is very related. It's like we could look inside and how we architect for this data product. The other thing that I'm quite excited about is that you can actually architect with data products, which is more about that bigger picture, looking at the bigger space. This is what I'll try to do. I am the Global Head of Technology at Thoughtworks. I've been here for over 16 years, but you see, my background is actually in engineering. I joined as a developer, and I fell into this data and AI space while here at Thoughtworks.

I was here when Zhamak wrote the seminal article to define data mesh principles and early reviewer of the book, and we've been trying to apply some of these ideas. The reason why I picked the topic to be data products and not data mesh, is because what we found is, it's a much easier concept for people to grasp. You have hints if you've read the data mesh book or the article. There's lots of things that you see there that comes from that. Data product is one way that's easy for people to engage with that content.

Shift in Data Landscape

To tell you a little bit about the history, like why we got to where we are, why is it hard to architect or to design data architectures these days. There's been a shift. Way back, we would be able to categorize things about like, we've got things that happen in the operational world. We've got data that we need to run the business, that have operational characteristics, and then we have this big ETL to move data into the analytical space where we'll do our reporting or our business intelligence. That's where a lot of those left to right architectures came about. We can use different ways to do those things, but in terms of how we model things, like databases, when I came into the industry, there weren't many options.

Maybe the option is like, which relational database am I going to use? It was always like, tables. Maybe we design tables in a specific way if we're doing transactional workloads versus if we're doing analytical workloads. That was the question. Like, is it analytical workloads? Is it transactional workload? Which relational database do we use? Early in the 2000s there were maybe a few new things coming up there, probably not as popular these days, but some people try and build object database or XML databases. Things have evolved, and these days, this is what the landscape looks like. This got published. Matt Turck does this every year. This started off as the big data landscape, I think 2006 or something like that.

Then it's been evolving. Now it includes machine learning, AI, and data landscape, which he conveniently calls it the MAD Landscape. It is too much for anyone to use to understand all of these things.

The other thing that happened is that we see those worlds getting closer together. Whereas before we could think about what we do operationally and what we do for analytics purposes as separate concerns, they are getting closer together, especially if you're trying to train more like predictive, prescriptive analytics. You train machine learning models, ideally you want the model to be used, backed by an operational system to do things like product recommendation or to optimize some part of your business. That distinction is getting more blurry, and there are needs for getting outputs of analytics back into the operational world, and the other way around.

A lot of these analytics needs good quality data to be able to train those models or get good results. These things are coming together. These were some of the trends where the original idea from data mesh came from. We were seeing lessons we learned for how to scale usage of software. A lot of the principles of data mesh came from how we learn to scale engineering organizations. Two in particular that I call out, they are principles there in the book as well, but one is the principle of data as a product. The principle means, let's treat data as a product, meaning, who are the people that are going to use this data? How can we make this data useful for people?

How can we put the data in the right shape that makes it easy for people to use it? Not just something that got passed around from one team to another, and then you lose context of what the data means, which leads to a lot of the usual data quality problems that we see. In data mesh, at least, data product side is the building block for how we draw the architecture, like the ITV example I showed earlier on. The other thing around that is that if we're treating as a product, we bring the product thinking as well.

In software these days we talk about product teams that own a piece of software for the long run, there's less about those build and throw over the wall for other people to maintain. The data products will have similar like long-term ownership, and if we can help drive alignment between how we treat data, how we treat technology, and how we align that through the business.

The other big concepts that influenced a lot of data mesh was from domain-driven design. The principle talks about decentralizing ownership of data around domains. What Zhamak means with domains in that context is from Eric Evans' book on domain-driven design, which originally was written about tackling complexity in software. A lot of those core ideas for how we think about the core domain, how do we model things? How do we drive ubiquitous language? Do we use the right terminology that reflects all the way from how people talk in the corridors all the way to the code that we write? How do we build explicit bounded contexts to deal with problems where people use the same name, but they mean different things in different parts of the organization? A lot of that, it's coming from that, applied to data.

Modeling is Hard

The thing is, still today, modeling is hard. I like this quote from George Box, he says, "Every model is wrong, but some of them are useful." To decide which one is useful, there's no blanket criteria that we can use. It will depend on the type of problem that you're trying to solve. We've seen on this track, I think most people said, with architecture, it's always a tradeoff. With data architecture, it's the same. The way that we model data as well, there's multiple ways of doing things. This is a dumb example, just to make an illustration.

I remember when I started talking about these things, there were people saying, when we model software or when we model data, we should just try to model the real world. Like, this thing exists, let's create a canonical representation of the thing. The problem is, let's say, in this domain, we've got shapes with different colors. I could create a model that organizes them by shape. I've got triangles, squares, and circles, but I could just as easily organize things by color. You can't tell me one is right, the other one is wrong, if you don't know how I'm planning to use this. One might be more useful than the other depending on the context. These are both two valid models of this "real world". Modeling is always this tradeoff.

How Do We Navigate?

We've got trends from software coming into data. We've got the landscape getting more complex. We've got some new, maybe thinking principles from data mesh. How can we navigate that world? The analogy that I like to use here is actually from Gregor Hohpe. He talks about the architect elevator, architect lift, if you're here in the UK, would be the right terminology. In his description, he wrote a book about it, but he talks about how an architect needs to be able to navigate up and down. If you imagine this tall building as representing the organization, you might have the C-suite at the top, you might have the upper management, you might have the middle management, the lower management, and then you might have people doing the work at the basement somewhere.

As an architect, you need to be able to operate at these multiple levels, and sometimes your role is to bring a message from one level to the other, to either raise concerns when there's something that's not feasible to be done, or to bring a vision down from the top to the workers. I like to apply a similar metaphor to think about architecture concerns. I'm going to go from the bottom up. We can think about data architecture within a system. If we are thinking about how to store data or access data within an application, a service, or in this talk, more about within the data product, and then we've got data that flows between these systems. Now it's about exposing data, consuming that data by other parts of the organization.

Then you've got enterprise level concerns about the data architecture, which might be, how do we align the organizational structure? How do we think about governance? In some cases, your data gets used outside of the organization, some big clients of ours, they actually have companies within companies, so even sharing data within their own companies might be different challenges.

Data Architecture Within a System

I'll go from the bottom to the up to talk about a few things to think about. Start, like I said, from the bottom, and the first thing, it's within a system, but which system are we talking about? Because if we talk about in the operational world, if we're serving transactional workloads where there's lots of read/writes and lots of low latency requirements, that choice there is not as much about data product, but it's, which operational database am I going to use to solve that problem? When we talk about data product in data mesh, you will see, like in the book, the definition is about a socio-technical approach to manage and share analytical data at scale.

We've always been focusing data products on those analytical needs. It's less about the read/write choices, but it's more about using data for analytical needs. The interesting thing is that some folks that work in the analytics, we're quite used about breaking encapsulation. I don't know if you've got those folks in your organization, but they're like, just give me access to your database, or just give me a feed to your data, and then we'll sort out things. We don't do that when we design our operational systems. We like to encapsulate and put APIs or services around how people use our things.

In analytics, at least historically, it's been an accepted way of operation where we just let the data team come and suck the data out of our systems, and they will deal with it. You end up with that disparate thing, and when we break encapsulation, the side effect is that the knowledge gets lost. This is where we see a lot of data quality problems creep in, because the analytics team is going to have to try to rebuild that knowledge somehow with data that they don't own, and the operational teams will keep evolving so they might change the schema. We create a lot of conflict. The data team doesn't like the engineering teams, and vice versa.

In the operational database side, like any other architecture choice, it's understanding what are the data characteristics that you need, or the way to think about the cross-functional requirements. From data, it might be like, how much volume are you expected to handle? How quickly is that data going to change? Aspects around consistency and availability, which we know it's a tradeoff on its own, different access patterns. There will be lots of choices. If you think about all the different types of databases that you could choose, I would just quickly go through them. You've got the traditional relational database, which is still very widely used, where we organize data in tables.

We use SQL to query that data. You've got document databases where maybe the data is stored as a whole document that we can index and then retrieve the whole document all at once. We've got those key-value stores where, maybe, like some of the caching systems that we built, we just put some data in with a key value that we can retrieve fast. We've got some database there, those wide column style databases like Cassandra or HBase, where we model column families, and then you index the data in a different way depending on the types of the queries that you're going to make. We've got graph databases where we can represent the data as graphs or relationships between nodes and entities, and infer or maybe navigate the graph with a different type of query.

Some people say like search database, like Elasticsearch, or things like that. We're trying to index text to make it easier to search. We've got time-series databases, so maybe you've got high throughput temporal data that we might need to aggregate by different periods of time because we're trying to do statistics or understand trends. We've got columnar databases, which is popular with analytical workloads, where maybe the way that we structure the data and the way that it's stored, it's more organized around column so it's easier to run analytical queries against that. This thing keeps evolving. I was just revisiting these slides, and actually now, with all the new things around AI and new types of analytics, there's even like new types of databases that are coming up.

Vector database is quite popular now, where if you're trying to train an LLM or do reg on the LLM, you might want to pull context from somewhere, or whether we store our embeddings will be in a vector database. Machine learning data scientists, they use feature stores where they can try to encapsulate how they calculate a feature that's used for training their models, but they want that computation to be reused by other data scientists, and maybe even be reused when they're training the model, and also at inference time, where they will have much more strict latency requirements on it.

There are some metrics databases coming up, like, how do we compute KPIs for our organization? People are talking about modeling that. Even relations, like, how can we store relations between concepts? How can we build like semantic layer around our data so it's easier to navigate? This landscape keeps evolving, even if I take all the logos from who are the vendors that are actually doing this.

There's more things. We could argue after, it's like, actually, Elasticsearch claims they're actually a document database as well, they're not just a search database. I can query the SQL in a graph database as well, because there's a dialect that's compatible with that. There's lots of nuances around that. There are some databases that claim to be multimodal. Like, it doesn't matter how it's going to get stored, you can use a product and it will give you the right shape that you need.

You have to choose whether you're going to use the cloud services that are offered, whether you're going to go open source, or whether you're going to get a vendor. The interesting one like, sometimes you don't even need a full database system running. When we talk about data, sometimes it's like, I just want a snapshot of that data to be stored somewhere in the lake. It might be like a file. It might be a parquet file. It's just sitting on a storage bucket somewhere.

Data Product

Now I'll shift more to the data product, which is more on the analytical side. One of the questions I get a lot from clients is to understand like, but what is a data product? Or, is this thing that I'm doing a data product? The way that I like to think about it, I'll ask you some questions here. If you've got a table, you've got a spreadsheet, do you think that's a data product? The answer is probably not, but there's other ways that we can have data. Maybe it's an event stream or an API, or I'm plugging a dashboard into this, does this become a data product now? Who thinks that? It doesn't stop there, because if we're going to make it a data product, we need the data to be discoverable and the data to be described.

There will be schema associated with the data, how that data gets exposed, especially on the output sign, and we want to capture metadata as well. That makes it more robust. Now, is that a data product? It doesn't end there. We want the metadata and the data product to be discoverable somewhere. We're probably going to have a data catalog somewhere that aggregates these things so people can find it. We also, like I said before, in the zoomed in version, there will be code that we write for how to transform, how to ingest data, how to serve data for this product. We need to manage that code. We need to choose what kind of storage for this data product we're going to use, if we're doing those intermediary steps. As we do transformation, we probably want to capture lineage of data, so that's going to have to fit somewhere as well.

More than that, we also want to put controls around quality checks. Like, is this data good? Is it not good? What is the access policy? Who is going to be able to access this data? Even we're talking about data observability these days, so we can monitor. If we define service level objectives or even agreements for our data, are we actually meeting them? Do we have data incidents that needs to be managed? This makes a little bit more of the complete picture. We want the data product to encapsulate the code, the data, the metadata that's used to describe them. We use this language, we talk about input ports, output ports, and control ports. Then the input ports and outputs is how the data comes in and how the data gets exposed. It could be multiple formats. It might be a database-like table, maybe, as an output port. It might be a file in a storage somewhere. We might use event streams as an input or output. It could be an API as well.

Some people talk about data virtualization. We don't want to move the physical data, but we want to allow people to connect to data from other places. There are different modes of how the data could be exposed. Then the control ports are the ones that connect to the platform, to either gather information about what's happening inside the data product, to monitor it, or maybe to configure it.

The thing about that is, because it gets complex, what we see is, when we're trying to build data products, a key piece of information that comes into play is the data product specification. We want to try to create a versionable way to describe what that product is, what is its inputs, what is its outputs? What's the metadata? What's the quality checks? What are the access controls that we can describe to the platform, how to manage, how to provision this data product. The other term that we use a lot when we talk about data products is the DATSIS principles.

Like, if we want to do a check mark, is this a data product or not? The D means discoverable. Can I find that data product in a catalog somewhere? Because if you build it, no one can find it, and it's not discoverable. If I find it, is it addressable? If I need to actually use it, can I access it and consume it through some of these interfaces that it publishes? The T is for trustworthy. If it advertises this quality check, so the SLO or the lineage, it makes it easier for me, if I don't know where the data is coming from, to understand, should I trust this data or not? Self-describing, so all that metadata about, where does it fit in the landscape? Who's the owner of that data product? The specification, it's part of the data product. It needs to be interoperable.

This gets a little bit more towards the next part of the presentation. When we talk about sharing data in other parts of the organization, do we use a formal language or knowledge representation for how we can allow people to connect to the data. Do we define standard schemas or protocols for how it's going to happen? Then S is for secure, so it needs to have the security and access controls as part of that.

Data Modeling for (Descriptive/Diagnostics) Analytical Purposes

Quickly dive into when we talk about analytics, one of the modeling approaches. I put here in smaller caps, like this is maybe for more descriptive, diagnostics types of analytics. It's not anything machine learning related. We usually see about this like, there's different ways to do modeling for analytics. We can keep that raw data. We can use dimensional modeling. We can do data vault. We can do wide tables. Which one do we choose? Dimensional model is probably one of the most popular ones. We structure the data either in a star schema or a snowflake schema, where we describe facts with different dimensions. Data vault is interesting. It is more flexible. You've got this concept of hubs that define business keys for key entities that can be linked to each other.

Then both hubs and links can have these satellite tables that is actually where you store the descriptive data. It's easier to extend, but it might be harder to work with, for the learning curve is a little bit higher. Then we've got the wide table, like one big table, just denormalize everything, because then you've got all the data that we need in one place. It's easy to query. That's another valid model for analytics. This is my personal view. You might disagree with this. If we think about some of the criteria around performance of querying, how easy it is to use, how flexible the model is, and how easy to understand? I put raw data as kind of like the low bar there, where, if you just got the raw data, it's harder to work with because it's not in the right shape.

Performance for querying is not good. It's not super flexible, and it might be hard to understand. I think data vault improves on that, a lot of them especially on model flexibility. Because it's easy to extend their model, but maybe not as easy to use as a dimensional model. The dimensional model might be less flexible, but maybe it's easy to use and to understand. Many people are used to this. Performance might be better because you're not doing too many joins when you're trying to get information out of it. Then, wide tables, it's probably like the best performance. Because everything is in a column somewhere, so it's very quick to use and very easy, but maybe not super flexible. Because if we need to extend that, then we have to add more things and rethink about how that data is being used.

Data Architecture Between Systems

I'll move up one level now to talk about data architecture between systems. There are two sides of this. I think the left one is probably the more important one. Once we talk about using data across systems, now, decisions that we make have a wider impact. If it was within my system, it's well encapsulated. I could even choose to replace the database or come up with a way to migrate to a different database, but once the data gets exposed and used by other parts of the organization, now I've made a contract that I need to maintain with people.

Talking about, "What's the API for your data product?" The same way that we manage APIs more seriously about, let's not try to make breaking changes. Let's treat versioning more as a first-class concern. That becomes important here. Then the other thing is how the data flows between the systems. Because there's basically two different paradigms for handling the data in motion. Thinking about the data product API, because it needs more rigor, one of the key terms that are coming up now, people are talking about this, is data contracts.

We're trying to formalize a little bit more, what are the aspects of this data that is in a contract now? We are making a commitment that I'm going to maintain this contract now. It's not going to break as easily. Some of the elements of the contract, the format of the data, so, is my data product going to support one type or maybe multiple formats? Is there a standard across the organization? Examples may be Avro, Parquet, maybe JSON or XML, Protobuf, like that. There are different ways that you could choose. The other key element is the schema for the data, so that helps people understand the data that you're exposing.

Then, how do we evolve that schema? Because if we don't want to make breaking changes, we need to manage how the schema evolves in a more robust way as well. Depending which format you chose, then there's different schema description languages or different tools that you could use to manage that evolution of the schema. The metadata is important, because it's how it hooks into the platform for discoverability, to interop with the other products. Then, discoverability, how can other people find it? I mentioned before having something like a data catalog that makes it easy for people to find. They will learn about what the data is, how they can be used, and then decide to connect to the API.

When we talk about data in motion, like I said, there's basically two main paradigms, and they're both equally valid. The very common one is batch, which usually is when we're processing a bounded dataset over a time period. We've got the data for that time period, and we just run a batch job to process that data. When we're doing batch, usually going to have some workflow orchestrator for your jobs or for your pipelines, because oftentimes it's not just one job, you might have to do preprocessing, processing, joining, and things like that. You might have to design multiple pipelines to do your batch job. Then the other one that's popular now is streaming.

We're trying to process an infinite dataset that keeps arriving. There are streaming engines for writing code to do that kind of processing. You have to think about more things when you're writing these things. The data might arrive late, so you have to put controls around that. If the streaming job fails, how do you snapshot so that you can recover when things come back? It's almost like you've got a streaming processing job that's always on, and it's receiving data continuously. There's even, for instance, Apache Beam, they try to unify the model so you can write the code that could either be used for batch processing or stream processing as well.

Flavors of Data Products

The other thing, thinking about between systems, I was trying to catalog, because there are different flavors of the data products that will end up showing up in the architecture. When we read the book and when we read the first article from Zhamak, she talks about three types of data products. For example, we've got source-aligned data products. Those are data products that are very close to, let's say, the operational system that generates that data. It makes that data usable for other parts of the organization. Then we may have aggregate data products. In this example here, I'm trying to use some of the modeling that I've talked about before.

Maybe we use a data vault type of modeling to aggregate some of these things into a data vault model as an aggregate. The dimensional model, in a way, is aggregating things as well. This is where it gets blurry, the dimensional model might also act as a consumer-aligned data product, because I might want to plug my BI visualization tool into my dimensional model for reporting or analytics. I might transform, maybe from data vaults into a dimensional model to make it easier for people to use, or maybe as a wide table format. Then the other one I added here, which is more, let's say we're trying to do machine learning now, I need to consume probably more raw data, but now the output of that trained model might be something that I need to consume back in the operational system.

We've got these three broad categories of flavors of data products. What I've done here, there are more examples of those types of products, so I was trying to catalog a few based on what we've seen. In the source-aligned, we've got the obvious ones, the ones that are linked to the operational system or applications. Sometimes what we see is like a connector to a commercial off-the-shelf system. The data is actually in your ERP that's provided by vendor X. What we need is connectors to hook into that data to make it accessible for other things. The other one I put here, CDC anti-corruption layer. CDC stands for Change Data Capture, which is one way to do that, pull the rug of the data from the database from somewhere.

This is a way to try to avoid the leakage, where we can try to at least keep the encapsulation within that domain. We could say, we'll use CDC because the data is in a vendor system that we don't have access, but we'll build a data product that acts as the anti-corruption layer for the data, for everyone else in the organization. Not everyone needs to solve that problem of understanding the CDC level event. Or, external data, if I bought data from outside or I'm consuming external feeds of data, maybe they might show up as a source-aligned in my mesh. The next one aggregate.

Aggregate, I keep arguing about with my teams as well, because lots of things are actually aggregates when we're trying to do analytics. In a way, it's like, is it really valid or not? We see just as an example like this, sometimes aggregates are closer to the producer. The example I have is, we're working with the insurance company, and you can make claims on the website, so I can call on the phone. What happens behind the scenes is there's different systems that manage the website and the phone. You've got claims data from two different sources, but for the rest of the organization, maybe it doesn't matter where the claim comes from, so I might have a producer owned aggregate that combines and makes it easy for people downstream to use that. We might have this especially analytical models that are more like aggregates across domains.

When we're trying to do KPIs, usually, business people ask about everything. You need to have somehow access to data that comes from everywhere. The common one is consumer owned. When we're trying to do, maybe, like a data warehouse model, or something like that, you aggregate for that purpose. Then the other one, like a registry style. This one comes up when we're trying to do, for instance, master data management in this world, where, let's say we've got our entity resolution piece in the architecture, where we've got customer data everywhere. I aggregate into this registry to try to do the entity resolution, and then the output is like, how did I cluster those entities, that can be used either by other operational systems or for downstream analytics?

Then in consumer-aligned, there's many of them. Data warehouse is a typical one. I put operational systems here as well because people talk about reverse ETL, where we take the output from analytics, put it back into the operational system. To me, that's just an example. I'm using output from another product to feed back into the operational system. This was the example from the ITV one in the beginning, where we're trying to feed that audience data into the CDP. It's actually aggregating things and putting in the consumer for marketing.

I think metric store, feature store, building machine learning models, examples of things that are more on the consumer side. Or if we're trying to build like an optimization type of product, maybe, that might need data from other places, and then the outputs might be the optimization. The interesting thing, if you think about this now, is that data products were born with the purpose of being analytical data products, but now we're starting to feed things back into the operational world, which means the output ports can actually be subject to operational SLAs as well. This is a question that I get a lot.

It's like, if I'm only using data products for analytics, how does it feed back into the operational world? In my view, I think that's a valid pattern. Like the output port can be something that you need to maintain just as another operational endpoint that gives you, let's say, model inference at the speed that the operational system needs.

Data Architecture at Enterprise Level

That leads me to the last level, at the enterprise level. Here we start getting more into organizational design. How is the business organized? How do we find those domains? A lot of that strategic design from DDD comes into play here, because we need to find, where are the seams between the different domains? Who's going to own these different domains? One question that comes up a lot when we talk about data mesh specifically, is, data mesh is very opinionated about if you need to scale, if you got a lot of data, a lot of use cases, the centralized model will become a bottleneck at some point.

If you haven't got to that point, maybe that's a good way to start. If you're trying to solve the problem of, my data teams are a bottleneck, then the data mesh answer is, look at a decentralized ownership model for data. A lot of that domain thinking data as a product is what enables you to decentralize ownership. This is a decision you can make. When we start mapping that to different domains or different teams, you've got, data products usually sit in a domain. Domain will have multiple data products.

The data products can be used both within that domain or across different domains. We will draw different team boundaries around that. The team could own one or multiple data products, which is another question I get a lot, like, if I'm going to do all this, do I need one team for each data product? Probably not. The team might be able to own multiple of them. Sometimes they can own all of the data products for a domain, if it's something that fits in the team's head. The key thing is they have that longer term ownership for those data products.

Then, this is another principle from data mesh to enable this. If you're trying to decentralize, if you're trying to treat data as a product, how do I avoid different teams building data in a different way, or building things that don't connect to each other? The answer is, we need a self-serve platform that helps you build those in a more consistent way, and also to drive speed, because then you don't need every team solving the same problem over again. The platform needs to think about, how does it support these different workloads, whether it's batch or streaming? Is it analytical, or AI, operational things? The platform also enables effective data governance, so when we start tracking those things I was talking before about quality, lineage, access control. If the platform enables that, then it gives you a better means to implement your data governance policies.

It delivers more of this seamless experience, how data gets shared and used across the organization for both producers, consumers of data. It might be engineering teams, and eventually even business users. If you try to drive those democratizing of analytics, and we've got business people trained to use things, they could also be users in this world. How does the platform support these multiple workloads? Going back to the example I had before, I had the self-service BI here as an example of a platform capability, maybe. There's actually more. In this case, I've picked a few examples.

I might have to pick different types of databases on the operational side. Maybe event stream platform to support how the events flow between the systems. Some kind of data warehouse technology here. I need a lot of the MLOps stack for how do I deploy the machine learning models? How do I manage all of that?

Another way to talk about, if you read the Data Mesh book, Zhamak talks about, data platform is actually multiplane. Oftentimes, when I go to clients, especially if they're not doing data products or data mesh, they will already have something in place. They will have the analytics consumption plane and the infrastructure plane. Analytics is where they're doing those visualizations or business reporting, or even, like I put data science, machine learning as a use case of data here that's built on some infrastructure that they've been building, could be cloud, could be on-prem. Infrastructure will be the building blocks to actually store the data, process the data, put some controls around that data.

The thing that we also need, I think, in the book, she talks about the data mesh experience plane. If we're abstracting away from data mesh, what that is, is really like a supervision plane. I'm trying to get a view of what's across the entire landscape of data. The missing bit there is the data product experience plane. This is something that usually we don't find because that is where we put capabilities around, how do I define the specification for the data product, the types of inputs and output ports that I will support? How do I support the onboarding and the lifecycle of managing these data products, metadata management? There's a lot of things, even like, I think we've got data product blueprints.

When I was talking about the different flavors, if the organization has data products that are similarly flavored, one thing you can do is extract those as blueprints. When you need another one that's very similar to that, then it's even easier for the teams to create a new one. In the supervision plane, especially in data mesh, like we talk about computational governance, that's the term that we use in the book. The idea is, actually we're informing that, connecting to the infrastructure plane, and building that into the data product onboarding and the lifecycle management so that we can actually extract those things around data lineage, data quality, auditing.

If you need access audits for who's doing what. This is a capability view, so I didn't put any product or vendor here, but it's a good way to think about that platform. You can see how it can quickly become very big. There's lots of things that fall under this space. Usually, we will have one or multiple teams that are actually managing that platform as well, and they build and operate the infrastructure. The key thing is self-serve platform. If it's not self-serve, then it means the teams that need things will have to ask the teams to build it. Here, self-service enables you to remove that bottleneck, so when they need it, they can get it.

One way that I really like to think about how to structure the teams and how to split the ownership is using team topologies. If you haven't read this highly recommended book, it's not just for data, it actually talks about software teams and infrastructure teams. The way that I really like is they use the idea of cognitive load as a way to identify like, is it too much for the team to understand? Maybe we can split things, split teams. How does value flow between these different teams? What's the interaction modes between them? Just a quick, is a hypothetical example here, but using the team topology terminology, if you're used to it.

Maybe the self-serve data platform team offers the platform as a service, so that's the interaction mode, and the data product teams are the stream-aligned teams from team topology terminology. Then we might have, let's say, governance as an enabling team that can facilitate things within the product, and they might collaborate with the platform team to implement those computational aspects of governance into the platform itself.

The last bit I'll talk about is governance, because when we talk about enterprise data usage, governance becomes important. How does the company deal with ownership, accessibility, security, quality of data? Governance goes beyond just the technology. This gets into the people and the processes around it. An example here, tying back to the data product space. Computational governance could be, we build a supervision dashboard like that, where we can see across the different domains, these are the data products that exist. I could see, do they meet the DATSIS principles or not? I even have some fitness function or tests to run against to see if the data product actually meets them or not.

Summary

Thinking about data, there's a lot of things. I talked about, data within the system, between systems at the enterprise level.

Questions and Answers

Participant: You mentioned the data platform, this self-service one, but that's not something that you just get completely out of the blue one day. It needs to be built. It needs to be maintained. How do you get to that point? Especially considering like your data source is mostly operational data, do you use your operational data teams to be building that, or do you bring in data product teams, or a bit of mix and match? Because there are pros and cons, both of them.

Sato: I think if you've got ambition to have lots of data products then the platform becomes more important. When we're starting projects, usually what we say is, even if you don't start with all the teams that you envision to have, you should start thinking about the platform parts of it almost separate. We might build that together with the same team to start with, but very quickly, the platform team will have its own roadmap of things that they will need to evolve, and you will want to scale more data product teams that use it. Again, this is assuming you've got the scale and you've got the desire to do more with data. That's the caveat on that. I think it's fine to start with one team, but always keep in mind that those will diverge over time.

 

See more presentations with transcripts

 

Recorded at:

Sep 19, 2024

BT