Transcript
So I'm Randy Shoup, your track host. I will introduce the next speaker, the VP of engineering at StitchFix.
Hi, I'm Randy Shoup, the VP of StitchFix.
Background
I want to talk about data in microservices. Thank you for coming. And my background, it informs a lot of the lessons I will impart. I'm VP of engineering at StitchFix- it is a clothing retailer in the United States and we use data science to help customers find the clothing they like. Before StitchFix, I was a roving CTO of services; I was helping companies discuss technologies and these situations. And on Thursday I will give a talk on how to scale your organizations and technology and all that.
Earlier in my career, I was director of engineering at Google for Google app engine, so that is Google's platform as a service, like Heroku, or something like that, and then earlier I was chief engineer at Ebay for six and a half years. We helped our team build multiple generations of search infrastructure. If you have gone to eBay and find something that you liked, good, my team did a good job, and if you didn't find it, you know where to put the blame.
Sticht Fix
A little bit of StitchFix, because that informs the lessons and the techniques as I talk about breaking monoliths into microservices. StitchFix is the reverse of the standard clothing retailer; rather than shopping online, or going to a store, what if you had an expert do it for you? So you fill out a profile about yourself, it is 60 questions, it might take you 20 to 30 minutes, we ask your size, height, weight, what styles do you like, flaunt your arms, hide your hips- very detailed and personal things that we ask, and why? Because if there is somebody in your life that knows how to choose clothes for you, what does that person know about you, we are using that explicitly and data science to make it happen. As a client, you have five hand-picked items we deliver to your door step, they are hand chosen for you by a human stylist, you keep the things that you like, you will pay us for those, and you will return the rest for free.
And so what goes on behind the scenes is a couple of things that includes humans and machines. So the machine side is that, every night, we look at every piece of inventory, times every client we have, and we compute a predicted probability of purchase; what is the conditional probability, if we send Randy this shirt, he will keep it. A 72 percent chance for this shirt, and 54 percent chance for the shoes and 47 percent chance for the shoes, and for each of you in the room the percentages are going to be different. Those are computed by machine learned models that we layer up in an ensemble way to compute that score, and that computes a set of algorithmic recommendations that go to the stylists.
So those algorithmic recommendations- as the stylist is essentially shopping for you, choosing those five items on your behalf, he or she is looking at those recommendations and figuring out what to put in the box. And so, what does the human do? He or she will put an outfit together, which the machines are not able to do, or the human will answer a request that you have; I'm going to Manhattan for a wedding, send me something appropriate. The machine doesn't know what to do that, but if you are a woman, you are like, slinky black dress. The humans know things that the machines don't. Cool.
Data at the Center
All of this requires a ton of data. And, interestingly, and I believe uniquely in the world, StitchFix has a one-to-one ratio between data science and engineering. So we have more than 100 software engineers in the team that I work on, and we have roughly 80 data scientists and algorithm developers that are doing all the data science. And, to my knowledge, this is a unique ratio in the industry. I don't know any other company on the planet that has this kind of near one-to-one ratio.
What do we do with all of those data scientists? It turns out, if you are smart, it pays. So we apply the same techniques to what clothes we're going to buy, so we make algorithmic recommendations to the buyers and they figure out that, okay, next season, we're going to buy more white denim, or cold shoulders are out, or capri pants are in next. We do the same for inventory management; what do we keep in what warehouses, and so on. We do the logistics optimization, so the carrier selection is on the door step when you want at minimal cost to us. The style recommendations, and then demand prediction. We are a physical business, we physically buy the clothes, you put them in warehouses, and we ship them through the mail to you. So, unlike Ebay and Google and a bunch of virtual businesses, if we guess wrong about demand, if demand is 2X what we expect, that is not a wonderful thing that we celebrate, that's a disaster because it means that we can only serve half of the people well. Does it make sense? If we have double the number of people, we should have double the number of warehouses, stylists, employees, and that kind of stuff. It is very important for us to get these things right. Cool. And again, the general model here is that we use humans for what the humans do best and machines for what the machines do best.
Design Goals
When you design a system at this scale, as I'm sure you hopefully do, there are a bunch of goals. So you want to make sure that the development teams can continue to move forward at a quick pace, that's what I call feature velocity; the teams are able to move forward rapidly and independently. We want scalability, like as our business grows, we want the infrastructure to grow with it, we want the components to scale to load, scale to the demands that we put on them. And also, we want those components to be resilient; we want the failures to be isolated and not cascaded through the infrastructure.
High-Performing Organizations
And so a talk about what high performing organizations that have these kinds of requirements have to do. So if people are familiar with the DevOps hand book, raise your hands. Maybe a third of the people, that is excellent. This is research from there from Kim and Nicole Grim and other people have done: what is the difference between high performing organizations and lower ones. Higher performing organizations both move faster and are more stable. So you don't have to make this choice between speed and stability; you can have both. So the higher-performing organizations are doing multiple deploys a day, versus maybe one per month, and higher-performing organizations have a latency between the time that I commit my code to the source control and to deployment of less than an hour, and other organizations might take a week for that to happen. That's the speed side.
Here is the stability side. These high-performing organizations recover from failure in an hour, versus it might take a day in a lower-performing organization, and the rate of failures is lower. So the percentage of time that a high-performing organization does a deploy, and it doesn't go well and has to be rolled back, is much lower, zero percent, and in slower organizations it is up to a half. This is a big difference, and a technique we will use is microservices that we will talk about for the rest of the session. But it is not just the speed and the stability, it is not just the technical metrics.
The higher-performing organizations are two and a half times more likely to exceed interesting business goals, like profitability, market share, and productivity. So this does not just matter to engineers, but to business people. So one of the things that I got asked a lot when I was doing my roving CTO as a service gig was this sort of thing. So I had people come to me, they would say, hey, Randy, you worked at Google and Ebay, tell us how you did it. I will say, I promise to tell you, and you have to promise not to do those things yet. Not because I wanted to hold the secrets together, but for a 15,000 person engineering team that Google has, that's a different set of problems than a 5-person start up that sit around a conference table. That is three orders of magnitude different, so there will be different solutions at different scales for different companies.
Evolution to Microservices
So I would love to tell you how the companies you have heard of have evolved to microservices, not just started with microservices, but evolved there over time. And Ebay is now on its 5th complete re-write of its infrastructure. It started out as a monolithic PERL application; the founder wanted to play with this thing called the web, and he spent a three-day weekend building this thing, Ebay. The next generation was a C++ application, at its worst, was 3.4 million lines of code in a single DLL. They were hitting compiler limits on the number of methods per class. Which is 16K. Yeah.
So I'm sure there are lots of people in the room that think they have a monolith; does anybody have a monolith that is worse than that? One person. We can talk afterward, we can cry over beer or tea, whatever your drink of choice is.
The third generation is a re-write in Java, you cannot call that microservices, it is miniapplications, they turned it into 220 different Java applications, one was for the search part, one for the buying part, dot dot dot, 220 applications. And then the current instance of Ebay is fairly characterized as a polyglot set of microservices.
Twitter has gone through a similar evolution; they are on roughly their third generation, they started as a Rails application, nicknamed the monorail, and the second generation was pulling the front-end out into JavaScript, the back-end out into services written in Scala, because Twitter was an early adapter. Now, we can characterize Twitter as a polyglot set of microservices.
Amazon has gone through a similar set of evolutions, not as clean in the generations, but it started out in PERL application that you can see evidence of in product details pages. If you go into the url Obidos, that was the code name of the original Amazon application. That is a city in Brazil on the Amazon, that's why it was named that way.
And the next generation basically took them from 2000 to 2005 to re-write everything in a service-oriented architecture, mostly the services were written in Java and SCala. And as a side-note, the genius of Jeff Bezos, he told Wall Street and everyone, and in 2000 to 2005, they were not doing particularly well as a business. But Jeff Bezos kept the faith and forced/encouraged strongly for everyone in Amazon to re-build in a service-oriented architecture, in retail. And now you can see it in a polyglot set of microservices.
Does this look like a pattern? Nobody that you heard of started with microservices and passed a certain scale, maybe only .1 percent of us is going to get to, and past a certain scale, everybody ends up getting to microservices. There is really something here. This is what I would like to say: if you don't end up regretting your early technology decisions, you probably over-engineered. Why do I say that? It is because, let's imagine there was an Ebay competitor in 1995, or an Amazon competitor in 1995; instead of finding a product market fit, finding a business model and things that people are going to pay for, instead they built a distributed system they are going to need in five years. There is a reason we have not heard of that company. Right? Yeah.
So, again, so think about the level and sort of where you are in your business, where you are in your team size. The solutions for the Amazon and Google and Netflix are not necessarily the solutions for you when you are small start-up. Okay, cool.
Microservices
And now, we will talk about microservices. So I like to define microservices in this way. The micro in microservices is not about the number of lines of code, it is about the scope of the interface. So a microservice is single-purpose; it has a simple, well-defined interface, and it is modular and independent. The critical thing that we will spend the rest of this time talking about and exploring the implications of is that effective microservices, as Amazon found, have isolated persistence. In other words, a microservice should not be sharing data, you know, behind their back with other services. Why is that true?
It is because, in order for a microservice to be able to execute business logic reliably and to be able to guarantee invariance, you cannot have people reading and writing the data behind their back. Does that make sense? Ebay discovered this on the other way. So Ebay did a bunch of effort with some very smart people to build a service layer in 2008; that was not a successful effort because, although the services were extremely well-built and the interfaces were quite good, very orthogonal, they spent a lot of time thinking about it. There was a sea of shared databases underneath them, which were also available to the applications. So nobody had to use the service layer in order to get their jobs done, and they didn't. Does that make sense?
Extracting Microservices
All right, cool, we will move forward. So what I'm going to talk about now is a journey that we are going on right now, actually, at Stitch Fix. So we did not build a monolithic application, so we never built the monorail, but our version of the problem is we built a monolithic database. I will sketch out how we are breaking up our monolithic database and extracting services from them, and then the subsequent parts of the talk will talk about how to deal with the fact that we don't have a monolithic database anymore. There are great things that we would like to retain, and we will talk about those things.
And this is our situation, there is way more than this number of apps, but there are only so many things that are readable on the slide. We have a shared database that includes everything that is interesting about Stitch Fix that matters at all. So clients and the boxes that we send, and the items that we put into the boxes, and metadata about the inventory, like styles and SKUs, information about the warehouses, times 100 different tables. And we have on the order of different services that use the same database for their work. And that is the problem- why is that a problem, that shared database is a coupling point for the teams to be able to be interdependent as opposed to independent, it is a single point of failure and a single performance bottleneck. So here is what we're going to do, we will decouple applications and services from the shared database, and there is a lot of work that is measured in engineer years.
So this is the starting point, and too many boxes and lines. We will imagine that there are only three tables and two applications. Here is how it is going to go. The first thing that we're going to do, we will build a service that represents client information. We will create the service, it is going to be one of the microservices, it has a well-defined interface, we thought about it with the consumers of that service, and we have, you know, we have negotiated it with them. We created the service. The next thing, we point the applications, instead of using the shared database to read from the table, they read from the service. There's a lot of work from moving the lines, that's the hardiest part here. I do not mean to trivialize, but it is hard to generalize in a slide how hard it is to do that. And they will not use it directly but go through the service, we will go from the shared database and put it in a private database that is associated with the microservice. Does that make sense?
There's a lot of hard work, but this is the pattern. And so the next pattern is we do the same thing for the item information, and so we create an item service, we have the applications instead of using the table, we use the service, and then we extract the table into a private database of the service. And then we do the same thing for SKUs or styles. And then we keep rinsing and repeating. This is where we end up, where, whoops, an individual microservice, the boundary of the microservice actually surrounds both that application box and the, you know, the database thing. Does that make sense, what I'm saying? That the boundary is, you know, client service and core client, you know, that is in and of itself the microservices boundary. This is what we end up with, we have the monolithic database, everything is all there and we have divided it up and microservices own their own persistence and we are happy and the birds sing and the unicorns play. But no, there are a lot of things that we like out of the monolithic database, I don't want to give them up. Things included the easy ability to share data between different services and applications, the ability to do joins across different tables in a very easy way, and also transactions. I want to be able to do multiple operations that, across multiple entities, as a single atomic unit. Are these all things you want to have? Yeah. Okay. So I will tell you what you can and cannot have, and what you can't have, I will give you partial solutions too.
Events as First-Class Construct
And before that, I will give you another tool in your toolbox that perhaps you know about but you don't appreciate as much as you should. So that architectural building block is events, and as defined by Wikipedia, it is a change in state, or a statement that something has occurred. In the traditional three-tier system that we cut our teeth building, there's the presentation tier that the users or the clients us, there's the application tier that represents stateless business logic, and then The Persistence tier that is backed by a relational database. We are missing something; as architects we are missing a building block that represents a state change, and that is what I will call an event. And because the events are asynchronous, what happens is that maybe I will produce an event and nobody will be listening to it yet, and maybe only one other consumer within the system will be listening to it, or maybe other consumers are going to be subscribing to my event. Does that make sense? Now we have, I hope, motivated slightly events as a first-class construct in our architecture, we will apply them now to microservices.
Microservices and Events
It is obvious that a microservices interface includes the front door, right? It is obvious that it includes the synchronous request response, you know, HTTP-typically, maybe HTTP JSON, maybe JRPC or something like that, but it clearly includes that. That is obvious. What is less obvious that I hope I can convince you is true it includes all of the events that the service produces, all of the events that the service consumes, and any other way to get data in and out of that service. If you are doing bulk reads out of the service for analytic purposes, or bulk writes into the service, or uploads, that is part of the interface of the service. I will assert that the interface of a service includes any mechanism that gets data into or out of it. Does that make sense? Cool. So now that we have events in our toolbox ish, you will start to use events as a tool in solving those problems of shared data, of joins, and of transactions.
Microservice Techniques: Shared Data
So here is the problem, such as it is, of shared data. In a monolithic database, it is very easy to leverage shared data. We point the applications at this shared table and we are all good. But where does shared data go in a microservices world? Well, we have a couple of different options, but again, I will give you one extra tool or maybe a phrase to use when you are discussing it with people.
So the principle, or that phrase, is single system of record. What do I mean by that? I mean, if there's a piece of data, like a customer, or an item, or a box, that is of interest in your system, there should be one and only one service that is the canonical system of record for that data. There should be one place in the system where that service owns the customer, that owns the box, that owns the item. And every other copy of that thing, because there are going to be lots of representations of customer around, they certainly are in Stitch Fix. Every other copy in the system is a read-only, non-authoritative cache of that system of record.
And I'm going to let that sink in. Read-only, non-authoritative. So don't modify the customer record and expect it to stick around in some other system. If you want to modify it, you need to go back to the system of record, and the system of record is the only place that can tell you the current view, you know, up to the millisecond of what the customer is doing. That's the idea of the system of record; we will use that in a couple of different possibilities that you can approach sharing data.
So the first is the obvious and the most simple one, simply synchronously look it up from that system of record. So in this example, we will imagine that at Stitch Fix, we have a fulfillment service, where we are actually going to ship a thing to the customer at their physical address, and there's a customer service that owns the customer data, one piece of data which is the customer's address. So one totally legitimate solution is the fulfillment center calls the customer in real time and looks it up; there is nothing wrong with this approach, this is a perfectly legit way to do it. Sometimes it is not right, maybe you do not want everything to be coupled on the customer service, or the fulfillment service or the equivalent of it is pounding the customer service so often that you don't want it to happen for performance reasons.
So there is another solution, and that solution involves the combination of an asynchronous event with a local cache. So still, the customer service is going to own that representation of the customer, so the customer service, when the customer data changes, is going to send, we will imagine an address updated event, when they change the address, the customer service will listen to that event and the fulfillment center will send the box on its merry way.
And the caching fulfillment has other nice properties. So imagine that the customer service does not retain a history of address changes; you can totally remember that in the the fulfillment service. Does that make sense? So this happens at scale, people order a box and move between the time that they started the order and the time that we shipped it out. We want to make sure that we send it to the right place. Does that make sense? Okay, cool. So that is shared data.
Microservice Techniques: Joins
And the next thing is joins. So, the problems such as it is, the problem statement here is, in a monolithic database, it is really easy to join tables together. How do I do that? I simply add another table to the front clause in MySQL statement and I'm all good. So select about from A, then we will select A from B to a join condition, and now I have this join. And this works great when everything is all in one, big, monolithic database. This does not work in a SQL statement if A and B are separate services, does that make sense? Okay, how do we do it? And the other thing is, once we split the data across microservices, or joins, or conceptually the joins are a lot harder to do.
So we always have architecture choices; there is more than one way to do it. So the first is join in the client, so whatever place is interested in, you know, the two -- the As and the Bs, just have it do the join. So in this particular example, let's imagine that we are producing an order history page, so the customer comes to our site, they would like to see the history of the boxes we sent them at Stitch Fix, and we might be able to do that in this way. So we might have the order history page go and call the customer service and get the current version of the customer's information, maybe her name and her address and how many, you know, how many things we have sent her, and then we go to the order service and we get the detail about all of her orders. We get a single customer from the customer service and we will query for the matching orders for that customer on the order service. Does this make sense? This is basically every web page in life, right? And this is every web page that does not get all of it's data from one service. And, again, this is a totally legitimate solution to this problem; we use it all the time at Stitch Fix, and I'm sure you do all over the place in your applications as well.
And so, here is the approach, too. So let's imagine that doesn't work, again, for reasons of performance, or reasons of reliability, something like that. Sometimes this doesn't work; maybe I'm querying the order service too much or something like that.
Here is approach number two; let's imagine that we create a service that does, motI what I like to call in database terminology, materializing the view. Does that term resonate with you? If not, I will explain what I mean. Here is how it works and I will explain the implications of it. It is a slightly different use case here, it is more evocative of this particular problem. So imagine we are trying to produce an item feedback service. At StitchFix, we send boxes out, sometimes people keep the things that we send, sometimes they will return them, we want to know why and we want to remember which, you know, which things are returned and which are kept. So let's imagine that we want to remember that in an item feedback service. So for this particular shirt, let's -- I'm making this up. Maybe we have 1,000 or 10,000 units of this shirt, and we want to remember all of the instances when this shirt was sent out, what people's feedback was about that shirt, and then obviously, you know, times the tens of thousands, or pieces of inventory that we might have. Cool.
So the way that this works is we are going to have an item service, which is going to represent the metadata about this shirt, and this item feedback service is going to listen to events from the item service, here are items, here are items that are gone, and changes to the metadata if that is interesting. It will also listen to events from the order service. So every time that we get some feedback from an order, we send five items in a box, that should generate an event or five, and if this item feedback service is listening to those events and then materializing the join, in other words, remembering for every item all the feedback that we got, and then we are remembering that in one sort of cached place. So a fancier way to say that is it maintains a denormalized join of items together in its own local storage. Does that make sense? And if it does not make sense, that is also cool. There are many common systems that do this all the time, and you don't even think that they are doing it. For example, any sort of enterprise-grade, ie you pay for it, system has a concept of it in a materialized view. Oracle and SQL server has it, a bunch of class databases has materializing a view.
Most non-SQL spaces do not work that way; if you use the dynamo-inspired stores, like dynamo DB from Amazon, Cassandra, react, or Voldemort from a no-SQL tradition, they are forcing you to do it up front. Relational databases are optimized from easy writes; you write to individual records, to individual tables, and on the read side you put it all together. Does that make sense? Most NoSQL systems are the other way. So the tables that you store are already the queries you wanted to ask. You wanted to know the queries that you wanted to ask and you have tables or column families or whatever you want to call them. And at write time, instead of writing a sub table, you are writing five times to all of the stored queries that you want to read from. So every NoSQL system is forcing you up front, along this tradition, is forcing you up front to do this materialized join.
And every search engine that you use, if you use Elasticsearch in your infrastructure, almost certainly you are doing some form of joining one particular entity with another particular entity. And then also, every analytical system on the planning is joining together lots of different pieces of data, because that is what analytical systems are about. Does this technique sound a little bit more familiar, yeah? Cool. All right, great. So that is joins. And now the hard stuff.
Microservice Techniques: Workflows and Sagas
And so, the wonderful thing about relational databases is you have this concept of a transaction. So what is a transaction? The properties that are common in relational databases transactions are asset properties, it is atomic, consistent, isolated, and durable. You can do that in a monolithic database, that's a wonderful thing about having THE database in your system. So it is really easy to do transaction to cross-multiple entities; you write the SQL conceptually or otherwise, you begin the transaction, you commit and that all happens or it doesn't happen at all. Does that make sense? Cool. And splitting data across services makes transactions very hard, I will replace hard with impossible. How do you know it is impossible? Even the techniques known in the database community for doing distributed transactions, like two-faced commit, nobody does them in practice. A good indicator of that is, raise your hand if you know a cloud service that implements that. Nope. No cloud service on the planet implements a distributed transaction. Why? It is a scalability killer. Yeah, great stuff. Cool.
So, you can't have a transaction, so sorry, but here is what you can have. So that transaction that you want it to do, as a unit, if you want it to update A-C, all together, or not at all, you will turn it into a saga. I will introduce this concept, and Chris Richardson will give you more detail about this. I encourage you to come back, but I'm talking in another track, so you can make your own choices. Chris is awesome. Cool.
And so, here is how you do it. You model the transaction as a state machine of individual atomic events. What do I mean? This is a picture that helps to understand it. So I'm going to re-implement that idea of updating A, updating B, updating C, as a workflow. So I update the A side, it produces an event that is consumed by the B service, that does its thing and it produces an event that is consumed by the C service. At the end of all of this, at the end of the state machine, we are in a terminal state where A, B, and C are all updated. Let's imagine something goes wrong.
We roll back by applying compensating operations in reverse, we do it in the reverse order. We undo the things we were doing in C, which produces one or several events, and then we undo the set of things that we did in the B service, which produces one or several events and then we undo the things that we did in A. Does it conceptually make sense? Yeah. So I'm -- you know, I'm leveling it up here. And so, this is the core concept. There's a lot of great detail behind it, some of which I will explore, and most of which Chris will explore in the last session.
So, again, as with the materializing the view, there are tons of systems that you use every day that work exactly like what we talked about. Who uses a payment processing system in your life? Yeah, everybody. And so if you wanted to pay me with a credit card, what I would like to happen, in one unit of work, the money gets sucked out of your account and magically ends up in my wallet. Is that what happens? No, it super does not! There are tons of things that involve payment processors and talking to the different banks and all of this financial magic, behind the scenes.
So, what you would think of, what you would -- I mean, the example of when we use transactions is debiting something from Justin's account and adding it into Randy's account. No financial system on the planet actually works like that. Every financial system, instead, implements it as a workflow. So first, you know, it gets taken out of Justin's account, it gives the bank for several days, why do you think banks have a lot of money? That's why, it is the float, baby. So it lives in the bank for, you know, way more time than I would like, and then ultimately it ends up in my pocket. Cool.
Expense approvals, who, when you get back from this conference, has to get expenses approved? Probably everybody. Does that happen immediately? No, it super does not! You submit your expenses and, like, okay, I went to this jazz club, please, let me declare the jazz club. That was really a work expense because, like, I was talking to Randy there. I hope that works, but I don't think. And then it goes to probably your manager, and she approves it, and it goes to her boss, and she approves it, dot dot dot, all the way up. And then it does a payment processing workflow, where ultimately the money goes into your account or however you do it.
So it is a wonderful thing, if it could happen, you would want it as a single unit and it happens as this workflow. Any multi-step workflow is like this. So who writes code for a living? Who, keep your hand up if, when you hit return on your IDE, that it is deployed to production immediately. One person. I hope that you run some CI and some tests. So I can see you laughing. So nobody, right? Again, that is not an atomic transaction, nor should it be. A continuous delivery pipeline, when I say commit, it does do a bunch of stuff, the end result of which is hopefully deployed to production. That's what the high-performing organizations are doing. But it does not happen atomically. It is a state machine, this step happens, this happens, this happens, and if something goes wrong along the way, we back it up. Is this sounding familiar? There is stuff you use every day that behaves like this, which means there is nothing wrong with using this technique in the services you build.
So a bit to wrap up, and because I want to give you opportunity to ask me questions, we have explored how to use events Aztec Niks, you -- as techniques as tools in the architecture toolbox, we shared on using different components in the system, we have figured out how to use events to implement joins, and we have figured out how to use events to help us do transactions.
So I want to thank you very much for this part. I have to point out self-servingly, if any of this stuff sounds interesting, I would encourage you to come and talk to me and live for us. We are hiring, I live and work here in San Francisco, our home base, but we have a majority of engineers that are remote all around the United States, we have people in Hawaii, the Midwest, Portland, Seattle, Texas, if there's a state with software developers, we probably have a few employees there. Great, please contact me about that. And I will ask you for questions.
Can I ask for a volunteer to run the mic around? I could do that, but I would rather be answering instead of running.
But I will start with the question.
There you go, that's the way to do it.
So you mentioned about events. What are some event-based fidelity to implement this model; like if it is lost, so if you have taken the complexity out of synchronous systems, you put them into the event layer, but you have not spoken on the challenges on that. Can you throw some more light on that, please.
Yes, people that are working for me are starting to laugh, I keep threatening to do the Master Class, which is a half day or multiple days of how to do that stuff with events. I will explore it in maybe 30 seconds for you now. So you point out, you raise the possibility of -- you raise delivery issues. So I might not get the event, or I might get it twice. Yeah. So there are two ways to build event-based systems; it is at least once delivery, or at most once delivery. So, you don't want at most once, if you care about the event. So at most once delivery is, like, it is going to be delivered zero times or one time, that is typically for logging systems. If it doesn't get delivered, on failure, you don't want to not deliver the event. So the majority of systems, and all event use cases I'm talking about here should be implemented in an at least once delivery system, and all messaging systems, you know, have that sort of configuration, or that fault.
And -- great. Now it is delivered at least once, on failure I will try to retry and retry and retry, so the failure mode or situation there is, I might receive the same event multiple times. Cool, how do you deal with that? Item potence. That's the mathematical property that if I apply a function multiple times, it has the same result as if I applied it one time. Item is Latin for equal, so if I do F of X, it is the same as F of F of X. And please have empathy for person do the transcript there. F, paren, I will not even look, I'm sure it is awesome. She has been doing great today. Cool.
So item potence. So that has strong implications. I love it. You get a hug later. All right, cool. So item potence. So that is -- so that is a property of what the consumer of the event does. Yeah? And so, the -- the consumer has to conceptualize what the consumer will do when, not if, when the consumer receives the event multiple times. The simplest case is when the things that you, you are just counting things, and that is a hard case. The simplest thing is when you are maybe writing it and you just write the same thing multiple times, like, that is cool, that is really easy to do.
The hard case, confusingly, is actually counters. And so f you are doing a counter in a distributed situation where you get multiple things, multiple deliveries of the same thing, you actually have to remember which events you have seen and only count the ones you have not yet seen. Does that conceptually make sense? So I think I have to stop here. There's a master class which I hope some day to deliver about this stuff. I'm not the expert in the world, but I -- I have a lot of gray hair/no hair. A lot of it is about event systems. So I will suggest that you look up CRDTs, conflict-free replicated data types, and they are all about solve, giving you data types that which you can actually compose together that allow you to handle these item potent situations. I have to leave it there. So it is not -- like, that is THE question, right? And then there's a separate thing about, how to deal with ordering of events, and that is a whole other thing. We will leave it for later. Next question.
Can you talk about the persistence of things that materialize a view? You don't want to have another canonical representation, what is the persistence of that cache?
That's a great question. And so, the answer is it depends. And it depends on- so, I will start with -- there's no reason why that persistence of the materialized view couldn't be another relational database. And now, it is not the canonical system of record. But that is a concept, that is a system-level concept, like, over here, the customer service, like, that owns customer data, the fulfillment service does not own customer data. Does that make sense? That's a metastatement that you make about the semantics of the service. So maybe you could or should, maybe you cannot ask the fulfillment service for customer information, but if you could, the fulfillment would say, I'm not the current place to ask about this, that's the other person. And in terms of the persistence mechanism, it could be in cache or in memory if you need it really fast, or stored on disk in a relational database form, or S3, or whatever.
And it is a great question, and also, it is a bit of an orthogonal question, does that make sense, because -- actually, watch Roopa's talk from the beginning of the track; she talks about five different use cases in Netflix, and architecture 201, thinking about the appropriate data storage for a particular business problem. Thank you. You can shout it out -- you have the mic, sorry.
With Kafka-distributed data logs, does it have joins?
Yes, I spoke at the Kafka summit and gave a lot of this talk at the keynote. I'm a big fan of what Jay and Neiha and that whole team has put together. I think Kafka, as a distributed log, is a really great thing. We are now only starting to use it at Stitch Fix, and there are still open questions for us if it meets our needs or whatever. But it is a great place to look. And it has, it has the wonderful properties, I mean, the people who don't know, Kafka is designed from the beginning to be a replayable log, right?
So the idea that I might have missed events along the way, I can re-set, I'm using the wrong terms. But I will re-set the pointer, where I am in the event stream, and go back and replay the events over for myself. That is hugely valuable, both for a service that existed for long time, consuming the events, and also when I want to spin up a new service that has never seen the stuff before. It is a great way to start the pointer at zero and replay the events into the new service. Does that make sense?
I do know, I have not experienced this in any direct way, I know that Confluent has some stuff around, like implementing joins and expressing those joins in SQL, they are smart, that's about all I can say. I have not used them, I cannot recommend them. But I don't anti-recommend them, if that makes sense. Yeah. Trying to be honest. Yes, next person, please.
Hi, you mentioned that Stitch Fix has a remarkable number of data scientists and it seems that remarkable that you never ended up with a monolith. But you do have the monolithic database; did that have a relationship with the process of how you ended up with this database?
No, that's a good question. No, we did that all ourselves. They are entirely blameness and wonderful. Yeah, I mean what -- if what you are getting at is did it serve the needs of the data scientists to have the data in one place? No, they actually have a really wonderful ETL framework that has been built by their platform team, in some sense, they don't care how many databases they have, they will put in ETL to do modeling off of. They are great on doing their own modeling and it is totally on us that we built the monnolithic database.
I have a question. I am thinking about the distributed transactions, when you do the two-face commit and multiple systems, and then you do the commit. In your case, if you are doing -- if I can call it a commit, right. And so, what happens when you are going through that workflow and you have, let's say, there are four or five different microservices involved, you have already altered the state of some of the objects and you are getting some read, and now the object status has changed and now you have to roll back.
Yes, that's a wonderful question. And so remember the acid properties: atomic consistent isolated durable. You do not get the A or the I. It is not atomic, it is not all at once or not at all. You can observe dirty reads and writes. I'm back to database terminology, because I used to work at Oracle, yeah. So, yes, you can -- you can observe the interim states in the workflow, but that's a legit thing in the payment process -- processing, that is a legit thing in the expense workflow. There are different ways of representing that to people. This is in the pending state, or the not-yet-completed state. So you see directly that you do though get the ACID or the isolated, but that is okay. I tell you that's okay. That's all you can get.
And also, like to the isolation, who is familiar with isolation levels in databases? Okay, if you are familiar with it, who runs in serialized? Nobody. There you go. If you don't know what that means, talk to the people that raised their hands afterward, or talk to me. Most people run in an isolation leve; if it is tunable and you know what I'm talking about, most people run -- most people run production systems not in a serializable, linearizable mode, they actually run in a lesser isolation mode, which actually means even in a relational database when doing these transactions, it is possible to observe these interim states. I can still do a bit more, maybe one more question. Yeah, I don't know where the mic is.
Hi. How do you deal with foreign key constraints, or is that dealt with sagas?
Well, you do not have referential integrity in the moment, because you don't have the atomic stuff. You have to, in order to do referential integrity across services, you do that at the application layer. Conceptually, it is something like, I am about to write a thing that depends on another thing, maybe check that that thing exists, or maybe deal with, you know, get comfortable with the fact that it might have gone away, that sort of thing. Yeah. That's how it goes.
I do not understand the counter. I think we are all done. But, yeah, so if you have more questions for me, or anybody else in the panel, please come to the next time slot here in this room, I believe. Because we're going to do a panel discussion of all of the speakers, and you can ask us anything you want. Thanks.
Live captioning by Lindsay @stoker_lindsay at White Coat Captioning @whitecoatcapx.
See more presentations with transcripts