Transcript
Nene: Let me start by quickly introducing ourselves. My name is Amit Nene, here with me is Eric Chen, we work on Uber's Michelangelo machine learning platform. I've been working on the platform for a little less than a year, Eric has been a veteran. We're going to talk today about Michelangelo Palette, which is a system within the greater Michelangelo platform that enables feature engineering for various use cases at Uber. Feature engineering is the use of domain knowledge to prepare features or data attributes for using machine learning.
Michelangelo at Uber
Before that, I want to give a small overview of the Michelangelo platform and what it does. Michelangelo platform is a proprietary platform that was built at Uber and our mission there is to enable engineers and data scientists across the company to easily build and deploy machine learning solutions at scale. Essentially, you can think of it as ML as a service within Uber and we give you all the building blocks for creating your end-to-end machine learning workflows. For instance, the first part is about managing your data, mostly features, we give you the tools, the APIs, to manage those. The second part is creating your workflows, API's to create end to end workflows for your supervised, unsupervised deep learning models. We support a variety of different tools because machine learning at Uber uses a wide variety of libraries and tools across Uber, so we want to make sure our workflows are built in a heterogeneous way.
Having trained your models, we give you the ability to deploy them in production, in various different environments, like batch environment for offline predictions, real time online predictions, and mobile serving environments. More recently, we've been working on tools to detect drips in features or your data, and also your models, the model performance, both offline tools as well as online tools.
That was a quick overview of what Michelangelo is at Uber. It's not an open source platform, but the purpose of this talk was to really share ideas around how we approach feature engineering, which I'll get to in a bit and also give you guys an overview of the architecture of the subsystem involved here.
Feature Engineering
Let's get started with the feature engineering parts of it, to start with a real world example here. This is a real world example inspired by like Uber Eats, imagine you have an Uber Eats app and you search for your favorite restaurant, you find the dishes, you place the order. Let's say you're not satisfied with the ETAs that are reported, and you want to build a model to give you more accurate ETAs, so the first things that come to mind is you need additional data features. You start thinking of what you can make use of, a few things that come to mind, how large is the order, the number of dishes you ordered, how busy is the restaurant at that moment in time, how quick is the restaurant generally speaking and what are the traffic conditions at that point in time?
Michelangelo Palette is essentially a platform that makes you easily create these features for your production models. It turns out that managing features, in our experience, is one of the biggest bottlenecks in productizing your ML models and why is that? The first problem is starting with good features, there's abundant data everywhere and data scientists or engineers just wish they had features ready to go so that they can use it further sort of models, so they struggle with that. Even if they end up finding or creating features for their experimental models, it is very difficult to make that data work at production and at scale, and also in real time environments where low latencies are expected.
The third very relevant problem to ML is a lot of training is done in an offline way, and solving often happens, at least at Uber, in real time environments. It is absolutely important to make sure that the data that you're using in real time, at serving time, matches the data that is used at serving, a previous presentation also hinted at this. A lot of times this is done in an ad hoc way, which can result in the so called training/serving skew, which is extremely hard to debug in our experience and should be avoided.
The fourth area, which makes feature engineering generally difficult, and we're seeing more and more of this, is a shift towards real time features. Features more and more are based on the real latest state of the world, and traditional tools simply don't work. Data scientists engineers are used to working with tables and SQL, and this world looks very different, there's Kafka streams, there are microservices.
Palette Feature Store
This is where palette comes in, the first important component of the palette feature store is the feature store. Our goal there is to be a centralized database in the company, to save your curated as well as crowd sourced- by crowd sourced, I mean, features contributed by various teams across Uber - features available in one single place. You can think of it as a one stop shop, where you can search for features. In our case, the restaurant features, trip features, rider features, in one single place, when you don't find those features, we will provide you the tools to create new features and features that are production-related. I'm going to talk a little bit about that in a second.
The other important goal is for teams to share features, often we find various teams are building the same features again and again, like the trips that a driver took daily. You would find 10 different versions of this feature, so not only is this a extremely redundant waste of resources, but it is also important to share the underlying data so that models are really acting on the same data consistently and producing better results.
And the fourth goal is to enable an ecosystem of tools around the feature store, around the features. For instance we're working on Data Drift Detection in real time and also offline, we're also sort of experimenting with automatic feature selection abilities, so that not only will we help you search for features, but also let you point your labels to our feature store and we can help you pick the features that are related to the labels or have an impact on the labels. This is work in progress but this is sort of our eventual goal.
Feature Store Organization
A little bit about how the feature store is organized. At the highest level, it's organized with the entities, so we're talking about the riders, the drivers, restaurants, and so forth. The second level is feature groups, these are logical groups of features that are associated with each other. Typically, these are features or data originating from the same pipeline or job. The third is the feature name, the data attribute name and the fourth is a thing called a join key. It's meta information in the feature store that tells the feature store how to look up that feature, so if you're talking about a restaurant, the restaurant ID is an example of a join key. There's a palette expression here, this is how you would uniquely identify every feature in the palette feature store. Starting with the restaurant real time, this is one feature name, orders placed last 30 minutes and restaurant ID.
Behind the scenes a little bit, how is it implemented? It's supported by a dual data store system, you can hink of it as a lambda architecture, but not quite so. The offline data system is based on warehouse used at Uber, which is Hive. It essentially saves snapshots of features, daily snapshots of features, it's primarily designed for consumption by training jobs. It can consume features at bulk at any given point in time, so the concept of time is really important here. It's designed for bulk retrieval, you can access any number of features, and the system joins these features for you, and stitches it into a common feature vector.
The offline store is paired with an online store. The purpose of the online store is to serve those same features in real time in a low latency way, today we use a KV store there, Cassandra in particular. Unlike the offline store, which says historic data, the goal of the online store is to serve the latest known values of these features. Behind the scenes, we carry out the data synchronization between the two stores, so if you have new features that you ingested into the offline store, they will automatically get copied to the online store. If you ingest real time features into the online store, it will get ETL back to the offline store. That is fundamentally how we achieve the consistency of data online and offline, it's the training/serving skew thing that I talked about earlier.
Moving on a little bit, going back to the example I started with, we want to build this model. It's a simple model, we want four features. How large is the order? Is that something we can build in the feature store? It doesn't seem like it because it's very specific to the context of what the user ordered, so it's probably an input feature, in our world, it's called a basis feature. It comes in during the request or it's present in the training table. How busy is the restaurant? That's something we should be able to look for in the feature store. Quick, is the restaurant same thing here? Busy is the traffic. These seems like generally good features applicable to a wide variety of models.
Creating Batch Features
Let's see how we can go about looking for these. We do go to the feature store and you find they're not there, so now the feature store offers you options to create these features, let's take a look at what are the options we have. Starting with the batch features, these are the most popularly used just because they're easy to use. There's a warehouse, there's a SQL query you can express ETAs with, the general idea is that you would write a Hive query or you would write a custom Spark job, and you would put in a little bit of meta information in the palette feature store and we would automatically productionize the pipelines for you.
Let's walk through this example, there was a feature, “how quick is the restaurant?” This feature doesn't strictly need to be real time in nature, so what we did was we found some warehouse tables that had this information, we decided to run a query to simply aggregate some of this information and went into this feature spec here and we wrote the small snippet of Hive query and the palette feature store using Uber's pipeline workflow infrastructure known as Piper. We automatically created the Piper job for you, not only did we do that, we also automatically created the alerts and monitoring so that if there's any data outage, the on-call would be alerted, so this is all built in out of the box.
The way it works is the feature spec, the job that it produced, started ingesting the data using the Palette APIs into the offline feature store. From there, it was automatically copied to the online store and then this whole feature store system is both available for training and scoring. Similarly, we have support for real time or rather near real time features. Here, we leverage Uber's near real time competition infrastructure, Flink, in particular, is a first class citizen here. The way it works is either you can write a Flink SQL that works on a stream, a real time stream, perform some transformations and produce those features into the feature store. Going back to our example, how busy is the restaurant? It turns out that there is a Kafka topic which is publishing some of this data, so it might be interesting if I were to build a feature out of this. It turns out however it's like a minute aggregation and I need to turn it into a five-minute aggregate.
I use Flink, I perform some aggregation via Flink in real time using streaming SQL and then I turn it into a feature. In this particular case, I'm calling the feature nMeal, it's a restaurant feature, it can be looked up at our ID. The way it works is very similar to the batch case where I go into the feature spec. I know I don't have to write a custom Flink job, although it's supported for more complex sort of feature engineering. Then I can do with a simple sort of Flink SQL, Uber's Flink system offers a service model where you can create a job on demand. We use the palette system to sort of automatically produce the production pipeline, compute these features, ingest it in the online store so unlike the batch store, you first ingest it into the online store. Very symmetrical to the batch store, we automatically ETL this data behind the scenes to offline store, so that it's available for training.
Whenever possible, we also support backfills, which means that if you just onboarded any real time feature, you can make it available for offline for training, that is a feature that we leverage from Flink, the underlying Flink support. For training and serving, the flows are similar, they specify the features that they want, they all get sucked in from the appropriate stores. The training and serving time, they get merged together and the feature vector is shipped for training and serving.
Bringing Your Own Features
Moving on a little bit, what if your features are so odd that you cannot use any of the palette features store tools? This is what we call bring your own features or external features. The expectation is that the customer is managing their own online and offline data here, so that is the caveat. This is like using palette feature store in an unmanaged way, meaning that the features are still registered in the feature store. The feature store still knows how to read these features, but the features are routed to external end points for retrieving them.
Going back to our example, how busy is the region? There was no warehouse table, there was no Kafka topic that had this information, but there was this one micro service that talked to a system outside of Uber to get this information and maybe I can make use of that for this. Here we're specifically talking about the traffic conditions. What I decided to do is I went into the feature spec and I put an entry for a service endpoint, which I know has the traffic information. The small little DSL that we give you from the RPC result, you can extract the features and the features are then made available for serving.
Remember that managing this data in this mode is my responsibility, which means that I decided to log the data, their RPC results and ETL them into this custom store here, and so that the data is available for offline training. I do run it for a bit, a couple of weeks till I had enough training data.
Palette Feature Joins
I mentioned a few times how the features are joined. At the very high level, the algorithm simply is that you start with your input features, recall from the example, there was an order. There was the number of them in orders, the number of dishes, for example and the various keys were supplied as part of the basis training job. The algorithm essentially takes this, joins it against all the palette features, the three different features that we talked about, and produces the final table, which is used for your training. Scoring is very similar, but unlike table format here, you're working on conceptually fewer rows at a time, essentially and logically, palette does the same exact operation. This is a sort of critical function of the palette store and we've invested a lot in getting this right from a scalability standpoint. Essentially, you want to take feature rows and join it, not only using a given key, but also a given point in time.
If there was a feature state from a month ago, you want to exactly state states from that point in time, so it's very important to get this right. Scalability is extremely important, you end up joining billions and billions of rows during training, we've put a lot of engineering effort there to optimize this process. Serving is similar, features are scattered across the board, multiple feature stores. There's a lot of parallelism, caching, that goes into ensure that features are available, can be retrieved in single digit P99 latencies, for example.
That was a quick example of how we put together these features, so we've used this as an input feature and order, and meal came from a near real time feature. There's a prep time feature, which came from batch and busy, we did, bring your own features, but the question at this point is, are you done with the end-to-end feature engineering? Turns out that the answer is no, because there are still customizations you need to do before those features can be consumed in the machine models. There is model specific feature engineering, or there could be chains of features engineering that you may apply. For example, you may look up one feature from palette, and use that to look up another feature, so we've built a framework called Feature Transformers, which lets you do the last part of feature engineering. I'm going to hand it over to Eric [Chen] who's going to talk all over that.
Feature Consumption
Chen: As Amit [Nene] just mentioned, the ways to consume those features are actually pretty arbitrary, so let's just really look into the problem and see what we have. We have four features here, one is the number of orders, as we know it's coming from your input, we call that basis feature. Not too many troubles over there, the other three actually, all very troublesome. Number of meals and meal do represent how busy this restaurant is, let's look into the whole querying feature transformation chain of it. The input, we have the restaurant ID, so not too hard; you can make a query, get a one level query, you can get your nMeal based on our syntax. Prep time, because usually for some new restaurant, there's no prep time, so for this particular case, you actually need two steps. The first step, you can still go to the feature store, get the prep time, but it could be actually a null. You need to introduce some impute, feature impute, so one possible way is, I'm going to use all the restaurants in the same training set and use the average as the prep time for this restaurant, so you need some feature impute after you grab the feature from the store.
The busy of the region is actually even more complicated because that is not a feature associated with the restaurant; instead, it's associated with the region. If you look into this, based on the restaurant ID, we can find the location of the restaurant, which has a lat/long. Then we need to do a conversion into the region, so not too complex, you can imagine this is a geohash. Based on the lat/long, I can get a geohash, and then geohash this entity. Based on this entity, I need to do another round of query to figure out how busy this particular region is. Here even though we're only trying to transform three features, these features are taking sort of like an arbitrary pass to make your final feature available, get into the model. How do we make these things happen?
Michelangelo Transformers
In order to answer this question, let's just take one step back, trying to understand what is a model. This concept is not really new, it was introduced by Michelangelo, we actually stepped on top of Spark ML framework. They have this thing called a Transformer Estimator pattern, so for a particular transformer, that is one stage in your whole pipeline, that will take your input. It will modify or add or remove some fields in your record, so it's doing some manipulation of your data.
What really is a model? A model is basically just a sequence of those transformations. Those transformations could fetch some feature from palette feature store, do some feature manipulation or model inferencing. What Michelangelo extends from, standard Spark, actually made them serverable from both online and offline structures. In the offline way, this actually is exactly the same, that is the spark transformer, but in the online way, they behave differently, we introduce some extension for it to make it happen. Estimator is also part of this framework, usually it's called an Estimator Transformer Pattern, so estimator is used during the training time, during the fit that will generate a transformer and you have a sequence of a transformer, which is what's used in the serving time.
Let's just try to see what is really there, for this particular talk, we're trying to focus more on feature engineering. Only for feature engineering side, for feature and consumption, there are already several types of it. Feature extraction, that's the palette feature store we talked about, it's also feature manipulation, for example, I have the lat/long, I need to translate it into the geohash, that is a UDF function, defining my things. Feature impute; you also can sync that as a feature manipulation, while if you write here, scala line, it's more like saying, “If now, use this value, otherwise use the other way.” so it's some code snippet in some sense.
The other two things are more centric on the modeling side, I'm actually not going to touch too deep inside. If you have categorical feature, you need to convert them into numerical values, you have linear regression, you may want to turn on one-hot encoder. If you have binary classification, you may want to define your decisions rationally, all these things can be seen as transformers as well.
Here is a real example where extend the standard Spark transformers. On the estimator side, that's actually exactly the same, nothing really special, the only special thing happens on the transformer side. From the transformer side we use standard Spark and new model things like MLreadable, writable, the only differences is here; it is a transformer, MA transformer stands for Michelangelo transformer. What happened is it actually introduced a one function called score instance, that's all used in the online case. The input is nothing related to data frame at all; it's purely just a map and output is also a map.
Then the question is, because we need to serve the same in the online/offline system, how do we guarantee the consistency? Usually, the way to write a transform function is define this as a UDF and register that UDF in the map function here. That's how you can guarantee the online and offline experience: consistency, because under the hood, they're using the same line of code all the time, then we might have [inaudible 00:26:20] on top of it to make it happen.
Palette Retrieval as a Transformer
Let's get deeper into two examples we want to understand. We talked about feature retrieval, we have that particular syntax, we want to get the feature out of the feature store. We have feature manipulation, for example, convert lat/long into the geohash, figure out the feature impute. Let's just do one by the other, first of all, let's try to understand what the palette feature transformer is. More or less, this is the syntax we already introduced in our previous page, we're talking about in this particular transformation stage we are interested in.
Here is what we're trying to do, in order to do the things in the transformer patterns, we need to figure out for each transformer what type of things we want to do. And the idea here is, we want to [inaudible 00:27:22] between palette feature retrieval and feature manipulation. Here are some examples, for the first feature transformer using the restaurant ID, I'm trying to get the number of meals, that's for the first feature. For the second feature, I'm using the restaurant ID trying to get the prep time, but for the third feature, how busy the region is, I can now finish it. The first step is I'm just going to get the lat/long for this restaurant. Imagine there's an interleave step, after that, I'm going to do another round of feature retrieval, then I can see I have region ID. I'm going to query from that service feature and figure out how busy this feature is. Then this one needs to be interleaved with a feature manipulation. Feature manipulation in our system is called a DSL, because that's a type of domain specific language. The idea here is you can say how I want to transfer a feature, remember, we said we get a lat/long, we want to transfer feature into the region ID so in the next step, we can use the region ID do another round of query, so this transformation can be written as this.
I'm trying to get the region ID, the way to compute it is we provide some built in function in our system that's really just a function that you can use these two features you specified in your previous stage, and then that will translate that into this feature. Similarly, after we grab all the features in the system, we want to do a bunch of impute feature manipulation. We clean up the prep time if it doesn't exist, we're using that as a numerical value. If it doesn't exist, it will fill that with the prep time of it, the average prep time of it. The number of meals, we're only going to just use that as a numerical value so we'll convert that as a numerical number of orders, the similar thing, then the busy [inaudible 00:29:30], similar thing.
This is called an estimator, why is that? It's really because of this fallback logic, because when we need to do the fallback, where are your fallback values coming from? It's actually coming from your training dataset. You want your fallback values the same way as your training and serving time, so that's why it's considered as an estimator. Based on the estimation stage, it will fit through your model, understand all the stats, that becomes the transformer to use. Behind the scene, this one actually is going through some code gen things, so all the things you see or hear [inaudible 00:30:10] now will be converted into some code snippet, and go through some compiler, and becomes Java byte code during serving time. That's where we can guarantee the speed, we don't do the parsing for every single query you have.
Let's now try to just pull the picture together. We have number of meals, we have prep time, we have the busy scale, we're trying to interleave them with a two palette feature retrieval, and two DSL transformer feature manipulations. The first stage, we're going to just query number of meals, prep time, and restaurant, lat/long from the restaurant ID. This manipulation, we're going to change the lat/long into a region ID. In the second round, we're going to, based on the region ID [inaudible 00:30:58] business, and final stage we do some cleanup of all the features. After this is done, the things after this is actually the standard training part, so you start talking about string indexers, start talking about one-hat encoder. If it's a binary classification, you may have one estimator at the end trying to fit your affluence score, fit your decisions threshold based on [inaudible 00:31:21]. That's more or less the whole picture about how can we make this pipeline ready.
Dev Tools: Authoring and Debugging a Pipeline
But remember, we are a platform, we are not an engineering team, so we cannot just code this one and say, "Hey, done," because we have to handle all these things daily from different customers. How can we provide a tool for them so that we can give this power back to them, so that they can, without help from us, do all this work by themselves? The idea is, we already talked about the Hive part coming from Hive query or a Flink query. Here it's about, how can you represent this as customized pipeline in an interactive way?
We actually leverage it directly from the Spark. Spark has this MLreadable, MLwritable thing so it can introduce the civilization, de-civilization. We gave users a tool where they can describe the pipeline they want, in this particular case, that's interleaved between palette retrieval and DSL estimator, then this is part of the regular linear equation. You can fit the model that you can use the fit of the model to try on your data frame, then Michelangelo will provide you with a way at upload store so you can upload. The interesting part is you actually can upload, too, you can upload both the model you train for your one-off test, and also the model you want to train, that's how you can do re-training through the system. We have a work saved flow system behind the scene which does all the optimization behind this simple line of code to make it scalable and re-trainable.
Takeaways
That's more or less all we have, and trying to highlight the takeaways we have. We talked about a particular feature engineering platform. There are two pieces, one is, how do you manage? How do you organize all these features, especially for two stores, online and offline? Also, we talked about how we're going to manipulate those features, and for the offline case, it's many joins. Both online and offline are about speed, so online and offline, they actually have a different way to optimize the queries to make it happen, they also need different stores to make it happen as well.
About the transformers, an actually even bigger level of transformers is MA Transformer, where you introduce the function score instance, that's the place we guarantee online and offline consistency. I actually would phrase the sentence another way, "It's a guaranteed, controlled inconsistency." What do I mean? Feature store, we said, in the online store, it implies that I'm interested in the feature right now. In the offline case, I'm saying, "I'm interested in the feature when this particular record happened." That's indeed not online, offline consistency; it's online, offline, controlled inconsistency. How can we make all these things happen? It's actually all modeled by our transformer framework. We also do all these pipelines behind the scenes, so we provide all the out of box reliability, monitoring, to put the operation back to our users.
Michelangelo, as a whole, is not only about feature engineering, if you go there, you can hear other stories. For example, how do we do [inaudible 00:35:07]? How do we do model understanding? How do we do other things, like Python-related modeling? There are all a lot of interesting stories there.
Questions and Answers
Participant 1: A quick question is, what if teams use TensorFlow to build a model? How do they leverage this platform?
Chen: TensorFlow is actually hiding behind this transformer framework, so your TensorFlow model can be hooked up with any other transformation stash by yourself, that's what happens behind the scenes. I think you also probably want to say TensorFlow has some feature engineering stuff as well. TensorFlow feature engineering usually is about [inaudible 00:36:12] coding. It's a different feature engineering from the transformer engineering I talked about. It's a different scope of it, it's fused inside TensorFlow part.
Participant 2: I noticed that you had different tools for the feature generation part- you had Hive QL, you had Flink QL. As we go further down, you had Spark for the transformers and the estimators. Have you guys thought about having a unified API, something like Spark Structured Streaming, which sort of works across the board? That way you have data frames to work from start to finish?
Chen: Unified API, the way we're seeing that is actually not a unified API, we see that as a package of APIs. The way we're trying to open this is through IPython Notebook, because we are trying to unify the experience of mainly the terms. When you say, palette feature, it's supposed to mean exactly the same thing when you're trying to do Hive query or when you trying to join them. We're actually at the stage trying to unify all different Python packages under the same naming patterns. We're exposing them all through this IPython Notebook provided by Uber [inaudible 00:37:38], so that's how we are unifying everything.
Nene: I also want to add that these systems that you mentioned, they all have their own sweet spots. Hive will thrive at queries at large scale, Flink will thrive at processing Kafka streams in near real time and then the feature [inaudible 00:37:57] here has a kind of slightly different goal. These systems have matured and they all have their optimal place. We try to leverage what is available in Uber, and make it available via a calm and sort of central place.
Participant 3: I have actually three questions. The first question I wanted to ask is, how often does it happen that a feature is being used across teams at Uber? I think the whole purpose of this platform is to feature sharing, and in my personal experience, outside a team, there is very little feature sharing that is happening. That's my question one. My second question is that when you build a platform like this, how do you prevent the proliferation of features created by different teams when they are exactly the same features, but with a different name. Sometimes they could be exactly the same, sometimes they could have partial correlation, that's question number second. My third question is, you said that there are a bunch of applications for which accessing the feature online cannot be done by a key value store, so you let people have their own custom solution. I want you to describe what are the use cases where you have to use custom solutions, because I can think of three where key values will not work.
Nene: You've got a lot of questions there, I will very quickly answer you in just one minute. I think the first part was about feature sharing. That is indeed very much the goal and although there are some trends where teams contribute to the feature store, but they use their own features, that is true, we see that, but there are teams who actually do actually use the same features, so in our case, it has actually been true. Not to the extent that we'd like, but that is definitely a goal. Now, this is a complex problem and part of it is technical, part of it is like process, and how teams work with each other, but one of the things we have found is that there's a trust element there.
Once you build your ecosystem where you can actually trust the features, where you actually build the tools to show the data distribution, the anomalies, how reliable it is, the confidence of using that feature actually increases, that is what actually we see. If you just throw some key values or features there and you don't know anything about it, that is where the problem originates, in my experience. We are actually focusing on actually building the tooling so that users start trusting the features more.
Another example is the Auto Feature Selection tools that we've been building, not our team but a related team. We are actually finding that unrelated teams are finding features built by the other team by this particular tool, which is actually showing, "Hey, this is a feature that is correlated to your labels.", so they are actually finding it interesting and then coming to us and saying, "How can we use this? Who's owning the pipelines?" Some of it is really related to just having poor tooling and once you invest in the tooling that ought to change, is my view. The third question was pretty complex. I don't know, maybe we should just discuss it like offline.
The tool we're actually building for the automatic feature selection actually shows redundancies across features, so it's a work in progress. But that is one [inaudible 00:41:32] step. Without such tools, you're at the mercy of just reviews, so every check into the palette feature store actually goes through a review. We try our best there and we try to enforce things and entities and rules, but sometimes because it's purely a human process, things can sort of slip through the cracks.
See more presentations with transcripts