BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Michelangelo - Machine Learning @Uber

Michelangelo - Machine Learning @Uber

Bookmarks
46:11

Summary

Jeremy Hermann talks about Michelangelo - the ML Platform that powers most of the ML solutions at Uber. The early goal was to enable teams to deploy and operate ML solutions at Uber scale. Now, their focus has shifted towards developer velocity and empowering the individual model owners to be fully self-sufficient from early prototyping through full production deployment & operationalization.

Bio

Jeremy Hermann is Head of Machine Learning Platform at Uber.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

I'm going to talk today about Machine Learning at Uber. There are three phases of the talk. The first one is to go over some of the interesting use cases of ML at Uber. Second piece is around looking at the first version of the platform that we built to support those use cases and many more. And then, the final section is more tailored to this track, which is around developer experience, and how we're working right now to accelerate Machine Learning usage, and adoption, and innovation at Uber through better tooling and better experience.

ML at Uber

First thing, ML at Uber. Uber, I think is one of the most exciting places right now to do Machine Learning for a bunch of reasons. First is that there's not one or two big use cases that consume all of the ML attention horsepower. There's a wide array of projects of more equal weight across the whole company. We'll go through a bunch of those. Second one is the interestingness of the data. Uber operates in the physical world. All the writer, partner and driver apps have GPSs, they have accelerometers. We collect a lot of interesting data about the physical world. And, of course, the cars move around the physical world. We're not just dealing with people clicking on web pages, we're dealing with things out there in the world.

The third thing is Uber is a younger company. In a lot of cases we're applying Machine Learning in areas for the very first time, and so you're not trying to grind out a few extra fractions of a cent of accuracy. You're actually seeing giant swings the first time you get a new model deployed in production for some use case. And the final one is that ML is really central and strategic to Uber at this point. The decisions and features that we can base on the data that we collect, very hard to copy. Also, ML is one of the things that helps Uber run the whole machine more efficiently. It's applied lots of places to make the product and a lot of the internal operations run much more efficiently.

Data at Uber. Uber has grown a lot in the last bunch of years. We have 75 million riders now, 3 million drivers. Completed 4 billion trips last year. So it's even bigger this year. We operate in 600 cities, and we're completing more than 15 million trips every single day. To give you a sense again, this is several years ago, so things have grown a lot since then. But these are GPS traces of the driver phones in London over the course of, I think six hours. But you can see how the cars very quickly cover a lot of the city, and this is all data that we can use for Machine Learning.

There's, I think, over 100 ML use cases or problems being solved right now. This is a small sampling of ones. But you can see it really cuts across the whole company, from Uber Eats, and we'll talk about that one in more depth, to self-driving cars, to customer support, pricing, forecasting. And then even things more removed from the product, like doing anomaly detection on system metrics in the back end services. And even capacity planning or data centers to make sure we have adequate hardware capacity for both the long term, as well as shorter-term spikes that we get on big holidays like New Year's Eve and Halloween.

Here we'll walk through a few interesting ones. Uber Eats, every time you open the Uber Eats app, we score, I think, hundreds of different models to generate the homepage for you. We use models to try to figure out which restaurants you're most likely interested in ordering from. We do ranking of restaurants within some of the screens. There are actually meals, and so we rank the meals again, trying to see which ones you're more likely to want. All of the delivery times you can see below the restaurant, or ML models, trying to predict how long will it take, once you place the order for the order to get prepared, for them to notify the driver partner to drive to the restaurant, to get out of the car, and walk into the restaurant, to pick up the meal, walk back to the car and then drive it to your house. ML is used to model that whole problem and give pretty good ETAs for delivery time for the meals.

Then finally, search ranking. When you try to search for a meal it will again, not just do prefix-based searching, but also try to predict your intent and what you're looking for. And this is an AB test. This makes a big difference for Uber's business. Self-driving cars, the cars have LIDAR and cameras, they try to understand the world around them. They also use ML for that, for object detection, trying to find where the streets go, looking out for pedestrians, other cars, and then also other parts in the process for planning and route finding, and so forth. The cars are mostly deep learning these days.

ETAs. In the app when you request a ride, or are about to request a ride, it tells you how far away the driver is. And this is super important both for the product experience and for our users, because if the ETA is incorrect it's quite frustrating and it may affect how you use it, but it's also fed into lots and lots of other internal systems that drives pricing and routing and a bunch of other things. So having accurate ETAs is super, super important to Uber and it's a hard problem. Uber for a long time has had a route-based ETA predictor that will look at the segments of the road you're going to travel over, and average speeds over the in past, and it will use that to predict a base ETA.

But what we found is those ETAs are usually wrong to some degree, but they're wrong in consistent ways or predictable ways. We can fit models to the error, and then use the prediction to correct the error and give dramatically more accurate ETAs across the board. Map making- Uber used to use Google Maps, and now we're building out our own mapping infrastructure. As part of the map making process, there's a layering of evidence collection. You start with a base, street map, and then you layer on evidence to make it more and more accurate.

One of the things we do is we drive cars around with cameras on top and take pictures of all of the buildings and street signs, and also tag those with the GPS coordinates of where the picture was taken from, and then use ML models to try to find addresses and street signs such that we can add them to the database and help to make the map itself more accurate and consistent. You get a base map and you layer on evidence that we collect with sensors and cars and Machine Learning to actually find. First, we figure out the objects we're interested in. And so in this case, you can see street signs and addresses. Then we apply text extraction algorithms to actually pull the text out of the image, and then the actual text where there's an address or a street sign or a restaurant name can be fed into the database.

Destination prediction. When you open the app and you are starting to search for where you want to go, ML again, is used, like in the Eats case, to try to help you find the place you want to go. Forecasting and marketplace. Uber is a marketplace- we try to connect riders and drivers for rides, and for the thing to work, it's very important that the riders and drivers be close to each other in both space and time. If you request a ride and the driver is very, very far away across the city, it doesn't work because it takes too long to drive across city to get to you. If you request a ride and there's no drivers available, even ones that are close, it doesn't work either. So the sort of proximity and space and time of supply and demand is quite important. You can contrast that with a business like eBay, which is also a marketplace, but you can order a futon today from LA and they can ship it next week, and that all works out even though the distance and time are spread out. But for Uber, the spatial-temporal thing is quite important.

In Uber's maps, you can see little hexagons there, we divide up the maps into hexagons, it's a more efficient way than a grid to organize maps, but we use deep learning models to predict a variety of marketplace metrics at various points of time in the future. Drivers who will be available, riders who will want rides, and then can identify gaps between supply and demand in the future, and then use that to help encourage drivers to go where there will be demand, to help again, keep ATAs low and utilization high.

Customer support. There are 15 million rides a day. People leave phones and backpacks in the back of cars, and they file customer support tickets and those get routed to call centers. Uber spends lots and lots of money for people to answer support tickets. What happens when a ticket comes in is the person has to read the ticket, figure out what the problem is, and then pick from a big menu of responses for the proper response for Lost and Found or whatever else. We can use deep learning models looking at the text of the message, to try to predict what the actual problem was, and then reduce the menu options from, I think, 30 down to three, like the three most likely response templates. That gave, I think, initially a 10% boost. But I think we have another model which gave another 6% boost in the speed at which these people can answer tickets, which is 16% off of the cost of the thing, which is huge for us.

Actually, another one that's quite similar, although a different application, is this new one-click chat thing that we released recently. The idea here is that when a car is coming to pick you up, you often want to communicate with the driver to tell him or her exactly where you're standing, or if you're running down the block, but it's hard for drivers to type. It's way easier to chat. We have an NLP model that basically predicts the next response in a conversation, so you can very quickly communicate with the driver and vice versa, via just picking responses out of menus as opposed to typing. I forget the exact accuracy rate, but it was quite high, and you're able to carry on pretty good conversations without actually typing any texts, which is pretty cool.

ML Platform

That was like 10 or something out of close to 100 different use cases around Uber where ML is being used. Over the last three years, we built a platform called Michelangelo, which supports the majority of those use cases. We'll talk a bit now about the philosophy of the platform, and the first version and what it covers.

So the overall mission of my team is to build software and tools that will enable data scientists and engineers around the company to kind of own the end-to-end to deploy and operate these ML solutions that we just saw before, and to do it at full Uber scale. There's a big death experience component to that, because you want to empower the same person to own the end-to-end, from the idea and the prototyping of the model, all the way through deployment and production. The more you can have one person own that process, the faster you can move through it and modeling, word being very iterative, the faster you can move through things has a compounding effect, because there are lots of cycles as you experiment with new models.

We had a blog post recently. In addition to technology, there's been a lot of organizational and process aspects to ML Uber. They've been quite important in making it work well at scale, in terms of the system scale, but also the organizational scale. There's a blog post we put out recently that describes a bunch of this.

V1 of ML at Uber was really just to enable people to do it. That's been quite successful and powerful. But it wasn't always the easiest thing. V2 is more around how do we improve developer productivity and experience, and increase the velocity of modeling work and deployment work. Again, to facilitate innovation.

Enable ML at Scale

This is a walk-through of the platform, the first verse of the platform, and then we'll talk about the things we're doing now to make it better and faster. One of the early hypotheses that we had, or a vision around the platform, was that Machine Learning is much more than just training models, that there's a whole end-to-end workflow that you have to support to make it actually work well. It starts with managing data, and this actually in most cases ends up being the most complicated part of the process. You have to manage the data sets they use for training the model, which is the features and the labels, and it has to be accurate. You have to build a manager for training and retraining. And then when you deploy the model, you have to get that same data to the model in production.

At Uber, most models are deployed into a real-time prediction service for request response-based predictions. Many times, the data that you need for the model is sitting in Hadoop somewhere. You have to wire up the pipelines running analytical queries against historical data, and then delivering that into a key-value store where the model can read it. A lot of complicated pipelines for getting the right data delivered to the right time and place for the model to use it at scoring time.

Training models. Obviously, you have to actually train the models, we do a bunch there. Model evaluation, modeling work is very iterative, and so you want to be able to have good tools for comparing models and finding out which ones are good or not. Deployments. Once you have a model that you like, you want to be able to click a button or call an API and have it deployed out across your serving infrastructure. And then making predictions, that's the obvious part. Then, monitoring is interesting in that you train models against historical data, you evaluate against historical data. And then when you deploy a model in production, you then don't actually know if it's doing the right thing anymore, because you're seeing new data against the model. So being able to monitor the accuracy of predictions going forward in time becomes quite important.

What we found is that the same workflow applies across all sorts of, or most, of the ML problems we've seen, from traditional trees and linear models to deep learning, supervised and unsupervised, online learning where you're learning more continuously, whether you're deploying a model in a batch pipeline or online or on a mobile phone. And then even as we saw in that marketplace case, it works for classification regression, but also for time series forecasting. For all these things, the same basic workflow holds true, and so we spent time building out a platform to support these things.

Managing data. I hit a bit of this before already, but in most cases data is the hardest part of ML. We've built a variety of things, including a centralized feature store, where teams can register and curate and share features that are used across different models. That facilitates modeling work, because rather than having to write new queries to find new features, you can just pick and choose them from a feature store. As importantly, once your model goes into production, we can automatically wire up the pipeline to deliver those features to the model at prediction time.

Training models. We run large scale distributed training for both on CPU clusters for trees and linear models, and then on GPU clusters for deep learning models. In the case of deep learning, we base a lot of it around TensorFlow and PyTorch. But we built our own distributed training infrastructure called Horovod. I won't go into too much detail here, and we'll actually come back to this in the experience section. But Horovod has two interesting aspects. One is that it makes distributed training more efficient by getting rid of the parameter server and using a different technique involving MPI and ring reduction to more efficiently shuffle data around during distributed training. But it also makes the API's for managing the distributed training jobs much, much easier for the modeling developers, and we'll come back to that later. It's quite strong in terms of scale and speed, but also much, much easier to use.

Managing Eval models. Again, after you train models, you often train tens or hundreds of models before you find one that's sort of good enough for your use case. Being able to keep a rigorous recording of all the models you train, the training data, who trained them, as well as a lot of metrics and reports around accuracy of the model, and even debug reports, helps the modelers iterate and eventually find the model that they want to use in production. We invest a lot of work here and collecting metadata about the models, and then exposing it in ways that are very easy for developers to make sense of, and move the modeling process forward.

This is for a regression model, and there's the standard error metrics as well as reporting to show the accuracy of the model, very standard things that data scientists are used to doing. For a classification model, again, different set of metrics, but again, the things that people need to use to hone in on the best model for their use case. For all of the different features that go into the model, we look at the importance of the feature to the model, as well as statistics about that data. So the mean, the min, the max standard deviation as well as distribution. Again, all things that help you understand the data in the model and accelerate the work here.

For tree models, we expose a tool that lets you actually dig into the structure of the learned trees, to help understand how the model works, and to help explain why a certain set of input features generates a certain prediction. Across the top, you can see- so this is a boosted tree. Each column in that top grid is one tree in the forest. Each row is a feature, and then as you click on a tree, you'll see the tree at the bottom with all the split points and distributions, and then you can actually fill in data on the left there and it will light up the path through the tree, so you can see how the tree handles that feature vector. Again, if a model's not behaving correctly, you can pull up this screen and figure out exactly why the model is generating a certain prediction for a certain input set of features.

Then deployment and serving. Once you've found the model that you want, it's important to be able to deploy it. Uber does both batch predictions, meaning you run a job once a day or once an hour, generate lots and lots of predictions, or you can deploy a model into, essentially, a web service. A container that will receive network requests, and then return predictions. Most models at Uber, and a lot of the ones that I showed before, are all of that nature. You open your Uber Eats app, and it calls the backend services. It will score a bunch of models to render your homepage in whatever it is, hundred milliseconds. We had built and operated the production clusters that scale out and are used across the company.

A quick architectural diagram. The idea is the client sends the feature vector in, we have a routing infrastructure, and then within the Prediction Service, you can have multiple models loaded. Based on a header it will go find the right model, send the feature vector to that model, get the prediction back, in some cases, load more data from Cassandra, which is the feature store we talked about, and then return the prediction back to the client. I think right now we're running close to a million predictions a second across all the different use cases at Uber, which is quite a bit. All right.

We're at 1 million plus and then for trees and linear models, the scoring time is quite fast. Typically we're less than five milliseconds for P95, I think it is, if there's no Cassandra in the path for the online features. And then when you have to call Cassandra to get features, that adds another 5 or 10 or 20 milliseconds. All in all, still quite fast for predictions, which is good. We're starting to work on more deep learning models. Those are trickier because depending on the complexity of the model, the inference time can actually go up quite a bit. But for trees, it's usually very, very fast.

The final bit I talked about is, you've trained against historical data and now you deploy your model in production and you want to make sure that when you evaluate it against historical data, you know your model was good for last week's data. But now it's running production, you want to make sure it's actually good for the data that you're seeing right now.

So what you can do- we'll come back to this at the end- there's another newer piece of this. But there's a few different ways you can monitor your predictions. The ideal way is where you can actually log the predictions that you make, and then join them back to the outcomes that you observe as part of the running of the system later on, and then see whether you got the prediction right or wrong. You can imagine for the Uber Eats case, we predict the ETA for a certain restaurant, and then you order the meal and then 20 minutes later, it delivers, then we know the actual arrival time from that meal. And that's collected on one of our backend systems. If we log the prediction that we made when you viewed the screen and then join that back to the actual delivery time, we can see how right or wrong that prediction was, and you collect those in aggregate and then you can generate ongoing accuracy reports for your model in production. Because you have to wait for batch processes to run to collect the outcomes, you can get good monitoring, but there's a, I think, an hour delay before you can know how correct the prediction was.

So from an architectural perspective, again, walking through along the top, you have the different workflow steps. At the bottom, you have both our offline batch systems at the bottom, and then our online systems at the top. We'll just walk through the stages of the architecture. In the offline world, we start in the lower left with our data lake. All of Uber’s data funnels into Hadoop and Hive tables, and that's the starting point for most batch data work. It was part of the ML platform; we let developers write either Spark or SQL jobs to do the core screen, joining and aggregation and collection of feature data and outcome data, and then those are fed back into tables that are used for training in batch prediction.

In cases where you want those features available online for prediction time, those values that were calculated in those batch jobs can be copied into Cassandra for online serving. For example, in the Uber Eats delivery time case, one of the features is something like, what's the average meal prep time for a restaurant over the last two weeks? And so that's computed via a Spark job. Because it's a two-week average, it's okay if that only gets refreshed in Cassandra once or twice a day, because two weeks plus or minus 12 hours doesn't make that much difference for that kind of metric. That one is fine flowing through the bottom batch path; that gets computed once a day, load to Cassandra and then we can use that same value for every single prediction.

However, there are cases where you want more. You want the features to be a lot fresher. In addition to the two week meal prep time for restaurant, you may also want to know- which gives you a sense of how you just how fast the restaurant is, in general- you may also want to know, how busy is the restaurant right now? What's the meal prep time over the last one hour or last five minutes? Obviously, if you're computing things with that freshness, you can't afford to go run offline jobs. We have a streaming path across the top where we can get metrics coming out of Kafka. We can run Flink job to aggregate across the stream of data. And then write those numbers into Cassandra and then double write them back to Hive, so you have the exact same numbers available later on for training. The parity between online offline is super important to get right. The way we've solved that generally, is by only compete the feature once and then double write it to the other store.

So then batch training pulls data from these Hive tables and runs it through the algorithm, which could be a tree or linear model or deep learning model. And then writes the output. It's actually not Cassandra anymore, but into a model database that stores all the metadata about the model that we talked about before; who trained it, when it was trained, what data set, and then, as well as all of the actual learned parameters, the artifacts of the model. If it's a tree model, it's all the split points we saw before. If it's a Z pointing model, it's all of the learned weights in the network. We capture both all the metadata, all the configuration plus the actual parameters, the model, and store that in database.

At deployment time, you can click a button or through an API, take one of those models you've trained and push that out into either an online survey container that we talked about before, that will do network-based, request-response predictions, or you can deploy it into a batch job that will run on a cadence and generate lots and lots of predictions, and then send them somewhere else.

And then finally, if you look at how the predictions actually happen- so along the top, again, the real-time case, imagine you open the Uber Eats app and you want to see your meal delivery time estimates. The features coming from the phone would be your location, time of day, a bunch of things that are relevant to the current context. That will go to the model. Then the model, as part of its configuration, knows that in addition to the features that come as part of the current request, we have to get a bunch of other ones that are waiting for it in the feature store. So the one hour meal prep time and the two week meal prep time and probably a bunch of others, are pulled out of Cassandra and then joined to the features you sent from the phone. Then that whole feature vector is then sent to the model for scoring. You can see we're blending the request contacts features with a bunch that are computed, either via streaming jobs or via batch jobs.

Again, a lot of the challenges here are getting the system set up in a way where it's very easy for developers to wire up all these pipelines and not have to do it one off each time, because without this, that's where most of the work in an ML goes, getting these data pipelines setup. For the monitoring case, either for real-time or batch predictions, we can log the predictions back to Hadoop and then join them to the outcomes once we learn about them as part of the regular processing of data, and then we can push those out to metric systems for alerting and monitoring. Again, because these are batch jobs, I think we run these things once an hour, so it's not super real-time yet, but we'll come back to that later.

Zooming out, we have a management plane that we use for the monitoring. We pumped this to central monitoring systems that drive dashboards. We have an API tier that orchestrates and is sort of the brains of the system, and then it also is a kind of public API surface for the web UI that's used to doing a lot of the workflow management and deployment. And then you can write Python or Java, automation code or integration code to drive the system from the outside.

We have a quick little video here showing the UI, but it's organized around projects. These are all kind of dummy names. But a project is a container for a modeling problem. You can go connect to Hive table to train your model on, look at all of the models you've trained. We talked about this before. It's a bunch of boosted tree models. You can drill into one of these.

Projects. Go grab a Hive table. Drill in. I think we're going to drill in and see some of the visualizations and reports on one of these models. This one's already deployed. We click in and you can see that this is a classification model. You can see the confusion matrix and bunch of the different metrics used to assess accuracy.

This is the tree thing we saw before. This model has whatever, 162 features and a lot of trees. You can see the actual split points in the trees. And then here's the feature report for all the features in this model with the distributions and so forth. I think we're going to go deploy model here, you can see how fast it goes out. You click, it says this model is not deployed. You click Deploy, click OK. It spins for a few minutes and packages up the model and pushes it out to the serving infrastructure. And then boom, it's ready to go. Then here you can see the history of all the different models you've deployed over time, logs of who deployed and when.

Accelerate ML

That's the V1 of the platform. This is what we built over the last few years to support ML use case at scale. It's worked well; in some cases things weren't as fast, as easy as they could be. The next wave of our efforts on the platform is around this foundation that we have now, and how do you make it faster and easier for people to go from idea through prototyping to first model, and then deploy that in and sort of scale that model up into production. And we'll go through a few recent projects that we've either finished building, or are building right now, to address those problems.

So on the right side we are working now on accelerating ML. We have a new Python ML project that helps people work with bringing the toolset to the data scientists who prefer working in Python over web UIs or over Scala. Horovod is our distributed deep learning system that has a really elegant API. AutoTune is our first piece of auto ML. Allowing the system to help you train good models, as opposed to having the data scientist or engineer have to figure out all the right settings themselves. Some new visualization tools to help understand why models are working well or not. And then some newer features around understanding more in real time how the model's behaving in production. The thing I showed you earlier was refresh once an hour, and now we have more real-time monitoring of the model in production.

So as we've started to look at how do we accelerate model development, and so to address the developer experience problem with Machine Learning, we've kind of looked at a few things. One is ML is this long workflow from getting data to training models all the way through, and there's friction points in every single step. We've been quite rigorous around trying to identify where those friction points are, and grinding off the rough edges and making the workflow faster. One of the guiding principles or philosophies has been that- and this kind of goes back in many ways to the DevOps philosophy, where if you let the engineer own the code from prototype, through hardening, through QA, through production, you can accelerate the loop of trying something out, getting it production, and you also build better systems, because the engineers are on the hook to support the thing in production. We found the same thing applies to Machine Learning, too. If you can empower the data scientists to own more and more of the workflow, ideally the whole thing, that they're able to traverse the workflow faster. And they also have more ownership of the problem end-to-end.

Bringing the tools to developers, we made a few mistakes early on around not embracing the tools that the data scientists were already very familiar with, i.e. Python, so we're bringing that back. Then more investments in visual tools to help understand and debug models. PyML. The general problem here is that Michelangelo initially targeted super high scale use cases, so high-skilled training on giant data sets and high scale predictions at a very low latency. That was great for the first couple years of use cases and a lot of the highest value ones. However, we found that the system is not as easy to use and not as flexible as is desired by many data scientists, and also as is required by a long tail of more unique problems across Uber.

The solution was, how can we just support plain Python and the rich ecosystem of Python tools throughout this end-to-end workflow and do it at somewhat limited scale, because you're dealing with Python, you're dealing with a non-distributed environment, but make it scale and make it work as well as it possibly can. The basic idea is to allow people to build models using essentially any Python code in any Python libraries, implement a serving interface in Python, and then have packaging and deployment tools that will treat it like any other model that we have, and be able to push it out to our serving infrastructure.

I'll go through a quick thing, but the trade-offs between the PyML and the other system is really kind of a trap between flexibility and resource, efficiency, and scale and latency. But the general idea is that this is a pretty simple, I think, is the Kaggle case? But we're going to build a, I think a logistic regression model, is that right? But we're going to build a Pandas DataFrame. We train a logistic regression model and then run some test predictions at the very bottom. It's a very simple standard scikit-learned model.

And this is actually all happening in a Jupiter notebook. I didn't show the whole contact, but this is all happening Jupiter. You can have a requirements file that selects all of your dependencies. You can then import your Python libraries. You save the model file back out to your directory. This is a serving interface, you implement an interface that knows how to load that model back into the file, and then implements a predict method that can do sample feature transformations, and then feed the data through the model for scoring. You can see how these pieces give you an interface to score the model. Through the API, we can test the model, and then at the bottom, we can actually call upload model. This will package up the model and all its dependencies, and send it up to the Michelangelo backend, such that it can be managed in our UI and then deployed the same way other models can be deployed.

This model has been uploaded. Now you can see it in the UI, the same way you saw the other models that were all trained on the high scale system. Then either through the UI or through the API, you can then deploy the model out to the exact same serving infrastructure to do real-time request response scoring. I don't have an example here, but you can also deploy it out to a Spark job to do batch scoring. But again, this is an attempt to embrace the flexibility of Python and the tools the data scientists like to use already, and then provide the infrastructure and scaffolding to kind of make it work at that high scale that these tools can support.

Architecturally, on the left side, in your environment where you're working, whether it's Jupiter or any other Python environment, you basically train your model, save it locally using whatever techniques you want. You build your model.py file, which is that one that had the serving, the predict interface in it. Then you have your typical requirements and packages that tell the system for all of the libraries and system libraries that you need. Then there is basically a packaging and build step that builds up a Docker container that includes your model and all the dependencies. That can be pushed out to our offline on the top or online system on the bottom for doing either batch predictions via Spark job or online predictions via the request-response thing we saw before.

Looking a little more closer, so on the left is the online serving of the high scale models, which is the picture we saw before. On the left side, you can see that we actually deploy a nested Docker container containing all of the Python resources. Our existing Prediction Service acts as a proxy or gateway, and then routes to a local Docker container that contains all the Python code. You get all of the same monitoring and support of the network stuff of our high scale production container. But then we can route the request to the nested Python service that's used for the actual scoring.

You can use scikit-learn, you can use deep learning, you can write custom algorithms and it's super flexible. Now, the trade-offs are, you'll have slightly higher latency. It doesn't scale as cheaply because Python's running, and then if you're using scikit-learn, you can't train on giant data sets because it's not distributed. But in terms of developer friendliness and speed, it's great. I think the way people are approaching it is they can use this to very quickly get out a model, say for one city, and then once they've proved that the model matters, then it's an easier sell to go rebuild it on the high scale system.

Horovod is our distributed deep learning system and it has, as we mentioned before, two interesting facets. One is that it scales more efficiently than the other distributed deep learning approaches, but also the API for it is much, much simpler and much easier to set up. Much easier to go from a single no-training job to a distributed training job. In this case, we've pulled an example from, I think it's from the TensorFlow documentation for how to set up a distributed training job in TensorFlow using a parameter server, and you can see there's one little method in the middle there which is training the model, and everything else is setting up the distributor training environment, which is not stuff that a model should have to care about.

In the Horovod case, we're able to do better-distributed training with a lot less work. We have the train method up there in the middle, and then around at RFU, API calls to set up Horovod to do the training. There's an initialization and then a few other calls to set up the environment. So much, much easier and friendlier than some of the other approaches. And this has been quite popular in the community, as well for both those reasons.

Manifold. One of the challenges in the Viz reports we showed before is, when you train models you tend to get a global accuracy metric, what's the AUC or mean squared error for the whole model across the whole dataset. And that's a good starting point, but often different segments of the data will have very different characteristics. The model will treat them very, very differently. We've seen cases where the model on the whole works great, but then there’s one slice of data where it behaves very, very poorly and it may be a very important slice of data. Manifold, we're building visualization tools that let you dive in more to the data to understand how a model works on smaller pieces or on smaller segments of that data, and trying to help you identify ones where they're anomalous or look different or problematic.

AutoTune. This is a system, where once you've figured out the features for your model, there's often a lot of work to figure out the right combination of hyperparameters that gives you the best accuracy. For a tree model you have the number of trees, the depth the trees. You have bidding, probably 6 or 10 different hyperparameters that you can tweak. It's impossible to know up front what the right combination is and so it's often a brute force process of finding the right combination of those. A common approach is either a brute force grid search where you just generate, essentially, a hypercube of all the different options and try every one. You can do a random search where you do the same thing but then just search random pieces and find a pretty good one. Or you can use what's called Black Box Optimization and you can actually, more efficiently and experimentally try different combinations and learn the shape of the space, and then traverse more directly to a more optimal set of parameters.

We collaborated with the research team at Uber to build this, and as you can see on the right, the line in the bottom is the Bayesian black box optimized hyper frame search, which gets to a better model in much few iterations than a grid search does, because it learns the process and it can do it much more efficiently. Once again, this is getting into auto ML, how do you help developers build models faster in a more automated fashion and deploy the human intelligence where it's really needed, and not where it can be automated away.

The final one is around, once you have them all that you like, and you deploy it, a big part of the thing is how do you make sure the model's behaving correctly in production. We talked about being able to join predictions back to outcomes to to know your model's behaving well. There's a couple problems with that. One is we run that as a batch job and so there is a delay. In the case of credit card fraud it can take 90 days for the bank to send you the outcomes, and so you often can't join to the outcomes as quickly as you'd want to. The other approach is to- it's less accurate in a sense, less precise, but it's much, much quicker.- and that is just looking at distributions of both feature data. So the input data, as well as the predictions coming out over time.

For most models there should be a pretty regular distribution of features and predictions over time. Maybe there's a seasonality to it as the day and the week flows by, but it's often easy to identify big anomalies that often cause problems. Usually they're caused by bad data coming in either from a broken service or a broken pipeline. You can see in this case, there's a classification model. The top we're just looking at the distribution of true versus false, which you can see the distribution is pretty steady over time. Then we have a few other slices looking at the actual class probability over time. And then a few histograms cast as time series, you can see how the bucketing of data works over time. I think things kind of look okay.

In this case we're now looking at the top, the prediction result, but then also looking against the distribution of features in the model. You can see at the top the predictions themselves kind of deviated from what looks more normal. At the bottom, you can see there's actually a feature that had some bad data that was sort of triggering the abnormal predictions. Again, like with software engineering, running a production system having all the monitoring set up automatically for you is super important. For ML that matters at the system level, but it also matters at the data and the model level. You want to make sure that not only is the system not erroring out, but that the predictions and data are correct and that there aren't breakages in data pipelines, which is a common failure mode for ML models.

Key Lessons Learned

Key lessons learned recently. One I mentioned before around productivity ML, is bring the tool to developers. We again focused very early on on high scale and that was the right, first choice for Uber. But now as we focus on velocity, we're now bringing the tools closer to developers and even compromising on scale to make that easier and faster, and then providing a path to scale up as the problem is more deeply understood. Data is generally the hardest part of ML, and so having really good infrastructure and tooling and automation around the data management lets the modelers focus on the modeling problem and not on the plumbing. On the system delivery side, we've leveraged a lot of open source, and it's taken a lot longer in many cases to make things actually work well at scale. Nothing's free.

The last one is that real-time ML is quite challenging, and hard to get right and hard to empower modelers to own the end-to-end, and we're investing a lot at Uber to make those systems run themselves so developers can focus on the modeling work and not worry about the systems themselves.

 

See more presentations with transcripts

 

Recorded at:

Mar 23, 2019

BT