Transcript
Tingley: I'm Michael Tingley. I'm the Engineering Manager with Facebook's probabilistic programming languages team. I'm super excited to introduce you to the topic. I'll assume most of you, this is their first experience with this. My goals for today's talk really are to give you a sense of what probabilistic programming is and why you should care. I don't want you to necessarily come away with the nitty-gritty things but I have this big message that looking towards the future, we've spent a ton of time, research, and effort in the space of neural networks. They've really taken the world by storm at this point. I want to take a step back here, take a look at some other technologies and other ways of thinking that maybe we should spend some time reflecting on. That's my hope for today's talk, is that you'll come away with this impression that if I want to do more precise analysis where I want finer control over things, that maybe probabilistic programming is the tool I should reach for.
Big Data
We will start today with big data. You got a lot of big data and you want to make sense of it. What do you do with it? Take it. Put it in the funnel. What's at the bottom of the funnel? It's a neural network. We always have neural networks today. Hit the button, stir it. My legal team made me take out the necessary XKCD comic here. You know the one I'm talking about, it goes right here. Done. You got your results. You're happy. What if you don't have big data? What if you try to put it in this funnel and what you get out the other side, it doesn't plug into your network, or maybe your network is not able to capture it? You can't stir that network, in the proverbial sense. What if what you get out of the other side explodes?
I will say, without going into too much detail, we just last week had a big problem of this form at Facebook where we deployed a very important network. Somebody shipped a new change, passed all of the tests, everything looked beautiful. We launched to production, and then fire. The shit hit the fan. These problems are not trivial at all. We have to address them. They're a big deal. My hopes today are to convince you that probabilistic programming can help.
We'll go through a few things here. I try to keep it as high level as I can to make it as accessible to everybody. I do want to get a little bit into the nitty-gritty here. We're going to start out with talking about Bayesian thinking. This is the sense of distributions and uncertainty. I'll walk you through what I mean there. Then we'll get into how probabilistic programming is the tool for putting that into reality, putting that into production. Last off, we'll talk about some of the ways that we're using this at Facebook. Give you a sense of maybe what this can do for you respective companies.
Unlimited Data? Use a Neural Network
If you have unlimited data, I'm not this crazy, diehard probabilistic programming guy. Use a neural network. It already works for you, fantastic. If it's doing what you need, you don't need me. I don't know why you're sitting here. Really, I love this. This guy is Zoubin Ghahramani, one of the chief scientists at Uber. This idea, big datasets are actually big collections of small datasets. When moving in, I saw a number of the Netflix talks today about different recommender systems and ways to think about microservices. This is exactly the essence of what we're getting at here, is that, when we're talking about personalization, and recommendation, you don't want to be cookie cutter, "I'll just apply the same formula to everything." You really want models that are personalized for individuals, for users. This is that essence here. Then once you've got that mentality, it's super easy to see how that big dataset is really customized for each user, or maybe for each population that you guys are dealing with. Then the most important thing is that, when you're dealing with these small datasets, you don't have that power, you don't have that crux of the data to lean on. You need a new way of thinking.
Bayesian Thinking
Let's talk about Bayesian thinking. I don't want to get too much into what Bayesian things are. My goal here is not to impress you with cool models and stuff, but rather to motivate you. Let's talk about the Hello World of a Bayesian model. This is the coin flipping example. If you've taken a stats course, you've seen this. You've got a coin. You don't know its heads-rate. You want to learn that through observations and data. Importantly, you don't want to just, "I observed 50 heads and 50 tails. It must be half." That would tell you the same thing as if you observed a million heads and a million tails. I didn't put this here. One of my friends showed me this article where they actually designed this physical arm that flipped coins in a vacuum. They actually showed with very high precision that the U.S. quarter is actually a 51% tails-rate. If you don't get anything else out of today's talk, that's a really cool little tidbit.
You see these flips. You've got some prior belief. That's the setup here. If you're a stats nerd like me, maybe you know how to talk about this in terms of distributions. You've got a beta distribution to describe your prior belief here. You've got this binomial distribution, which I'll use to describe the data that you've observed after you see a lot of coin flips. Importantly, the key is that the output here is what you want to have at the end of the day. It's not a number. It's not a specific, "This is the heads-rate." It's a distribution. This is my confidence.
This process of taking your data, having some hypothesis, how it relates to each other, and getting useful insights out of that, that's called inference. We'll talk about Bayesian inference a lot. This is Bayes' theorem. Maybe you've seen this before, possibly. For purposes of this talk, it's really important to realize that basically, trying to take some prior belief. Then you take some model. That's this likelihood term here, as your model is a lot crammed into a tiny little bit of notation. What you want to get out of all of this is a distribution. It's not a point estimate. It's this probability. Here, why do we use H? I like thinking about it as a hypothesis space. If you have a model, you have lots of different ways, potentially, an infinite number of ways that you could explain the world. Before you observe any data, you have some notion in your head what this hypothesis space looks. Then you mix in the data. What you do is you're constraining your hypothesis space. If you're going to be a little facetious about it, this is all we're doing. Is we're taking this very infinite hypothesis space, and crushing it down into a smaller infinite hypothesis space. We've done this in a way where, now you can think of things in terms of the uncertainties that underlie particular events, or ways that you're explaining the world. This is in very abstract terms, but you can imagine probably how this maps to any regression or classification problem that you might have encountered in your day-to-day experiences here. Then this is super important, what we get out being a distribution because this helps us make decisions.
Honestly, who cares? Nobody wants to see slides full of math. This is not a math talk. This is Qcon, a bunch of practitioners. We're here to build products, me too. I feel really strongly about this, which is why I want to show you this math. I don't care about this theorem here. What I care about is what this can do for me and what this can do for my team. It gives you three things. It probably gives you a lot, but the three that I really care about, we'll go through them.
Uncertainty-quantified Predictions
Let's start with uncertainty-quantified predictions. This is probably the biggest, most glamorous thing that probabilistic programming can offer to you. Honestly, this is the reason why Facebook is investing in the space, initially. The idea here is that we have huge amounts of data, terabytes at minimum on pretty much any vertical that you look at. You can think of when we throw all of this into a model. Much of what we've tossed in, we actually end up not using. There is a lot of uncertainty in our datasets. There's different uncertainty, whether this is in the data itself. Maybe if you're familiar with the term, it's called aleatory uncertainty. We're actually in the model that you've written called epistemic uncertainty. If you're not familiar with these terms, it doesn't matter. The point is, that what you're getting out, you really want to be dealing with this. You want to see these confidence intervals and you want to understand when you don't understand. Your systems right now, probably the ones that we've been using at Facebook, they don't know this. You just plug them in, and they give you back an answer. This is the crux of so many problems that we're encountering. Instead, they really should be able to tell you a spectrum of answers. We have the data to make that happen. We just need the tools.
Moving on, this part here about priors. I've seen, when I've been working with different teams at Facebook, there's this common pattern where people deploy a system. It works great for their big dataset. Then they'll come across a particular day, or a particular vertical, or somebody will splice the data in such a way where it flops. It no longer predicts what you expect it to. It triggers monitoring alarms, all of these things. People usually solve this by patching on an IF statement. If my dataset size is less than 100, or my prediction is 0 or 1, in some sense, then fall back or do something like this. That's great. I'm a very principled guy. I want my system to do that for me. This is what priors give to you. They let you gracefully degrade from a very precise insight based on lots of data, down into one that maybe is a safe fallback. It naturally sways between one and the other. It doesn't super matter what the plot here is. This is just something where we worked at Facebook, and you see that we went from having the large data case on the right, but we still don't have enough to get to the goal. We can use priors in such a way as to stabilize these signals. You never want one that's overwhelming your data, but rather to make sure that your system does the right thing, even in the Edge cases.
Explainable Models
Then lastly, the thing that I'm personally most excited about is probabilistic programming offers to you explainable models. This is the biggest thing. I don't know if you guys have done any ML engineering. Probably, the least fun part of my job as an ML engineer is really taking out the wrench and the hammer and going at the neural network, and, "Will this fix it?" "I don't know." In probabilistic programming, you have a model. It doesn't matter what this is. The idea is you're writing something basically like this, but in code. You can see physically in a very visible way how your model fits together.
The goal here is that once you have this, you can make the right intuitive or principled steps to improve your model. Then not only that, not only growing your model, you can also query it in a much more nuanced way, where your model is no longer just painting a picture about this single thing that you asked for but all of the intermediate components that went into producing that answer. The specific charts don't matter. The point is that, I can say, if I got a particular output. If I made a particular prediction, why did I make that prediction? What are the uncertainties that are going into it? What are the hypotheses that can explain that? We're done. I've written you some math. I told you why you should care.
Neural Network and Bayesian Model
Let's talk a little bit about what it takes to get to probabilistic programming, and in particular, Bayesian thinking. Today's world, we have neural networks. This is a very facetious reduction to what is a neural network. It's the neural network itself. It's a big data. It's the hardware that you use to execute that model. It's also got a bunch of other things, like distributed systems problems, and backpropagation, and lots of architecture, and support. You can imagine at a vague sense, this is what you need in order to get the results you're looking for. What about a Bayesian model? We talked a little bit about what a Bayesian model might be. The input here is data. For now we need a grad student to derive this model and take this math and massage it into something that a system can run. You can imagine the next slide here is we can't scale this. You can't hire a whole swath of grad students and labs. Universities probably can, but Facebook is not willing to do that. We need to find another way to solve this problem.
We've got a model. This is a very basic one. The goal here is to bridge this gap between somebody who's a well-versed statistician or mathematician, but what we want to access are ML engineers, are data scientists, are software engineers, people that think in code and they think in data. Actually, this is a very basic probabilistic program. I want to give you the sense of what's going on here. We can do this. The three components that come here, let's start from the bottom, actually. I mentioned to you the outputs that we want are distributions. Curves, this stuff that looks like this. You have a bunch of inputs. The inputs are your data, and then, also some notion of prior beliefs. This is domain specific, but the general advice is, you shouldn't have very strong priors, but they can help you smoothly degrade from one world to the other. If you haven't observed any coin flips, maybe you have some belief that it's 50/50. You can calibrate your prior distribution look like that. It scales up as well.
Then the most important part here is actually the probabilistic program itself. All you need to see through this thing, is there are just really two concepts that go into probabilistic programming. There's a way of teasing out uncertainty from your program. That's this sample statement here. Then there is another statement that's involved in constraining your hypothesis space. Hopefully, this doesn't sound new. I've been talking about these things this whole time: uncertainty, and hypothesis spaces. Once you have these, this is the syntax that you should be thinking. This is the way you should be thinking. We call this a generative model. If you're super versed in stats, it's not exactly a generative model. Basically, the idea here is that I have a process that I have in my mind about how the world works. All I'm trying to do is write that in code. Then if we did our jobs as a platform team, what we can do behind the scenes is take this thing and make it run for you.
In practice, these models have tens of thousands of variables. They need tensorized, compute-oriented architectures. They are very complex and have lots of external dependencies, all of these things. Once you have these components, there are tons of models. The background to this slide, are the many different models that you could express. These are all just the classical ones. In practice, virtually every model that we've built is very bespoke, and addresses a pointed problem that the company is facing.
We want to remove the grad student, what do we need to do here? We need to take this process of data plus model to insights, we need to automate that. Make it so that all you need to do as a user is you can write that front-end code. You can write that model. What you want to get out is completely handled by the system for you. I want to walk through in the next few slides what this looks like. If a probabilistic program had feelings, I want those feelings to flow into you here.
Prior and Target Distribution Sampling
We have this prior distribution, something like your prior beliefs. You have this target distribution. This is not something that you can examine directly. The game that we play is I give you this operation, which lets you ask one particular point on this graph. You could say, what is the value at this point? You can't ask for the entire curve, but you can ask for specific points, with this operation. You can also look at the whole green curve if you want. You want to figure out what that red curve should be. I'm simplifying things a little bit. There should be probably another curve here, which represents the probability implied just by your data. To simplify things, we're going to say that there's a starting curve, which is your prior beliefs, and the target. We want to move from one to the other.
We need some strategies. What's our fist strategy? What if we just sample from the green curve? Obviously, this isn't going to work. You'll just stay where the green curve is. Instead, one of the games that we can play is, let's sample from the green curve, step one. Then step two, let's ask, how likely is that data point in the red curve? For more likely points, we will tend to accept these points and we'll add them to our list of points that represent the red curve. For things that have a low probability, we'll reject them. This is all probabilistic. You can see in this little animation here, first, we sample from the green curve randomly. Then we figure out, what's the probability in the red curve? We accept with that particular acceptance rate. If you do this enough times, you will get something that looks a little bit like this. You can use your imagination that if we'd run it a bunch more, you'll basically get a lot of green points around the center here and you'll get fewer green points as you span out. Eventually, you'll get to zero green points at the extremities of this target. That's exactly the goal we're looking for. It's a discrete approximation to this red curve. It's very inefficient, because the further apart your green and your red curves are the worse this process becomes. It's super inflexible. Can we do better? I wouldn't be here if the answer were no.
The first improvement that we'll make is, instead of directly sampling from the green curve. Let's find a starting place that's based off of sampling from that green curve. That's this green circle here. Then, what about the next point that we want to sample? Instead of just going back to the green curve, let's sample around that green dot. That accepted point. If you want to think about this as the proposal distribution, then you can imagine this is a pretty simple distribution that we know how to sample from. Then we can draw our next point based off our current place. Using this like a locality helps you tend to find better points. We'll see in the animation for this one that it doesn't always accept every point that it guesses. Over time, it tends to guess points that are close to the red curve, and then it will accept it. Then because you accepted it, you just move closer to the red curve. Basically, you repeat this process enough, and you can see that it will tend to accept a lot of points around the central mass of your target distribution here. The actual math of what we're doing, we can talk about it, but it's not super interesting. It's really interesting to people like me, but to leave the impression the goal here is that once you find better points, you tend to stay around those points, and estimate this curve.
This is great. You speed things up again. It looks like what you're getting for eventually. The problem here is that, if you're really far away from the red curve, this will take quite a long time. Also, really importantly, there's a lot of information that we're leaving on the floor. If you're familiar with how neural networks work, you might have heard of backpropagation. We can use some of those same tools in probabilistic programming to help us find the right places.
Using Gradient Information
Instead of doing what we just showed, let's use some gradient information. You can imagine we're currently at this green point here. We want to look at the derivative on our target distribution. I told you before, you could query a particular point on that red curve. Maybe I lied, you can query that point, but you can also ask about its derivative information. From this, we can get a sense of which direction should we be going and not just do it randomly. Let's instead adjust our proposal such that it can account for some of that information. Instead of centering our proposal around our current point, let's center our proposal around where that gradient is pointing us towards. You'll notice throughout all of this, it's never, do this single deterministic operation. It's never, climb this hill, and move to this new point. It's always, try this thing, and see if this new point is better with some probability. This is one of the key differences that allow us to actually capture the full distributional information, instead of just point pieces of information about your target.
Actually, with gradient information, you can start to see that this lets you get into higher dimensions. If we had tried to do this with the previous algorithm, it would take quite a while to explore the space. What we're seeing here is you've got the point that you're located at, which is where all of those arrows are originating from. Then you've got this blue arrow which represents the gradient. Then you've got the proposal distribution represented in these concentric, gray circles there. They're the proposal distribution in two-dimensional space, the contours. Then whenever you see a green arrow, it's deciding to accept the point. You'll tend to see that these tend to be closer to the mass of the target distribution, which is this blue curve now. Then when you see the red ones, it's decided to reject it.
We run this a bunch of times. You get all of these points. It's looking pretty good. You'll see that the points themselves tend to follow the curve that you're looking to build up. Also, you'll notice, we could do a lot better here. I think this is after 10,000 samples or so. You're getting a good shape to the joint distribution up here, but it's still your marginal. It leaves a lot to be desired. Can we do even better? It's called the MALA algorithm, Metropolis-adjusted Langevin Algorithm.
Let's talk about another approach that we can use here. Previously, one of the problems we encountered is that we basically took a step along the gradient and decided to propose a new distribution. This is good, but it means that once we've found a bad spot, it's really tricky for us to get out of it. Instead, what you can do is imagine taking multiple steps along this gradient. This is super useful for helping get out of those areas that you get stuck in here. You can't just take multiple steps of the gradient and stop there. You need to add in some stochasticity that helps you preserve the whole distribution. You can imagine, this thing is being a hockey puck that tends to flow uphill for some reason. You give it a kick randomly in one direction. If you do this, you get the math right here and you follow several steps of the gradient. Then you can decide whether to accept or reject your point. Turns out, you will end up getting your target distribution. You'll approximate it. This is called Hamiltonian Monte Carlo. We're getting on the verge of pretty complex inference at this point.
If we, again, visualize it here, you've got your starting point. Then you give it a random kick, that's this gray arrow. Then it follows along the path. It's following the gradient of your target distribution. That's the 20 steps that you're seeing in the little black curve here. Then lastly, after 20 steps, we decide to stop it, and say, should we accept or reject this point? We will decide to accept or reject that point based on how probable the particular point is in the target distribution. Green and red arrows indicate acceptance and rejection.
Run this for the same number of steps as before. We see that we're actually getting a whole lot better in terms of our marginals. Our distribution itself also looks really nice. We've done a lot better here. We've explored the space a lot more accurately. Just to put them side by side, you can get a real sense of the differences here. This is one of the bigger algorithms in the space. There's a bunch of other exciting ones as well. I don't want to get too much into it. It really gets bogged down here.
Our goal here is to remove this statistician in two steps. First off, we want to help make modeling accessible to ordinary software engineers. Give you a language where you can write code you're familiar with. Describe processes that you know a lot about. Do this in a way where you can do things in a probabilistic nature. I think of this personally as giving you a wildcard or a blank that you could plug into your program. Then it will fill that in for you based on data that you've provided. Then second off is once you've given the engineers that language to write with. Can we automate the heavy lifting of turning your data and your model into insights for you? We've walked through some cool inference algorithms here, very useful in the space, a lot of others to explore as well. They require differentiable models. I would be happy to share some of the exciting work that we're doing as well to relax some of these restrictions. I don't want to get too deep into the details for this talk.
Applications - Measuring Ads Lift
For this part, I'd love to spend the rest of the time here talking about a few of the different applications that we've been exploring with probabilistic programming at Facebook. We'll start off super basic here. In the ad space, there's this concept called the ads lift, which is basically trying to answer questions of the form, how effective is this ad compared to other ads, or no ad, or competitors, these things? This is really influential for our advertisers to be able to decide what's working well and what's not. Importantly, they need very precise information. If we give them something that's just a best estimate that we can offer certainties for, then the advertiser is not going to know what to do, whether they should trust their data, whether they should run their experiments for a lot longer. Uncertainty is a really key metric here for them to be able to make decisions.
Starting off, I'll describe the process that was used previously. At Facebook, it's actually pretty surprisingly interesting. Basically, the way that this works prior to our involvement, we could generate data by running a particular ad campaign for a while. This will give us information about revenue, conversion, sales, profit, all of these metrics that are directly important to the advertiser. Then we would want to generate uncertainty estimates for these properties. What would happen is, we would bootstrap them, which means that we would take some subset of them. We would take lots of different subsets and recompute average revenue, or average profit, or average conversions. Then we would assume that all of those bootstrapped values represented the full corpus, the full body of uncertainty in the dataset. The team, in order to make sure there was enough data, would have to do some banding, some discretization of these bootstraps as well. It introduced a lot of limitations in the insights that the advertiser could make. It added a lot of lossiness to the data. We wanted to see if we could address this problem with probabilistic programming.
Just as a very basic starter model here, it's super easy. Basically, in my head, what am I thinking? We have some probability of showing an ad to the user. Then conditional on us deciding to show it to them, we want to model whether or not they converted. Then conditional on them converting, we wanted to model the sales. The actual spend that the user made on that ad. Then conditional on that, we would want to model what the profit would look like for the advertiser. The nice thing about having a probabilistic model for this that relates all of these properties, is in addition to getting direct distributional answers to what these important metrics look like, you ensure that these metrics are consistent. Previously, we had separate models to predict revenue and separate models to predict profit and conversions. All of these metrics of interest, which tended to work well. There were lots of cases where they produce different answers just based on the process we're using.
Because probabilistic programming is all about simulating hypothesis spaces, we'll never give you an inconsistent answer. We might give you an answer that has a low probability. It's certainly a valid answer for explaining how you arrived at a particular configuration of profit and revenue. Then if we do this well enough, we will tend to give you the right set of hypotheses that will explain the data you've observed. At the end of the day, the advertiser is able to have full distributional information over these quantities of interest.
Fake Account Prevalence (Labelers)
I want to keep moving on, talk about another interesting model. This is one that I'm pretty passionate about. At Facebook, we care a lot about this space called integrity. Basically, the idea behind integrity is that we want to diminish the influence of malicious actors on the site. Some other companies maybe call it abuse, or whatever. For us, one of the really important problems is answering, where are the fake actors? How prevalent are they across the company? This is not an easy problem. It's not like each account has some bit that says whether it's fake or not. We actually need human labelers to come in and say, this account looks fake to me. In this sense, there is really no ground truth dataset for us to work on. It's all based on human opinion. Facebook has this really awesome team of tens of thousands of human labelers that can help out in making these decisions for the company. Humans, since this is an opinion, we need to have several human labelers in order to come to a full conclusion. For a particular account, we'll have lots of people evaluate and answer, "Does this account look fake to you or not?" There's lots of guidance and metrics, but it's still a stochastic process at the end of the day.
In this particular instance, let's say we've got an account. This true label that you actually can't observe in practice is that the account is fake. You've sent this off to three labelers. Two of them say that the account is fake, and the third one says it's not fake. As a result, the existing solution would overall conclude this account is overall a fake account, which is great. We're happy. We have correctly labeled the account. Of course, you can imagine the problem with this is that your labelers themselves may not all agree. They clearly don't agree, but they sometimes will get it wrong as a majority. You'll have a couple labelers here that say the account is fake. You'll have a third labeler maybe that says the account is not fake. It's actually that third guy that got it correct. The other two got it wrong.
You might say, "Michael, what can we do here? All we can work on is opinions of humans. There's nothing that we can do better." If this is all the data that we had, I would totally agree with you. The thing is, these labelers are working over thousands of accounts per day, many days in a row. There's a lot of information that we can tease out of this. In particular, we can see how correlated the different labelers are. You can imagine that each of these labelers is making lots of votes of this form where there are three different labelers or more all voting on content. You can think of it as crowdsourcing in a sense. We want to understand, how often are these labelers making the same decisions? Try to identify the discrepancies.
For instance, just to motivate this. Imagine that we have three labelers, and one of the labelers is always consistently agreeing with all the other labelers that they're working with. Then for some reason, for one particular piece of content, you see that they disagree. One hypothesis could be, that guy got it wrong. Another one could be this labeler is almost always agreeing with the others. Maybe in this particular case, we have a situation like this, where it's the other two that are wrong. It's really hard to write a system that does this in a deterministic way, because there's a lot of hypothesis, space manipulation, a lot of uncertainty that's flowing through your system. This is exactly where probabilistic programs shine. I want to motivate the system that's going on here. What we do in the model is you assume that your labeler has some confusion rate. It's the rate that he or she confuses labels for fake accounts as not fake, and the separate rate that he or she confuses not fake accounts for being fake.
Information in One Central Place
Once you have this assumption, you can write a pretty straightforward model that combines all of this information in one central place. Basically, all you need to do is write a generative process which says, a labeler sees a particular account, and we will assume a true ground truth label for that account. We don't ever need to observe it, but we assume it exists. Then we'll model based on that true label and a particular labeler's confusion rate. We'll model what they might have predicted. For a particular individual piece of content, this is not useful to us. We don't actually have very much information. When you do this across millions of different accounts, you get a lot of information. You can see when particular labelers tend to agree, when particular labelers tend to disagree. You can use this information to determine how particularly confident we are that a true label is fake or not fake based on the votes that were cast on it. We have a little time chart here. I can't, unfortunately, show you the real numbers for all of this stuff. What we do is we're able to put confidence bounds on what the prevalence rate is for these pieces of content.
To me, this is pretty magical. We took something that didn't have any ground truth information. We wrote a model that really fits on one slide. It's just saying, sample a ground truth label. Sample a confusion rate. Then from that you can draw what you think the labeler would have voted for a particular item. Do this for all of your labelers. Do that for all of your items. There you go. This is all you, as a user, need to write. The rest of it is pushed into the framework, and it will automatically produce these distributions for you. It goes actually beyond this, one step further, even. Instead of just giving us these prevalence rates, it also gives us the error rates for individual labelers. We can say, what of our labelers do we think are not as trustworthy as we'd like? The people that are tending to get things wrong. We can actually use this information to then go and issue additional training for individuals. We can also, on the other hand, use this for particular items and say, "This particular item looks particularly difficult to classify. Maybe we should add more community guidelines or community standards around these." These all turn into business decisions that we're able to make, just by having one model that's consistent throughout.
Network Outage Detection
I want to go into one last problem here that I think is pretty illuminating. This is the network outage problem that we're working on. Basically, the goal here is, Facebook has tons of data centers, and different pieces of hardware that make up its network internally. We're sending terabytes of data every hour, between different endpoints. We want to understand when information is getting dropped. You can imagine, this is a very small reduction of the situation that's going on. Each piece of hardware that we own, can suffer an outage. For particular pieces of hardware, the outage rate is extremely low. It's far less than one over 1 million likelihood that a piece of hardware will be in an outage state currently. When you scale this up to all of the pieces of physical infrastructure that we own, it becomes a serious problem. You can get outages on the data centers, maybe on the hub. You can get outages on all of the links that go between all these things. There's a whole lot of stuff I haven't shown on here that goes all the way. This is the highest level view you can imagine. You can zoom all the way in past beyond just the servers, but into the actual physical components making them up as well. We want to understand where these outages are happening, for obvious reasons.
Let's imagine in a situation that there's an outage on this link. Maybe it's got a probability that it's dropping packets, maybe it's 30% here. The existing way that we would go about understanding what's happening here is we will send a lot of packets from known endpoints. We'll send packets from A to C, or B to A, all of these things. We'll send a number of known packets and we'll wait to hear back the responses. Then if there are any instances where we've received fewer packets than we've sent, it means that there's something somewhere in the network that's dropping packets, or maybe it's just due to random noise.
The existing solution is actually to basically hand all of this information over to an analyst, these giant logs. Have them figure out, where is the outage? Is there an outage? Questions of this form. This is a big problem, because we're dealing with hundreds of thousands of nodes in this graph. Tens of millions of packets transmitted for this purpose, just this diagnosis purpose every five minutes. You can imagine how many packets are sent generally across the network. Outage rates that are far less than 30%, they're in the 0.1% outage, would be really bad for us as well. Many outages happening simultaneously, in certain situations. Then we, really in critical situations, even an outage that lasts for a minute is a huge loss to Facebook's bottom line. The existing solution is, to put it shortly, quite unsatisfactory.
PPL Solution: Model the Graph
Let's expand this thing out. The graph is much bigger. I want to give the intuition here. How do we develop a PPL model for this? This is actually really cool because graph problems are something that PPLs really shine at. You can imagine attributing a state to each node, whether it's in a passing, or healthy or unhealthy state. You can model this as having a distribution over its failure rate. Initially, this failure rate is very small. This is based on information that we have observed, historically. Then we want to model how packets travel through this network. In this little example here, we see a packet traveling between five nodes here, five vertices in this graph. Really, what you want to model is, what is the success rate for that particular packet?
In this model, basically, the packet needs to succeed at passing through each one of those nodes. It also needs to succeed in the backwards pass as well. All it is, is you take the healthy rate of each node, and you multiply them together. There's no modeling. Really, if you think about this, I'm just writing out a process. This is the experience that I'm hoping for software engineers to have, where they just think, how does this system behave in the real world? I can write that out in code. The only special thing I need to do is I'll sample from the healthy rate rather than just computing it directly. That's really the only difference.
Once you've written out that model, that simple thing. You just plug in the observed behavior that you've seen. Now you've seen the behavior of trillions of packets across your network. You can condition on what you've observed. The probabilistic program will automatically constrain the assignments of healthiness to your different vertices based on the ones that best explain your data. Really, this is all you need to do. The probabilistic program will take care of the rest for you. You need to model the network topology. This is something that we have awesome network engineers very versed in understanding how the different nodes fit together, and understanding how packets flow through the network.
Then what we can do from this is we can trace out the individual nodes that look like they're problematic. Even for very small failure rates, we can be quite confident. Here, we're not sure exactly what the failure rate is. We are very sure, far more than 95% sure, that there's at least a 1% failure rate on here, which means that this node is responsible for dropping 1% of packets in our network. That's a serious problem that we should fix. Because we have this distributional information, we can prevent sending off spurious alarms to our systems engineers. We can make sure that what we're sending to them is highly actionable, because we're very confident about the failures in the network.
Then the last thing, we haven't done this yet, but I'm actually super excited about it, is that, you can imagine assembling these models in a hierarchical fashion where it's not just the failure rate of a particular node. Each node has some topology interpretation to it. Maybe you have a data center. Within that data center, you have switches. Then within those switches, you have racks. Racks have servers. Servers have components, and all of this stuff. You can ask, what is the failure rate of a conglomeration there, the top level node? We can produce those estimates for you, which in a sense aggregate the failure rates of everything beneath it. Then this is very helpful for us in terms of when there are widespread outages and we really quickly need to quarantine parts of the network. We can ask, what are the right parts to quarantine? Maybe we can't fix the hardware immediately, but we can reroute traffic around this.
Basically, the takeaway that I had from working on this model was that we took a problem that would take a long time to design a custom embedding to fit this into a network model, versus in probabilistic programming it's just a handful of lines to actually write out what this looks like. I walked you through the math here. The code for it directly maps to what we saw here.
Takeaways
The takeaways here are a few things. First off, having interpretable uncertainty estimates is key for letting you subdivide your datasets and get onto a smaller scale. Importantly, they give you unambiguous action that you can take. You can be certain that what you're taking is the right thing to do. Or on the converse, you might say, "I'm not really sure about the true state of the world. I need to gather more information." Because you have these interpretable models, you're able to go deeper and say, "I'm not sure about this particular state of the world. Let me zoom in and see what's going on in a particular part of my model." You can also use this latent information, how these nodes fit together to ask other questions about your models that are consistent with the global inferences that you're making. This is a really tricky thing to do otherwise. You get model compositionality for free, in a sense. Then also, I didn't mention too much about this, but actually just encoding models in this way with priors tends to produce better estimations overall anyway. If you have the capability of building a model like this, you'll just tend to get better results overall. At least, that's been my experience.
What about Facebook?
I haven't talked anything about the innovations that Facebook is making in this space. We're super interested in exploring probabilistic programming languages. It's a very new field to hit industry. I'm not, unfortunately, able to share a ton about what we're doing just yet, but you can stay tuned probably in mid-2020 for more information. The main thing to understand is we're trying to exploit actual structure in the model that the user has written. You can think of that as interpreting the model as a piece of data that we can use to tease out more information from the dataset that we have.
See more presentations with transcripts