BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Lessons Learned from Building LinkedIn’s AI Data Platform

Lessons Learned from Building LinkedIn’s AI Data Platform

Bookmarks
49:15

Summary

Felix GV provides an overview of LinkedIn’s AI ecosystem, then discusses the data platform underneath it: an open source database called Venice.

Bio

Felix GV joined LinkedIn's data infrastructure team in 2014, first working on Voldemort, the predecessor of Venice. Over the years, GV participated in all phases of the development lifecycle of Venice, from requirements gathering and architecture, to implementation, testing, roll out, integration, stabilization, scaling and maintenance.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

GV: This is a session on lessons learned from building LinkedIn's AI data platform. My name is Felix. I'm a committer on the open source Venice project, which I'll talk about in the second half of the talk. I've been doing data infra in various forms since 2011.

I'd like to start with this quote from an influential paper from 2015, which said, only a small fraction of real-world machine learning systems is composed of their machine learning code, the required surrounding infrastructure is vast and complex. I think it's been picked up a lot since then, because probably a lot of people feel it is true. In this talk, I'm going to talk about the surrounding infrastructure, everything except that blue box. That's a lie. I can't really talk about everything, because there's just too much. I'll try to give a shallow overview as much as I can. Then I'll zoom in into the serving infrastructure and adjacent pieces.

Outline

I want to present a couple of AI use cases we've got at LinkedIn. Then I want to talk about the AI ecosystem. By that I mean the tooling and components that make up our AI infra. Then I'll talk briefly about some of the solutions in our data infrastructure, which is essentially the underpinning of the AI infra, so going one layer below. Then one of those data infrastructure components in particular is Venice. I'll spend some more time on that. Section 2 and 4 are like the big sections here. I'll try to sprinkle some lessons learned during the talk. I'll keep a few for the end, quick hitters there to hopefully give you something actionable to walk away with.

AI at LinkedIn

What kind of AI do we do at LinkedIn? One major use case is something we call affectionately PYMK. What does that mean? People You May Know. LinkedIn invented the People You May Know concept back in 2006, before all the other social networks copied it. It looks like this. It's a list of second-degree connections that you might know. Sometimes we call it triangle closing in the literature, because in the graph of connections we have an open triangle between connections, and we try to close that. I talked about that in more details at QCon AI in San Francisco in a talk called, "People You May Know: Fast Recommendations over Massive Data." You can check that out if you want to learn more about that use case in particular. Another one is the feed. That's the main thing you see when you open the app or the website. You'll see a bunch of posts like this one. Both the feed and PYMK are what we call recommender use cases. Which means that there's a ton of entities that we could present to you, it could be members, it could be posts or other things. There's way too much. We can't show everything. Essentially, a recommender system is going to score the relevancy of all of the entities, then rank that and return to you the Top K hopefully most relevant entities. That's the pattern we have. I don't know how many, probably more than 1000 use cases like this that are just permeating the whole site, every part of the application. We also have some non-recommender AI use cases, but recommender is a really big use case.

AI Ecosystem

What are the things that make up the AI ecosystem at LinkedIn? We have an AI platform. We didn't always have an AI platform. A few years ago, we had a collection of tools that were developed by a collection of teams that didn't really talk to each other. Probably a lot of companies are in that stage also, and that works fine, up to a certain point. The challenge we had there was that each tool had a certain scope that it served. Sometimes there was overlap in scope between different tools. Then it was confusing, should I do this part of the workflow with this tool or with that tool? Then there were parts of the necessary scope that were gapped, they were not covered by any tool. That meant the AI engineer needed to bridge that gap themselves with some boilerplate that every team would have to do their own glue code, put these things together, fill in the blanks. A few years ago, we decided to embark on a journey towards platformizing this, or sometimes we call it like industrializing this, going from artisanship to industrial scale AI, what was kind of the vision. To that end, we developed an opinionated platform. Now there are still a bunch of tools but the tools now have coherency between them. There's some holistic, or at least we attempt to have a holistic view of how they fit in together. Importantly, we try to make this platform cater to both AI researchers and AI engineers, which for a given project might be the same person that wears both hats, or it might be two different person or even two different teams, depending on the circumstances. There are essentially two different personas that have different needs. The way we think about it is there's like five steps in the workflow. You could break it up in other ways. This is not like a hard truth here. It's just one way to look at it. We got feature management, model creation, which is that blue box, in a sense, in the first diagram. Then, model deployment, model serving, model maintenance. For each of these activities, we have a bunch of components that we use. This is not a strict categorization.

Some of these tools actually cover a few different steps of the lifecycle. It's a best effort categorization there. I am not going to be able to cover all of this. There's actually even more. These are just some of the tools we have. The field is evolving so fast, that there's just like a new need every other day. We try to keep up. Then, for the purpose of this talk, I'm going to focus on a few of them.

I'm going to flash this diagram here briefly, just so you have it in the back of your mind, but I'm not going to explain it yet. I'll come back to it. First, I want to give you a few more details about some of the ML infra component. The first one I want to talk about is Frame. Frame is our feature store. In particular, it is a virtual feature store. What does that mean? It means that Frame is not vertically integrated into a single storage. Rather, it abstracts over a bunch of different storage types. We have feature definitions which sometimes we call anchors, so I might use that interchangeably. A feature definition can span multiple environments. We've got offline streaming and online features. In the offline context, the feature may take the form of an Iceberg table, or a SeaTable, or something like that. In the streaming context, the feature might be available inside a Kafka topic. In the online context, it may be inside of a Venice store or a Pinot table. Actually, all or almost all of our online data systems are available through Frame. We've got features everywhere. It happens to be the case that Venice is the most popular choice that AI engineers pick for their online storage features. That is why I'll focus on that in a later part of the talk. There's a fraction of the scope of Frame which is available in open source. We open sourced it under the name of Feathr. It's roughly the offline part of Frame that is open sourced. At LinkedIn, we run a bigger scope version of that system, but there's a part of that that you could leverage if you feel like it.

Another system we've got is King Kong. This is where we do our deep learning training on Kubernetes. It supports many open source tools like TensorFlow, PyTorch, Ray. This one is not open source per se, but it's a thin layer on top of a bunch of open source things. For example, for TensorFlow, we use the Kubeflow operator. You can use that directly if you would like. FedEx is an interesting one, I think. It's our feature productionization pipeline. What we noticed essentially, is that the AI engineer has some steps that they need to do, whenever they want to take some feature that is available in the offline environment and make it available online for user facing inference. These steps include like, joining a bunch of features, preparing them, massaging them in certain ways. Then, if there is not already a Frame-Online anchor for that feature, then they need to register that. The data is typically going to be pushed inside a Venice store, like I was saying, although sometimes it's somewhere else depending on the need, but oftentimes it's in a Venice store, so they need to create that store, potentially configure it in a certain way. Then they populate that store using the Venice Push Job. There are some other metadata that you need to register for tracking and so on. FedEx provides a declarative Python DSL to do all of that. It tries to automate as much as possible the grunt work of taking a feature from offline to online. The first time they run this, there's going to be some things that are created on the fly, like the Frame-Online anchor, then the Venice store. Then if they rerun that to refresh that data, then there's just a subset of the steps that run in those cases. That's something that gives our AI engineers more productivity. The last one that I want to deep dive in here in this section is Model Cloud. This is our inference platform, essentially, where our AI engineers can submit their models to be served. We used to have every AI engineer need to stand up their own service. Then use a bunch of tools and glue them together to make it work. Now this is a standard recipe that they can use instead, again, for greater productivity. It takes the Frame-Online anchors that they're going to need, the model that they're going to need, wires those together. Then there's some boilerplate stuff like alerts, benchmarking capability. The stuff you would need on a production system. This is also the gateway through which we make GPUs available to the company. If the user needs a GPU for their model, then the preferred option is that they onboard to Model Cloud to get it. This and all the other tools I described before are self-service. That's very important for us in terms of not needing to wait on an email or a ticket or something like that to get the stuff they need, just like either through some script or through some clicks, they get their work done.

Let's come back to this diagram now that we can understand a bit better. There's a bunch of arrows here. The red arrows are the batch stuff, the periodic stuff. That's going to happen maybe once a day, maybe several times a day. Here, you've got the Frame-Offline features that feed into both model training as well as feature productionization. The top part, FedEx, is the feature productionization that pushes the data into Venice. Then the Open Connect, which I didn't describe in depth, but is basically the orchestrator for training, also takes in those features, trains the model using King Kong. The model goes through some analysis to make sure it checks out. Then, it gets deployed into Model Cloud. Ultimately, all of that stuff lands inside Model Cloud. Where Model Cloud is an online facing service that can accept requests for all of the entities that need to be scored. It's going to retrieve the features it needs from Frame, which itself typically uses the Venice client. Not always, like I said. It could plug in some other storages in there as well. A typical case is that it will query Venice, get the features from there. Pass those features into the model, run the inference. Then the output of the model is oftentimes a Top K, could take some other form in some cases also. Then that response is given back. That's the way it works. The right-hand side where all the green arrows are, that's the hot path, in a sense. That's the low latency request-response usage pattern.

Data Infra

Here, I just want to name drop a few data infra components we use that happen to be open source. These are things that if you wanted to, you could leverage. There's way more than just those that we use in our infrastructure, but just want to mention a few. Spark, we use that heavily for batch processing. King Kong is built on top of Kubeflow and some other things. OpenHouse is one that we recently open sourced, a month or two ago, it is our operational catalog. This one essentially allows us to keep metadata about all of our offline datasets, which are typically Iceberg tables, and we can set policies on them, so compliance policies, retention policies, replication policies. OpenHouse will make sure that those are followed and enforced. Again, it's about making all of these activities more productive. You just declare them and the infra takes care of it. In terms of grid, we still use Hadoop. We're old school like that, and maybe we're the last users of Hadoop. We have a multi-exabyte grid at this point, and still scaling. It still works for us, although there's certainly a lot of growing pains to get there. Azkaban is our workflow orchestration for many years, and I'll touch on that a bit later as well. In terms of streaming, we do a whole bunch of streaming, including in the AI space, to try to make our features fresher. Samza is our stream processor. Nowadays, we often wrap it under Beam, which is a new API that is starting to become popular. Kafka, of course, is our Pub/Sub. We open sourced that many years ago. We're starting to leverage Flink SQL. Brooklin is our change capture that we also open sourced a while back. Finally, online storage. We use a variety of things here. The last three are things we open sourced as well. Ambry is our blob store, which we use to transport our model binaries and stuff like that. Pinot is our online analytical store, which sometimes we use for AI as well. For example, a use case is impression discounting where, let's say, a recommender system is going to present some entities that it recommends, but then it's going to keep track of which one it showed. Then the ones that it already showed, it's going to discount the relevance of that, because if you've already seen it, and you haven't chosen to act on it, then maybe it's not actually relevant to you. Impression discounting is a use case we use Pinot for. Venice is our most frequently chosen feature storage for online inference use cases.

Venice

Venice is our derived data platform. What does derived data mean? Essentially, derived data means data that's been derived off of other data. Data that's been computed or joined, massaged, inferred out of other data. I have a whole talk on what is derived data from QCon London last year. You can check that out if you want to learn more. Venice's mission is to be the default storage for online AI use cases. In order to cater to these use cases, we support a few types of capabilities. One of them is that we've noticed that feature data wants to change very frequently. There is a very high need for refreshing that continuously. We support high throughput ingestion from batch sources and streaming sources. We can also combine batch and streams together, as I'll show. The other thing that online inference needs very often is very low latency. Imagine that you're wanting to score 5000 entities. For each of these entities, you want to retrieve 100 features. That means you want to retrieve like 500,000 features, and you want to spend most of your latency budget on running the model, you don't want to spend all of your time fetching the features that you'll put in the model. That means the users come to us, and they tell us, I want all 100 features for these 5000 entities. I want them in 10 milliseconds, something like that. We offer a few different options for low latency to try to serve our users' needs. The last point here is again self-service. We want to get out of the way as much as possible. We want users to be able to focus on their business needs. That means the infra in our point of view should be invisible. The infra should be there when you need it, but you shouldn't have to work around its limitations or exert a lot of effort to give the infra what it needs. These are the characteristics that we think make Venice well suited for that. We open sourced Venice in September 2022. I'm going to use the opportunity of this talk to do a little bit of a retro on how the scale of the platform has changed since then, to give you a perspective on that. Since then, there's been 800 new commits into the project. I can't cover the changelog of everything that happened since then. For a full overview of Venice, there is the Open Sourcing Venice talk from 2022, which is still accurate. You could check that out. Here, this is going to be a much more summarized version of this.

First, I want to present like, how does it work when you want to pump data into Venice? Which is one of the main use cases that AI engineers need to do. They want to take their offline data or their streaming data and make it queryable online. Here, I'm going to present what we call the hybrid store use case. Hybrid because it takes data from batch and it takes data from streaming. We got a few hundred use cases doing that. We also have a bunch of batch only use cases, we have also some stream only use cases. The hybrid use case is just one of many different ways of using Venice, but I'm using it here as the example just because it allows me to cover all the bases. I'm going to walk you through a four-step process of how Venice can orchestrate the ingestion of streaming data and batch data. All of this is fully automated. The user doesn't need to do anything special here. It's all handled by the platform. The first step is the steady state, or the initial state. We've got a stream processor that is cranking out data that lands into what we call this real time buffer. This real time buffer is a Kafka topic, which is fully owned by Venice. The user doesn't need to create that topic, doesn't need to configure it. The system just takes care of everything. You'll see, in a couple slides, why having a buffer is useful for us. The Venice server is ingesting data out of that buffer. Here, an important concept is that the datasets in Venice are versioned. In this initial state, there's only one version of the dataset, which is the current version. Given that is the current version, it is the one that's serving the queries of the client. Then, a full push comes in. The user has a push which is maybe scheduled daily or maybe several times a day, or could come from data triggers, like as soon as something becomes available, this runs. At that point, Venice is going to dynamically provision the version 2 of the dataset. This is tagged as future. The data is going to be loaded in the background. It's actually running in the same fleet of servers, but we have some checks and balances to make sure the client is not impacted by a latency blip, while this is occurring in the background. Importantly, the client doesn't see a mix of old and new data before it's ready.

When the full push is determined to have completed, at that point, we have the real time buffer replay phase that kicks in. That's why we have a buffer. The point is that, by the time the data from the full push lands inside the online system, that data is likely to be several hours old already. It depends on the use case, but it's for sure going to be at least a little bit old. Because the data comes from the grid, and it's been processed in however many ways over there, maybe it's the result of a batch inference, or maybe not. The point is, it's taken a while to get that snapshot of data. By the time it's ready to go, and it has been fully ingested, at that point, there's been water flowing under the bridges and a bunch of data that's been written from the real time path during that time. We want to replay the recent real time write. Once the real time buffer replay has caught up, only then do we start serving the traffic off of the new version. At this point, you see that v2 has gone from future to current, and v1 has gone from current to backup. We'll keep the backup for a while in case that the AI engineer notices that maybe there's been a dip in business metrics, maybe they have a regression in the way that they've refreshed the dataset. For any reason, they may want to roll back, so they can if they want that. Importantly, this backup is not like a very stale backup, it keeps getting the updates from the real time buffer. The snapshot part of that dataset is the old one, but the real time part of it keeps getting refreshed. That's the way it works. When we open sourced Venice, I shared these numbers at the time in terms of the scale that we support. I want to give an update on that. We've had a moderate increase in the number of Push Jobs per day, a 20% increase or so. This is fully automated. Nobody is clicking a button. Nobody is checking that the data made it. It just works. We used to write over a petabyte of data per day, now we write over 2 petabytes of data per day. We went from 3 trillion to 5 trillion rows per day that are updated within this system. The peak has also increased. Now we ingest up to 80 gigabytes per second of mostly feature data, and upward of 150 million rows per second at peak.

Let's talk about reading data now. Let's switch gears. We talked about writing data into Venice, now how do we read data out of it? We have a few different options in terms of client libraries to access the Venice data. This here is the thin client, which is the oldest. It's the first library we shipped with Venice, from the end of 2016 when we landed in production. It's a very simple, very well battle tested, very stable. The way that it works is that the client talks to our stateless router here, which then will route the query to the correct shard in the server fleet. The router is stateless, the server is stateful. The client has almost no logic. This is a two-hop architecture. Latency is what it is with two network hops, but we still managed to bring it down to single digit millisecond in the p99, for a single element retrieval. In the past couple of years, we've been working on the next gen of this client, which we call the Fast Client. This one has the router logic built in, skips the routing tier, goes straight to the correct server, so one network hop. For this one we promised to our users currently less than 2 millisecond p99. In practice, it's a bit faster than that, but we keep some padding in terms of the promises. This one has been running in production for more than a year now. We've been giving it out only to our power users, those that are willing to pick up the library, upgrade quickly if there's a need, work closely with us. We are slowly checking all the boxes to make this generally available internally. It's already available in the open source project, if you want to use it yourself. It's still experimental, but we're getting there to make it the default client. The last one, is what we call the Da Vinci Client. This one is interesting, because it has zero network hops. How is that possible? The trick is that it preloads the data. You see here, it doesn't even talk to the server at all. It goes straight to the Kafka data and loads it from there, and stores it just like the server would in a local RocksDB. We leverage RocksDB in both the server and Da Vinci. Then, depending on the configuration, we can leverage either a local SSD or a local RAM. If we use a SSD, we get around roughly 200 microseconds of latency. If we leverage RAM, they're all in RAM configuration, then we are usually at single digit microseconds, so pretty fast. This is what, for example, is integrated in our search infrastructure. They need super-fast access to the data, so they need to have it local.

The overall strategy for our clients is that all of these libraries have the same API. There's the key value style API, which is the most popular, Single Get, Batch Get, very simple. We also have a more power user-oriented API that we call Read Compute, which gives the option of doing some vector math, like dot product, cosine similarity, which some AI engineers are able to leverage. The point is that we want it to be easy to swap one client for another. Most people enter the platform through the thin client, because that's the easiest to use, most stable. Then if their requirements evolve in terms of their performance requirements, or the cost that they're willing to pay for that in terms of resources that they'll provision, then they have other options that they can pick from. That brings me to my first tactical tidbit, which is, if possible, let users defer decisions. We've seen cases where users agonize on their application architecture, are we going to use this client or that client? We have to shake them back to reality and tell them, "It doesn't matter. Pick whichever one you want. You'll be able to change later. No code changes. Just swap one library for another one, and they're compatible." That has worked well for us. These are the numbers I had shared back in 2022 in terms of read scale. The read scale has increased fairly. This is in the past 18 months. This is not year over year, it's a bit more than that. A 69% increase in keys looked up per second. We're now at 167 million keys per second at peak, that is being queried out of Venice. Each of those keys is a record, which contains one or many features, typically many features. A record might contain 100 features, depending on the case. That's the scale.

Just to summarize, these are the numbers I already shared, so I'm not going to repeat them. I want to layer on top of that. Our dataset count has also been increasing by 38%. This does not really stress the hot path, but it stresses the control plane. In terms of metadata load, the system needs to keep up along that dimension as well, and it has. What I find most interesting, personally, as an engineer, or infra engineer, is that the number of hosts that we run this system on has only increased 10% over the past 18 months. Of course, we want our infrastructure to be linearly scalable. That's a given. In this case, it has actually demonstrated superlinear scalability. That is because we are continuously working to optimize the system. Make it more efficient, faster, higher throughput. These little things add up, even if it's just like 1% here, half a percent there, or like one narrow use case getting optimized here and there. It adds up. It allows us to squeeze more out of roughly the same hardware. That's important because as operators, the more hardware you operate, the more failures you got to deal with. I say hardware, but even if it were like containers, or VMs, those things still fail. Actually, part of our infrastructure runs on containers, part of it is bare metal. We got both experiences. We don't want that to grow unboundedly, for operational reasons. If you want to learn more about Venice, we've got the documentation up at venicedb.org. From there, you can get the code on GitHub. There's our community Slack. We do some online meetups every other week, if there's anybody in the community that has questions.

Challenges and Lessons Learned

In this last section, I want to do a few quick hitters in terms of a few interesting tidbits of challenges we've had and how we've solved them. One thing that I think is very important for any infrastructure system is to get some grasp of what the bounds of the system are. Sometimes it may not just be a single answer. For us, we run a bunch of clusters and each dataset is assigned to one of the clusters. Most of our clusters are storage bound. That means we're limited by how many SSDs can we cram in these clusters. The AI datasets are ever larger. Yesterday people were happy with embeddings of 100 elements, and today they want 200, and tomorrow they want 400. It just keeps growing. Everybody wants more. Most of our use cases are storage bound, but we have some fraction that is what I call traffic bound, and we tune them differently. The storage bound clusters are hosting the data on SSD. For those, our replication is used primarily for reliability purposes. For that, we think that three per region is enough for being highly available. For the traffic bound use cases, we put the data in RAM, so that we get better latency and better throughput. We also use replication for read scaling. We've got some clusters with a replication factor of 4, some 6, and we benchmark the system up to 12, and it works fine. Importantly, the assignment of which cluster a dataset lands on is operator driven. That's a very important design choice for us. We don't let the user configure the cluster, it's not anywhere in the config. They only give us their dataset name, they never give us a cluster name or cluster address. They just can't even configure it. The client automatically negotiates with the backend, where does it need to go? It discovers that. Moreover, the operator can decide to migrate a dataset from one cluster to another, and the clients will just self-adjust, and start shifting their traffic away. That allows us to handhold the user from maybe the mass market storage bound cluster that most people run. If their use case evolves into eating up all of the serving capacity of those clusters, then we can migrate them into one of the traffic bound clusters, which is tuned for that. That brings me to this takeaway, which is, don't let users do the operator's job. Ideally, design your system so that the interesting levers are owned by the operator team. Those can be pulled when you need to, either as a human pulling that lever, or from a script, or from automation, doesn't matter. Design-wise, you shouldn't have to reach out, like, there's these 100 users on the cluster, can you please reconfigure? That doesn't really scale, if you need to have this level of tight coordination with users. Ideally, you keep that to yourself. Keep the complexity on the infra side.

Since I mentioned that we're mostly storage bound, compression is something that we've looked at quite a bit. Compression is not the panacea, not every dataset compresses. Many years ago, we used to have some pretty precise schema design guidelines, like, don't store your Enum as a string, that is going to be the same stuff repeated across all your records, and stuff like that. What we found out over time is users don't care. They just want to get their stuff done. They don't care about schema design. That's fine. We made peace with that. We have compression. We've had GZIP compression for many years, which is fine. It's very simple to integrate with. It doesn't work that well in terms of compressibility because the simplicity comes from the fact that the payloads are self-contained, which is great. You get the payload, you can decompress it. That means if the same symbol is repeating across all the payloads, you don't get to compress that in an online serving system that needs to query just one record at a time. As opposed to a batch system that can do a cross-records compression easily because it's going to scan. We stumbled upon Zstandard, which is a great open source library that you can pick up and use if you'd like. Zstandard has a mode where you can train the dictionary and keep it outside of the payload. In our case, since a large fraction of our data comes from Push Jobs, we have the perfect opportunity within our Push Jobs to train the dictionary. That then increases efficiency tremendously. The most extreme case we've seen is a 225x compression ratio. That's an outlier. It's not that good, usually. It's just an example. So far, we have never seen a case where GZIP outperforms Zstandard. Zstandard has always been better so far. Interestingly, last year, I think, we introduced a mechanism for gathering compressibility in all pushes. Regardless of whether the dataset has compression enabled or not, we still run the dictionary training job part of it, and we gather the compressibility. Then, periodically, we can mine those heuristics, and essentially, do a query along the lines of multiply data size by compressibility. Filter all those that are not compressed yet, and rank that by the biggest numbers. This is the list, the Top K subjects that would be useful to compress. Then we can consider, are we going to turn that on? Then we tune behind the scene. That's another lever that we keep for ourselves. Actually, the user can also request the compression on their own if they would like, but even if they don't care about it, we can still do it for them and make the platform more efficient. There is a latency cost for the decompression, but in most cases, we found it to be negligible. There could be some edge cases where latency is so important that they cannot afford to pay that cost. It is still a judgment case. It's been working fine for us.

Another type of data which is not very compressible is embeddings. Actually, you can compress embeddings but you need to lose precision. If you go from 32 bits to 16 bits, or 8 bits. An embedding is basically just an array of floating-point values, or it could be multi-dimensional arrays. We encode our data in the Avro format, and we use an open source library called fastavro that makes it even faster than the default implementation. In particular, we've put in a bunch of optimizations for floating point decoding in that library. It was already pretty fast, but last year we made it even faster by leveraging Java 9 VarHandles, which improved our memory allocation and compute latency, which we figured was one of the bottlenecks in some of our embedding heavy use cases. That brings me to this, which is the saying that performance is the best feature. If you make your system extremely fast, users will come back because they have a need for speed, at least in our experience. If you want to learn more performance related tips, we have a talk from QCon San Francisco from last fall, called, "The Journey to 1 million Operations Per Second Per Node in Venice," given by two of my colleagues. You could check that out.

The last tactical tip I want to give is about versioning. Dealing with multiple software versions is a pain, in our experience. If you have client libraries that are spread across a large range of versions, and you're needing to debug something, then you're always hitting that question of like, did we actually fix that in a newer version already, and stuff like that. It's tedious. What we do internally is we deprecate old versions aggressively. We have some tooling internally to push the dependency upgrade to our users. We try to do that as much as possible, so that dependencies are fresh. That's for our client libraries. For the Push Job, we have an even better story, which is that the Push Job runs inside our Azkaban scheduler environment. We leverage the Azkaban plugin architecture, which means that when the user configures their workflow, with the last part of the DAG being the Push Job, they don't even have the option of specifying the version that they are going to depend on. They just say, I want to use the push plugin. That acts as some sort of a symlink. We control the destination of the symlink, and so we can roll out an upgrade to the Push Job to all users, or we can ramp that gradually. We have a few knobs we can turn there. That has allowed us to remain nimble, while having more than 1000 Push Jobs per day. We've been able to increase the performance and the reliability, and so on, and do things like instrument all of the Push Job compressibility, for example, and be sure that we don't have blind spots, because we're actually upgrading everybody every time. That's been really useful. This only works if you're diligent about backwards compatibility, and forward compatibility. If you're in the habit of breaking your API all the time, then you can't push your upgrades. It's very important as an infrastructure developer to be mindful of compatibility. Think carefully about your APIs before putting them out there. Because once they are out there, you're married to them. You got to get it right. That's one of our core tenets in terms of development within the team.

 

See more presentations with transcripts

 

Recorded at:

Jun 13, 2024

BT