BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Chronon - Airbnb’s End-to-End Feature Platform

Chronon - Airbnb’s End-to-End Feature Platform

Bookmarks
47:41

Summary

Nikhil Simha discusses Airbnb's Feature Platform, focusing on the recent efforts to solve the challenges, specifically covering: core APIs, training data generation, feature serving and observability.

Bio

Nikhil Simha is a Staff Software Engineer on the Machine Learning infrastructure team at Airbnb. He is currently working on Chronon, an end-to-end feature engineering platform. Prior to Airbnb, he was a founding engineer on the stream processing team where he built a scheduler (Turbine, ICDE '20) and a stream processing framework (RealTime Data @ FB, SIGMOD '16) at Facebook.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Simha: We'll start with a story. The story is of a civilization. A civilization that was really good at building things. It was expanding. It found this piece of land where they said, let's build a huge tower that stretched into the sky. They started building, and it looks like they're about to finish it, they're about to reach the sky. The gods in the sky said, "We don't like this.

We're going to prevent these people from reaching the sky." This is where it gets interesting. The gods say, the way to prevent them from doing this is by confusing their language. If we destroy the unity of their language, we will prevent them from building large complicated things. This story is 2000 years old. There is a variant of this story that is 2000 more years old, so 4000 years ago from a Syrian myth. The lesson here remains the same, if you want to build large complex things, and run civilizations that span the globe, you need to have linguistic unity. A similar way of communicating with systems and people.

ML's Direction (Supervised Learning)

Let me address the elephant in the room first. There is a lot of talk about generative AI. This is a slide that I borrowed from Andrew Ng's presentation in Stanford, called, "Opportunities in AI." If you see, there is a prediction of where different kinds of machine learning will go. There is generative AI, your ChatGPT type stuff. There is the traditional machine learning called supervised learning, the big green blob, and unsupervised and reinforcement learning.

The prediction says that supervised learning is going to be pretty big, and is going to grow. Why is that? Why would supervised learning have such staying power? There are two reasons. This is the very first artifact on which records were kept. It has a tally of some sort, we don't know of what, on a bone, found in France, which dates back to about 17,000 to 12,000 years ago. The point of this is that since we know counting, we have been counting and keeping records of it. It is as primitive as language itself.

Since then, we have been counting by different names. We have been keeping records by different names. Today we call it data warehousing. As long as we count and we speak, we're going to keep recording data, maybe in different forms. This is very fundamental to our civilization. This is why supervised learning has that very high staying power, because supervised learning allows us to incorporate labels and make predictions on numerical data. This is why feature engineering is going to be relevant for time to come.

There are certain things that make feature engineering hard. First one is that features can be in different forms. The raw data that powers these features could be in a data stream, or in a data lake, which is data at rest, or from services. Usually, there is hundreds of features per model, that means to power this model, what ends up happening is we end up creating pipeline jungles. There is way more pipelines to power a model that create features, than there are pipelines to train the model or serve the model. This is what makes iterating on a model hard. Whenever you need to add new sets of features, you need to create many more pipelines that span the data stream, the data lake, or the microservice.

Essentially, features are the bottlenecks. There is a survey at Airbnb and at different companies, which says, 80% of the time is being spent doing feature engineering and doing data pipeline. What adds one more dimension of difficulty is the high fanout. There is one prediction but hundreds of features, that means we need to fetch them very quickly and the latency budgets are pretty tight usually.

Chronon - Feature Platform

This is where Chronon comes in. Chronon is a feature platform that addresses all of these problems. We'll start with an example. It's a scenario where we're trying to predict if a transaction is fraudulent. You're a bank and you're trying to identify if a particular transaction of a user is unusual. We're going to use only one feature. The feature is essentially measuring how far away from the mean is the current transaction amount. This is called a z-score. You take the current amount, subtract from the mean, and divide by variance. It tells you how far away from the distribution is the current transaction. Here, user is Leela. We have timestamps of transactions of this user. Then there is an amount. What the z-score will tell us is how far out of the distribution.

A negative number will tell us how unusually low it is, and a positive number will tell us how unusually high it is. The higher the number, the more fraudulent it probably is. A model will take in the z-score and tell us whether it's a fraud or not. This is an example. In the real world, there will be hundreds of such features. Assuming the distribution is a normal distribution that looks like this, or a bell curve, the distance from the mean, measured as variances from the mean, will tell us how much out of the distribution it is. If it is three standard deviations away, that means it's less than 0.3% likely that it's unusual transaction. How do we keep this up to date in real time? How to measure the z-score in real time.

We talked about two terms, mean and variance. Mean is sum by count. To update the mean in real time, you need to maintain sum and count. Similarly, to update variance, you need to keep track of three things: count, sum, and square-sum. Whenever a new transaction comes in, you bump the count, you bump the sum, and you square and bump the square-sum. There is a problem with this approach. The issue is that squared-sum is very prone to overflowing. There are various ways of dealing with this, called Welford's algorithm. The point is, we manage this in the framework for you. Meaning, Chronon figures out what is the state that needs to be maintained, and it'll update it in real time, and compute the means and variances for you. When you need the z-score for an amount, you can run this formula.

Just to put this in familiar terms, I'm going to show a pseudo SQL. Pseudo because this kind of SQL is not very easy to write. If you look at this thing, we are maintaining mean and variance in real time, it's really simple. It's a GroupBy in SQL. You maintain mean and variance of a particular user. The data source is what makes it interesting. The data source is going to be the transaction history table. The current transaction stream is grouped by the user.

When you need to compute the z-score, we read the amount, subtract the mean, and normalize it by the variance, and return that as the z-score for a particular user. Because it's grouped by user, it's going to be fast when you query with where user equal to Leela. Behind the scenes, what we do is we look at the transaction history source, which is data at rest, a table in the warehouse, so Hive, or Redshift, or BigQuery. Turns that would be a warehouse table. We bootstrap the triple that we need to maintain, or the state that we need to maintain, and update this in the key-value store. Then, when new transactions come in, we update the triple again, and increment the count, increment the sum, increment the squared-sum.

This is not exactly what happens. We talked about Welford's algo. This is just to give you an intuition. When a z-score is requested, we pull mean and variance, and normalize the amount based on that, and compute the z-score, basically. If you see, we are splitting the computation across three systems, three systems with different ways of interacting with them. The first system is batch processing. The second one is stream processing. The third one is serving. This is where most of the power of Chronon comes from. People are able to write these kinds of statements, and we behind the scenes will create the pipelines that are necessary to maintain these features and serve them in low latency.

First, we will talk about the inputs to the code example. We are going to say the user, merchant, amount in dollars, and status of the transaction, and timestamp. This is the schema of the transaction. We talked about only one z-score of the user distribution. To make the example a bit more interesting, we are going to compute three z-scores. The model is going to look at three distribution distances instead of one. It tells us what's the distribution distance for the user, how far it is away from the usual transactions of the merchant, and how far it is away for the user-merchant pair. You spend $5 at Starbucks, probably, and $100 at Costco. Knowing how much you spend typically at a merchant makes more sense.

We're going to filter the data out. We're going to filter out and only select the successful transaction, but not the cancelled, or refunded, or pending transactions. This is what a Chronon GroupBy looks like. This is called a GroupBy. I'll walk you through these things slowly. The first thing to note are the key columns. This is a Python function that takes arguments which are the key columns, and groups and computes mean and variance by those key columns. If you look below, the keys are user, merchant, and user-merchant pair. The other interesting aspect of it is the source.

If you look at the source, the source is composite, which is the main deviation from SQL. We allow you to specify Hive tables, or the mutation streams, or the mutation logs of this table. Transactions can mutate. We allow you to specify all three. We will see what exactly is an entity source in the coming slides. The idea is that a source is composite. It's not just its manifestation in the warehouse. The interesting thing is we run this computation across these sources, and ship the computation to stream processing pipelines/pipelines of services behind the scenes.

Usually, there is hundreds of such features. The way to put them together is by using a join. In this join, we are putting together those three features that we created. Usually, there might be other features coming from other sources for the merchant, or for the user, or for that interaction. We express the z-score as a derivation. We'll unpack all of this in the coming slides. The idea is that we GroupBy, and then we use a join to get all the features together in one place, and then apply derivations which compute the z-score.

The interesting thing about Python, where it adds power compared to SQL, is that, in SQL you would have had to repeat a lot of queries again and again, for each of the key pairs. In Python, you can essentially use the framework to write functions and abstract away, and reuse code better. There are other reasons why we use Python, because we want to extend what's possible in SQL in a large way. How do you use this? You define that join. Let's call it payments, and you're trying to detect transaction fraud, and it's the first version. You query. You say for Leela and for this merchant, Chewy, and this amount, 35, what are the z-scores? The response would be the user_z_score, the merchant_z_score, and the interaction_z_score. Model would look at these things and tell us if it's fraudulent or not.

The next thing to talk about is what makes the aggregation engine in Chronon very powerful. In Chronon, time is a first-class citizen. If you make time a first-class citizen, you are able to express time windows. What's the point of time windowing? What is time windowing? Instead of saying the transaction z-score of Leela, for the last 20 years, you're able to say, compute the distribution z-scores for the last year. Because maybe, if you get promoted or if you start your first job, your distribution and transactions will shift, and if you lose money, it will shift down. This allows us to account for such changes. This allows us to separate natural patterns. Maybe every week or every month or every year, you have a period of time where you spend a lot more money.

Like I said, recent data is the most important data when it comes to fraud detection or for personalization. The last, but not the least, is whenever a new user registers in your bank or in your platform, you want to be able to do fraud detection for them without having to wait for a day. Again, why windows? What does this mean? If you have just lifetime aggregates, just sums and counts, they are monotonically increasing. The issue with that is that the distribution of features are shifting, and you will need to retrain the model. What windows do is they make these values stable. Stable distributions mean stable predictions.

Those are the benefits of windows. Windows are not easy or cheap to build. What makes these windows hard? The first thing is tail eviction. You are going to remove events from the tail of the window. That's a very unusual thing. Whenever a new event comes in, you can trigger some computation. How do you trigger a computation for eviction? We're going to see how Chronon manages this.

Take a minute to think about how would you remove old events from a window. Because you need to keep a set of events in memory all the time to slide the window forward, it makes the windows very memory hungry. In Chronon, we use a variant of these windows called sawtooth windows. This is a timeline, and e1, e2, e3, up to e6 are the transactions. We are computing some aggregations in a 1-hour window. If a query comes in at 2:27, we go back to 1:27, look at all the events in that time, and do the aggregation. This is what's called a sliding window where the scanning is exact. The problem with this is that sliding window is very memory hungry, you will need to store all the raw events. If the windows are large, like 14 days or a year, it will become more expensive.

There is another kind of window called the hopping window, where instead of being very precise about the time, if you're allowed to be a bit stale, you can basically partially aggregate all the data into hops. Hops, in this example, are the 10-minute buckets in the 1-hour window. What we do is called a sawtooth window, which is a variant of these two windows. We use the hops and we also get the most recent data. Essentially, the timeline goes from 2:27 to 1:20. It's slightly more than an hour, and that's the tradeoff. We get freshness and cheapness.

We take into account most recent data, but at the same time, we are able to compress old data into hops. Just to summarize what I said, why sawtooth? Hopping windows are stale, so they're not good for fraud detection, or personalization. Sliding windows are memory hungry, so they're really hard to scale. Sawtooth windows get the benefit of both worlds, but it sacrifices the tail precision, meaning it's not exactly 1 hour, but it's going to be 1 hour plus up to 10 minutes. The advantage is that it's going to be fresh, most recent data is accounted for, and it's cheap to compute.

How do we maintain this window in Chronon? The red boxes are all the raw events. We break the window into three parts, the head of the window, it's the most recent events, the mid portion of the window, and the tail portion of the window which is the oldest events. If we are computing a 7-day window, we are going to first compress the tail into hops. We talked about 10-minute hops in a 1-hour window. This is something like that. We compress the tail events into these green boxes. Fewer boxes mean less data to fetch and less data to store. We compress all or aggregate all the data in the mid portion of the window into a single value, or into a single aggregate. We use batch processing, or Spark, in particular, to maintain these aggregates.

Instead of storing all the raw events or all the red boxes, we store fewer boxes, so the green boxes. The head events are maintained using stream processing. The idea is that instead of fetching and storing a lot of data, we fetch fewer data or fewer boxes. The reader is going to read and aggregate this into a result. This is a Chronon client, which does this aggregation behind the scenes. We partially aggregate first, and then merge all the partial aggregates later on read. Whenever a mean or a variance is requested, we look at all these tiles or hops behind the scenes, or all the raw events, pull them and reaggregate for you.

Next thing about GroupBys, or what makes them a bit more interesting compared to the SQL version of the GroupBy is bucketing. Usually, depending on the category of the transaction, the z-score varies, or the distribution varies. Maybe you want to bucket by the category of the transaction. If you spend a couple thousand dollars on groceries, it's fraudulent, but it's not going to be fraudulent if you probably buy home supplies. What we are doing here is we are bucketing by the category, so mean by category. For groceries and utilities, we have different means. The next thing I'm going to talk about is called auto-explode. A lot of times you have itemized bills.

There are different scenarios where you get these lists of things in the column, and you want to still be able to do your aggregation or mean and variance over these lists. In SQL, you would write what's called an explode statement and then do GroupBy. In Chronon, this happens automatically. You just specify mean. We unpack this list and compute the mean. The other thing is that you can use bucketing along with explode. Explode is implicit, but you can still bucket over nested data or lists of data. In fact, you can specify windows, buckets, and whether you want to explode or not, in any combination you want to. You can have windows, buckets, and explode, or you can follow any of them.

We talked about GroupBy, which is an aggregation primitive. The point of this is that it allows you to specify feature views that are maintained by Chronon in real time. We talked about windowing, and sawtooth windows in particular, and why they're more powerful than other kinds of windows, and more specific to machine learning. We talked about bucketing and auto-explode.

These GroupBys also map to feature groups, and they are a unit of reuse. These transaction features get reused across many different models. The other thing is that, because these are reused, they are immutable ones, they are online. Once you say this thing is online, and many models are using it, the only way to change or add a new version of this is to create a duplicate.

Sources

The next and probably the more important thing that we are going to talk about is sources, or data sources. We talked about how data is available in different kinds of sources, so like data streams, or data at rest, or data in warehouse and in microservices. There are many patterns in which this data is stored. If you have heard about terms like CDC, or events, or fact tables, or dims tables, or star schema, that all falls into this pattern of representing data. All of this can be very complex to deal with and create feature pipelines out of. Moreover, you are taking what's a simple computation and splitting across these different patterns and different sources. Because we want to unify compute, we need to have a good representation of these diverse sources of data. Just to take a very short detour, and talk about the high-level view of data sources. In a web company, or in a digital company, usually you have a service that is serving users, which is backed usually by a production database that's serving data to these services.

Then, you collect the data from these two sources into your data streams and into the data lake. Streams are more real time and data lake is historical. Whenever events happen, like clicks, or impressions, usually they go into an event stream. They end up in an event log to store historical versions of these events. These production databases also get snapshotted. Every day or every hour, you take snapshots of this database, and put them in your data lake. The point of doing all of this is you're recording history. You're recording history, and you're trying to compute features at historical points in time, or use this data for other kinds of analyses other than machine learning.

The other thing this database has produced is changes. If you're changing rows in the database, they create mutations. This conventionally is called change capture. You have a change capture stream and a change capture log. We will see examples of each of these things in the subsequent slides. Then, there is derived data. People might be taking all of this data or some external data and deriving or doing computation over this. This is roughly the land. I simplified it significantly. There is other variations on top of this. This is what at a high level the data landscape looks like.

If you look at this event stream and log, they represent the same data. Similarly, the change capture data and the database snapshot, it's the same data. We're just storing it in different ways. This is where Chronon unifies the idea of a source. You are able to specify all of these things as a single source. The first sources are events. These are your clicks and streams. In the warehouse, they are stored as partitions of events. Every day, we store only the events that happened during the day, in that partition or in that day's partition. The other name for these in star schema terms is fact sources. The important thing is that with these sources, if you have the timestamp, you can reconstruct what a feature value would have been in history.

We'll see why that is important. We'll see why deconstructing feature values is very important. The other thing that you can specify as part of the source is the data stream. This is used to update and maintain features for feature serving. The next one, this is just the representation of the data we saw earlier as event tables. I removed the user merchant columns for simplicity. These are transactions with their IDs and with the amount at different timestamp of a given day. In the next day, you will have a new transaction. The issue with modeling transactions as events is that it does not allow for mutations. Let's say you go to a restaurant and you add a tip, it registers in your bank as a mutation to a previous transaction, where the tip gets added. We'll talk about entities.

Entities allow us to model these mutations to transactions. In the example you saw an entity source, and this is what that looks like in reality. Each partition contains data for all transactions. It could be a bit repetitive. We also have table snapshots, so the transaction table is snapshotted every day or every hour. We have mutations which tell us about the changes to the rows in these tables.

Just to show the example in terms of data, let's say you have a transaction table in a particular day, and in the next snapshot some of the rows have mutated. If you see, the transaction ID 3 is missing, 4 is new. Transaction 2 somehow went from 40 to 400. There is another table or another stream called mutation stream or change data stream or BinLog. There are many names to it.

What essentially it says is it tells us how these rows have changed at what times. If you see, this represents that 3 is deleted. There is only a before value, but no after value for the transaction ID 3. Transaction ID 4 is inserted, so there is only an after value. Transaction ID 2 is updated, so it went from 40 to 400. There is a before value, which is 40, after value, which is 400. Then there is a new updated row in the new partition. Some of this is not yet an industry standard, but systems like Debezium and Kafka, they are making this easier. What Chronon is able to do is use these set of data sources and compute features at millisecond precise timestamps.

Imagine if you didn't have the mutations, and just had these two tables, you will only be able to compute features as of midnight of 9:20, and midnight of 9:21, but not in between. Having these allows us to recreate the state of the table at any point in time. Other related thing is to use service as sources of features. This is an extremely bad idea. We'll see why that is. A model is requesting features from a service and service reaches back into a database, and does probably a range scan or an aggregation. This requires significant disk bandwidth, and can slow down other queries to the database and slow down your feature request. Amplified across hundreds of features, this is going to make the read latency very bad.

The other interesting thing is that the model owner is usually not the service owner. There is a payments team that is maintaining this table, and then there is a payments risk team that is trying to predict the validity of a transaction. These two are different teams. You need to go and convince someone in the payments team to do the wrong thing. That's usually impossible. This is where entities really help. Instead of talking to the table directly or talking to the service directly, you can look at the change data stream and the snapshot. You can use it to compute features before the query even arrives, and maintain a feature index. The model simply fetches these features from the feature index. If you see, we are no longer required to talk to the service owner, we no longer need to hit the transaction table directly, have slow scans, and risk other data flows that update the table. We have our own isolated system that is maintaining features now. This is what makes entities extremely powerful.

Just to recap what I said, change data gives us agility. You don't need to cut across organizational boundaries and convince other people to update their live systems with range scans. It will also create isolation because we're not impacting the producer of that data anymore. Again, efficiency. Because we are not doing range scans and most of it is pre-computed, the features are going to be fetched in much lower latency. Last thing is scalability. Because they're pre-aggregated, you can look at much larger ranges of times to construct your features instead of limiting yourself to smaller ranges. It comes with some limitations. There are two classes of aggregations, the first class, of which the examples are sum, count, mean, variance, can be subtracted.

When there is an update or a delete, you can remove aggregations or subtract events from the aggregation. That's not the case for other aggregations, like first, last, min, max. These aggregations don't allow you to subtract. In Chronon, the batch job, we talked about the breaking of windows into different pieces, and batch job is updating the mid and the tail, that corrects away the inconsistencies that are created by these subtractions. To step back, the variation between the offline and the online system is a huge reason why a lot of ML use cases don't succeed when they're online. Having this batch job that corrects away inconsistencies regularly, helps us to keep the features consistent.

The last source is what's called a cumulative source. These are insert only tables, which means you don't need to look at mutations at all, you know that they are all insert only. The snapshot itself is enough. Each new partition is a superset of any old partition, and latest partition is enough, and there is no deletes and updates. Just to show the example, this is what you would represent transactions as, if you're using slowly changing dimensions or cumulative events. You would have the same data in the next partition with a few additional rows. If you see, you really don't need the previous partition, this is enough to recompute a partition at any data or any particular point in time. This is enough. You don't need mutation tables separately to understand what was the feature value at historical points in time.

We simply do what's called a surrogate column where we say, between what times was this row valid. Whenever there is a new row, we just don't have an end time, that means this is current. The benefit of this is, if you store only the most recent partition, you are reducing repetition. You don't need to store this transaction ID 1, again and again. You can simply delete this partition and just keep this. This also makes computing over this data very efficient.

Those are the three sources. We talked about events, which can be real time, using entity streams and logs, or fact tables. We have entities which are table snapshots or dim tables. We have cumulative, which is slowly changing dimensions that we talked about. In Chronon, users just tell us what is the type of the source. They don't tell us what are the ranges that we need to scan, and they don't tell us how to optimize the query. They just tell us what is the data model or the pattern of the data. We figure out, behind scenes, what is the right partition range to scan, what is the minimum required partition range to scan. This was added as part of a rewrite.

It was added because we realize that manually specifying these ranges is a huge source of errors in feature computation. If people just tell us what the pattern of the data is, and if they don't specify these ranges, it's much easier to do the right thing and to avoid errors. We talked about different source modes as well, stream and batch, talked about services. The interesting thing about services is that you cannot backfill or compute features as historical points in time. This is also slow, very high effort, and very hard to isolate, and very hard to convince people to pull features from services. We recommend that we should use CDC instead of directly talking to services.

Join

Next, we'll talk about join. Join is a primitive that allows you to put together hundreds of features into a feature vector. It allows you to also define labels. It allows you to define external services to create that feature vector. Before we talk about the details behind the join, we should talk about this thing called log-and-wait. When there is a real time feature, and when we need to serve these features, what we do is we take this raw data, convert them into a serving system, and begin logging, and wait for these logs to be enough to train a model.

Then, we know if this particular feature is useful. To tell if your idea is going to be useful, you need to first build the serving layer, then wait until you have enough logs, and then you will be able to tell if it's useful or not. This process takes months. What you can do instead is simply backfill, or compute features at historical points in time, and simulate those features.

Going back to our example, we have this table of queries. These are points in time at which we need to produce features. We are producing the z-score feature. We talked about how we would do this using a serving system. What Chronon can do is backfill them without having to create the streaming pipeline or keeping an index. It will just look at the raw historical logs and create these z-scores for you. You can train a model and immediately know if the z-score is useful or not, without having to wait for 6 months and without having to do all that additional work. This speeds up the iteration time or your experimentation time considerably. This is just the same example that we showed before but with the source.

Join is an enrichment primitive. We are taking this transaction table which has all the chargebacks or fraudulent transactions, and we are enriching it with new features. If you have another feature idea, you would add a line here, basically. We are able to compute the entire feature training set and train a model, and be able to judge if this feature is useful or not.

Just to visualize, a join allows you to put together features from different feature groups or GroupBys, and from external services, also incorporate contextual features. Amount, for example, was a contextual feature on which we computed z-score. You can also specify labels. This allows you to create this wide feature vector, or wide training dataset. The interesting thing about GroupBys is that they can be both online and backfill. If you use services, you have to rely on logs, and log-and-wait. Labels are only used offline to generate training sets. This is what a Chronon join does. It maintains all of these things, all of these moving parts, and evolves the training set as you update it. It's also used for serving.

Just to summarize, it's an enrichment primitive. It can enrich tables with features. It can reply to requests with features. It can enrich topics or streams with features. The interesting thing is that each of the join part has an implicit key. We looked at different keys, user, merchant, and user-merchant pair. Each of these join parts is already keyed, and the keys are implicit. The join knows how to fetch the right data and fan out to the right data source. We also generate the names of these features automatically for you. It is a variation based on a couple of things. It's based on the name of the feature group, name of the input column, name of the operation, mean or variance, name of the bucket.

If you're bucketing by groceries, or the category itself, the bucket would be category. The window is the amount of time that is used for aggregating features. We have names that look like this. These are transaction features where we're computing the mean of the amount, bucketed by the category in the last 7 days. It's long, but the idea is that reading this will give you a quick sense of what this feature is doing, and where to go look if there is an issue with this feature, or which files to look at if you have an issue with any of these features. One thing we did not talk about here is, how do we actually scale the backfills? That requires custom optimizations. There are other talks that we have given in the past that talk about optimizing these backfills and computing features in history.

Recap

We talked about GroupBy, which is the aggregation primitive, and the unit of reuse. We talked about source, many different kinds of sources, can be batch or streaming, they can be events, entities, or cumulative. Users just tell us what the type of the source is, and we take care of doing the right thing for aggregations. We talked about join, which is an enrichment primitive that produces training data and that serves features. This is mostly what users interact with. They run a join to generate training data, and they prepare this join to serve features. We didn't talk about a lot of things.

We didn't talk about algorithms to scale backfills, or observability, or metadata, which is also autogenerated based on the joins. We generate dashboards and alerts. We also generate logs and maintain the schema of these logs as joins evolve. We didn't talk about how to compose these different things, so create chains of GroupBys that feed into a join that feed into another GroupBy. These are things that are all possible with Chronon.

Questions and Answers

Participant 1: You mentioned that if you gave Chronon the entities, it will derive the data. If you gave it an incremental event stream, it can construct the point in time?

Simha: Exactly.

Participant 1: It does feature generation, if you have the primitives of aggregations, and all that stuff. What about feature selection and elimination after that?

Simha: This is what users want to do. They want to test their ideas and want to remove ideas that don't work. This is something users do. The framework helps you evolve your data when you try to do it. It's like removing feature groups from a join or adding new feature groups, essentially.

Luu: Is this open source or about to be? What's the current state?

Simha: The current state is that it's in a closed beta. We are working with a few other companies, and deploy it there and see what happens, and incorporate some early feedback before open sourcing.

Luu: At some point in the near future looks like this framework is going to be open source. I think that's going to be very useful for a lot of companies that are doing feature engineering at scale and such, all these very interesting features.

 

See more presentations with transcripts

 

Recorded at:

Aug 09, 2024

BT