To increase productivity and scalability when creating new features to use in machine learning models, AirBnb has built Chronon, a solution to create the infrastructure required to turn raw data into features for training and inference.
Going from raw data to features to use to train ML models is a complex and time-consuming task, explains AirBnb engineer and Chronon creator Nikhil Simha, requiring engineers to extract data from AirBnb data warehouse and write complex ETL logic to convert them into features. An additional stumbling block comes from the need to ensure that the logic produces the same feature distribution for inference as for training.
Chronon attempts to address those issues, says Simha, by allowing ML engineers to define features and centralize data computation in a replicable way across training and inference.
As a user, you need to declare your computation only once, and Chronon will generate all the infrastructure needed to continuously turn raw data into features for both training and serving. ML practitioners at AirBnb no longer spend months trying to manually implement complex pipelines and feature indexes. They typically spend less than a week to generate new sets of features for their models.
A first component of Chronon enables ingesting data from a variety of sources, including event data sources, entity data sources, and cumulative event sources, each of which collects different types of data.
Once ingested, data can be transformed using SQL-like operations and aggregations, which produce low-latency endpoints to use for serving models online, and Hive tables to use for offline training. Under the hood, Chronon builds pipelines using Kafka, Spark/Spark Streaming, Hive, and Airflow. SQL-like operations include GroupBy
, Join
, and StagingQuery
, which are Spark SQL queries computed offline once a day. Aggregations include windows, buckets, and time-based aggregations.
Finally, a Python API is also available, which provides SQL-like primitives and understands time-based aggregation and windowing as first-class concepts. For example, using the Python API, you can filter and transform the number of times an item is viewed by a user in the last five hours.
One important concept in Chronon is that of accuracy, i.e. how frequently feature values are updated, either in real-time or at fixed intervals. The correct accuracy to use depends on the specific use case, so Chronon allows its users to easily specify the accuracy of a computation as temporal or snapshot.
At the moment of this writing, it is not clear whether AirBnb will make Chronon available on GitHub, but you will find the discussion in the original article an interesting read if you want to set up your own feature engineering pipeline.