Key Takeaways
- Data scientists spend most of their time building out architecture and preparing data, NOT building models
- Feature stores enable features to be registered, discovered, used, and shared for ML pipelines throughout a company
- A combined OLTP/OLAP RDBMS overcomes the latency and complexity in the traditional online (real-time NoSQL) and offline (batch SQL Data Warehouse) feature store architecture
- By streamlining the process of operationalizing machine learning, ML models become far easier to manage, implement, and operate
Machine learning has now entered its business heyday. Almost half of CIOs were predicted to have implemented AI by 2020, a number that is expected to grow significantly in the next five years. Despite this, seven out of ten executives whose companies had made investments in
artificial intelligence reported minimal or no impact from them, according to a 2019 research report from MIT. Why? Because creating a machine learning model and putting it into operation in an enterprise environment are two very different things. The biggest challenge for companies looking to use AI is operationalizing machine learning, the same way DevOps operationalized software development in the 2000’s. Simplifying the data science workflow by providing necessary architecture and automating feature serving with feature stores are two of the most important ways to make machine learning easy, accurate, and fast at scale.
Data Science Workflow
In a typical data science silo, a new project follows a pipeline:
- collecting relevant data from data sources,
- cleaning and organizing the data,
- transforming the data into useful features,
- building out the architecture to run a model,
- training a model on the features, and
- deploying a model.
Even though training and deploying the machine learning model may seem like the most important part of this work, only 20% of a data scientist’s time is actually spent training and deploying models; the remaining 80% is spent on data preparation.
Considering how rare and expensive data scientists are to employ, this inefficiency is far from ideal. The good news is that there are many approaches to MLOps that streamline the ML architecture and data to feature process.
Data Science Architecture
The days of simple models whose training data can fit on a laptop spreadsheet are over. Now, data is minimally tens of terabytes large, and often at the scale of petabytes. Building the distributed computing infrastructure that you need to execute feature pipelines and model training at this scale is a major headache for data scientists.
Containers have become necessary for these large-scale deployments of machine learning. However, in order to deploy a model, a data scientist has to work with machine learning engineers to build each container and manually wrap it in a RESTful API so the model can be called from applications. This is unnecessary coding, and frustratingly requires the creation of new containers in order to scale a model. The integration of Kubernetes architecture automates container orchestration so that data scientists can automatically scale their models, and has become an industry standard. But executing a Kubernetes integration requires data scientists to become Kubernetes experts, or collaborate with a machine learning engineer to do it for them.
New database deployment makes it as easy as possible to deploy a model. Traditionally, container-based endpoint models as discussed above require excessive software engineering. With database-deployment, it only takes one line of code to deploy a model. The database-deployment system automatically generates a table and trigger that embody the model execution environment. No more messing with containers. All a data scientist has to do is enter records of features into the system-generated predictions database table to do inference on these features. The system will automatically execute a trigger that runs the model on the new records. This saves time for future retraining too, since the prediction table holds all the new examples to add to the training set. This enables predictions to stay continuously up-to-date, easily--with little to no manual code.
Feature Stores
The other major bottleneck in the ML pipeline happens during the data transformation process: manually transforming data into features and serving those features to the ML model is time-intensive and monotonous work.
A Feature Store is a shareable repository of features made to automate the input, tracking, and governance of data into machine learning models. Feature stores compute and store features, enabling them to be registered, discovered, used, and shared across a company. A feature store makes sure features are always up to date for predictions and maintains the history of each feature’s values in a consistent manner, so that models can be trained and re-trained.
Feature Store Architecture
Figure 1: Feature Store Architecture
Feature stores are fed by pipelines that transform raw data into features. These features can then be defined, declared into groups, and assigned meta-data that makes them easier to search for. Once the features are in the store, they are used to create:
1. Training Views. These are arbitrary SQL statements that join labels with feature sets, allowing data scientists to ensure that features and labels are point-in-time consistent. Training Views can be used to create many different training sets, that can vary in time and in what subset of features are included.
Creating a training view is:
2. Training Sets. These are a collection of designated features over a specific time window, sometimes combined with labels (in supervised learning) used to train models. It’s a simple call to retrieve a training set with pre-decided historical feature values from the feature store.
Figure 2: Training models in Feature Store
Creating a Spark dataframe from a training set looks like:
3. Feature Serving. Features are served to models directly from the feature store. It’s a straightforward call to retrieve the feature vector for an entity as an input to a model for predictions.
Figure 3: Feature serving in Feature Store based ML solutions
Getting the most up-to-date features looks like:
These mechanisms allow feature stores to enable:
- Automated Data Transformation. Feature Stores manage data pipelines that transform raw data into feature values. These can be scheduled pipelines that aggregate petabytes of data at a time (like calculating the average 30, 60, and 90 day spending amounts of each customer of a large retailer), or real-time pipelines that are triggered by events and update feature values instantly (like updating the sum total of today’s spending for a particular customer every time they swipe their credit card).
- Real-Time Feature Serving. Feature stores serve a single vector of features made up of the freshest feature values to machine learning models. For example, if an application wants to recommend a particular product to a user, the model may need to know the average amount that user has spent in a particular spending category as well as total length of time spent shopping in the last 48 hours. The Feature Store will have the most up-to-date values for those metrics immediately available for the model, instead of having to run the data pipeline to calculate them and be able to serve those features in milliseconds.
- Feature Registry. A feature registry is a central interface for cataloguing feature definitions within an organization. A feature registry contains standardized feature definitions and associated metadata to act as a single source of information for an organization. The Feature Store makes searching through available Features and Feature definitions simple and straightforward. It exposes APIs and UIs to the data scientist to see currently available features, pipelines, and training datasets that are either being used in production models or under development. Data scientists can then pick and choose the features needed for their use case, and incorporate them into models without any extra code.
- Model Training and Retraining. A feature store organizes older features into a time-series database so that when models are trained, the examples all have features aligned at the same time. Because all historical feature values are stored along with their timestamps, the Feature Store can generate entire training datasets for features, and align them properly with labels for training. As those Features are updated, the Feature Store can generate updated training datasets for model retraining in exactly the same way.
- Model Monitoring. When all previous predictions from models are stored along with the inputs to the model at that time, monitoring the model is as simple as executing SQL queries. This allows users to monitor the model’s performance, and keep track of any Feature drift, model prediction drift, and model accuracy (when labels become available). Because the Feature Store keeps all feature values up to date and keeps all historical values in a time consistent manner, it’s easy to monitor models with the Feature Store.
Benefits of a Feature Store
Figure 4: Feature Store architecture components
- Improved Data Science Productivity. Feature stores take the mundane, tedious, and time-intensive data tasks out of the equation so data scientists can shift their focus from rote data plumbing to model building and experimentation.
- Enhanced Model Accuracy. Feature stores enable more accurate models by taking data freshness and consistency to a whole new level. By separating the data pipeline from the ML model, large aggregation-based features that may take hours to compute can be retrieved immediately when needed. This gives real-time models access to feature values they wouldn’t have otherwise. By having access to real-time data, models can predict more accurately based on what’s happening in the real world, instead of being stuck on yesterday’s data. Additionally, maintaining historical features in a time-series automatically guarantees consistent training, which is often difficult with bespoke training sets.
- Model Transparency. When a regulator is auditing a company's practices that are powered by ML models, a feature store offers transparency to the lineage of the predictions. You know what features went into the model. You know what data populated those features in a training set. And in some feature stores, you can even return to the database state when the model was trained. We call this time-travel.
Figure 5: Data flow in a Feature Store based solution
Data Types
Currently, all the feature stores on the market serve two different types of data separately. "Online" data includes streaming key values that are updated in real-time. "Offline" data includes large volumes of analytical data that are often processed in batches. In traditional feature stores, these are processed in separate data pipelines.
- Real-time data pipelines are needed to drive ML models that are used to react directly to end user interactions in real-time. In such cases, the source of the feature needs to be connected in real-time to the feature store either through streaming, direct database insert, or update operations. As these transactions are processed, they generate new feature values just in time for a subsequent inference that reads them. This kind of operation typically affects a small number of rows at a time but has potential for very high concurrency. This low latency and high concurrency processing is usually characteristic of OLTP databases.
- Batch data pipelines occur periodically (typically once a day or weekly). They process large amounts of source data by extracting, loading, cleansing, aggregating and otherwise curating data into usable features. Transforming large amounts of data usually requires parallel processing that can scale. This high volume data processing found in massively parallel processing database engines is usually referred to as OLAP.
However, having separate data pipelines for the two kinds of data has significant drawbacks, the most obvious of which is having to pay for and manage two separate data engines. In addition, moving data between two different pipelines increases latency, as it’s hard to keep two separate pipelines in sync.
There is another way. Using an OLTP/OLAP combined database allows you to store both types of data in the same platform, which will:
- Decrease latency between the two pipelines
- Eliminate the need to pay for and manage two data engines
- Simplify the complexity of keeping two separate pipelines in sync, making the data highly available
Few databases support both kinds of workloads and also provide horizontal scalability, yet an ACID-compliant OLTP and OLAP engine that can scale independently would seamlessly deliver both real-time and batch inference directly onto the Feature Store.
Example - Recommendation Engine
To help conceptualize how a feature store can really revolutionize the data science workflow, let’s take the example of an e-commerce business. This business wants to build an ML model that will provide customers with personalized item recommendations.
In order to suggest what product a customer should look at next, the model has to be fed data that can help it predict a customer’s buying behavior. This could include the last item the customer looked at on the site, an online feature, which would come from an operational or transactional database like MongoDB, DB2, or Oracle. Another input for the model could be a customer’s average monthly spending, a pre-calculated offline or batch feature, which would come from a data warehouse like Redshift or Snowflake.
Figure 6: Recommendation engine data model
Without a feature store, a data engineer has to manually create bespoke ETL pipelines to feed data to the ML model. They have to write the code that moves data from the warehouse to the training sets, from the database to Jupyter notebooks model deployment, and manually engineer the data into usable features.
Figure 7: Traditional ETL pipelines in ML models
This is not easy to scale, as a new ETL pipeline has to be created for each data source for each different model. Though this works, pipelines are not always used consistently, and scaling an ML operation this way is incredibly time-intensive.
With a feature store, the data engineers only have to build one data pipeline from each data source to the feature store, and as many machine learning models as are necessary can pull their features from a central place. This means it will be the same amount of work architecture-wise to build two machine learning models as twenty, which makes the model pipeline incredibly scalable.
Figure 8: ETL pipeline in Feature Store based ML models
For this company, building a feature store is as simple as:
Online / Offline Feature Stores
As explained above, most feature stores have two separate databases for online and offline features. In this case, the user has to manually manage these two kinds of features, which can become inconsistent. There is a ton of time and energy dedicated to keeping two different databases in sync, which can be undone as soon as one goes under for some reason or another.
Figure 9: Online and offline Feature Store databases
Data parity is much easier to achieve when both online and offline features are stored in a single system. Linking the feature store to an ACID-compliant hybrid database makes it impossible to have a discrepancy between the two stores; if something new is introduced, the database triggers ensure the consistency of the online live feature values and the offline feature histories. This is more efficient, requires less code, and is less prone to consistency problems.
Figure 10: Single database for Feature Store online and offline use cases
Feature Stores on the Market
While there are a number of technology companies who have built their own feature stores for proprietary use (including Uber, Airbnb, and Netflix), there are also a few open-source options on the market for companies who want to use a feature store but don’t want to build it themselves. Feast, Splice Machine, and Hopsworks are the three leading open-source solutions.
Feast is an open-source feature store with a very active community. Feast has great documentation and offers good SDK support for Python, Command line, Go, Java and some others, and is currently adding integrations with Kubeflow. Feast recently added Spark and Kubernetes support for more scalable solutions. Currently, Feast’s only online store is Redis.
Feast’s major weakness is that it doesn’t offer an offline store, though they plan to soon. This means that it won’t automate the input of historical feature values, which is a huge part of what a feature store does. If you were to use Feast right now, you would need to provide the historical values of the features yourself, and it would be your responsibility to ensure full lineage of your data.
Hopsworks by Logical Clocks is another open-source option. Their feature store is a modular component of the Logical Clocks offering, so you aren’t forced to use the entire system. They also have a completely unified UI that covers end-to-end MLOps. The Feature Store UI is pretty sophisticated, allowing the user to search for features, feature groups, training sets, etc. You can link features and groups to the pipelines that created them, using Apache Airflow for pipeline integration. Security is also built into the system.
Hopsworks is working on creating unity in the entire pipeline. Currently, users need to write scripts outside of the working environment to deploy models and create pipelines. In addition, Hopsworks is dependent on external systems for their store, as their offline feature store needs Hive.
The Splice Machine Feature Store contains all the functionality of the other stores, with a key difference: its single-engine feature store online/offline data in the same place, allowing it to be ACID-compliant and singularly consistent. Even after Feast adds an offline store, they cannot guarantee consistency between those two stores, because they are distinct engines. Splice Machine offers the only feature store that can 100% guarantee that your offline and online data will be consistent, because it’s the only one that holds all the data in the same store.
Conclusion
In the vast majority of data science silos, data scientists aren’t able to focus on what they do best. They spend most of their time and energy building ML architecture and preparing data for models. By providing the necessary architecture for easy model deployment and streamlining the feature process, building a feature store on top of a combined OLTP/OLAP combined database can increase data science productivity one hundred times over.
About the Author
Monte Zweben is the co-founder and CEO of Splice Machine, a scalable SQL database that makes data science and Machine Learning easy. A technology industry veteran, Monte’s early career was spent with the NASA Ames Research Center as the deputy chief of the artificial intelligence branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. Monte then transitioned to the entrepreneurial world, founding the industry-leading Blue Martini and Red Pepper Software startups. Monte has published articles in the Harvard Business Review, various computer science journals, and conference proceedings. He was Chairman of Rocket Fuel Inc. and serves on the Dean’s Advisory Board for Carnegie Mellon University’s School of Computer Science.