Feature Store is a core part of next generation ML platforms that empowers data scientists to accelerate the delivery of ML applications. It enables the teams to track and share features with versioning enabled and serve features for model training, batch, and real-time predictions. Mike Del Balso from Tecton.ai and Geoff Sims from Atlassian recently spoke at Spark AI Summit 2020 Conference about the feature store driven ML development.
Del Balso talked about machine learning process shortfalls like limited predictive data, long development cycles, and painful path to production which typically involves multiple teams, a lot of resources, and different implementations. He spoke about Operational ML, which basically consists of applied ML solutions that drive user experiences in use cases like fraud detection, click-thru rate (CTR) prediction, recommendation, and search. Building OperationalML applications is very complex and data is at the core of that complexity. Actual ML code is a smaller portion of the overall effort compared to tasks like configuration, data collection, feature engineering, and resource management.
Features are major building blocks of any ML application but the current tooling for managing features is not where it needs to be. There is a need to automate the process of deploying and operating the feature pipelines in production, including feature engineering and feature serving.
Del Balso discussed Tecton, a data platform for machine learning applications, that automates the full operational lifecycle to make it easy for data science teams to manage features throughout the lifecycle in a typical ML process. It can be used to extract data from data sources (batch or real-time) and transform that data as feature pipelines, and organizes feature values in a Feature Store. Data platforms for ML solve critical problems like managing the sprawling and disconnected feature transformation logic, building quality training sets from messy data, and deploying to production.
ML features are highly curated data in a business, but they are some of the poorly managed assets. Since each ML model typically has hundreds, if not thousands, of features to manage, this challenge makes it difficult to scale ML efforts in organizations. He recommended that features should be managed as feature data as well as feature transformation code that's used to generate it.
He discussed some common challenges with assembling training data like stitching multiple data pipelines together, data leakage, and delivering training data to training jobs. Data science and engieering teams also face problems when deploying their models to production and moving from a batch environment to real-time. Some of these challenges are related to infrastructure provisioning and drift & data quality monitoring. An enterprise-grade Feature Store can manage the feature training and feature serving.
Geoff from Atlassian talked about how they used the Feature Store solution to automate content categorization in one of their popular products, Jira, by automatically labeling for every issue tracked in Jira. They used the feature store to collect a large amount of events, store the features per model and update it in real time, as well as generate the features and predictions.