Amazon SageMaker Feature Store

A fully managed service for machine learning features

Why Amazon SageMaker Feature Store?

Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, share, and manage features for machine learning (ML) models. Features are inputs to ML models used during training and inference. For example, in an application that recommends a music playlist, features could include song ratings, listening duration, and listener demographics. Features are used repeatedly by multiple teams and feature quality is critical to ensure a highly accurate model. Also, when features used to train models offline in batch are made available for real-time inference, it’s hard to keep the two feature stores synchronized. SageMaker Feature Store provides a secured and unified store to process, standardize, and use features at scale across the ML lifecycle.

How it works

How it works: Amazon SageMaker Feature Store

Benefits of SageMaker Feature Store

Ingest features from any data source including streaming and batch such as application logs, service logs, clickstreams, sensors, and tabular data from AWS or third party data sources
Transform data into ML features and build feature pipelines that support MLOps practices and speed time to model deployment
Store, share, and manage ML model features for training and inference to promote feature reuse across ML applications

Feature Management

Feature processing and ingestion

You can ingest data into SageMaker Feature Store from a variety of sources, such as application and service logs, clickstreams, sensors, and tabular data from Amazon S3, Amazon Redshift, AWS Lake Formation, Snowflake, and Databricks Delta Lake. Using feature processing, you can specify your batch data source and feature transformation function (for example, count of product views or time window aggregates) and SageMaker Feature Store transforms the data at the time of ingest into ML features. With Amazon SageMake Data Wrangler you can publish features directly into SageMaker Feature Store. With the Apache Spark connector, you can batch ingest a high volume of data with a single line of code.

Screenshot of

Feature storage, catalog, search, and reuse

SageMaker Feature Store tags and indexes feature groups so they are easily discoverable through the visual interface of Amazon SageMaker Studio. Browsing the feature catalog allows teams to discover existing features they can confidently reuse and avoid duplication of pipelines. SageMaker Feature Store uses the AWS Glue Data Catalog by default, but allows you to use a different catalog if desired. You can also query features using familiar SQL with Amazon Athena or another query tool of your choice.

The image depicts the feature group catalog

Feature consistency

SageMaker Feature Store supports offline storage for training and online storage for real-time inference. Training and inference are very different use cases and the storage requirements are different for each. During training, models often use the complete data set and can take hours to complete, while inference needs to happen in milliseconds and usually uses a subset of the data. When used together, SageMaker Feature Store ensures that offline and online datasets remain in sync which is critical because if they diverge, it can negatively impact model accuracy.

The image depicts the creation of feature group

Time travel

Data scientists may need to train models with the exact set of feature values from a specific time in the past without the risk of including data from beyond that time (also referred to as feature leakage), such as patient medical data before a diagnosis. SageMaker Feature Store Offline API supports point-in-time queries to retrieve the state of each feature at the historical time of interest.  

The image shows the flow of Feature Store Offline API queries to retrieve the state of each feature at the historical time of interest

Security and Governance

Lineage tracking

To enable feature reuse with confidence, data scientists need to know how features were built and which models and endpoints are using them. SageMaker Feature Store allows data scientists to track their features in Amazon SageMaker Studio with SageMaker Lineage. SageMaker Lineage lets you track scheduled pipeline executions, visualize upstream lineage to trace features back to their data sources, and view feature processing code, all in one environment.

The image shows the lineage of feature group in SageMaker Studio

ML operations

Feature stores are a key component in the MLOps lifecycle. They manage datasets and feature pipelines, speeding up data science tasks and eliminating the duplicate work of creating the same features multiple times. SageMaker Feature Store can be used as a standalone service or together with other SageMaker services in an integrated manner across the MLOps lifecycle.

Security and compliance

To support security and compliance needs, you may need granular control over how shared ML features are accessed. These needs often go beyond table and column-level access control to individual row-level access control. For example, you may want to let account representatives see rows from a sales table for only their accounts and mask the prefix of sensitive data like credit card numbers. SageMaker Feature Store together with AWS Lake Formation can be used to implement fine-grained access controls to protect feature store data and grant access based on role.

Image shows how SageMaker Feature Store and AWS Lake Formation can be used to implement fine-grained access controls