Store, share, and manage ML model features for training and inference to promote feature reuse across ML applications
Ingest features from any data source including streaming and batch such as application logs, service logs, clickstreams, sensors, and tabular data from AWS or third party data sources
Transform data into ML features and build feature pipelines that support MLOps practices and speed time to model deployment
Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, share, and manage features for machine learning (ML) models. Features are inputs to ML models used during training and inference. For example, in an application that recommends a music playlist, features could include song ratings, listening duration, and listener demographics. Features are used repeatedly by multiple teams and feature quality is critical to ensure a highly accurate model. Also, when features used to train models offline in batch are made available for real-time inference, it’s hard to keep the two feature stores synchronized. SageMaker Feature Store provides a secured and unified store to process, standardize, and use features at scale across the ML lifecycle.
How it works
Feature processing and ingestion
You can ingest data into SageMaker Feature Store from a variety of sources, such as application and service logs, clickstreams, sensors, and tabular data from Amazon Simple Storage Service (Amazon S3), Amazon Redshift, AWS Lake Formation, Snowflake, and Databricks Delta Lake. Using feature processing, you can specify your batch data source and feature transformation function (for example, count of product views or time window aggregates) and SageMaker Feature Store transforms the data at the time of ingest into ML features. With Amazon SageMake Data Wrangler you can publish features directly into SageMaker Feature Store. With the Apache Spark connector, you can batch ingest a high volume of data with a single line of code.
Feature storage, catalog, search, and reuse
SageMaker Feature Store tags and indexes feature groups so they are easily discoverable through the visual interface of Amazon SageMaker Studio. Browsing the feature catalog allows teams to discover existing features they can confidently reuse and avoid duplication of pipelines. SageMaker Feature Store uses the AWS Glue Data Catalog by default, but allows you to use a different catalog if desired. You can also query features using familiar SQL with Amazon Athena or another query tool of your choice.
SageMaker Feature Store supports offline storage for training and online storage for real-time inference. Training and inference are very different use cases and the storage requirements are different for each. During training, models often use the complete data set and can take hours to complete, while inference needs to happen in milliseconds and usually uses a subset of the data. When used together, SageMaker Feature Store ensures that offline and online datasets remain in sync which is critical because if they diverge, it can negatively impact model accuracy.
To enable feature reuse with confidence, data scientists need to know how features were built and which models and endpoints are using them. SageMaker Feature Store allows data scientists to track their features in Amazon SageMaker Studio with SageMaker Lineage. SageMaker Lineage lets you track scheduled pipeline executions, visualize upstream lineage to trace features back to their data sources, and view feature processing code, all in one environment.
Data scientists may need to train models with the exact set of feature values from a specific time in the past without the risk of including data from beyond that time (also referred to as feature leakage), such as patient medical data before a diagnosis. SageMaker Feature Store Offline API supports point-in-time queries to retrieve the state of each feature at the historical time of interest.
Feature stores are a key component in the MLOps lifecycle. They manage datasets and feature pipelines, speeding up data science tasks and eliminating the duplicate work of creating the same features multiple times. SageMaker Feature Store can be used as a standalone service or together with other SageMaker services in an integrated manner across the MLOps lifecycle.
Security and compliance
To support security and compliance needs, you may need granular control over how shared ML features are accessed. These needs often go beyond table and column-level access control to individual row-level access control. For example, you may want to let account representatives see rows from a sales table for only their accounts and mask the prefix of sensitive data like credit card numbers. SageMaker Feature Store together with AWS Lake Formation can be used to implement fine-grained access controls to protect feature store data and grant access based on role.
“At Climate, we believe in providing the world’s farmers with accurate information to make data driven decisions and maximize their return on every acre. To achieve this, we have invested in technologies such as machine learning tools to build models using measurable entities known as features, such as yield for a grower’s field. With Amazon SageMaker Feature Store, we can accelerate the development of ML models with a central feature store to access and reuse features across multiple teams easily. SageMaker Feature Store makes it easy to access features in real-time using the online store or run features on a schedule using the offline store for different use cases. With the SageMaker Feature Store, we can develop ML models faster.”
Daniel McCaffrey, Vice President, Data and Analytics, Climate
“We chose to build Intuit’s new machine learning platform on AWS in 2017, combining Amazon SageMaker’s powerful capabilities for model development, training, and hosting with Intuit’s own capabilities in orchestration and feature engineering. As a result, we cut our model development lifecycle dramatically. What used to take six full months now takes less than a week, making it possible for us to push AI capabilities into our TurboTax, QuickBooks, and Mint products at a greatly accelerated rate. We have worked closely with AWS in the lead up to the release of Amazon SageMaker Feature Store, and we are excited by the prospect of a fully managed feature store so that we no longer have to maintain multiple feature repositories across our organization. Our data scientists will be able to use existing features from a central store and drive both standardization and reuse of features across teams and models.”
Mammad Zadeh, Intuit Vice President of Engineering, Data Platform
“At Experian, we believe it is our responsibility to empower consumers to understand and use credit in their financial lives, and assist lenders in managing credit risk. As we continue to implement best practices to build our financial models, we are looking at solutions that accelerate the production of products that leverage machine learning. Amazon SageMaker Feature Store provides us with a secure way to store and reuse features for our ML applications. The ability to maintain consistency for both real-time and batch applications across multiple accounts is a key requirement for our business. Using the new capabilities of Amazon SageMaker Feature Store enables us to empower our customers to take control of their credit and reduce costs in the new economy.”
Geoff Dzhafarov, Chief Enterprise Architect, Experian Consumer Services
“At DeNA, our mission is to deliver impact and delight using the internet and AI/ML. Providing value-based services is our primary goal and we want to ensure our businesses and services are ready to achieve that goal. We would like to discover and reuse features across the organization and Amazon SageMaker Feature Store helps us with an easy and efficient way to reuse features for different applications. Amazon SageMaker Feature Store also helps us in maintaining standard feature definitions and helps us with a consistent methodology as we train models and deploy them to production. With these new capabilities of Amazon SageMaker, we can train and deploy ML models faster, keeping us on our path to delight our customers with the best services.”
Kenshin Yamada, General Manager / AI System Dept System Unit, DeNA
“A strong care industry where supply matches demand is essential for economic growth from the individual family up to the nation’s GDP. We’re excited about Amazon SageMaker Feature Store as we believe it will help us scale better across our data science and development teams, by using a consistent set of curated data. With the newly announced capabilities of Amazon SageMaker, we can accelerate development and deployment of our ML models for different applications, helping our customers make better informed decisions through faster real-time recommendations.”
Clemens Tummeltshammer, Data Science Manager, Care.com
“Using ML, 3M is improving tried-and-tested products, like sandpaper, and driving innovation in several other spaces, including healthcare. As we plan to scale machine learning to more areas of 3M, we see the amount of data and models growing rapidly – doubling every year. We are enthusiastic about the new SageMaker features because they will help us scale. Amazon SageMaker Data Wrangler makes it much easier to prepare data for model training, and Amazon SageMaker Feature Store will eliminate the need to create the same model features over and over. Finally, Amazon SageMaker Pipelines will help us automate data prep, model building, and model deployment into an end to end workflow so we can speed time to market for our models. Our researchers are looking forward to the taking advantage of the new speed of science at 3M.”
David Frazee, Technical Director at 3M Corporate Systems Research Lab