Posted On: Jan 10, 2022

Amazon SageMaker Feature Store is announcing a new enhancement, a connector for Apache Spark that makes batch data ingestion easier for customers. Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, update, retrieve, and share machine learning (ML) model features. There are various ways to ingest data into SageMaker Feature Store including the PutRecord API, SageMaker Python SDK’s FeatureGroup.ingest functionality and SageMaker Processing Job.

For batch ingestion customers can ingest data from Spark sources such as Amazon EMR and Processing Jobs. This requires iterating through spark dataframe records and configuring the PutRecord API with Feature Group and Feature names multiple times which can be time consuming. With the new release customers can use the SageMaker Feature Store connector for Apache Spark which simplifies and automates these steps. The connector makes available all of Spark’s libraries and customers can add simple API calls to their existing Feature Engineering pipeline on Amazon EMR to easily batch ingest data into SageMaker Feature Store. In addition the connector also allows direct ingestion into SageMaker Feature Store offline store to simplify the backfilling process.

To learn more, please view the documentation. To get started, log into the Amazon SageMaker console.