Posted On: Jan 21, 2022

Amazon SageMaker Pipelines is a fully-managed service that allows customers to define and orchestrate their model building steps as workflows. Today, we are happy to introduce a new step type that allows machine learning engineers to run data processing applications using open source frameworks such as Apache Spark, Presto, and Hive on Amazon EMR clusters.

SageMaker Pipelines already provides variety of steps (e.g. processing, training, register model, callback etc.), these steps allow customers to flexibly define their model building workflow. Oftentimes, customers want to use open source frameworks such as Spark, Hive and Presto running on EMR to execute data processing tasks (feature engineering) on EMR cluster in model building process. Using newly launched SageMaker Pipelines EMR step customers can submit these tasks as EMR jobs on an EMR cluster. The SageMaker Pipeline EMR step requires customers to provide cluster-id of EMR cluster and execution property for EMR job which need to be executed on the cluster. Sagemaker pipelines takes care of establishing a secure connection, submitting the EMR workloads and actively tracking them to completion. Once created, the SageMaker Pipelines EMR step can be integrated in ML model building workflow along with other SageMaker Pipelines steps.

This feature is available in all AWS regions where Amazon SageMaker is available. To get started, create a new SageMaker Pipeline from the SageMaker Studio or the command-line interface using EMR Step. To learn more visit our documentation page.