Posted On: Oct 1, 2021

You can now use open source frameworks such as Apache Spark, Apache Hive, and Presto running on Amazon EMR clusters directly from Amazon SageMaker Studio notebooks to run petabyte-scale data analytics and machine learning. Amazon EMR automatically installs and configures open source frameworks and provides a performance-optimized runtime that is compatible with and faster than standard open source. For e.g. Spark 3.0 on Amazon EMR is 1.7x faster than it’s open source equivalent. Amazon SageMaker Studio provides a single, web-based visual interface where you can perform all ML development steps required to prepare data, as well as build, train, and deploy models. Analyzing, transforming and preparing large amounts of data is a foundational step of any data science and ML workflow. This release makes it simple to use popular frameworks such as Apache Spark, Hive, and Presto running on EMR clusters directly from Sagemaker Studio to help simplify data science and ML workflows.

With this release, you can now visually browse a list of EMR clusters directly from SageMaker Studio and connect to them in a few simple clicks. Once connected to an EMR cluster, you can use Spark SQL, Scala, Python, and HiveQL to interactively query, explore and visualize data, and run Apache Spark, Hive and Presto jobs to process data. Jobs run fast because they use EMR’s performance-optimized versions of Spark, Hive, and Presto. Further, clusters can automatically scale up or down based on the workloads and integrate with Spot instances and Graviton2 based processors to lower costs. Finally, Sagemaker Studio users can authenticate when they connect to Amazon EMR clusters using LDAP-based credentials or Kerberos.

These features are supported on EMR 5.9.0 and above, and are generally available in all AWS Regions where SageMaker Studio is available. To learn more, watch the demo Interactive data processing on Amazon EMR from Amazon SageMaker, read the blog Perform interactive data engineering and data science workflows from Amazon SageMaker Studio notebooks or the SageMaker Studio documentation here.