Posted On: Dec 1, 2021

Amazon SageMaker Studio is the first fully integrated development environment (IDE) for machine learning (ML). It provides a single, web-based visual interface where you can perform all ML development steps required to prepare data, as well as to build, train, and deploy models. We recently introduced  the ability to visually browse and connect to Amazon EMR clusters right from the SageMaker Studio notebook. Starting today, you can now monitor and debug your Apache Spark jobs running on EMR right from SageMaker Studio notebooks with just a click. Additionally, you can now discover, connect to, create, terminate and manage EMR clusters directly from SageMaker Studio. The built-in integration with EMR therefore enables you to do interactive data preparation and machine learning at peta-byte scale right within the single universal SageMaker Studio notebook.

Analyzing, transforming, and preparing large amounts of data is a foundational step of any data science and ML workflow. Data workers such as data scientists and data engineers leverage Apache Spark, Hive, and Presto running on EMR for fast data preparation. Until today, these data workers could easily connect to EMR clusters from Studio notebooks in the same account. However, they had to set up complex security rules and web proxies to connect across accounts or to monitor and debug their Apache Spark jobs running on EMR. Furthermore, when these data workers needed to create EMR clusters tailored to their specific workloads, they had to either request their administrator to create them or had to switch to using other tools and use detailed technical knowledge of network, compute, and cluster configuration to create clusters by themselves. This process was not only challenging and disruptive to their workflow but also distracted them from focusing on their data preparation tasks. Consequently, although uneconomical, many customers kept persistent clusters running in anticipation of incoming workload regardless of active usage.

Starting today, data workers can easily discover and connect to EMR clusters in single account and cross account configurations directly from SageMaker Studio. Further, data workers can now have one-click access to Apache Spark UI to monitor and debug Apache Spark jobs running on EMR right from SageMaker Studio Notebooks, greatly simplifying their debugging workflow. Customers can also use AWS Service Catalog to define and roll out pre-configured templates to selected data workers to enable them to create EMR clusters right from SageMaker Studio. Customers can fully control the organizational, security, compute and networking guardrails when data workers use these templates. Data workers can visually browse through a set of templates made available to them, customize them for their specific workloads, create EMR clusters on-demand and terminate them with just a few clicks right from SageMaker Studio. Customers can use these features to simplify their data preparation workflow and more optimally use EMR clusters for interactive workloads from SageMaker Studio.

These features are generally available in the following AWS Regions there are no additional charges to use this capability: US East (N. Virginia and Ohio), US West (N.California and Oregon), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (Stockholm), Europe (Paris) and Europe (London), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), and Asia Pacific (Tokyo) and South America (Sao Paolo). To learn more, see this blog post and the SageMaker Studio Notebooks user guide.