Now launch Amazon SageMaker Studio Notebooks backed by Spark in Amazon EMR

Posted on: Dec 21, 2020

Amazon SageMaker Studio is the first fully integrated development environment (IDE) for machine learning. With a single click, data scientists and developers can quickly spin up SageMaker Studio notebooks to explore and prepare datasets to build, train and deploy machine learning models in a single pane of glass. Amazon EMR is a web service that makes it easy to quickly and cost-effectively process vast amounts of data. Starting today, customers can use Studio notebooks to easily and securely connect to Amazon EMR clusters and prepare vast amounts of data for analysis and reporting, model training, or inference. 

Data preparation is a critical step in the machine learning workflow. With SageMaker Studio, you have access to a range of tools for data preparation based on your preference. If you prefer a visual interface, you can use Amazon SageMaker Data Wrangler to connect to Amazon S3, Amazon RedShift, or Amazon Athena to access, visualize, and analyze data from SageMaker Studio. If you prefer to write code, you can also use SageMaker Studio notebooks to prepare data interactively using libraries and SDKs, or process large amounts of data in batch using Amazon SageMaker Processing with built-in Spark container. However, if you prefer to connect Studio notebooks to existing EMR clusters to access and process data, you need to manually set up the environment, bring your own Sparkmagic kernel, configure target cluster information, install tools such as Kerberos for authentication, before running your Spark jobs or query your Hive tables.

Amazon SageMaker Studio now comes with built-in tools that make it quick and easy to securely connect your notebook to an EMR cluster for processing large amounts of data. You can create a Studio notebook from a built-in SageMaker image with PySpark kernel, use built-in commands to connect to an EMR cluster, and start to query, analyze and process data in a few steps. For added security, you can connect to EMR clusters using Kerberos authentication. The feature is now available in all AWS Regions where Amazon SageMaker Studio is available. For more information, see the Amazon SageMaker Studio documentation