Amazon EMR

Amazon EMR Studio

Why EMR Studio?

EMR Studio is an integrated development environment (IDE) that makes it easy for data scientists and data engineers to develop, visualize, and debug data engineering and data science applications written in R, Python, Scala, and PySpark.

EMR Studio provides fully managed Jupyter Notebooks and tools such as Spark UI and YARN Timeline Service to simplify debugging. Data scientists and analysts can install custom kernels and libraries, collaborate with peers using code repositories such as GitHub and BitBucket, or execute parameterized notebooks as part of scheduled workflows using orchestration services like Apache Airflow or Amazon Managed Workflows for Apache Airflow.

EMR Studio kernels and applications run on EMR clusters, so you get the benefit of distributed data processing using the performance optimized Amazon EMR runtime for Apache Spark. Administrators can set up EMR Studio such that analysts can run their applications on existing EMR clusters or create new clusters using pre-defined AWS Cloud Formation templates for EMR.

Simple to use

EMR Studio makes it simple to interact with applications on an EMR cluster. You can access EMR Studio either from the AWS Console using AWS IAM Authentication or without logging into the AWS console by enabling federated access from your identity provider (IdP) via AWS IAM Identity Center (successor to AWS SSO). You can interactively explore, process, and visualize data using notebooks, build and schedule pipelines, and debug applications without logging into EMR clusters.

Fully managed Jupyter Notebooks

With EMR Studio, you can start notebooks in seconds, get onboarded with sample notebooks, and perform your data exploration. You can collaborate with peers via built-in real-time collaboration and track changes across notebook versions via Git repositories. You can also customize your environment by loading custom kernels and Python libraries from notebooks.

Screenshot of an EMR Notebooks demo in AWS EMR Studio. The displayed Jupyter notebook explains how to install notebook-scoped Python libraries on a running cluster, visualize Spark dataframes, and describes the benefits of notebook-scoped libraries such as runtime installation, dependency isolation, and portability.

Easy to build applications

EMR Studio makes it easy for you to move from prototyping to production. You can trigger pipelines from code repositories, simply run Notebooks as pipelines using orchestration tools like Apache Airflow or Amazon Managed Workflows for Apache Airflow, or attach notebooks to a bigger cluster using a single click.

Screenshot of the Apache Airflow interface in AWS EMR Studio showing the DAG (Directed Acyclic Graph) tree view for a custom cluster execution sensor DAG, with workflow steps and task status indicators.

Simplified debugging

With EMR Studio, you can debug jobs and access logs without logging into the cluster for both active and terminated clusters. You can use native application interfaces such as Spark UI and YARN timeline service directly from EMR Studio. EMR Studio also allows you to quickly locate the cluster or job to debug by using filters such as cluster state, creation time, and cluster ID.

Screenshot of the AWS EMR Studio interface showing EC2 cluster management and debugging features within a Jupyter notebook environment. The interface lists various EMR clusters, their IDs, states, elapsed times, and launching options for application UIs such as Spark History Server, YARN Timeline Server, and Tez UI.

Real-time collaborative notebooks

With EMR Studio, data scientists, engineers, and analysts can collaborate across teams in real time. You can invite your colleagues to view and edit notebooks. This enables real-time co-authoring, code debugging and code reviews of Jupyter notebooks.

SQL Explorer

EMR Studio comes with SQL Explorer, a feature in your Workspace that allows you to browse the data catalog and run SQL queries on EMR clusters directly from EMR Studio. In SQL explorer, you can connect to Amazon EMR on EC2 clusters with Presto to view and browse the data catalog. SQL Explorer also provides you an Editor to run SQL queries, view query results in a table, and download them in a csv format.

Multi Language Notebooks

EMR Studio enables you to use multiple languages within a single Jupyter notebook. You can switch between Python, Scala, SparkSQL, and R within the same Jupyter notebook and share data between cells via temporary tables. With this feature, you can write code in languages best suited to different components of your workflow.

Screenshot of Amazon EMR Studio showing a multi-language Jupyter Notebook interface with cells using SparkR, SQL, PySpark, and ScalaSpark code, displayed on a gradient background.

Use cases

With EMR Studio, you can start notebooks in seconds, get onboarded with sample notebooks, and perform your data exploration. You can collaborate with peers via built-in real time collaboration and track changes across notebook versions via Git repositories. You can also customize your environment by loading custom kernels and Python libraries from notebooks.

In EMR Studio, you can use code repository to trigger pipelines. You can also parameterize and chain notebooks to build pipelines. You can integrate notebooks into scheduled workflows using workflow orchestration services such as Apache Airflow or Amazon Managed Workflows for Apache Airflow. EMR Studio also allows you to re-attach notebooks to a bigger cluster to run a job.

In EMR Studio, you can debug notebook applications from the notebook UI. You can also debug pipelines by first narrowing down clusters using filters like cluster state, and diagnose jobs on both active and terminated clusters with as few clicks as possible to open native debugging UIs like Spark UI, Tez UI, and Yarn Timeline Service.

Resources

Documentation