AWS Big Data Blog
Introducing AWS Glue interactive sessions for Jupyter
Interactive Sessions for Jupyter is a new notebook interface in the AWS Glue serverless Spark environment. Starting in seconds and automatically stopping compute when idle, interactive sessions provide an on-demand, highly-scalable, serverless Spark backend to Jupyter notebooks and Jupyter-based IDEs such as Jupyter Lab, Microsoft Visual Studio Code, JetBrains PyCharm, and more. Interactive sessions replace AWS Glue development endpoints for interactive job development with AWS Glue and offers the following benefits:
- No clusters to provision or manage
- No idle clusters to pay for
- No up-front configuration required
- No resource contention for the same development environment
- Easy installation and usage
- The exact same serverless Spark runtime and platform as AWS Glue extract, transform, and load (ETL) jobs
Getting started with interactive sessions for Jupyter
Installing interactive sessions is simple and only takes a few terminal commands. After you install it, you can run interactive sessions anytime within seconds of deciding to run. In the following sections, we walk you through installation on macOS and getting started in Jupyter.
To get started with interactive sessions for Jupyter on Windows, follow the instructions in Getting started with AWS Glue interactive sessions.
These instructions assume you’re running Python 3.6 or later and have the AWS Command Line Interface (AWS CLI) properly running and configured. You use the AWS CLI to make API calls to AWS Glue. For more information on installing the AWS CLI, refer to Installing or updating the latest version of the AWS CLI.
Install AWS Glue interactive sessions on macOS and Linux
To install AWS Glue interactive sessions, complete the following steps:
- Open a terminal and run the following to install and upgrade Jupyter, Boto3, and AWS Glue interactive sessions from PyPi. If desired, you can install Jupyter Lab instead of Jupyter.
- Run the following commands to identify the package install location and install the AWS Glue PySpark and AWS Glue Spark Jupyter kernels with Jupyter:
- To validate your install, run the following command:
In the output, you should see both the AWS Glue PySpark and the AWS Glue Spark kernels listed alongside the default Python3 kernel. It should look something like the following:
Choose and prepare IAM principals
Interactive sessions use two AWS Identity and Access Management (IAM) principals (user or role) to function. The first is used to call the interactive sessions APIs and is likely the same user or role that you use with the AWS CLI. The second is
GlueServiceRole, the role that AWS Glue assumes to run your session. This is the same role as AWS Glue jobs; if you’re developing a job with your notebook, you should use the same role for both interactive sessions and the job you create.
Prepare the client user or role
In the case of local development, the first role is already configured if you can run the AWS CLI. If you can’t run the AWS CLI, follow these steps for setting up. If you often use the AWS CLI or Boto3 to interact with AWS Glue and have full AWS Glue permissions, you can likely skip this step.
- To validate this first user or role is set up, open a new terminal window and run the following code:
You should see a response like the following. If not, you may not have permissions to call AWS Security Token Service (AWS STS), or you don’t have the AWS CLI set up properly. If you simply get access denied calling AWS STS, you may continue if you know your user or role and its needed permissions.
- Ensure your IAM user or role can call the AWS Glue interactive sessions APIs by attaching the
AWSGlueConsoleFullAccessmanaged IAM policy to your role.
If your caller identity returned a user, run the following:
If your caller identity returned a role, run the following:
Prepare the AWS Glue service role for interactive sessions
You can specify the second principal,
GlueServiceRole, either in the notebook itself by using the
%iam_role magic or stored alongside the AWS CLI config. If you have a role that you typically use with AWS Glue jobs, this will be that role. If you don’t have a role you use for AWS Glue jobs, refer to Setting up IAM permissions for AWS Glue to set one up.
To set this role as the default role for interactive sessions, edit the AWS CLI credentials file and add
glue_role_arn to the profile you intend to use.
- With a text editor, open
On Windows, use
- Look for the profile you use for AWS Glue; if you don’t use a profile, you’re looking for [Default].
- Add a line in the profile for the role you intend to use like,
- I recommend adding a default Region to your profile if one is not specified already. You can do so by adding the line
us-east-1with your desired Region.
If you don’t add a Region to your profile, you’re required to specify the Region at the top of each notebook with the
%regionmagic.When finished, your config should look something like the following:
- Save the config.
Start Jupyter and an AWS Glue PySpark notebook
To start Jupyter and your notebook, complete the following steps:
- Run the following command in your terminal to open the Jupyter notebook in your browser:
Your browser should open and you’re presented with a page that looks like the following screenshot.
- On the New menu, choose Glue PySpark.
A new tab opens with a blank Jupyter notebook using the AWS Glue PySpark kernel.
Configure your notebook with magics
AWS Glue interactive sessions are configured with Jupyter magics. Magics are small commands prefixed with % at the start of Jupyter cells that provide shortcuts to control the environment. In AWS Glue interactive sessions, magics are used for all configuration needs, including:
- %region – Region
- %profile – AWS CLI profile
- %iam_role – IAM role for the AWS Glue service role
- %worker_type – Worker type
- %number_of_workers – Number of workers
- %idle_timeout – How long to allow a session to idle before stopping it
- %additional_python_modules – Python libraries to install from pip
Magics are placed at the beginning of your first cell, before your code, to configure AWS Glue. To discover all the magics of interactive sessions, run
%help in a cell and a full list is printed. With the exception of
%%sql, running a cell of only magics doesn’t start a session, but sets the configuration for the session that starts next when you run your first cell of code. For this post, we use three magics to configure AWS Glue with version 2.0 and two G.2X workers. Let’s enter the following magics into our first cell and run it:
When you run magics, the output lets us know the values we’re changing along with their previous settings. Explicitly setting all your configuration in magics helps ensure consistent runs of your notebook every time and is recommended for production workloads.
Run your first code cell and author your AWS Glue notebook
Next, we run our first code cell. This is when a session is provisioned for use with this notebook. When interactive sessions are properly configured within an account, the session is completely isolated to this notebook. If you open another notebook in a new tab, it gets its own session on its own isolated compute. Run your code cell as follows:
When you ran the first cell containing code, Jupyter invoked interactive sessions, provisioned an AWS Glue cluster, and sent the code to AWS Glue Spark. The notebook was given a session ID, as shown in the preceding code. We can also see the properties used to provision AWS Glue, including the IAM role that AWS Glue used to create the session, the number of workers and their type, and any other options that were passed as part of the creation.
Interactive sessions automatically initialize a Spark session as
sc; having Spark ready to go saves a lot of boilerplate code. However, if you want to convert your notebook to a job,
sc must be initialized and declared explicitly.
Work in the notebook
Now that we have a session up, let’s do some work. In this exercise, we look at population estimates from the AWS COVID-19 dataset, clean them up, and write the results a table.
This walkthrough uses data from the COVID-19 data lake.
To make the data from the AWS COVID-19 data lake available in the Data Catalog in your AWS account, create an AWS CloudFormation stack using the following template.
If you’re signed in to your AWS account, deploy the CloudFormation stack by clicking the following Launch stack button:
It fills out most of the stack creation form for you. All you need to do is choose Create stack. For instructions on creating a CloudFormation stack, see Get started.
When I’m working on a new data integration process, the first thing I often do is identify and preview the datasets I’m going to work on. If I don’t recall the exact location or table name, I typically open the AWS Glue console and search or browse for the table then return to my notebook to preview it. With interactive sessions, there is a quicker way to browse the Data Catalog. We can use the
%%sql magic to show databases and tables without leaving the notebook. For this example, the population table I want in is the COVID-19 dataset but I don’t recall its exact name, so I use the
%%sql magic to look it up:
Looking through the returned list, we see a table named
county_populations. Let’s select from this table, sorting for the largest counties by population:
Our query returned data but in an unexpected order. It looks like
population estimate 2018 sorted lexicographically if the values were strings. Let’s use an AWS Glue DynamicFrame to get the schema of the table and verify the issue:
The schema shows
population estimate 2018 to be a string, which is why our column isn’t sorting properly. We can use the apply_mapping transform in our next cell to correct the column type. In the same transform, we also clean up the column names and other column types: clarifying the distinction between
id2, removing spaces from
population estimate 2018 (conforming to Hive’s standards), and casting
id2 as an integer for proper sorting. After validating the schema, we show the data with the new schema:
With the data sorting correctly, we can write it to Amazon Simple Storage Service (Amazon S3) as a new table in the AWS Glue Data Catalog. We use the mapped DynamicFrame for this write because we didn’t modify any data past that transform:
Finally, we run a query against our new table to show our table created successfully and validate our work:
Convert notebooks to AWS Glue jobs with nbconvert
Jupyter notebooks are saved as .ipynb files. AWS Glue doesn’t currently run .ipynb files directly, so they need to be converted to Python scripts before they can be uploaded to Amazon S3 as jobs. Use the
jupyter nbconvert command from a terminal to convert the script.
- Open a new terminal or PowerShell tab or window.
cdto the working directory where your notebook is.
This is likely the same directory where you ran jupyter notebook at the beginning of this post.
- Run the following bash command to convert the notebook, providing the correct file name for your notebook:
cat <Untitled-1>.ipynbto view your new file.
- Upload the .py file to Amazon S3 using the following command, replacing the bucket, path, and file name as needed:
- Create your AWS Glue job with the following command.
Note that the magics aren’t automatically converted to job parameters when converting notebooks locally. You need to put in your job arguments correctly, or import your notebook to AWS Glue Studio and complete the following steps to keep your magic settings.
Run the job
After you have authored the notebook, converted it to a Python file, uploaded it to Amazon S3, and finally made it into an AWS Glue job, the only thing left to do is run it. Do so with the following terminal command:
AWS Glue interactive sessions offer a new way to interact with the AWS Glue serverless Spark environment. Set it up in minutes, start sessions in seconds, and only pay for what you use. You can use interactive sessions for AWS Glue job development, ad hoc data integration and exploration, or for large queries and audits. AWS Glue interactive sessions are generally available in all Regions that support AWS Glue.
To learn more and get started using AWS Glue Interactive Sessions visit our developer guide and begin coding in seconds.
About the author
Zach Mitchell is a Sr. Big Data Architect. He works within the product team to enhance understanding between product engineers and their customers while guiding customers through their journey to develop data lakes and other data solutions on AWS analytics services.