AWS Big Data Blog
Introducing enhanced support for tagging, cross-account access, and network security in AWS Glue interactive sessions
AWS Glue interactive sessions allow you to run interactive AWS Glue workloads on demand, which enables rapid development by issuing blocks of code on a cluster and getting prompt results. This technology is enabled by the use of notebook IDEs, such as the AWS Glue Studio notebook, Amazon SageMaker Studio, or your own Jupyter notebooks.
In this post, we discuss the following new management features recently added and how can they give you more control over the configurations and security of your AWS Glue interactive sessions:
- Tags magic – You can use this new cell magic to tag the session for administration or billing purposes. For example, you can tag each session with the name of the billable department and later run a search to find all spending associated with this department on the AWS Billing console.
- Assume role magic – Now you can create a session in an account different than the one you’re connected with by assuming an AWS Identity and Access Management (IAM) role owned by the other account. You can designate a dedicated role with permissions to create sessions and have other users assume it when they use sessions.
- IAM VPC rules – You can require your users to use (or restrict them from using) certain VPCs or subnets for the sessions, to comply with your corporate policies and have control over how your data travels in the network. This feature existed for AWS Glue jobs and is now available for interactive sessions.
Solution overview
For our use case, we’re building a highly secured app and want to have users (developers, analysts, data scientists) running AWS Glue interactive sessions on specific VPCs to control how the data travels through the network.
In addition, users are not allowed to log in directly to the production account, which has the data and the connections they need; instead, users will run their own notebooks via their individual accounts and get permission to assume a specific role enabled on the production account to run their sessions. Users can run AWS Glue interactive sessions by using both AWS Glue Studio notebooks via the AWS Glue console, as well as Jupyter notebooks that run on their local machine.
Lastly, all new resources be tagged with the name of the department for proper billing allocation and cost control.
The following architecture diagram highlights the different roles and accounts involved:
- Account A – The individual user account. The user
ISBlogUser
has permissions to create AWS Glue notebook servers via theAWSGlueServiceRole-notebooks
role and assume a role in account B (directly or indirectly). - Account B – The production account that owns the
GlueSessionsCreationRole
role, which users assume to create AWS Glue interactive sessions in this account.
Prerequisites
In this section, we walk through the steps to set up the prerequisite resources and security configurations.
Install AWS CLI and Python library
Install and configure the AWS Command Line Interface (AWS CLI) if you don’t have it already set up. For instructions, refer to Install or update the latest version of the AWS CLI.
Optionally, if you want to use run a local notebook from your computer, install Python 3.7 or later and then install Jupyter and the AWS Glue interactive sessions kernels. For instructions, refer to Getting started with AWS Glue interactive sessions. You can then run Jupyter directly from the command line using jupyter notebook
, or via an IDE like VSCode or PyCharm.
Get access to two AWS accounts
If you have access to two accounts, you can reproduce the use case described in this post. The instructions refer to account A as the user account that runs the notebook and account B as the account that runs the sessions (the production account in the use case). This post assumes you have enough administration permissions to create the different components and manage the account security roles.
If you have access to only one account, you can still follow this post and perform all the steps on that single account.
Create a VPC and subnet
We want to limit users to use AWS Glue interactive session only via a specific VPC network. First, let’s create a new VPC in account B using Amazon Virtual Private Cloud (Amazon VPC). We use this VPC connection later to enforce the network restrictions.
- Sign in to the AWS Management Console with account B.
- On the Amazon VPC console, choose Your VPCs in the navigation pane.
- Choose Create VPC.
- Enter
10.0.0.0/24
as the IP CIDR. - Leave the remaining parameters as default and create your VPC.
- Make a note of the VPC ID (starting with
vpc-
) to use later.
For more information about creating VPCs, refer to Create a VPC.
- In the navigation pane, choose Subnets.
- Choose Create subnet.
- Select the VPC you created, enter the same CIDR (
10.0.0.0/24
), and create your subnet. - In the navigation pane, choose Endpoints.
- Choose Create endpoint.
- For Service category, select AWS services.
- Search for the option that ends in
s3
, such ascom.amazonaws.{region}.s3
. - In the search results, select the Gateway type option.
- Choose your VPC on the drop-down menu.
- For Route tables, select the subnet you created.
- Complete the endpoint creation.
Create an AWS Glue network connection
You now need to create an AWS Glue connection that uses the VPC, so sessions created with it can meet the VPC requirement.
- Sign in to the console with account B.
- On the AWS Glue console, choose Data connections in the navigation pane.
- Choose Create connection.
- For Name, enter
session_vpc
. - For Connection type, choose Network.
- In the Network options section, choose the VPC you created, a subnet, and a security group.
- Choose Create connection.
Account A security setup
Account A is the development account for your users (developers, analysts, data scientists, and so on). They are provided IAM users to access this account programmatically or via the console.
Create the assume role policy
The assume role policy allows users and roles in account A to assume roles in account B (the role in account B also has to allow it). Complete the following steps to create the policy:
- On the IAM console, choose Policies in the navigation pane.
- Choose Create policy.
- Switch to the JSON tab in the policy editor and enter the following policy (provide the account B number):{
- Name the role
AssumeRoleAccountBPolicy
and complete the creation.
Create an IAM user
Now you create an IAM user for account A that you can use to run AWS Glue interactive sessions locally or on the console.
- On the IAM console, choose Users in the navigation pane.
- Choose Create user.
- Name the user
ISBlogUser
. - Select Provide user access to the AWS Management Console.
- Select I want to create an IAM user and choose a password.
- Attach the policies
AWSGlueConsoleFullAccess
andAssumeRoleAccountBPolicy
. - Review the settings and complete the user creation.
Create an AWS Glue Studio notebook role
To start an AWS Glue Studio notebook, a role is required. Usually, the same role is used both to start a notebook and run a session. In this use case, users of account A only need permissions to run a notebook, because they will create sessions via the assumed role in account B.
- On the IAM console, choose Roles in the navigation pane.
- Choose Create role.
- Select Glue as the use case.
- Attach the policies
AWSGlueServiceNotebookRole
andAssumeRoleAccountBPolicy
. - Name the role
AWSGlueServiceRole-notebooks
(because the name starts withAWSGlueServiceRole
, the user doesn’t need explicitPassRole
permission), then complete the creation.
Optionally, you can allow Amazon CodeWhisperer to provide code suggestions on the notebook by adding the permission to the role. To do so, navigate to the role AWSGlueServiceRole-notebooks
on the IAM console. On the Add permissions menu, choose Create inline policy. Use the following JSON policy and name it CodeWhispererPolicy
:
Account B security setup
Account B is considered the production account that contains the data and connections, and runs the AWS Glue data integration pipelines (using either AWS Glue sessions or jobs). Users don’t have direct access to it; they use it assuming the role created for this purpose.
To follow this post, you need two roles: one the AWS Glue service will assume to run and another that creates sessions, enforcing the VPC restriction.
Create an AWS Glue service role
To create an AWS Glue service role, complete the following steps:
- On the IAM console, choose Roles in the navigation pane.
- Choose Create role.
- Choose Glue for the use case.
- Attach the policy
AWSGlueServiceRole
. - Name the role
AWSGlueServiceRole-blog
and complete the creation.
Create an AWS Glue interactive session role
This role will be used to create sessions following the VPC requirements. Complete the following steps to create the role:
- On the IAM console, choose Policies in the navigation pane.
- Choose Create policy.
- Switch to the JSON tab in the policy editor and enter the following code (provide your VPC ID). You can also replace the
*
in the policy with the full ARN of the roleAWSGlueServiceRole-blog
you just created, to force the notebook to only use that role when creating sessions.
This policy complements the AWSGlueServiceRole
you attached before and restricts the session creation based on the VPC. You could also restrict the subnet and security group in a similar way using conditions for the resources glue:SubnetIds
and glue:SecurityGroupIds
respectively.
In this case, the sessions creation requires a VPC, which has to be in the list of IDs listed. If you need to just require any valid VPC to be used, you can remove the first statement and leave the one that denies the creation when the VPC is null.
- Name the policy
CustomCreateSessionPolicy
and complete the creation. - Choose Roles in the navigation pane.
- Choose Create role.
- Select Custom trust policy.
- Replace the trust policy template with the following code (provide your account A number):
This allows the role to be assumed directly by the user when using a local notebook and also when using an AWS Glue Studio notebook with a role.
- Attach the policies
AWSGlueServiceRole
andCustomCreateSessionPolicy
(which you created on the previous step, so you might need to refresh for them to be listed). - Name the role
GlueSessionCreationRole
and complete the role creation.
Create the Glue interactive session in the VPC, with assumed role and tags
Now that you have the accounts, roles, VPC, and connection ready, you use them to meet the requirements. You start a new notebook using account A, which assumes the role of account B to create a session in the VPC, and tag it with the department and billing area.
Start a new notebook
Using account A, start a new notebook. You may use either of the following options.
Option 1: Create an AWS Glue Studio notebook
The first option is to create an AWS Glue Studio notebook:
- Sign in to the console with account A and the
ISBlogUser
user. - On the AWS Glue console, choose Notebooks in the navigation pane under ETL jobs.
- Select Jupyter Notebook and choose Create.
- Enter a name for your notebook.
- Specify the role
AWSGlueServiceRole-notebooks
. - Choose Start notebook.
Option 2: Create a local notebook
Alternatively, you can create a local notebook. Before you start the process that runs Jupyter (or if you run it indirectly, then the IDE that runs it), you need to set the IAM ID and key for the user ISBlogUser
, either using aws configure
on the command line or setting the values as environment variables AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
for the user ID and secret key, respectively. Then create a new Jupyter notebook and select the kernel Glue PySpark.
Kernel version
The %assume_role
magic is available on kernel 1.0 or later. To check your kernel version, you can add a notebook cell and run %help
in AWS Glue Studio notebook or !pip show aws_glue_sessions
if in a local notebook.
To upgrade the notebook kernel version, run !pip install --upgrade aws-glue-sessions
in a notebook cell. Once the upgrade completes, restart the kernel with the restart kernel button on the toolbar or by using the shortcut 00 (typing zero twice without any cell textbox selected). A popup will ask you to confirm.
Note: In AWS Glue Studio notebooks, this kernel upgrade won’t persist once you start a new notebook.
Start a session from the notebook
After you start the notebook, select the first cell and add four new empty code cells. If you are using an AWS Glue Studio notebook, the notebook already contains some prepopulated cells as examples; we don’t use those sample cells in this post.
- In the first cell, enter the following magic configuration with the session creation role ARN, using the ID of account B:
- Run the cell to set up that configuration, either by choosing the button on the toolbar or pressing Shift + Enter.
It should confirm the role was assumed correctly. Now when the session is launched, it will be done by this role. This allowed you to use a role from a different account to run a session on that account.
- In the second cell, enter sample tags like the following and run the cell in the same way:
- In the third cell, enter the following sample configuration (provide the role ARN with account B) and run the cell to set up the configuration:
- In the fourth empty cell, enter the following code to set up the objects required to work with AWS Glue and run the cell:
It should fail with a permission error saying that there is an explicit deny policy activated. This is the VPC condition you set before. By default, the session doesn’t use a VPC, so this is why it’s failing.
You can solve the error by assigning the connection you created before, so the session runs inside the VPC authorized.
- In the third cell, add the
%connections
magic with the valuesession_vpc
.
The session needs to run in the same Region in which the connection is defined. If that’s not the same as the notebook Region, you can explicitly configure the session Region using the %region
magic.
- After you have added the new config settings, run the cell again so the magics take effect.
- Run the fourth cell again (the one with the code).
This time, it should start the session and after a brief period confirm it has been created correctly.
- Add a new cell with the following content and run it:
%status
This will display the configuration and other information about the session that the notebook is using, including the tags set before.
You started a notebook in account A and used a role from account B to create a session, which uses the network connection so it runs in the required VPC. You also tagged the session to be able to easily identify it later.
In the next section, we discuss more ways to monitor sessions using tags.
Interactive session tags
Before tags were supported, if you wanted to identify the purpose of sessions running the account, you had to use the magic %session_id_prefix
to name your session with something meaningful.
Now, with the new tags magic, you can use more sophisticated ways to categorize your sessions.
In the previous section, you tagged the session with a team and billing department. Let’s imagine now you are an administrator checking the sessions that different teams run in an account and Region.
Explore tags via the AWS CLI
On the command line where you have the AWS CLI installed, run the following command to list the sessions running in the account and Regions configured (use the Region and max results parameters if needed):
You also have the option to just list sessions that have a specific tag:
You can also list all the tags associated with a specific session with the following command. Provide the Region, account, and session ID (you can get it from the list-sessions
command):
Explore tags via the AWS Billing console
You can also use tags to keep track of cost and do more accurate cost assignment in your company. After you have used a tag in your session, the tag will become available for billing purposes (it can take up to 24 hours to be detected).
- On the AWS Billing console, choose Cost allocation tags under Billing in the navigation pane.
- Search for and select the tags you used in the session: “team” and “billing”.
- Choose Activate.
This activation can take up to 24 hours additional hours until the tag is applied for billing purposes. You only have to do this one time when you start using a new tag on an account.
- After the tags have been correctly activated and applied, choose Cost explorer under Cost Management in the navigation pane.
- In the Report parameters pane, for Tag, choose one of the tags you activated.
This adds a drop-down menu for this tag, where you can choose some or all of the tag values to use.
- Make your selection and choose Apply to use the filter on the report.
Clean up
Run the %stop_session
magic in a cell to stop the session and avoid further charges. If you no longer need the notebook, VPC, or roles you created, you can delete them as well.
Conclusion
In this post, we showed how to use these new features in AWS Glue to have more control over your interactive sessions for management and security. You can enforce network restrictions, allow users from other accounts to use your session, and use tags to help you keep track of the session usage and cost reports. These new features are already available, so you can start using them now.