Use Amazon SageMaker Data Wrangler in Amazon SageMaker Studio with a default lifecycle configuration
If you use the default lifecycle configuration for your domain or user profile in Amazon SageMaker Studio and use Amazon SageMaker Data Wrangler for data preparation, then this post is for you. In this post, we show how you can create a Data Wrangler flow and use it for data preparation in a Studio environment with a default lifecycle configuration.
Data Wrangler is a capability of Amazon SageMaker that makes it faster for data scientists and engineers to prepare data for machine learning (ML) applications via a visual interface. Data preparation is a crucial step of the ML lifecycle, and Data Wrangler provides an end-to-end solution to import, explore, transform, featurize, and process data for ML in a visual, low-code experience. It lets you easily and quickly connect to AWS components like Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, and AWS Lake Formation, and external sources like Snowflake and DataBricks DeltaLake. Data Wrangler supports standard data types such as CSV, JSON, ORC, and Parquet.
Studio apps are interactive applications that enable Studio’s visual interface, code authoring, and run experience. App types can be either Jupyter Server or Kernel Gateway:
- Jupyter Server – Enables access to the visual interface for Studio. Every user in Studio gets their own Jupyter Server app.
- Kernel Gateway – Enables access to the code run environment and kernels for your Studio notebooks and terminals. For more information, see Jupyter Kernel Gateway.
Lifecycle configurations (LCCs) are shell scripts to automate customization for your Studio environments, such as installing JupyterLab extensions, preloading datasets, and setting up source code repositories. LCC scripts are triggered by Studio lifecycle events, such as starting a new Studio notebook. To set a lifecycle configuration as the default for your domain or user profile programmatically, you can create a new resource or update an existing resource. To associate a lifecycle configuration as a default, you first need to create a lifecycle configuration following the steps in Creating and Associating a Lifecycle Configuration
Note: Default lifecycle configurations set up at the domain level are inherited by all users, whereas those set up at the user level are scoped to a specific user. If you apply both domain-level and user profile-level lifecycle configurations at the same time, the user profile-level lifecycle configuration takes precedence and is applied to the application irrespective of what lifecycle configuration is applied at the domain level. For more information, see Setting Default Lifecycle Configurations.
Data Wrangler accepts the default Kernel Gateway lifecycle configuration, but some of the commands defined in the default Kernel Gateway lifecycle configuration aren’t applicable to Data Wrangler, which can cause Data Wrangler to fail to start. The following screenshot shows an example of an error message you might get when launching the Data Wrangler flow. This may happen only with default lifecycle configurations and not with lifecycle configurations.
Solution overview
Customers using the default lifecycle configuration in Studio can follow this post and use the supplied code block within the lifecycle configuration script to launch a Data Wrangler app without any errors.
Set up the default lifecycle configuration
To set up a default lifecycle configuration, you must add it to the DefaultResourceSpec
of the appropriate app type. The behavior of your lifecycle configuration depends on whether it’s added to the DefaultResourceSpec
of a Jupyter Server or Kernel Gateway app:
- Jupyter Server apps – When added to the
DefaultResourceSpec
of a Jupyter Server app, the default lifecycle configuration script runs automatically when the user logs in to Studio for the first time or restarts Studio. You can use this to automate one-time setup actions for the Studio developer environment, such as installing notebook extensions or setting up a GitHub repo. For an example of this, see Customize Amazon SageMaker Studio using Lifecycle Configurations. - Kernel Gateway apps – When added to the
DefaultResourceSpec
of a Kernel Gateway app, Studio defaults to selecting the lifecycle configuration script from the Studio launcher. You can launch a notebook or terminal with the default script or choose a different one from the list of lifecycle configurations.
A default Kernel Gateway lifecycle configuration specified in DefaultResourceSpec
applies to all Kernel Gateway images in the Studio domain unless you choose a different script from the list presented in the Studio launcher.
When you work with lifecycle configurations for Studio, you create a lifecycle configuration and attach it to either your Studio domain or user profile. You can then launch a Jupyter Server or Kernel Gateway application to use the lifecycle configuration.
The following table summarizes these errors you may encounter when launching a Data Wrangler application with default lifecycle configurations.
Level at Which the Lifecycle Configuration Is Applied | Create Data Wrangler Flow Works (or) Error |
Workaround |
Domain | Bad Request Error | Apply the script (see below) |
User Profile | Bad Request Error | Apply the script (see below) |
Application | Works—No issue | Not required |
When you use the default lifecycle configuration associated with Studio and Data Wrangler (Kernel Gateway app), you might encounter Kernel Gateway app failure. In this post, we demonstrate how to set the default lifecycle configuration properly to exclude running commands in a Data Wrangler application so you don’t encounter Kernel Gateway app failure.
Let’s say you want to install a git-clone-repo script as the default lifecycle configuration that checks out a Git repository under the user’s home folder automatically when the Jupyter server starts. Let’s look at each scenario of applying a lifecycle configuration (Studio domain, user profile, or application level).
Apply lifecycle configuration at the Studio domain or user profile level
To apply the default Kernel Gateway lifecycle configuration at the Studio domain or user profile level, complete the steps in this section. We start with instructions for the user profile level.
In your lifecycle configuration script, you have to include the following code block that checks and skips the Data Wrangler Kernel Gateway app:
For example, let’s use the following script as our original (note that the folder to clone the repo is changed to /root from /home/sagemaker-user
):
The new modified script looks like the following:
You can save this script as git_command_test.sh
.
Now you run a series of commands in your terminal or command prompt. You should configure the AWS Command Line Interface (AWS CLI) to interact with AWS. If you haven’t set up the AWS CLI, refer to Configuring the AWS CLI.
- Convert your
git_command_test.sh
file into Base64 format. This requirement prevents errors due to the encoding of spacing and line breaks. - Create a Studio lifecycle configuration. The following command creates a lifecycle configuration that runs on launch of an associated Kernel Gateway app:
- Use the following API call to create a new user profile with an associated lifecycle configuration:
Alternatively, if you want to create a Studio domain to associate your lifecycle configuration at the domain level, or update the user profile or domain, you can follow the steps in Setting Default Lifecycle Configurations.
- Now you can launch your Studio app from the SageMaker Control Panel.
- In your Studio environment, on the File menu, choose New and Data Wrangler Flow.The new Data Wrangler flow should open without any issues.
- To validate the Git clone, you can open a new Launcher in Studio.
- Under Notebooks and compute resources, choose the Python 3 notebook and the Data Science SageMaker image to start your script as your default lifecycle configuration script.
You can see the Git cloned to /root
in the following screenshot.
We have successfully applied the default Kernel lifecycle configuration at the user profile level and created a Data Wrangler flow. To configure at the Studio domain level, the only change is instead of creating a user profile, you pass the ARN of the lifecycle configuration in a create-domain call.
Apply lifecycle configuration at the application level
If you apply the default Kernel Gateway lifecycle configuration at the application level, you won’t have any issues because Data Wrangler skips the lifecycle configuration applied at the application level.
Conclusion
In this post, we showed how to configure your default lifecycle configuration properly for Studio when you use Data Wrangler for data preparation and visualization requirements.
To summarize, if you need to use the default lifecycle configuration for Studio to automate customization for your Studio environments and use Data Wrangler for data preparation, you can apply the default Kernel Gateway lifecycle configuration at the user profile or Studio domain level with the appropriate code block included in your lifecycle configuration so that the default lifecycle configuration checks it and skips the Data Wrangler Kernel Gateway app.
For more information, see the following resources:
- Amazon SageMaker Studio lifecycle configuration documentation
- Amazon SageMaker Studio
- Repository of example lifecycle configuration scripts
- Debugging Lifecycle Configurations
About the Authors
Rajakumar Sampathkumar is a Principal Technical Account Manager at AWS, providing customers guidance on business-technology alignment and supporting the reinvention of their cloud operation models and processes. He is passionate about cloud and machine learning. Raj is also a machine learning specialist and works with AWS customers to design, deploy, and manage their AWS workloads and architectures.
Vicky Zhang is a Software Development Engineer at Amazon SageMaker. She is passionate about problem solving. In her spare time, she enjoys watching detective movies and playing badminton.
Rahul Nabera is a Data Analytics Consultant in AWS Professional Services. His current work focuses on enabling customers build their data and machine learning workloads on AWS. In his spare time, he enjoys playing cricket and volleyball.