Bringing your own R environment to Amazon SageMaker Studio

Amazon SageMaker Studio is the first fully integrated development environment (IDE) for machine learning (ML). With a single click, data scientists and developers can quickly spin up SageMaker Studio notebooks to explore datasets and build models. On October 27, 2020, Amazon released a custom images feature that allows you to launch SageMaker Studio notebooks with your own images.

SageMaker Studio notebooks provide a set of built-in images for popular data science and ML frameworks and compute options to run notebooks. The built-in SageMaker images contain the Amazon SageMaker Python SDK and the latest version of the backend runtime process, also called kernel. With the custom images feature, you can register custom built images and kernels, and make them available to all users sharing a SageMaker Studio domain. You can start by cloning and extending one of the example Docker files provided by SageMaker, or build your own images from scratch.

This post focuses on adding a custom R image to SageMaker Studio so you can build and train your R models with SageMaker. After attaching the custom R image, you can select the image in Studio and use R to access the SDKs using the RStudio reticulate package. For more information about R on SageMaker, see Coding with R on Amazon SageMaker notebook instances and R User Guide to Amazon SageMaker.

You can create images and image versions and attach image versions to your domain using the S ageMaker Studio Control Panel, the AWS SDK for Python (Boto3), and the AWS Command Line Interface (AWS CLI)—for more information about CLI commands, see AWS CLI Command Reference. This post explains both AWS CLI and SageMaker console UI methods to attach and detach images to a SageMaker Studio domain.

Prerequisites

Before getting started, you need to meet the following prerequisites:

Install the AWS CLI on your local machine. This post uses AWS CLI version 2. You should make necessary adjustments if you use a different AWS CLI version.
Permissions to access the Amazon Elastic Container Registry (Amazon ECR). For more information, see Amazon ECR Managed Policies.
Install Docker on your local machine. For more information, see Orientation and setup. This is necessary for building Docker images and pushing them to Amazon ECR.
- For instructions on building a container from a Studio development environment, see Using the Amazon SageMaker Studio Image Build CLI to build container images from your Studio notebooks and the SageMaker Docker Build GitHub repo.
Dockerfile for the R image. A sample Dockerfile is provided in this post that you can customize for your own specific case, but you can also use your own Dockerfile.
- Building the R image from the Dockerfile installs dependencies that may be licensed under copyleft licenses such as GPLv3. You should review the license terms and make sure they are acceptable for your use case before proceeding to build this image.
An AWS Identity and Access Management (IAM) role that has the AmazonSageMakerFullAccess policy attached. If you have onboarded to SageMaker Studio, you can get the role from the Studio Summary section on the SageMaker Studio Control Panel.
A SageMaker Studio domain. For instructions, see Onboard to Amazon SageMaker Studio.

Creating your Dockerfile

Before attaching your image to Studio, you need to build a Docker image using a Dockerfile. You can build a customized Dockerfile using base images or other Docker image repositories, such as Jupyter Docker-stacks repository, and use or revise the ones that fit your specific need.

SageMaker maintains a repository of sample Docker images that you can use for common use cases (including R, Julia, Scala, and TensorFlow). This repository contains examples of Docker images that are valid custom images for Jupyter KernelGateway Apps in SageMaker Studio. These custom images enable you to bring your own packages, files, and kernels for use within SageMaker Studio.

For more information about the specifications that apply to the container image that is represented by a SageMaker image version, see Custom SageMaker image specifications.

For this post, we use the sample R Dockerfile. This Dockerfile takes the base Python 3.6 image and installs R system library prerequisites, conda via Miniconda, and R packages and Python packages that are usable via reticulate. You can create a file named Dockerfile using the following script and copy it to your installation folder. You can customize this Dockerfile for your specific use case and install additional packages.

# This project is licensed under the terms of the Modified BSD License 
# (also known as New or Revised or 3-Clause BSD), as follows:

#    Copyright (c) 2001-2015, IPython Development Team
#    Copyright (c) 2015-, Jupyter Development Team

# All rights reserved.

FROM python:3.6

ARG NB_USER="sagemaker-user"
ARG NB_UID="1000"
ARG NB_GID="100"

# Setup the "sagemaker-user" user with root privileges.
RUN \
    apt-get update && \
    apt-get install -y sudo && \
    useradd -m -s /bin/bash -N -u $NB_UID $NB_USER && \
    chmod g+w /etc/passwd && \
    echo "${NB_USER}    ALL=(ALL)    NOPASSWD:    ALL" >> /etc/sudoers && \
    # Prevent apt-get cache from being persisted to this layer.
    rm -rf /var/lib/apt/lists/*

USER $NB_UID

# Make the default shell bash (vs "sh") for a better Jupyter terminal UX
ENV SHELL=/bin/bash \
    NB_USER=$NB_USER \
    NB_UID=$NB_UID \
    NB_GID=$NB_GID \
    HOME=/home/$NB_USER \
    MINICONDA_VERSION=4.6.14 \
    CONDA_VERSION=4.6.14 \
    MINICONDA_MD5=718259965f234088d785cad1fbd7de03 \
    CONDA_DIR=/opt/conda \
    PATH=$CONDA_DIR/bin:${PATH}

# Heavily inspired from https://github.com/jupyter/docker-stacks/blob/master/r-notebook/Dockerfile

USER root

# R system library pre-requisites
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    fonts-dejavu \
    unixodbc \
    unixodbc-dev \
    r-cran-rodbc \
    gfortran \
    gcc && \
    rm -rf /var/lib/apt/lists/* && \
    mkdir -p $CONDA_DIR && \
    chown -R $NB_USER:$NB_GID $CONDA_DIR && \
    # Fix for devtools https://github.com/conda-forge/r-devtools-feedstock/issues/4
    ln -s /bin/tar /bin/gtar

USER $NB_UID

ENV PATH=$CONDA_DIR/bin:${PATH}

# Install conda via Miniconda
RUN cd /tmp && \
    curl --silent --show-error --output miniconda-installer.sh https://repo.anaconda.com/miniconda/Miniconda3-${MINICONDA_VERSION}-Linux-x86_64.sh && \
    echo "${MINICONDA_MD5} *miniconda-installer.sh" | md5sum -c - && \
    /bin/bash miniconda-installer.sh -f -b -p $CONDA_DIR && \
    rm miniconda-installer.sh && \
    conda config --system --prepend channels conda-forge && \
    conda config --system --set auto_update_conda false && \
    conda config --system --set show_channel_urls true && \
    conda install --quiet --yes conda="${CONDA_VERSION%.*}.*" && \
    conda update --all --quiet --yes && \
    conda clean --all -f -y && \
    rm -rf /home/$NB_USER/.cache/yarn


# R packages and Python packages that are usable via "reticulate".
RUN conda install --quiet --yes \
    'r-base=4.0.0' \
    'r-caret=6.*' \
    'r-crayon=1.3*' \
    'r-devtools=2.3*' \
    'r-forecast=8.12*' \
    'r-hexbin=1.28*' \
    'r-htmltools=0.4*' \
    'r-htmlwidgets=1.5*' \
    'r-irkernel=1.1*' \
    'r-rmarkdown=2.2*' \
    'r-rodbc=1.3*' \
    'r-rsqlite=2.2*' \
    'r-shiny=1.4*' \
    'r-tidyverse=1.3*' \
    'unixodbc=2.3.*' \
    'r-tidymodels=0.1*' \
    'r-reticulate=1.*' \
    && \
    pip install --quiet --no-cache-dir \
    'boto3>1.0<2.0' \
    'sagemaker>2.0<3.0' && \
    conda clean --all -f -y

WORKDIR $HOME
USER $NB_UID

Setting up your installation folder

You need to create a folder on your local machine and add the following files in that folder:

.
├── Dockerfile
├── app-image-config-input.json
├── create-and-attach-image.sh
├── create-domain-input.json
└── default-user-settings.json

In the following scripts, the Amazon Resource Names (ARNs) should have a format similar to:

arn:partition:service:region:account-id:resource-id
arn:partition:service:region:account-id:resource-type/resource-id
arn:partition:service:region:account-id:resource-type:resource-id

Dockerfile is the Dockerfile that you created in the previous step.

Create a file named app-image-config-input.json with the following content:

{
    "AppImageConfigName": "custom-r-image-config",
    "KernelGatewayImageConfig": {
        "KernelSpecs": [
            {
                "Name": "ir",
                "DisplayName": "R (Custom R Image)"
            }
        ],
        "FileSystemConfig": {
            "MountPath": "/home/sagemaker-user",
            "DefaultUid": 1000,
            "DefaultGid": 100
        }
    }

Create a file named default-user-settings.json with the following content. If you’re adding multiple custom images, add to the list of CustomImages.

{
  "DefaultUserSettings": {
    "KernelGatewayAppSettings": {
      "CustomImages": [
          {
                   "ImageName": "custom-r",
                   "AppImageConfigName": "custom-r-image-config"
                }
            ]
        }
    }
}

Create one last file in your installation folder named create-and-attach-image.sh using the following bash script. The script runs the following in order:

Creates a repository named smstudio-custom in Amazon ECR and logs into that repository
Builds an image using the Dockerfile and attaches a tag to the image r
Pushes the image to Amazon ECR
Creates an image for SageMaker Studio and attaches the Amazon ECR image to that image

Creates an AppImageConfigfor this image using app-image-config-input.json

# Replace with your AWS account ID and your Region, e.g. us-east-1, us-west-2
ACCOUNT_ID=<AWS ACCOUNT ID>
REGION=<STUDIO DOMAIN REGION>

# create a repository in ECR, and then login to ECR repository
aws --region ${REGION} ecr create-repository --repository-name smstudio-custom
aws ecr --region ${REGION} get-login-password | docker login --username AWS \
    --password-stdin ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom

# Build the docker image and push to Amazon ECR (modify image tags and name as required)
$(aws ecr get-login --region ${REGION} --no-include-email)
docker build . -t smstudio-r -t ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom:r
docker push ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom:r

# Using with SageMaker Studio
## Create SageMaker Image with the image in ECR (modify image name as required)
ROLE_ARN="<YOUR EXECUTION ROLE ARN>"

aws sagemaker create-image \
    --region ${REGION} \
    --image-name custom-r \
    --role-arn ${ROLE_ARN}

aws sagemaker create-image-version \
    --region ${REGION} \
    --image-name custom-r \
    --base-image ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom:r

## Create AppImageConfig for this image (modify AppImageConfigName and 
## KernelSpecs in app-image-config-input.json as needed)
## note that 'file://' is required in the file path
aws sagemaker create-app-image-config \
    --region ${REGION} \
    --cli-input-json file://app-image-config-input.json

Updating an existing SageMaker Studio domain with a custom image

If you already have a Studio domain, you don’t need to create a new domain, and can easily update your existing domain by attaching the custom image. You can do this either using the AWS CLI for Amazon SageMaker or the SageMaker Studio Control Panel (which we discuss in the following sections). Before going to the next steps, make sure your domain is in Ready status, and get your Studio domain ID from the Studio Control Panel. The domain ID should be in d-xxxxxxxx format.

Using the AWS CLI for SageMaker

In the terminal, navigate to your installation folder and run the following commands. This makes the bash scrip executable:

chmod +x create-and-attach-image.sh

Then execute the following command in terminal:

./create-and-attach-image.sh

After you successfully run the bash script, you need update your existing domain by executing the following command in the terminal. Make sure you provide your domain ID and Region.

aws sagemaker update-domain --domain-id <DOMAIN_ID> \
    --region <REGION_ID> \
    --cli-input-json file://default-user-settings.json

After executing this command, your domain status shows as Updating for a few seconds and then shows as Ready again. You can now open Studio.

When in the Studio environment, you can use the Launcher to launch a new activity, and should see the custom-r (latest) image listed in the dropdown menu under Select a SageMaker image to launch your activity.

Using the SageMaker console

Alternatively, you can update your domain by attaching the image via the SageMaker console. The image that you created is listed on the Images page on the console.

To attach this image to your domain, on the SageMaker Studio Control Panel, under Custom images attached to domain, choose Attach image.
For Image source, choose Existing image.
Choose an existing image from the list.
Choose a version of the image from the list.
Choose Next.
Choose the IAM role. For more information, see Create a custom SageMaker image (Console).
Choose Next.
Under Studio configuration, enter or change the following settings. For information about getting the kernel information from the image, see DEVELOPMENT in the SageMaker Studio Custom Image Samples GitHub repo.
1. For EFS mount path, enter the path within the image to mount the user’s Amazon Elastic File System (Amazon EFS) home directory.
2. For Kernel name, enter the name of an existing kernel in the image.
3. (Optional) For Kernel display name, enter the display name for the kernel.
4. Choose Add kernel.
5. (Optional) For Configuration tags, choose Add new tag and add a configuration tag.

For more information, see the Kernel discovery and User data sections of Custom SageMaker image specifications.

Choose Submit.
Wait for the image version to be attached to the domain.

While attaching, your domain status is in Updating. When attached, the version is displayed in the Custom images list and briefly highlighted, and your domain status shows as Ready.

The SageMaker image store automatically versions your images. You can select a pre-attached image and choose Detach to detach the image and all versions, or choose Attach image to attach a new version. There is no limit to the number of versions per image or the ability to detach images.

Using a custom image to create notebooks

When you’re done updating your Studio domain with the custom image, you can use that image to create new notebooks. To do so, choose your custom image from the list of images in the Launcher. In this example, we use custom-r. This shows the list of kernels that you can use to create notebooks. Create a new notebook with the R kernel.

If this is the first time you’re using this kernel to create a notebook, it may take about a minute to start the kernel, and the Kernel Starting message appears on the lower left corner of your Studio. You can write R scripts while the kernel is starting but can only run your script after your kernel is ready. The notebook is created with a default ml.t3.medium instance attached to it. You can see R (Custom R Image) kernel and the instance type on the upper right corner of the notebook. You can change ML instances on the fly in SageMaker Studio. You can also right-size your instances for different workloads. For more information, see Right-sizing resources and avoiding unnecessary costs in Amazon SageMaker.

To test the kernel, enter the following sample R script in the first cell and run the script. This script tests multiple aspects, including importing libraries, creating a SageMaker session, getting the IAM role, and importing data from public repositories.

The abalone dataset in this post is from Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science (http://archive.ics.uci.edu/ml/datasets/Abalone).

# Simple script to test R Kernel in SageMaker Studio

# Import reticulate, readr and sagemaker libraries
library(reticulate)
library(readr)
sagemaker <- import('sagemaker')

# Create a sagemaker session
session <- sagemaker$Session()

# Get execution role
role_arn <- sagemaker$get_execution_role()

# Read a csv file from UCI public repository
# Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml].\ 
# Irvine, CA: University of California, School of Information and Computer Science
data_file <- 'http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'

# Copy data to a dataframe, rename columns, and show dataframe head
abalone <- read_csv(file = data_file, col_names = FALSE, col_types = cols())
names(abalone) <- c('sex', 'length', 'diameter', 'height', 'whole_weight', 'shucked_weight', 'viscera_weight', 'shell_weight', 'rings')
head(abalone)

If the image is set up properly and the kernel is running, the output should look like the following screenshot.

Listing, detaching, and deleting custom images

If you want to see the list of custom images attached to your Studio, you can either use the AWS CLI or go to SageMaker console to view the attached image in the Studio Control Panel.

Using the AWS CLI for SageMaker

To view your list of custom images via the AWS CLI, enter the following command in the terminal (provide the Region in which you created your domain):

aws sagemaker list-images --region <region-id>

The response includes the details for the attached custom images:

{
    "Images": [
        {
            "CreationTime": "xxxxxxxxxxxx",
            "ImageArn": "arn:aws:sagemaker:us-east-2:XXXXXXX:image/custom-r",
            "ImageName": "custom-r",
            "ImageStatus": "CREATED",
            "LastModifiedTime": "xxxxxxxxxxxxxx"
        },
        ....
    ]
}

If you want to detach or delete an attached image, you can do it on the SageMaker Studio Control Panel (see Detach a custom SageMaker image). Alternatively, use the custom image name from your default-user-settings.json file and rerun the following command to update the domain by detaching the image:

aws sagemaker update-domain --domain-id <YOUR DOMAIN ID> \
    --cli-input-json file://default-user-settings.json

Then, delete the app image config:

aws sagemaker delete-app-image-config \
    --app-image-config-name custom-r-image-config

Delete the SageMaker image, which also deletes all image versions. The container images in Amazon ECR that are represented by the image versions are not deleted.

aws sagemaker delete-image \
    --region <region-id> \
    --image-name custom-r

After deleting the image, it will not be listed under custom images in SageMaker Studio. For more information, see Clean up resources.

Using the SageMaker console

You can also detach (and delete) images from your domain via the Studio Control Panel UI. To do so, under Custom images attached to domain, select the image and choose Detach. You have the option to also delete all versions of the image from your domain. This detaches the image from the domain.

Getting logs in Amazon CloudWatch

You can also get access to SageMaker Studio logs in Amazon CloudWatch, which you can use for troubleshooting your environment. The metrics are captured under the /aws/sagemaker/studio namespace.

To access the logs, on the CloudWatch console, choose CloudWatch Logs. On the Log groups page, enter the namespace to see logs associated with the Jupyter server and the kernel gateway.

For more information, see Log Amazon SageMaker Events with Amazon CloudWatch.

Conclusion

This post outlined the process of attaching a custom Docker image to your Studio domain to extend Studio’s built-in images. We discussed how you can update an existing domain with a custom image using either the AWS CLI for SageMaker or the SageMaker console. We also explained how you can use the custom image to create notebooks with custom kernels.

For more information, see the following resources:

About the Authors

Nick Minaie is an Artificial Intelligence and Machine Learning (AI/ML) Specialist Solution Architect, helping customers on their journey to well-architected machine learning solutions at scale. In his spare time, Nick enjoys family time, abstract painting, and exploring nature.

Sam Liu is a product manager at Amazon Web Services (AWS). His current focus is the infrastructure and tooling of machine learning and artificial intelligence. Beyond that, he has 10 years of experience building machine learning applications in various industries. In his spare time, he enjoys making short videos for technical education or animal protection.

Artificial Intelligence

Bringing your own R environment to Amazon SageMaker Studio

Prerequisites

Creating your Dockerfile

Setting up your installation folder

Updating an existing SageMaker Studio domain with a custom image

Using the AWS CLI for SageMaker

Using the SageMaker console

Using a custom image to create notebooks

Listing, detaching, and deleting custom images

Using the AWS CLI for SageMaker

Using the SageMaker console

Getting logs in Amazon CloudWatch

Conclusion

About the Authors

Resources

Blog Topics

Follow

Learn

Resources

Developers

Help