AWS Machine Learning Blog

Using R with Amazon SageMaker

July, 2022: This post was reviewed and updated for relevancy and accuracy, with an updated AWS CloudFormation Template.

December 2020: Post updated with changes required for Amazon SageMaker SDK v2

This blog post describes how to train, deploy, and retrieve predictions from a machine learning (ML) model using Amazon SageMaker and R. The model predicts abalone age as measured by the number of rings in the shell. The reticulate package will be used as an R interface to Amazon SageMaker Python SDK to make API calls to Amazon SageMaker. The reticulate package translates between R and Python objects, and Amazon SageMaker provides a serverless data science environment to train and deploy ML models at scale.

To follow along with this blog post, you should have a basic understanding of R and be familiar with the following tidyverse packages: dplyr, readr, stringr, and ggplot2. You use RStudio to run the code. RStudio is an integrated development environment (IDE) for working with R. We will be using fully managed RStudio on Amazon SageMaker.

Launching AWS CloudFormation

Please refer to Get started with RStudio on Amazon SageMaker which details the steps you take to create a SageMaker domain with RStudio. Use the following AWS CloudFormation stack to create a domain with a user profile:

Launching this stack creates the following resources:

  • A SageMaker domain with RStudio
  • A SageMaker RStudio user profile
  • An IAM service role for SageMaker RStudio domain execution
  • An IAM service role for SageMaker RStudio user profile execution

After you launch the stack, follow these steps to configure and connect to RStudio:

  1. On the Select template page, choose Next.
  2. On the Specify stack details page, in the Stack name section, enter a name.
  3. On the Specify stack details page, in the Execution Role Arn, leave blank unless you already have the required role created.
  4. On the Specify stack details page, in the Vpc Id parameter, select your Vpc.
  5. On the Specify stack details page, in the Subnet Id(s) parameter, select your subnets.
  6. On the Specify stack details page, in the App Network Access Type parameter, select either PublicInternetOnly or VpcOnly.
  7. On the Specify stack details page, in the Security Group(s) parameter, select your security groups.
  8. On the Specify stack details page, in the Domain Execution Role Arn, leave blank unless you already have the required role created.
  9. Leave the remaining required parameters as is.
  10. There are also optional parameters which we will not use:
    1. Customer managed CMK
    2. RStudio connect URL
    3. RStudio package manager URL
    4. 3 RStudio custom images
  11. On the bottom of the Specify stack details page, choose Next.
  12. On the Configure stack options page, choose Next.
  13. On the Review page, select the I acknowledge that AWS CloudFormation might create IAM resources with custom names check box and choose Next.

Once the stack status is CREATE_COMPLETE, navigate to the Amazon SageMaker Control Panel and launch the RStudio app for rstudio-user.

On the RStudio Workbench launcher page start a new R session using the RSession Base image.

Reticulating the Amazon SageMaker Python SDK

First, load the reticulate library and import the sagemaker Python module. Once the module is loaded, use the $ notation in R instead of the . notation in Python to view the available classes.

Use the Session class, as shown in the following image.

The Session class provides operations for working with the following boto3 resources with Amazon SageMaker:

To view the objects available to the Session class, use the $ notation, as shown in the following image.

Creating and accessing the data storage

Let’s create an Amazon Simple Storage Service (Amazon S3) bucket for your data. You will need the IAM role that allows Amazon SageMaker to access the bucket.

Specify the Amazon S3 bucket to store the training data, the model’s binary file, and output from the training job:

library(reticulate)
sagemaker <- import('sagemaker')
session <- sagemaker$Session()
bucket <- session$default_bucket()
role_arn <- sagemaker$get_execution_role()

Note:

  • You do not need to install Miniconda. Type n when you are prompted.
  • The default_bucket function creates a unique Amazon S3 bucket with the following name: sagemaker-<aws-region-name>-<aws account number>.

Downloading and processing the dataset

The model uses the abalone dataset from the UCI Machine Learning Repository. First, download the data and start the exploratory data analysis. Use tidyverse packages to read the data, plot the data, and transform the data into ML format for Amazon SageMaker:

library(readr)
work_dir <- getwd()
system(paste('aws s3 cp', data_file, work_dir))
data_file <- 's3://sagemaker-sample-files/datasets/tabular/uci_abalone/abalone.csv'
column_names <- c('sex', 'length', 'diameter', 'height', 'whole_weight', 'shucked_weight', 'viscera_weight', 'shell_weight', 'rings')
abalone <- read_csv(file = file.path(work_dir, 'abalone.csv'), col_names = column_names)
head(abalone)

The preceding image shows that sex is a factor data type but is currently a character data type (F is female, M is male, and I is infant). Change sex to a factor and view the statistical summary of the dataset:

abalone$sex <- as.factor(abalone$sex)
summary(abalone)

The summary shows that the minimum value for height is 0:

Visually explore which abalones have height equal to 0 by plotting the relationship between rings and height for each value of sex:

library(ggplot2)
ggplot(abalone, aes(x = height, y = rings, color = sex)) + geom_point() + geom_jitter()

The plot shows multiple outliers: two infant abalones with a height of 0 and a few female and male abalones with greater heights than the rest. Let’s filter out the two infant abalones with a height of 0.

library(dplyr)
abalone <- abalone %>%
  filter(height != 0)

Preparing the dataset for model training

The model needs three datasets: one each for training, testing, and validation. First, convert sex into a dummy variable and move the target, rings, to the first column. Amazon SageMaker algorithms require the target to be in the first column of the dataset.

abalone <- abalone %>%
  mutate(female = as.integer(ifelse(sex == 'F', 1, 0)),
         male = as.integer(ifelse(sex == 'M', 1, 0)),
         infant = as.integer(ifelse(sex == 'I', 1, 0))) %>%
  select(-sex)
abalone <- abalone %>%
  select(rings:infant, length:shell_weight)
head(abalone)

This code produces a dataset like the following:

Next, sample 70% of the data for training the ML algorithm. Split the remaining 30% into two halves, one for testing and one for validation:

abalone_train <- abalone %>%
  sample_frac(size = 0.7)
abalone <- anti_join(abalone, abalone_train)
abalone_test <- abalone %>%
  sample_frac(size = 0.5)
abalone_valid <- anti_join(abalone, abalone_test)

Upload the training and validation data to Amazon S3 so that you can train the model. First, write the training and validation datasets to the local filesystem in .csv format:

write_csv(abalone_train, 'abalone_train.csv', col_names = FALSE)
write_csv(abalone_valid, 'abalone_valid.csv', col_names = FALSE)

Second, upload the two datasets to the Amazon S3 bucket into the data key:

s3_train <- session$upload_data(path = 'abalone_train.csv', 
                                bucket = bucket, 
                                key_prefix = 'data')
s3_valid <- session$upload_data(path = 'abalone_valid.csv', 
                                bucket = bucket, 
                                key_prefix = 'data')

Finally, define the Amazon S3 input types for the Amazon SageMaker algorithm:

s3_train_input <- sagemaker$TrainingInput(s3_data = s3_train,
                                          content_type = 'csv')
s3_valid_input <- sagemaker$TrainingInput(s3_data = s3_valid,
                                          content_type = 'csv')

Training the model

Amazon SageMaker algorithms are available via a Docker container. To train an XGBoost model, specify the training containers in Amazon Elastic Container Registry (Amazon ECR) for the AWS Region.

container <- sagemaker$image_uris$retrieve(framework = 'xgboost',
                                           region = session$boto_region_name,
                                           version = 'latest')

Define an Amazon SageMaker Estimator, which can train any supplied algorithm that has been containerized with Docker. When creating the Estimator, use the following arguments:

  • image_uri – The container image to use for training
  • role – The Amazon SageMaker service role that you created
  • instance_count – The number of Amazon EC2 instances to use for training
  • instance_type – The type of Amazon EC2 instance to use for training
  • volume_size – The size in GB of the Amazon Elastic Block Store (Amazon EBS) volume to use for storing input data during training
  • max_run – The timeout in seconds for training
  • input_mode – The input mode that the algorithm supports
  • output_path – The Amazon S3 location for saving the training results (model artifacts and output files)
  • output_kms_key – The AWS Key Management Service (AWS KMS) key for encrypting the training output
  • base_job_name – The prefix for the name of the training job
  • sagemaker_session – The Session object that manages interactions with Amazon SageMaker API
s3_output <- paste0('s3://', bucket, '/output')
estimator <- sagemaker$estimator$Estimator(image_uri = container,
                                           role = role_arn,
                                           instance_count = 1L,
                                           instance_type = 'ml.m5.large',
                                           volume_size = 30L,
                                           max_run = 3600L,
                                           input_mode = 'File',
                                           output_path = s3_output,
                                           output_kms_key = NULL,
                                           base_job_name = NULL,
                                           sagemaker_session = session)

Note

The equivalent to None in Python is NULL in R.

Specify the XGBoost hyperparameters and fit the model. Set the number of rounds for training to 100 which is the default value when using the XGBoost library outside of Amazon SageMaker. Also specify the input data and a job name based on the current time stamp:

estimator$set_hyperparameters(num_round = 100L)
job_name <- paste('sagemaker-train-xgboost', format(Sys.time(), '%H-%M-%S'), sep = '-')
input_data <- list('train' = s3_train_input,
                   'validation' = s3_valid_input)
estimator$fit(inputs = input_data,
              job_name = job_name)

Once training has finished, Amazon SageMaker copies the model binary (a gzip tarball) to the specified Amazon S3 output location. Get the full Amazon S3 path with this command:

estimator$model_data

Deploying the model

Amazon SageMaker lets you deploy your model by providing an endpoint that consumers can invoke by a secure and simple API call using an HTTPS request.

Let’s deploy our trained model to a ml.t2.medium instance. For more information, see Amazon SageMaker ML Instance Types.

model_endpoint <- estimator$deploy(initial_instance_count = 1L,
                                   instance_type = 'ml.t2.medium')

Generating predictions with the model

Use the test data to generate predictions. Pass comma-separated text to be serialized into JSON format by specifying CSVSerializer for the endpoint:

model_endpoint$serializer <- sagemaker$serializers$CSVSerializer()

Remove the target column and convert the first 500 observations to a matrix with no column names:

abalone_test <- abalone_test[-1]
num_predict_rows <- 500
test_sample <- as.matrix(abalone_test[1:num_predict_rows, ])
dimnames(test_sample)[[2]] <- NULL

Note

500 observations was chosen because it doesn’t exceed the endpoint limitation.

Generate predictions from the endpoint and convert the returned comma-separated string:

library(stringr)
predictions <- model_endpoint$predict(test_sample)
predictions <- str_split(predictions, pattern = ',', simplify = TRUE)
predictions <- as.numeric(predictions)

Column-bind the predicted rings to the test data:

abalone_test <- cbind(predicted_rings = predictions, 
                      abalone_test[1:num_predict_rows, ])
head(abalone_test)    

The predicted ages (number of shell rings) look like this:

Deleting the endpoint

When you’re done with the model, delete the endpoint to avoid incurring deployment costs:

model_endpoint$delete_endpoint()

Conclusion

In this blog post, you learned how to build and deploy an ML model by using Amazon SageMaker with R. Typically, you execute this workflow with Python, but we showed how you could also do it with R.


About the Author

Ryan Garner is a Data Scientist with AWS Professional Services. He is passionate about helping AWS customers use R to solve their Data Science and Machine Learning problems.