AWS Machine Learning Blog

Using R with Amazon SageMaker

This blog post describes how to train, deploy, and retrieve predictions from a machine learning (ML) model using Amazon SageMaker and R. The model predicts abalone age as measured by the number of rings in the shell. The reticulate package will be used as an R interface to Amazon SageMaker Python SDK to make API calls to Amazon SageMaker. The reticulate package translates between R and Python objects, and Amazon SageMaker provides a serverless data science environment to train and deploy ML models at scale.

To follow along with this blog post, you should have a basic understanding of R and be familiar with the following tidyverse packages:  dplyr, readr, stringr, and ggplot2. You use RStudio to run the code. RStudio is an integrated development environment (IDE) for working with R. It can be licensed either commercially or under AGPLv3.

Launching AWS CloudFormation

Use the following AWS CloudFormation stack to install, configure, and connect to RStudio on an Amazon Elastic Compute Cloud (Amazon EC2) instance with Amazon SageMaker:

Launching this stack creates the following resources:

  • A public virtual private cloud (VPC)
  • An Amazon EC2 instance (t2.medium)
  • A security group allowing SSH access only
  • An AWS Identity and Access Management (IAM) role for Amazon EC2 with Amazon SageMaker permissions
  • An IAM service role for Amazon SageMaker
  • An installation of RStudio
  • The required R packages
  • An installation of the Amazon SageMaker Python SDK

After you launch the stack, follow these steps to configure and connect to RStudio:

  1. On the Select Template page, choose Next.
  2. On the Specify Details page, choose your key pair for KeyName.
  3. On the Options page, choose Next.
  4. On the Review page, select the I acknowledge that AWS CloudFormation might create IAM resources with custom names check box and choose Next.
  5. Once the stack has reached CREATE_COMPLETE status, choose the Outputs
  6. Copy and paste the SSH string in Value into a terminal window.

The SSH command forwards port 8787 to your computer while connecting to the new instance. Once connected, open a browser window and type localhost:8787 in the address bar:

Note – You might see connect failed messages in your terminal while RStudio and the required R packages are installing. The installation process takes approximately 15 minutes.

Sign in with the following credentials:

  • Username: rstudio
  • Password: rstudio

Reticulating the Amazon SageMaker Python SDK

First, load the reticulate library and import the sagemaker Python module. Once the module is loaded, use the $ notation in R instead of the . notation in Python to view the available classes.

Use the Session class, as shown in the following image.

The Session class provides operations for working with the following boto3 resources with Amazon SageMaker:

To view the objects available to the Session class, use the $ notation, as shown in the following image.

Creating and accessing the data storage

Let’s create an Amazon Simple Storage Service (Amazon S3) bucket for your data. You will need the IAM role that allows Amazon SageMaker to access the bucket.

Specify the Amazon S3 bucket to store the training data, the model’s binary file, and output from the training job:

library(reticulate)
sagemaker <- import('sagemaker')
session <- sagemaker$Session()
bucket <- session$default_bucket()

Note – The default_bucket function creates a unique Amazon S3 bucket with the following name: sagemaker-<aws-region-name>-<aws account number>.

Specify the IAM role’s ARN to allow Amazon SageMaker to access the Amazon S3 bucket:

role_arn <- session$expand_role('sagemaker-service-role')

The AWS CloudFormation stack automatically created an IAM role called sagemaker-service-role with the required policies. The expand_role function retrieves the ARN for this role because Amazon SageMaker requires it.

Downloading and processing the dataset

The model uses the abalone dataset from the UCI Machine Learning Repository. First, download the data and start the exploratory data analysis. Use tidyverse packages to read the data, plot the data, and transform the data into ML format for Amazon SageMaker:

library(readr)
data_file <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
abalone <- read_csv(file = data_file, col_names = FALSE)
names(abalone) <- c('sex', 'length', 'diameter', 'height', 'whole_weight', 'shucked_weight', 'viscera_weight', 'shell_weight', 'rings')
head(abalone)

The preceding image shows that sex is a factor data type but is currently a character data type (F is female, M is male, and I is infant). Change sex to a factor and view the statistical summary of the dataset:

abalone$sex <- as.factor(abalone$sex)
summary(abalone)

The summary shows that the minimum value for height is 0:

Visually explore which abalones have height equal to 0 by plotting the relationship between rings and height for each value of sex:

library(ggplot2)
ggplot(abalone, aes(x = height, y = rings, color = sex)) + geom_point() + geom_jitter()


The plot shows multiple outliers: two infant abalones with a height of 0 and a few female and male abalones with greater heights than the rest. Let’s filter out the two infant abalones with a height of 0.

library(dplyr)
abalone <- abalone %>%
  filter(height != 0)

Preparing the dataset for model training

The model needs three datasets: one each for training, testing, and validation. First, convert sex into a dummy variable and move the target, rings, to the first column. Amazon SageMaker algorithms require the target to be in the first column of the dataset.

abalone <- abalone %>%
  mutate(female = as.integer(ifelse(sex == 'F', 1, 0)),
         male = as.integer(ifelse(sex == 'M', 1, 0)),
         infant = as.integer(ifelse(sex == 'I', 1, 0))) %>%
  select(-sex)
abalone <- abalone %>%
  select(rings:infant, length:shell_weight)
head(abalone)

This code produces a dataset like the following:

Next, sample 70% of the data for training the ML algorithm. Split the remaining 30% into two halves, one for testing and one for validation:

abalone_train <- abalone %>%
  sample_frac(size = 0.7)
abalone <- anti_join(abalone, abalone_train)
abalone_test <- abalone %>%
  sample_frac(size = 0.5)
abalone_valid <- anti_join(abalone, abalone_test)

Upload the training and validation data to Amazon S3 so that you can train the model. First, write the training and validation datasets to the local filesystem in .csv format:

write_csv(abalone_train, 'abalone_train.csv', col_names = FALSE)
write_csv(abalone_valid, 'abalone_valid.csv', col_names = FALSE)

Second, upload the two datasets to the Amazon S3 bucket into the data key:

s3_train <- session$upload_data(path = 'abalone_train.csv', 
                                bucket = bucket, 
                                key_prefix = 'data')
s3_valid <- session$upload_data(path = 'abalone_valid.csv', 
                                bucket = bucket, 
                                key_prefix = 'data')

Finally, define the Amazon S3 input types for the Amazon SageMaker algorithm:

s3_train_input <- sagemaker$s3_input(s3_data = s3_train,
                                     content_type = 'csv')
s3_valid_input <- sagemaker$s3_input(s3_data = s3_valid,
                                     content_type = 'csv')

Training the model

Amazon SageMaker algorithms are available via a Docker container. To train an XGBoost model, specify the training containers in Amazon Elastic Container Registry (Amazon ECR) for the AWS Region.

containers <- list('us-west-2' = '433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest',
  'us-east-1' = '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
  'us-east-2' = '825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest',
  'eu-west-1' = '685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest')
container <- containers[session$boto_region_name][[1]]

Define an Amazon SageMaker Estimator, which can train any supplied algorithm that has been containerized with Docker. When creating the Estimator, use the following arguments:

  • image_name – The container image to use for training
  • role – The Amazon SageMaker service role that you created
  • train_instance_count – The number of Amazon EC2 instances to use for training
  • train_instance_type – The type of Amazon EC2 instance to use for training
  • train_volume_size – The size in GB of the Amazon Elastic Block Store (Amazon EBS) volume to use for storing input data during training
  • train_max_run – The timeout in seconds for training
  • input_mode – The input mode that the algorithm supports
  • output_path – The Amazon S3 location for saving the training results (model artifacts and output files)
  • output_kms_key – The AWS Key Management Service (AWS KMS) key for encrypting the training output
  • base_job_name – The prefix for the name of the training job
  • sagemaker_session – The Session object that manages interactions with Amazon SageMaker API
s3_output <- paste0('s3://', bucket, '/output')
estimator <- sagemaker$estimator$Estimator(image_name = container,
                                           role = role_arn,
                                           train_instance_count = 1L,
                                           train_instance_type = 'ml.m5.large',
                                           train_volume_size = 30L,
                                           train_max_run = 3600L,
                                           input_mode = 'File',
                                           output_path = s3_output,
                                           output_kms_key = NULL,
                                           base_job_name = NULL,
                                           sagemaker_session = NULL)

Note – The equivalent to None in Python is NULL in R.

Specify the XGBoost hyperparameters and fit the model. Set the number of rounds for training to 100 which is the default value when using the XGBoost library outside of Amazon SageMaker. Also specify the input data and a job name based on the current time stamp:

estimator$set_hyperparameters(num_round = 100L)
job_name <- paste('sagemaker-train-xgboost', format(Sys.time(), '%H-%M-%S'), sep = '-')
input_data <- list('train' = s3_train_input,
                   'validation' = s3_valid_input)
estimator$fit(inputs = input_data,
              job_name = job_name)

Once training has finished, Amazon SageMaker copies the model binary (a gzip tarball) to the specified Amazon S3 output location. Get the full Amazon S3 path with this command:

estimator$model_data

Deploying the model

Amazon SageMaker lets you deploy your model by providing an endpoint that consumers can invoke by a secure and simple API call using an HTTPS request. Let’s deploy our trained model to a ml.t2.medium instance. For more information, see Amazon SageMaker ML Instance Types.

model_endpoint <- estimator$deploy(initial_instance_count = 1L,
                                   instance_type = 'ml.t2.medium')

Generating predictions with the model

Use the test data to generate predictions. Pass comma-separated text to be serialized into JSON format by specifying text/csv and csv_serializer for the endpoint:

model_endpoint$content_type <- 'text/csv'
model_endpoint$serializer <- sagemaker$predictor$csv_serializer

Remove the target column and convert the first 500 observations to a matrix with no column names:

abalone_test <- abalone_test[-1]
num_predict_rows <- 500
test_sample <- as.matrix(abalone_test[1:num_predict_rows, ])
dimnames(test_sample)[[2]] <- NULL

Note – 500 observations was chosen because it doesn’t exceed the endpoint limitation.

Generate predictions from the endpoint and convert the returned comma-separated string:

library(stringr)
predictions <- model_endpoint$predict(test_sample)
predictions <- str_split(predictions, pattern = ',', simplify = TRUE)
predictions <- as.numeric(predictions)

Column-bind the predicted rings to the test data:

abalone_test <- cbind(predicted_rings = predictions, 
                      abalone_test[1:num_predict_rows, ])
head(abalone_test)    

The predicted ages (number of shell rings) look like this:

Deleting the endpoint

When you’re done with the model, delete the endpoint to avoid incurring deployment costs:

session$delete_endpoint(model_endpoint$endpoint)

Conclusion

In this blog post, you learned how to build and deploy an ML model by using Amazon SageMaker with R. Typically, you execute this workflow with Python, but we showed how you could also do it with R.


About the Author

Ryan Garner is a Data Scientist with AWS Professional Services. He is passionate about helping AWS customers use R to solve their Data Science and Machine Learning problems.