AWS Machine Learning Blog

Host RStudio Connect and Package Manager for ML development in RStudio on Amazon SageMaker

Today, we announced RStudio on Amazon SageMaker, the first machine learning (ML) integrated development environment (IDE) in the cloud for data scientists working in R. The open-source language R and its rich ecosystem with more than 18,000 packages has been a top choice for statisticians, quant analysts, data scientists, and ML engineers. RStudio on SageMaker makes it easy for data scientists to run statistical analysis, build ML models, and create data science content on a centralized environment for the team without worrying about the compute infrastructure.

Along with the RStudio Workbench as part of the RStudio suite for R developers are RStudio Connect and RStudio Package Manager. RStudio Connect makes it easy to surface ML and data science insights off data scientists’ complicated work and put it in the hands of decision-makers. RStudio Connect is designed to allow data scientists to publish insights, dashboards, and web applications. RStudio Connect also makes hosting and managing content simple and scalable for wide consumption.

RStudio Package Manager helps organize and centralize R packages across ML teams and organizations. As data scientists develop their ML models, they need various packages with different capabilities for their ML use cases in RStudio. Managing the sources and versions of these packages and numerous public repositories manually for enterprise users is prone to errors and is also time-consuming. RStudio Package Manager mitigates these issues by managing the package repository centrally for your organization so that data scientists can install packages quickly and securely, and ensure project reproducibility and repeatability. Security and reproducibility are the most important aspects in regulated industries such as healthcare and finance.

In this post, we first show you how to architect and deploy RStudio Connect and RStudio Package Manager with a well-architected solution in AWS. We then show you how to use RStudio Connect and RStudio Package Manager from RStudio on SageMaker. We use an UCI breast cancer dataset to build out several types of ML content in R language in RStudio on SageMaker. The ML content we demonstrate in the post includes R Markdown and an R Shiny application

Solution overview

The solution architecture is based on professional versions of RStudio Connect and RStudio Package Manager Docker containers. RStudio Connect and RStudio Package Manager are configured across two Availability Zones for high availability. Both RStudio Connect and RStudio Package Manager containers support automatic scaling to handle incoming traffic depending on the incoming number of requests, memory, and CPU usage within the containers.

Container images are stored and fetched from Amazon Elastic Container Registry (Amazon ECR) with vulnerability scan enabled. Vulnerability issues should be addressed before deploying the images.

The following diagram illustrates the solution architecture.

The following are the steps in the solution workflow:

  1. R users access RStudio Connect and RStudio Package Manager via Amazon Route 53. Route 53 is a DNS service for incoming requests.
  2. Route 53 resolves incoming requests and forwards those to AWS WAF for security checks.
  3. Valid requests reach an Application Load Balancer (ALB), which forwards these to the Amazon Elastic Container Service (Amazon ECS) cluster. The ALB checks incoming requests for an HTTPS certificate, which is issued and validated by AWS Certificate Manager.
  4. Amazon ECS controls the containers in a cluster of Amazon Elastic Compute Cloud (Amazon EC2) instances (EC2 launch type) in an Auto Scaling group and is responsible for scaling up and down the number of containers as needed using an Amazon ECS capacity provider.
  5. Incoming requests are processed by the RStudio Connect server on any of the available RStudio Connect containers; users are authenticated and applications are rendered on the web browser. RStudio Package Manager requests are routed to the Package Manager container.
  6. Amazon Aurora Serverless PostgreSQL databases are used to provide high availability utilizing multiple containers for both RStudio Connect and RStudio Package Manager. Aurora backs up the serverless cluster databases automatically. Data on Aurora is encrypted at rest using AWS Key Management Service (AWS KMS).
  7. Amazon Elastic File System (Amazon EFS) provides the persistent file system required by RStudio Connect and RStudio Package Manager. Data on Amazon EFS is encrypted at rest using AWS KMS. Amazon EFS is an NFS file system that stores data in multiple Availability Zones in an AWS Region for data durability and high availability. Files created on the RStudio Connect and RStudio Package Manager container Amazon EFS mounts are automatically backed up by Amazon EFS.
  8. If the user session communicates with the public internet, outbound requests are sent to a NAT gateway from the private container subnet.
  9. The NAT gateway sends outbound requests to be processed via an internet gateway. Routes to the internet can also be configured by AWS Transit Gateway.

We use AWS Cloud Development Kit (AWS CDK) for Python to develop the infrastructure code and store the code in an AWS CodeCommit repository, so that AWS CodePipeline can integrate the AWS CDK stacks for automated builds.

The deployment code utilizes Route 53 public hosted zones to service the RStudio Connect and RStudio Package Manager on publicly accessible URLs. You can use Route 53 private hosted zones for the RStudio Connect and RStudio Package Manager containers with an internal ALB, which provides private endpoints for users coming from RStudio on SageMaker in a VPC-only connectivity mode. This means you don’t need a preexisting public domain in your AWS account. However, you need to fetch the public Docker images (RStudio Connect, RStudio Package Manager) and store those in a private Amazon ECR repository and point the deployment code to those images for the infrastructure build.

If all communications between AWS services must stay within AWS, you can use AWS PrivateLink to configure VPC endpoints for AWS services. AWS PrivateLink makes sure that inter-service traffic is not exposed to the internet for AWS service endpoints.

You can also refer to the RStudio Team solution from RStudio to learn how to deploy an RStudio technology stack on Amazon EC2 in AWS as an alternative to the solution discussed in this post.

Prerequisites

To deploy the AWS CDK stacks from the source code, you need to review and perform the prerequisites described in the accompanying GitHub repository to make sure you have the necessary resources to proceed.

Launch the solution

  1. Clone the GitHub repository, check out the rsc-rspm branch, and move into the aws-fargate-with-rstudio-open-source folder.
  2. Create a CodeCommit repository to hold the source code for installation of RStudio Connect/RStudio Package Manager with the following command:
    aws codecommit --profile <profile of AWS account> create-repository --repository-name <name of repository>
  3. Pass the required parameters in cdk.json following Step 3 in the Installation Steps section of the readme file.
  4. Install the package requirements for the AWS CDK application:
    python3 -m pip install -r requirements.txt
  5. Before committing the code into the CodeCommit repository, synthesize the AWS CDK stacks. This ensures all the necessary context values are populated into the cdk.context.json file and avoids the dummy values being mapped.
    cdk synth --profile <AWS CLI profile of the account>
  6. Commit the changes into the CodeCommit repo you created. Follow Step 5 in the Installation Steps of the readme if you need help with the Git commands.
  7. Deploy the AWS CDK stacks to install RStudio Connect/RStudio Package Manager using CodePipeline. This step takes around 30 minutes.
    cdk deploy --profile <AWS CLI profile of the account>
  8. Navigate to the CodePipeline console (the link takes you to the us-west-2 Region). Monitor the pipeline and confirm that the services are built successfully.

The pipeline name is RSC-RSPM-App-Pipeline-<instance>. From this point onwards, the pipeline is triggered on commits to the CodeCommit repository you created. There is no need to run cdk deploy (Step 7) anymore.

  1. When the pipeline installation is complete, you can access RStudio Connect and RStudio Package Manager using the following URLs, where r53_base_domain, and instance are parameters you passed into cdk.json:
    1. https://connect.<instance>.<r53_base_domain>
    2. https://package.<instance>.<r53_base_domain>
  2. You can use Amazon ECS Exec to log in to both RStudio Connect and RStudio Package Manager containers. Follow the readme for instructions.

Manage packages with RStudio Package Manager

RStudio Package Manager helps with enabling consistency and standardization of R packages across an organization. In RStudio Package Manager, an IT administrator can include an approved package in the repository. Multiple groups can be created to have access to different packages or package versions. RStudio Package Manager also handles all the updating and versioning of the packages. The administrator can enable automatic updates to the packages, or can also configure RStudio Package Manager in a way that the packages can only be updated manually, which provides more isolation between RStudio Package Manager and the CRAN service.

Configure RStudio Package Manager

We can create a repository that pulls the packages from the RStudio CRAN by using the following commands. We need to SSH into RStudio Package Manager using Amazon ECS Exec to run these commands.

# Initiate a sync
rspm sync --wait 
# Create a repository:
rspm create repo --name=dev-cran --description='Access CRAN packages'
# Subscribe the repository to the cran source
rspm subscribe --repo=dev-cran --source=cran 

The commands create a repository and subscribe it to the built-in source named cran. When this is complete, the dev-cran repository is available in the web interface of RStudio Package Manager, as shown in the following screenshot. This web interface is accessible by the administrator as well as the users who have the URL for it.

In addition to serving CRAN packages, repositories can be created to distribute local packages, Git packages, local packages along with CRAN packages, a subset of approved CRAN and local packages, and bleeding edge packages from GitHub. For further details on how to create repositories, see Serving CRAN Packages. In addition, RStudio Package Manager supports Bioconductor. Bioconductor is a commonly used ecosystem of R packages in life sciences. We can combine Bioconductor packages with CRAN as well as local packages in RStudio Package Manager.

RStudio Package Manager package versions

In the web interface of RStudio Package Manager, on the Setup tab, you can choose a repository by date in a calendar view. You can also choose whether to use the latest version of the packages, or freeze the packages to a particular snapshot, as shown in the following screenshot.

On the Setup tab, we can also see what system prerequisites might be needed for the repository’s packages, along with the commands to install them.

Configure an RStudio on SageMaker domain to use RStudio Connect and RStudio Package Manager

When creating a SageMaker domain with RStudio, you have an option to set a default RStudio Connect server and RStudio Package Manager repository for all users in your SageMaker domain. During the SageMaker domain creation process, as detailed in the Create a SageMaker domain with RStudio section in Getting Started with RStudio on Amazon SageMaker, you can configure default RStudio Connect and RStudio Package Manager URLs for all user profiles in Step 3: RStudio settings. For RStudio Connect, enter the RStudio Connect server URL. For RStudio Package Manager, enter a CRAN or a Bioconductor repository.

The default URLs are configured and saved in /etc/rstudio/rsession.conf for all users on RStudio on SageMaker. You can verify the default repository in the R console with options('repos'). You should see a repository pointing to your RStudio Package Manager. As for the default RStudio Connect URL, it’s automatically populated when you one-click publish a piece of R content.

Updating a repository from RStudio Package Manager in an R session

If you already have a working RStudio on SageMaker and want to use a different repository, you can configure your R session in RStudio on SageMaker to use a repository from your RStudio Package Manager with the following steps:

  1. In an R Session, on the Tools menu, choose Global Options.
  2. Choose Packages and then choose Change.
  3. In the Custom field, enter the URL for the selected repository (found on the Setup tab of the RStudio Package Manager web interface), and choose OK.
  4. Choose OK again, and we’re done!

Now, the packages that we install in RStudio are sourced from the selected repository from your RStudio Package Manager server. You can verify it with options('repos') or by installing a package and see where it is pulling from. For more details, see Checking For Success.

Update RStudio Connect account in an R session

If you already have a working RStudio on SageMaker and want to use a different RStudio Connect server than the default, complete the following steps:

  1. On the Tools menu, choose Global Options.
  2. Choose Publishing.
  3. Choose Connect.
  4. Choose RStudio Connect.
  5. Enter your server public URL, for example, https://xxxx.rstudioconnect.com, and choose Next.

A new page appears to ask you to log in with an account if this is the first time.

  1. Choose Connect to proceed.
  2. Choose Connect Account in the dialog in RStudio.

You should see you RStudio Connect user profile and server URL in the list.

  1. Choose Apply then OK.

For more information, see Connect your RStudio Account, and Connecting: RStudio IDE.

Now the RStudio Connect server is successfully connected to the RStudio on Amazon SageMaker. We’re ready to build some great content and publish.

Build ML content in RStudio on Amazon SageMaker

You can easily create an analysis within RStudio on Amazon SageMaker and push-button publish it to your RStudio Connect so that your collaborators can consume your analysis. For this post, we use a UCI breast cancer dataset from mlbench to walk through some of the common use cases of publication: R Markdown and Shiny app.

R Markdown

R Markdown is a great tool to run your analyses in R as part of a markdown file and share in RStudio Connect. In rsconnect_rmarkdown/breast_cancer_eda.Rmd, we perform two simple analyses and plotting on the dataset along with the texts in markdown:

```{r breastcancer}
data(BreastCancer)
df <- BreastCancer
# convert input values to numeric
for(i in 2:10) {
  df[,i] <- as.numeric(as.character(df[,i]))
}
summary(df)
```

```{r cl_thickness, echo=FALSE}
ggplot(df, aes(x=Cl.thickness))+
       geom_histogram(color="black", fill="white", binwidth = 1)+
       facet_grid(Class ~ .)
```

We can preview the file by choosing Knit and publish it to RStudio Connect by choosing Publish.
Besides R Markdown, more often than not, you’re building an interactive application or dashboard with Shiny. Let’s look at how we can publish Shiny apps from RStudio on Amazon SageMaker to RStudio Connect.

Shiny application

Shiny is an R package that makes it easy to create interactive web applications programmatically. It’s popular among data scientists to share their analyses and models through a Shiny application to their stakeholders. In rsconnect_shiny/breast-cancer-app/, we develop an ML model in breast_cancer_modeling.r and create a web application to allow users to interact with the data and ML model.

To publish, open app.R and choose Publish. Select both app.R and breast_cancer_modeling.r to publish.

In the application, you can change two features to visualize in the plot and select the data points in the plot to see actual data and model predictions of whether they are benign or malignant cancer cases. By sliding the probability threshold, you can interact with the model and get a different classification counts. You can see the dashboard in action in the following screenshot.

Conclusion

In this post, we showed you how to deploy RStudio Connect and RStudio Package Manager servers in AWS with an architecture based on AWS Fargate and Amazon ECS, using AWS CDK. With RStudio Connect and RStudio Package Manager running in the cloud, we showed you how to use them from RStudio on Amazon SageMaker. Then we demonstrated how to deploy R-based materials such as R Markdown and Shiny applications to the RStudio Connect instance based on a breast cancer prediction use case.

Having an RStudio Connect instance in the cloud not only enables your ML and data science teams to collaborate more effectively, but also makes sharing ML insights across stakeholders and business units much easier. This in turn promotes the use of ML in your organization for a better business outcome. With RStudio Package Manager, you can quickly and securely manage, serve, and install R packages from trusted sources to ensure project reproducibility.

You can learn more about RStudio on SageMaker from a data scientist’s perspective in the post Announcing Fully Managed RStudio on Amazon SageMaker for Data Scientists. You can also learn more about how to set up and administer RStudio on SageMaker in the post Getting started with RStudio on Amazon SageMaker. To learn more about Amazon SageMaker Studio, the first IDE for ML in the cloud, see Amazon SageMaker Studio.


About the Authors

Michael Hsieh is a Senior AI/ML Specialist Solutions Architect. He works with customers to advance their ML journey with a combination of Amazon Machine Learning offerings and his ML domain knowledge. As a Seattle transplant, he loves exploring the great mother nature the region has to offer, such as the hiking trails, scenery kayaking in the SLU, and the sunset at the Shilshole Bay.

Chayan Panda is a Cloud Infrastructure Architect. He provides advisory services and thought leadership to AWS customers on robust solution design for cloud migrations, cloud infrastructure (security, network, DevOps), Greenfield platform implementations, big data/AI/ML, and serverless and database solutions. When he is not obsessing about customers, he enjoys a short run, music, a book, or travel with his family.

Farooq Sabir is a Senior AI/ML Specialist Solutions Architect. He helps customers solve their business problems using data science, machine learning, and artificial intelligence.