Field Notes: Accelerating Data Science with RStudio and Shiny Server on AWS Fargate
This post was updated November 18, 2021.
Data scientists continuously look for ways to accelerate time to value for analytics projects. RStudio Server is a popular Integrated Development Environment (IDE) for R, which is used to render analytics visualizations for faster decision making. These visualizations are traditionally hosted on legacy unix servers along with Shiny Server to support analytics. In this previous blog, we provided a solution architecture to run Data Science use cases for medium to large enterprises across industry verticals.
In this post, we describe and deliver the infrastructure code to run a secure, scalable and highly available RStudio and Shiny Server installation on AWS. We use these services: AWS Fargate, Amazon Elastic Container Service (Amazon ECS), Amazon Elastic File System (Amazon EFS), AWS DataSync, and Amazon Simple Storage Service (Amazon S3). We will then demonstrate a Data Science use case in RStudio and create an application on Shiny. The use case discussed involves pre-processing a dataset, and training a machine learning model in RStudio. The goal is to build a shiny application to surface breast cancer prediction insights against a set of parameters to users.
Overview of solution
We show how to deploy a Open Source RStudio Server and a Shiny Server in a serverless architecture from an automated deployment pipeline built with AWS Developer Tools. This is illustrated in the diagram that follows. The deployment adheres to best practices for following an AWS Multi-Account strategy using AWS Organizations.
In the preceding architecture, a central development account hosts the development resources. From this account, the deployment pipeline creates AWS services for RStudio and Shiny along with the integrated services into another AWS account. Networking information is delegated from a central networking account and the data feed to RStudio comes in from a central data account.
Public URL Domain and Data Feed
The RStudio/Shiny deployment accounts obtain the networking information for the publicly resolvable domain from a central networking account. The data feed for the containers comes from a central data repository account. Users upload data to the S3 buckets in the central data account or configure an automated service like AWS Transfer Family to programmatically upload files. AWS DataSync transfers the uploaded files from Amazon S3 and stores the files on Amazon EFS mount points on the containers. Amazon EFS provides shared, persistent, and elastic storage for the containers.
We recommend that you configure AWS Shield or AWS Shield Advanced for the networking account and enable Amazon GuardDuty in all accounts. You can also use AWS Config and AWS CloudTrail for monitoring and alerting on security events before deploying the infrastructure code. You should use an outbound filter such as AWS Network Firewall for network traffic destined for the internet. AWS Web Application Firewall (AWS WAF) protects the Amazon Elastic Load Balancers (Amazon ELB). You can restrict access to RStudio and Shiny from only allowed IP ranges using the automated pipeline.
You deploy all AWS services in this architecture in one particular AWS Region. The AWS services used are managed services and configured for high availability. Should a service become unavailable, it automatically launches in the same Availability Zone (AZ) or in a different AZ within the same AWS Region. This means if Amazon ECS restarts the container in another AZ, following a failover, the files and data for the container will not be lost as these are stored on Amazon EFS.
The infrastructure code provided in this blog creates all resources described in the preceding architecture. The following numbered items refer to Figure 1.
1. We used AWS Cloud Development Kit (AWS CDK) for Python to develop the infrastructure code and stored the code in an AWS CodeCommit repository.
2. AWS CodePipeline integrates the AWS CDK stacks for automated builds. The stacks are divided into four different stages and are organized by AWS service.
3. AWS CodePipeline fetches the container images from public Docker Hub and stores the images into Amazon Elastic Container Registry (Amazon ECR) repositories for cross-account access. The deployment pipeline accesses these images to create the Amazon ECS container on AWS Fargate in the deployment accounts.
4. The build script uses a key from AWS Key Management Service (AWS KMS) to create secrets for the RStudio front-end password in AWS Secrets Manager.
5. The central networking account Amazon Route 53 has the pre-configured base public domain. This is done outside the automated pipeline and the base domain info is passed on as a parameter to the deployment pipeline.
6. The central networking account delegates the base public domain to the RStudio deployment accounts via AWS Systems Manager (SSM) Parameter Store.
7. An AWS Lambda function retrieves the delegated Route 53 zone for configuring the RStudio and Shiny sub-domains.
8. AWS Certificate Manager configures encryption in transit by applying HTTPS certificates on the RStudio and Shiny sub-domains.
9. The pipeline configures an Amazon ECS cluster to control the RStudio and Shiny containers and to scale up and down the number of containers as needed.
10. The pipeline creates RStudio container for the instance in a private subnet. The RStudio container is not horizontally scalable for the Open Source version of RStudio.
– You can also create one RStudio container for each Data Scientist depending on your compute requirements. To create multiple RStudio containers for data scientists, you need to specify the number of rstudio containers you need in cdk.json. You can also control the container memory/vCPU using cdk.json.
– Further details are provided in the readme. If your compute requirements exceed Fargate container compute limits, consider using EC2 launch type of Amazon ECS which offers a range of Amazon EC2 servers to fit your compute requirement. You can specify your installation type in cdk.json and choose either Fargate or EC2 launch type for your RStudio containers. For the EC2 launch type, the autoscaling group is configured with multiple EC2 servers and an Amazon ECS Capacity Provider.
11. Shiny containers are horizontally scalable and the pipeline creates the Shiny containers in the private subnet using Fargate launch type of Amazon ECS. Shiny containers are configured to scale depending on the number of requests, memory and CPU usage.
12. Application Load Balancers route traffic to the containers and perform health checks. The pipeline registers the RStudio and Shiny load balancers with the respective Amazon ECS services.
13. AWS WAF rules are built to provide additional security to RStudio and Shiny endpoints.
14. Users upload files to be analysed to a central data lake account either with manual S3 upload or programmatically using AWS Transfer for SFTP.
15. AWS DataSync transfers files from Amazon S3 to cross-account Amazon EFS on an hourly interval schedule.
16. An AWS Lambda initiates DataSync transfer on demand outside of the hourly schedule for files that require urgent analysis. It is expected that bulk of the data transfer will happen on the hourly schedule and on-demand trigger will only be used when necessary.
17. Amazon EFS file systems provide shared, persistent and elastic storage for the containers. This is to facilitate the deployment of Shiny Apps from RStudio containers using a shared file system. The EFS file systems will live through container recycles.
18. You can create Amazon Athena tables on the central data account S3 buckets for direct interaction using JDBC from the RStudio container. Access keys for cross account operation are not stored in the RStudio container R environment.
Note: It is recommended that you implement short term credential vending for this operation.
The source code for this deployment can be found in the aws-samples GitHub repository.
To deploy the AWS CDK stacks from the source code, you need to review and perform the prerequisites described in the accompanying GitHub repository to make sure you have the necessary resources to proceed.
1. Access to four AWS accounts (minimum three) for a basic multi-account deployment.
2. Permission to deploy all AWS services mentioned in the solution overview.
3. Review RStudio and Shiny Open Source Licensing: AGPL v3 (https://www.gnu.org/licenses/agpl-3.0-standalone.html)
4. Basic knowledge of R, RStudio Server, Shiny Server, Linux, AWS Developer Tools (AWS CDK in Python, AWS CodePipeline, AWS CodeCommit), AWS CLI and, the AWS services mentioned in the solution overview
5. Ensure you have a Docker hub login account, otherwise you might get an error while pulling the container images from Docker Hub with the pipeline – You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limits.
6. Review the readmes delivered with the code and ensure you understand how the parameters in cdk.json control the deployment and how to prepare your environment to deploy the cdk stacks via the pipeline detailed below.
Launch the solution
- Clone the GitHub repository, check the main branch, and move into the aws-fargate-with-rstudio-open-source folder.
git clone -b main https://github.com/aws-samples/aws-fargate-with-rstudio-open-source.git
2. Create a CodeCommit repository to hold the source code for installation of RStudio Open Source/Shiny with the following command:
aws codecommit --profile <profile of AWS account>
create-repository --repository-name <name of repository>
python3 -m pip install -r requirements.txt
5. Verify the email in Amazon SES for the sns_email parameter value in cdk.json. You will get a verification email in the address provided. Select Verify before proceeding with the next steps.
aws ses --profile <AWS CLI profile of the RStudio deployment account>
verify-email-identity --email-address <sns_email in cdk.json>
6. Before committing the code into the CodeCommit repository, synthesize the AWS CDK stacks. This ensures all the necessary context values are populated into the
cdk.context.json file and avoids the dummy values being mapped.
cdk synth --profile <AWS CLI profile of the central development account>
cdk synth --profile <AWS CLI profile of the central network account>
cdk synth --profile <AWS CLI profile of the central data account>
cdk synth --profile <AWS CLI profile of the RStudio deployment account>
7. Commit the changes into the CodeCommit repo you created. Follow Step 8 in the Installation Steps of the readme if you need help with the Git commands.
8. Deploy the AWS CDK stacks to install RStudio Open Source/Shiny using CodePipeline. This step takes around 40 minutes.
cdk deploy --profile <AWS CLI profile of the central development account>
9. Navigate to the CodePipeline console (the link takes you to the us-west-2 Region). Monitor the pipeline and confirm that the services are built successfully.
The pipeline name is
Rstudio-Shiny-<instance>. From this point onwards, the pipeline is triggered on commits to the CodeCommit repository you created. There is no need to run
cdk deploy (Step 7) anymore.
10. When the pipeline installation is complete, you can access RStudio Open Source and Shiny using the following URLs, where
instance are parameters you passed into cdk.json. <number> stands for the container number. If you specified a number greater than one for number_of rstudio_containers in cdk.json, you will receive a corresponding URL for each of those numbers. You will get an email with password and URL details at the address you specified in sns_email in cdk.json.
Data Science use case
Now the solution is launched, we can demonstrate a typical data science use case:
- Explore, and pre-process a dataset, and train a machine learning model in RStudio,
- Build a Shiny application that makes prediction against the trained model to surface insight to dashboard users.
This showcases how to publish a Shiny application from RStudio containers to Shiny containers via a common EFS filesystem.
First, we log on to the RStudio container with the URL from the deployment and clone the accompanying repository using the command line terminal. The ML example is in
ml_example directory. We use the UCI Breast Cancer Wisconsin (Diagnostic) dataset from mlbench library. Refer to the ml_example/breast_cancer_modeling.r.
Let’s open the ml_example/breast_cancer_modeling.r script in the RStudio IDE. The script does the following:
- Install and import the required libraries, mainly caret, a popular machine learning library, and mlbench, a collection of ML datasets;
- Import the UCI breast cancer dataset, create an 80/20 split for training and testing (in shiny app) purposes;
- Perform preprocessing to impute the missing values (shown as NA) in the dataframe and standardize the numeric columns;
- Train a stochastic gradient boosting model with cross-validation with the area under the ROC curve (AUC) as the tuning metric;
- Save the testing split, preprocessing object and the trained model into the directory where shiny app script is located
You can execute the whole script with this command in the console.
We can then inspect the model evaluation in the model object
> gbmFit Stochastic Gradient Boosting 560 samples 9 predictor 2 classes: 'benign', 'malignant' No pre-processing Resampling: Cross-Validated (10 fold, repeated 10 times) Summary of sample sizes: 504, 505, 503, 504, 504, 504, ... Resampling results across tuning parameters: interaction.depth n.trees ROC Sens Spec 1 50 0.9916391 0.9716967 0.9304474 1 100 0.9917702 0.9700676 0.9330789 1 150 0.9911656 0.9689790 0.9305000 2 50 0.9922102 0.9708859 0.9351316 2 100 0.9917640 0.9681682 0.9346053 2 150 0.9910501 0.9662613 0.9361842 3 50 0.9922109 0.9689865 0.9381316 3 100 0.9919198 0.9684384 0.9360789 3 150 0.9912103 0.9673348 0.9345263
If the results are as expected, move on to developing a dashboard and publishing the model for business users to consume the machine learning insights.
In the repository, ml_example/breast-cancer-prediction/app.R has a Shiny application that displays a summary statistics and distribution of the testing data, and an interactive dashboard. This allows users to select data points on the chart and understand get the machine learning model inference as needed. Users can also modify the threshold to alter the specificity and sensitivity of the prediction. Thanks to the shared EFS filesystem across the RStudio and Shiny containers, we can publish the Shiny application with the following shell command to
$ cp ~/aws-fargate-with-rstudio-open-source/ml_example/breast-cancer-prediction/ \ /srv/shiny-server/ -rfv
That’s it. The Shiny application is now on the Shiny containers accessible from the Shiny URL, load balanced by Application Load Balancer. You can slide over the
Probability Threshold to test how it changes the total count in the prediction, change the variables for the scatter plot and select data points to test the individual predictions.
Please follow the readme in the repository to delete the stacks created.
In this blog, we demonstrated how a serverless architecture can be deployed, walked through a data science use case in RStudio server and deployed an interactive dashboard in Shiny server. The solution creates a scalable, secure, and serverless data science environment for the R community that accelerates the data science process. The infrastructure and data science code is available in the github repository.