Scaling RStudio/Shiny using Serverless Architecture and AWS Fargate
Data scientists use RStudio server as an Integrated Development Environment (IDE) to develop, publish, and share interactive web dashboards built on Shiny Server. Although it is possible to use virtual server infrastructure in the cloud to run R workloads, containerization offers significant operational benefits.
Migrating R workloads into a serverless model in AWS, customers can benefit from managed infrastructure, scalability, and security. If you run RStudio on our proposed architecture, you can focus on building applications instead of managing the infrastructure. You can scale to the need of users accessing the dashboard visualizations.
This blog post discusses a scalable, secure, and serverless architecture pattern on AWS to host RStudio Server and Shiny App. This follows best practices as suggested in AWS Well-Architected Framework. All the components in this architecture use serverless infrastructure: compute, storage, network, data transfer, orchestration, perimeter security, logging, and monitoring. You are only responsible for managing the containerized application.
This AWS architecture can be implemented on AWS Fargate using RStudio open source. Fargate is a serverless container service that provides compute capacity for Amazon Elastic Container Service (ECS) and Amazon Elastic Kubernetes Service (EKS).
AWS Fargate removes the need to provision and manage servers, lets you specify and pay for resources per application, and improves security through application isolation. It ensures that the infrastructure your containers run on is always up to date with the required patches.
Numbered items refer to Figure 1.
- R users access RStudio Server and Shiny App via Amazon Route 53. Route 53 is a DNS service for incoming requests.
- Route 53 resolves incoming requests and forwards those onto AWS WAF (Web Application Firewall) for security checks.
- Valid requests reach an Amazon Application Load Balancer (ALB) which forwards these to the Amazon Elastic Containers Service (ECS) cluster.
- The cluster service controls the containers and is responsible for scaling up and down the number of instances as needed.
- Incoming requests are processed by RStudio server; users are authenticated and R sessions are spawned for valid requests. Shiny users are routed to the Shiny container.
- If the R session communicates with public internet, outbound requests can be filtered via a proxy server and then sent to a NAT Gateway.
- NAT Gateway sends outbound requests to be processed via an Internet Gateway. Route to internet can also be configured by AWS Transit Gateway.
- The R users require data files to be transported onto the container. To facilitate this, files are transferred to Amazon Simple Storage Service (S3) using AWS Transfer Family or S3 upload.
- The uploaded files from S3 are synced to Amazon Elastic File System (EFS) by AWS DataSync.
- Amazon EFS provides the persistent file system required by RStudio server. Data scientists can deploy Shiny apps from their RStudio Server container to the Shiny Server container easily by the shared file system.
- RStudio can be integrated with S3 and R sessions can query Amazon Athena tables built on S3 data using a JDBC connection. Athena is a serverless interactive query service that analyses data in Amazon S3 using standard SQL.
You can use Amazon ECS Exec to log in to both RStudio Open Source and Shiny containers.
High Availability: RStudio Server and Shiny App run across two Availability Zones (AZ). ALB distributes traffic to the containers and performs health checks on the RStudio services. It instructs ECS to bring up a fresh container due to failure. ECS maintains the desired number of running containers and destroys the unhealthy target.
Perimeter Security: The ALB is protected by AWS WAF. ALB checks incoming requests for HTTPS certificate, which is issued and validated by AWS Certificate Manager.
Amazon GuardDuty detects threat in the deployment account within AWS Organizations and AWS Shield provides DDoS protection for the Route 53 resources. You can also subscribe to AWS Shield Advanced for higher level of protection against attacks targeting your deployment.
Private Communication: If all communications amongst AWS services must stay within AWS, you can use AWS PrivateLink to configure VPC endpoints for AWS services. PrivateLink makes sure that interservice traffic is not exposed to the internet for AWS service endpoints.
Vulnerability Scan: Container images should be stored and fetched from Amazon Elastic Container Registry (ECR) or a container repository with vulnerability scan enabled. Vulnerability issues should be addressed before deploying the images.
Persistent Storage: RStudio and Shiny need files presented on a disk in the containers. Amazon EFS provides this persistent storage. Amazon EFS is an NFS file system that stores data in multiple Availability Zones in an AWS Region for data durability and high availability.
Data on EFS is encrypted at rest using AWS Key Management Service (KMS).
Serverless Data Transfer: AWS DataSync is an agentless service that syncs EFS with S3. RStudio users can upload data or transfer data with AWS Transfer Family, a serverless SFTP service. Files uploaded to S3 will be copied to EFS in an event-driven architecture. AWS Lambda can trigger an S3 file upload event or this copy can be scheduled with DataSync.
The event orchestration can also be synchronized by Amazon EventBridge – a scalable serverless event bus for applications to react in real time.
Scalability: In this architecture, you can add capacity to RStudio Server by provisioning a container for each data scientist on different URLs. This is because RStudio Server open source scalability is restricted to only one running container in a Fargate service. An alternative is to use the EC2 launch type of ECS. This offers many different categories of virtual servers to scale vertically depending on your use case.
Shiny apps are deployed onto separate containers that support automatic scaling to handle the traffic coming to the interactive dashboard. Customers using RStudio Professional can scale horizontally using Amazon EKS, Cluster Autoscaler, and job launcher. RStudio Professional requires privileged containers, which can run on EC2 launch types, but not on Fargate.
Logging and Monitoring: Logging and monitoring infrastructure is provided by AWS CloudTrail and Amazon CloudWatch. All logs can be aggregated into a central audit account in an AWS Control Tower Landing Zone environment.
Backup: Files created on the RStudio container EFS mount will be automatically backed up by EFS. Files containing source code will be checked in and out of Git repositories such as AWS CodeCommit directly from RStudio Server.
User Management: RStudio frontend users are local Linux users. RStudio open source does not provide a federated authentication/authorization mechanism. RStudio Professional provides SAML federated access to RStudio Server and RStudio Connect for authentication of Shiny app using Amazon Cognito.
Following up on this blog, we’ll publish AWS CDK templates automated with AWS CodePipeline. We’ll take you through the technical steps for automated deployment of the entire solution detailed in this architecture. Read Field Notes: Accelerating Data Science with RStudio and Shiny Server on AWS Fargate which shows infrastructure code you can use to run a secure, scalable and highly available RStudio and Shiny Server installation on AWS.
We discuss a serverless architecture, which addresses common challenges of hosting RStudio/Shiny servers. We use best practices as shown in AWS Well-Architected Framework. This architecture provides data science teams a secure, scalable, and highly available environment while reducing infrastructure management overhead.