AWS for Industries
Deploying a Statistical Compute Environment using R on Amazon EKS
This blog post describes how to deploy a statistical compute environment (SCE) on Amazon Elastic Kubernetes Services (EKS) using a single command deployment package. This solution uses Posit packages for Posit Workbench, Connect and Package Manager to enable customer flexibility, utilizing open-source communities for statistical compute use cases such as pharmaverse and others.
As clinical trials become more complex – both in terms of therapeutic areas and treatment modalities, pharma customers are looking for new ways to do biostatistics in a flexible manner that will leverage the power of the cloud and innovations from the open-source communities.
Many customers currently use 3rd Party Vendor offerings for this purpose. In parallel, an open-source community has formed around using R as an alternative to vendor products. Pharma customers have collaborated with technology companies to create a library of open-source modules that cover common repeatable tasks working on clinical trial data called https://pharmaverse.org/:
“A connected network of companies and individuals working to promote collaborative development of curated open-source R packages for clinical reporting usage in pharma, in a space where previously we would only ever have worked in silos on our own closed source and often duplicative solutions. Adopting shared solutions in this post-competitive space should ultimately ease regulatory review, resulting in bringing new treatments to patients fast.” (Pharmaverse Organization)
Use Cases for Statistical Compute in general
A statistical computing environment is a powerful tool that simplifies the process of accessing clinical trial data throughout the drug development lifecycle. It enables the creation of analysis datasets, tabulation datasets, table listings, graphs, and submission components, ensuring that all clinical trial deliverables comply with regulatory requirements (GxP). By providing an agnostic workspace, a statistical computing environment empowers programmers, statisticians, and data scientists to perform modeling and simulations using a variety of programming languages, streamlining the data analysis process and facilitating informed decision-making in clinical research.
What is a Statistical Compute Environment
To implement an environment where statistical compute can be done, customers can choose from a broad range of commercial products and open-source packages. Open-source packages for the R programming language are popular with customer as they provide a high degree of flexibility and agility. Structural it contains a way of storing your raw and your results data, your compute code and parameters and settings for your environment. It consists of an Integrated Development Environment (IDE), a Dashboarding and Visualization tool and a package manager to help you add pre-packaged R based solutions for your industry domain. This biostatistics piece is the core part of the overall clinical trial process – this is where the analysis is performed to see if the particular trial is meeting its objectives, whether the therapy has the right safety profile and more.
Posit Package Manager
Posit Package Manager helps organize and centralize R packages across teams and organizations. As data scientists develop their artifacts, they need various packages with different capabilities for their use cases in Posit. Managing the sources and versions of these packages and numerous public repositories manually for enterprise users is prone to errors and is also time-consuming. Posit Package Manager mitigates these issues by managing the package repository centrally for your organization so that data scientists can install packages quickly and securely, and ensure project reproducibility and repeatability.
Overview of solution
The provided solution consists of a complete Statistical Compute Environment (SCE) for R. It utilizes packages form Posit: Posit Workbench, Posit Connect and Posit Package Manager. It is hosted on managed services providing automation and scaling to customers. The Posit Packages are installed on Amazon Elastic Kubernetes Services (EKS). A shared filesystem is used via the Amazon Elastic Filesystem (EFS). The environment stores its configuration and metadata in the Amazon Relational Database Service (RDS) using Amazon RDS for PostgreSQL. The entire setup is fronted by an Application Load balancer (ALB). Security is applied at all levels, with all components using their own security group only exposing the endpoints to the user. The solution uses the AWS Cloud Development Kit (CDK) to deploy infrastructure and Helm to install and configure the Posit packages.
Figure 1: Architectural diagram provides a high-level overview of the SCE infrastructure.
Walkthrough
The provided solution consists of six parts:
1. Users access the Statistical Compute Environment by pointing their browser to an internal domain (domain resolution not shown). The SSL certificate is stored in AWS Certificate Manager.
2. Request reach the Application Load Balancer (ALB). The ALB checks incoming request for an SSL certificate, which is validated by AWS Certificate Manager. The ALB then forwards based on the URL given. (see installation section)
3. Amazon EKS controls the containers in a cluster of Amazon Elastic Compute Cloud (Amazon EC2) instances (EC2 launch type) in an Auto Scaling Group and is responsible for scaling up and down the number of containers as needed.
4. Amazon RDS for PostgreSQL databases are used to provide high availability utilizing multiple containers for both Posit Connect and Posit Package Manager Data on RDS is encrypted at rest using AWS Key Management Service (AWS KMS). DB Passwords are stored in AWS Secrets Manager.
5. Posit Connect, Package Manager and Posit Workbench have access to private shared volumes hosted by Amazon Elastic File System (Amazon EFS) which provides the persistent file system required. Data on Amazon EFS is encrypted at rest using AWS KMS. Amazon EFS is an NFS file system that stores data in multiple Availability Zones in an AWS Region for data durability and high availability. Files created on the RStudio, Connect and Package Manager containers EFS mounts are automatically backed up.
6. If the user session or package manager communicates with the public internet, outbound requests are sent to a NAT gateway from the private container subnet. The NAT gateway sends outbound requests to be processed via an internet gateway securely.
Benefits from this solution
Customers leveraging AWS managed services such as EKS; EFS and RDS and benefit from their features, for example, auto scaling of EKS to accommodate dynamic team sizes that utilize the environment over time. Overall, customers benefit from a minimal management overhead which helps them focus on creating value for the organization without distractions from infrastructure management.
Prerequisites
Download or clone the github to your repository your local file system and unpack it if required. Open a Terminal and change the directory to downloaded package. Open the README.md file and follow the instructions and commands. Github repository associated with this solution.
The following utilities need to be installed on your local machine:
- AWS Command Line Interface (CLI) installed and configured with appropriate permissions.
- An AWS account with sufficient permissions to create and manage EKS clusters, EC2 instances, and related resources.
- Homebrew, Kuberneter CLI (KubeCTL), Helm and Node installed.
- [Windows Only] A installed and configured WSL (Windows Subversion Linux)
- Valid Posit Licenses for Posit Workbench, Connect and Package Manager.
Deployment
In your directory, create or open the .env file using vi or any other text editor and your Posit license keys and save the file again.
PWB_LICENSE=xxx
PCO_LICENSE=xxx
PPM_LICENSE=xxx
Then install installing the required utilities and dependencies.
brew install awscli
brew install kubernetes-cli
brew install helm
brew install node
brew install aws-cdk
Next run.
npm install
In your local directory, once this completes type:
bash ./run.sh deploy
This will guide you through the installation set-up process. Here you can choose your environment name, provide your own domain address and choose to have https enabled on the installation. You can run this with the options as shown in the image below to have https enabled with a local certificate.
Figure 2: Posit console installer
In the first step, the basic infrastructure will be deployed using CDK:
- A dedicated VPC with 4 subnets (two public two private)
an EKS Cluster - An EFS shared filesystem with endpoints in the subnets
- An Amazon RDS for PostgreSQL Database cluster
- Security- and auto-scaling- groups
Once the EKS Cluster is running, the installation will switch to helm using helm-charts to deploy.
- Posit Workbench IDE.
- Posit Connect.
- Posit Package Manager.
- Additional utilities for traffic management, filesystem access etc.
- Launch and Configure an Application Load-Balancer with the corresponding forwarding rules to reach the three Posit Packages.
- Create and Install the HTTPS Self-Certificate if the option is chosen.
After all packages have become active, then either pick up the Application Load-balancer DNS name from the terminal output or navigate to your EC2 View in the AWS Console and find the posit-sce-alb can copy its DNS name into your clipboard. The full install will take about 45 minutes to complete.
Figure 3: Application Load Balancer DNS name in the AWS Management console
Open a new browser window and paste the ALB DNS into it and add.
- /pwb for Posit Workbench
- /pct for Posit Connect
- /ppm for Posit Package Manager
Access management is currently local, but can be upgraded to integrate with a wide variety of identity providers. We defer to the posit documentation for Workbench and Connect for setup and configuration parameters to include SAML or extend local access management.
To get started we have provisioned a default account for Workbench which is available as rstudio, rstudio. For Posit Connect, you can use the sign-up form on the landing page to get direct access to the dashboard.
With your experimental environment now set up, you can begin exploring the capabilities of Posit Package Manager. Start by opening the admin guide within the Posit interface. Here you will find the necessary kubectl commands for loading R packages from CRAN into your Kubernetes cluster. This allows you to easily install and manage R packages from directly within Posit.
Figure 4: Admin guide for Posit Package Manager
When you are working in a regulated GxP Environment, Posit Package Manager allows you to maintain full control and validation of packages. You can seamlessly integrate your organization’s internally validated package repository directly into Posit. This provides traceability and assurance that only approved packages meeting your compliance needs will be available to data scientists and researchers within the Posit interface. By configuring Posit to retrieve packages from your validated repository, you can empower self-service access while maintaining governance of the packages permitted in your validated systems – a key advantage for regulated industries.
Hosting this environment as part of a validated GxP environment is beyond the scope of this blog post, there are however several documented approaches from the PHUSE community.
Cleaning up
To uninstall the whole environment, open your terminal again and run.
bash ./run.sh destroy
This will guide you through the uninstallation process. You can choose to only uninstall the Posit packages and leave the base infrastructure of EKS, RDS and EFS standing. For example, if you made configuration changes to the manifest section of the solution and want to re-deploy Posit. Or you can choose to uninstall the entries solution in one go.
Conclusion
In this blog post we have shown you how to use a pre-configured template to deploy a statistical compute environment centered around open source and R which customers can use to cover use cases in clinical as well as financial service use cases to analyze and transform data using open-source communities to address repeatable solutions in their business domain. We showed you how this solution deploys a full featured Posit Workbench, Posit Connect and Posit Package Manager with an architecture based on Amazon EKS using AWS CDK and helm. We showed how to populate open-source R Packages from CRAN into Package Manager.
You can download the complete installation package from GitHub and start exploring this set up today. Share your experience and any challenges you encounter in the comments sections below. We’re excited to see how you will use this environment to drive insights and innovation in your organization!