Deploying Open OnDemand with AWS ParallelCluster
Open OnDemand is an open-source High Performance Compute (HPC) portal that provides an easy way for system administrators to provide web-based access to HPC resources. Many of our customers use Open OnDemand to provide access to their HPC resources.
We’ve heard from these customers that they want to be able to leverage the cloud, but they don’t want to change the tooling with which their users are familiar. In addition, researchers and scientists have come to rely on these portals and want to focus on their research instead of cloud infrastructure.
In order to meet our HPC customers where they are and provide a smooth transition to leveraging cloud-based HPC, we’re releasing an HPC workshop to demonstrate to integrate AWS ParallelCluster with Open OnDemand.
This article describes what is in the workshop, and highlights some of the features of Open OnDemand once the resources are started.
The workshop walks you through creating an AWS based Open OnDemand environment integrated with AWS ParallelCluster. AWS ParallelCluster is an open-source cluster management tool that makes it easy for you to deploy and manage HPC clusters on AWS. ParallelCluster allows you to leverage the elasticity of the AWS cloud by providing an easy way to expand and contract compute queues. In addition, you can leverage the agility of AWS by quickly spinning up clusters to prototype a solution, or easily leverage the different types of compute resources available on AWS to the best mix of CPU and memory or acceleration via hardware such as GPUs for your jobs.
Once the infrastructure is created, you will run two jobs. The first job will leverage Spack — an open-source package manager that makes installing scientific software easy — to compile OpenFOAM. Then, you will run an OpenFOAM Motorbike simulation on your cluster. Both jobs will be submitted through Open OnDemand.
The workshop consists of AWS CloudFormation and ParallelCluster template files. CloudFormation is an infrastructure as code (IaC) service that allows you to easily model, provision, and manage AWS and third-party resources. CloudFormation and ParallelCluster are used to build the following architecture (Figure 1):
Users access Open OnDemand by navigating to an AWS Application Load Balancer (ALB). An ALB serves as the single point of contact for clients. The load balancer distributes incoming application traffic across multiple targets, such as EC2 instances, in multiple Availability Zones. This increases the availability of your application. The ALB has an Amazon Certificate Manager (ACM) certificate associated with it to enforce encryption in transit. ACM certificates are free to create and use with AWS services such as an ALB.
The Open OnDemand instance is in an Amazon EC2 Auto Scaling Group (ASG). Amazon EC2 Auto Scaling helps you maintain application availability and automatically adds or removes EC2 instances using scaling policies that you define. However, the maximum number of instances in the ASG is only 1. While the application won’t scale out, having an ASG of 1 instance provides high availability as ASG will automatically create a new instance and recover from an instance failure. When an Open OnDemand instance is created, the cfn-init helper script executes to install and configure the application.
An Amazon Elastic File System (EFS) file system is created for shared user home directories. EFS is a fully managed, multi-availability zone capable Network File Share (NFS). EFS is mounted to both the Open OnDemand instance as well as the ParallelCluster cluster. EFS is used for providing consistent user home directories throughout the architecture. As an example, users upload jobs to their home directories and those are then accessed by and executed on the cluster.
AWS Managed Microsoft AD is used as an identity store. Managed Microsoft AD is a fully managed, highly available directory powered Windows Server 2019. A Network Load Balancer (NLB) is placed in front of Managed Microsoft AD to provide a single DNS entry from which to access the directory. The workshop uses Open OnDemand’s integration with the Dex identity service. Dex uses OpenID Connect, which allows you to extend the solution to defer authentication to SAML and other identity providers. If you do not want to use Dex, Open OnDemand supports most authentication modules that work with Apache HTTP Server 2.4.
The cluster has an Amazon FSx for Lustre file system attached as scratch storage for jobs. FSx for Lustre is a fully managed file system that makes it easy and cost-effective to launch and run the popular, high-performance Lustre file system. FSx for Lustre provides low latency, high throughput access to files and direct integration with S3.
There is one compute queue in the workshop, but this can be expanded as needed. The compute queue leverages the elasticity and agility of the cloud. When not in use, the queue contracts to 0 nodes. When in use, the queue expands based on the parameters of the job and the queue settings.
After the head node is created, a cluster configuration file is added to an Amazon Simple Storage Service (S3) bucket. S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. When the cluster configuration file is uploaded to S3, an Event Bridge event executes a Simple Systems Manager (SSM) document that copies the configuration file to our Open OnDemand instance. SSM is a feature that facilitates the automatic configuration of EC2 instances. This event-driven cluster registration makes it seamless to add a new ParallelCluster to Open OnDemand.
To display job information in Open OnDemand, Slurm accounting is configured. Slurm accounting is configured to use an Amazon Relational Database Service (RDS) Aurora MySQL database is created. RDS Aurora is a highly performant, highly available, fully managed database. If you don’t want to use Aurora MySQL the solution can be used with any database supported by Slurm accounting, including Amazon RDS for MySQL, a fully managed MySQL database.
All of the secrets in the application are stored in AWS Secrets Manager. AWS Secrets Manager helps you manage, retrieve, and rotate database credentials, API keys, and other secrets throughout their lifecycles. We use Secrets manager to securely store and programmatically access secrets, such as the LDAP user that Open OnDemand uses. Access to secrets is recorded in AWS CloudTrail for auditing. AWS CloudTrail monitors and records account activity across your AWS infrastructure, giving you control over storage, analysis, and remediation actions. Secrets are accessed programmatically from the different EC2 instances of the system. This simplifies configuration and ensures that secrets aren’t embedded in source-controlled configuration files.
Once you deploy the Open OnDemand portal with an HPC cluster, you will navigate to the application and be greeted by the Open OnDemand login page. A username and password to login can be found in secrets manager.
Once logged in, you can click on the clusters drop down to interact with the HeadNode for the ParallelCluster you created.
You will use Open OnDemand’s File editing functionality to create scripts to compile OpenFOAM with Spack and execute the motorbike simulation. Finally, you will use Open OnDemand’s job submission functionality to submit both of these jobs to ParallelCluster.
To view your results, you can either download the output from the Files functionality in Open OnDemand or extend the solution to enable interactive desktops and view the output on a compute node.
The workshop demonstrates the art of the possible integrating the popular Open OnDemand HPC portal with AWS ParallelCluster. You can leverage similar approaches to integrate ParallelCluster with existing Open OnDemand installations, allowing your existing HPC portal to integrate elastic and scalable resources of AWS to provide additional capacity or capabilities for HPC jobs while giving researchers and your end-users the familiar feel of Open OnDemand.