Deploying an HPC cluster and remote visualization in a single step using AWS ParallelCluster

Since its initial release in November 2018, AWS ParallelCluster (an AWS-supported open source tool) has made it easier and more cost effective for users to manage and deploy HPC clusters in the cloud. Since then, the team has continued to enhance the product with more configuration flexibility and enhancements like built-in support for the Elastic Fabric Adapter network interface, and integration with Amazon Elastic File System and Amazon FSx for Lustre shared filesystems. With the release of Version 2.5.0, ParallelCluster further simplifies cluster deployments by adding native support for NICE DCV on the master node.

NICE DCV is an AWS high-performance remote display protocol that provides customers with a secure way to deliver remote desktops and application streaming from any cloud or data center to any device, over varying network conditions. Using NICE DCV, customers can run graphics-intensive HPC applications remotely on EC2 instances and stream the results to their local machine at no additional charge, eliminating the need for expensive dedicated workstations.

Prior to the version 2.5.0 release, HPC users requiring remote visualization would typically have to deploy another EC2 instance to install, and configure a DCV server. This would entail installing prerequisites such as a Window Manager, configuring display drivers, installing DCV server software, configuring secure authentication, and configuring desktop sessions. With the release of ParallelCluster 2.5.0, all you need to do is enable DCV in the ParallelCluster config file, and all this is done for you. The remainder of this post will walk through the process of provisioning an HPC cluster to run a simulation using NAMD, and then visualizing the results in DCV.

For anyone unfamiliar with ParallelCluster or looking to deploy a cluster for the first time, additional background and introductory concepts are detailed in the ParallelCluster launch post. The Parallel Cluster User Guide is also useful as a reference.

Configure ParallelCluster

The first step after installing ParallelCluster is to create or update the ParallelCluster configuration file. By default, ParallelCluster obtains its settings from the .parallelcluster/config file located in the user’s home directory. A comprehensive list of all configuration parameters can be found in the configuration section of the ParallelCluster documentation.

ParallelCluster 2.5.0 greatly simplifies the pcluster configuration process. Previously, when you launched pcluster configure, you would be prompted for information that you would have to track down and copy/paste into the terminal. In version 2.5.0, the configuration program queries your VPC and presents you with a numbered list of options to choose from.

To create an initial configuration file, simply type pcluster configure. The configuration wizard will prompt you for required inputs and provide a list of menu options. You can choose to deploy clusters in an existing VPC, or the program can create one for you. I want to deploy a cluster in an existing VPC, so we will walk through that scenario.


bk:~ $ pcluster configure
INFO: Configuration file /Users/bkknorr/.parallelcluster/config will be written.
Press CTRL-C to interrupt the procedure.

Allowed values for AWS Region ID:

1. ap-northeast-1
2. ap-northeast-2
3. ap-south-1
4. ap-southeast-1
5. ap-southeast-2
6. ca-central-1
7. eu-central-1
8. eu-north-1
9. eu-west-1
10. eu-west-2
11. eu-west-3
12. sa-east-1
13. us-east-1
14. us-east-2
15. us-west-1
16. us-west-2

AWS Region ID [ap-northeast-1]: us-east-1
Allowed values for EC2 Key Pair Name:

1. hpc-key

EC2 Key Pair Name [hpc-key]: 1
Allowed values for Scheduler:

1. sge
2. torque
3. slurm
4. awsbatch

Scheduler [sge]: 1
Allowed values for Operating System:

1. alinux
2. centos6
3. centos7
4. ubuntu1604
5. ubuntu1804

Operating System [alinux]: 3
Minimum cluster size (instances) [0]: 
Maximum cluster size (instances) [10]: 
Master instance type [t2.micro]: g3s.xlarge
Compute instance type [t2.micro]: p3.2xlarge
Automate VPC creation? (y/n) [n]: n
Allowed values for VPC ID:

1. vpc-01234567 | Default | 7 subnets inside

VPC ID [vpc-01234567]: 1
Automate Subnet creation? (y/n) [y]: n
Allowed values for Master Subnet ID:

1. subnet-ec3fa3c3 | Subnet size: 4096

Master Subnet ID [subnet-a1b2c34d]: 1
Allowed values for Compute Subnet ID:

1. subnet-a1b2c34d | Subnet size: 4096

Compute Subnet ID [subnet-a1b2c34d]: 1
Configuration file written to /Users/bkknorr/.parallelcluster/config

You can edit your configuration file or simply run pcluster create [cluster-name] to create your cluster.

Once the configuration file is written, you’ll need to make a few modifications to enable NICE DCV on the Master node. This is enabled in two steps:

Add a line dcv_settings = default to the cluster section of the config file.
Create a dcv section that references the value you chose for the dcv_settings and include the line enable = master.

In the example config file below, I’ve highlighted the DCV-related items in bold. I’ve also chosen to deploy the cluster with a GPU-enabled Master instance. A Master instance typically does not need a lot of memory and CPU because it does not participate in any job runs; a c5.large instance would typically suffice. In this case, the master node is hosting NICE DCV remote desktop sessions, so I chose to deploy a g4dn.xlarge because it provides a cost-effective option for graphics-intensive applications.

Example pcluster config file


[global]
cluster_template = default
update_check = true
sanity_check = false

[aws]
aws_region_name = us-east-1

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[cluster default]
base_os = centos7
key_name = hpc-key
vpc_settings = public
initial_queue_size = 0
max_queue_size = 10
placement = cluster
placement_group = DYNAMIC
master_instance_type = g4dn.xlarge
compute_instance_type = p3.2xlarge
cluster_type = ondemand
efs_settings = customfs

dcv_settings = default

[dcv default]
enable = master

# port = 8443
# access_from = 11.22.33.0/24

[vpc public]
vpc_id = [your_vpc_id]
master_subnet_id = [your_subnet_id]
ssh_from = 11.22.33.0/24

[efs customfs]
shared_dir = efs
efs_fs_id = fs-abc12345

Create a cluster

To create a cluster, issue the pcluster create command:

bk:~ bkknorr$ pcluster create hpcblog
Beginning cluster creation for cluster: hpcblog
Creating stack named: parallelcluster-hpcblog
Status: parallelcluster-hpcblog - CREATE_COMPLETE
MasterPublicIP: 34.239.191.193
ClusterUser: centos
MasterPrivateIP: 172.31.81.91

You should have a cluster and remote desktop available within minutes. Once your create command completes, you can connect to your desktop session and run a simulation job.

Connect to a NICE DCV session

To connect to your NICE DCV session, simply run pcluster dcv connect [cluster] -k [keyname], where [cluster] specifies the name you gave your cluster and -k specifies the location of your private key. Once you are successfully authenticated, your NICE DCV session will launch in a browser and your credentials will be securely passed to the NICE DCV server using an authorization token.

bk:~ bkknorr$ pcluster dcv connect hpcblog -k hpc-key.pem

Job submission using NAMD

The following demonstration uses NAMD and VMD to run a molecular dynamics job and then visualize the output. This demonstration uses the first part of Unit 3 of the Membrane Protein Tutorial, starting on page 34. The requisite software and tutorial files need to be uploaded to a shared filesystem.

Upon connecting to your NICE DCV session, launch a terminal and navigate to your working directory containing the tutorial files. The job submission example below uses Sun Grid Engine (sge), the default ParallelCluster job scheduler.

[centos@ip-172-31-93-180 03-MINEQ]$ qsub -cwd -pe mpi 4 -N kcsa_popcwimineq-01 -o kcsa_popcwimineq-01.log kcsa_popcwimineq-01.job
Your job 1 ("kcsa_popcwimineq-01") has been submitted
Exiting.

Breaking down the above command:

qsub submit a job to Sun Grid Engine (sge).
-cwd execute the job from the current working directory.
-pe mpi 4 request four slots from the parallel environment named mpi. The number of slots requested will determine how many cores are used to run the job. The p3.2xlarge compute instance I am using has four cores and eight threads. I’ve disabled hypterthreading, so I’m requesting only four slots rather than eight.
-N kcsa_popcwimineq-01 [optional] name of the job I am running.
-o kcsa_popcwimineq-01.log [optional ] name of the output logfile.
kcsa_popcwimineq-01.job name of the input file.

Visualize output in NICE DCV

Once your simulation job completes, you can now visualize your results using VMD. The first step is to download and install VMD.

[centos@ip-172-31-93-180 ~]$ cd /efs/vmd-1.9.3/
[centos@ip-172-31-93-180 vmd-1.9.3]$ ./configure
using configure.options: LINUXAMD64 OPENGL OPENGLPBUFFER FLTK TK ACTC CUDA IMD LIBSBALL XINERAMA XINPUT LIBOPTIX LIBOSPRAY LIBTACHYON VRPN NETCDF COLVARS TCL PYTHON PTHREADS NUMPY SILENT ICC
[centos@ip-172-31-93-180 vmd-1.9.3]$ cd src/
[centos@ip-172-31-93-180 src]$ sudo make install
Make sure /usr/local/bin/vmd is in your path.
VMD installation complete. Enjoy!
[centos@ip-172-31-93-180 src]$

Once VMD is installed, you can launch vmd from your terminal to view the output of your job.

Conclusion

As demonstrated in this post, AWS ParallelCluster 2.5.0 further simplifies HPC cluster deployment and management. With the ability to provision compute resources and remote visualization with a single command, HPC practitioners can iterate quicker and reduce time to insight.

Learn more about ParallelCluster on GitHub.