Building an HPC cluster with AWS ParallelCluster and Amazon FSx for Lustre

High performance computing (HPC) customers love the breadth of services offered by AWS and the flexibility offered by the cloud to address their computational challenges. AWS provides you with the opportunity to innovate quickly and accelerate your workflow thanks to a virtually unlimited capacity. For more information, see the following posts:

Natural language processing at Clemson University – 1.1 million vCPUs & EC2 Spot Instances
Creating a 1.3 Million vCPU Grid on AWS using Amazon EC2 Spot Instances and TIBCO GridServer
Western Digital HDD Simulation at Cloud Scale – 2.5 Million HPC Tasks, 40K EC2 Spot Instances

However, building an HPC system can be complex, which is one reason AWS created AWS ParallelCluster. This tool offers a simple method to create an automatically scaling HPC system in the AWS Cloud, utilizing services such as Amazon EC2, AWS Batch, and Amazon FSx for Lustre. AWS Parallel Cluster configures the compute and storage resources necessary to run your HPC workloads. Settings based on your specific storage, compute, and network requirements all help optimize its functionality.

You can manage an HPC cluster with AWS ParallelCluster through a simple set of commands: create, update, and delete. The compute portion of the cluster automatically scales when jobs have to be processed. As a result, you only pay for what you use. Additionally, to provide shared storage to the cluster, AWS ParallelCluster can use Amazon FSx for Lustre as a high-performance file system that can be accessed by all compute nodes.

In this post, I show how easy it is to spin up a fully functioning compute cluster with a shared, high-performance file system in a matter of minutes using AWS ParallelCluster and Amazon FSx for Lustre. To test file system throughput, which is a common performance metric for sizing HPC clusters, you’ll use the SGE scheduler (pre-installed with AWS ParallelCluster) once your cluster is up to run a quick I/O benchmark job. The I/O benchmark will also verify that the compute cluster can achieve the expected level of file system performance. If you’re new to AWSParallelCluster, review this post detailing how to get started before continuing.

Overview

AWS recently announced AWS ParallelCluster support for FSx for Lustre. AWS ParallelCluster is a cluster-management tool for AWS that makes it easy for scientists, researchers, and IT administrators to deploy and manage HPC clusters in the AWS Cloud. This tool also facilitates both quick-start proofs of concept (POCs) and production deployments.

FSx for Lustre provides a high-performance file system that makes it easy to process data stored on Amazon S3 or on premises. With FSx for Lustre, you can launch and run a file system that provides sub-millisecond access to your data. It allows you to read and write data at speeds of up to hundreds of gigabytes per second of throughput and millions of IOPS. By integrating AWS ParallelCluster with FSx for Lustre, HPC customers can now spin up clusters with thousands of compute instances, and shared storage that scales performance for the most demanding workloads so that compute instances never have to wait for data and you get to the answers faster.

To test the performance of your FSx for Lustre file system, you can use IOR, a commonly used HPC benchmarking tool. IOR provides a number of options that approximate workload performance. However, I always recommend that you benchmark with your own workload to get full visibility into performance characteristics.

Right-sizing

Use IOR to generate a sequential read/write I/O pattern to test the high-throughput capabilities of FSx for Lustre. This means that you begin your sizing at the storage layer, and then determine how many compute nodes should drive your desired I/O performance.

When sizing FSx for Lustre performance, a good rule of thumb is to plan for 200 MB/s of throughput per 1 TiB allocated. FSx for Lustre allocates file systems in 3.6-TiB increments, which provide around 720 MB/s per increment. A 39.6-TiB file system, should give you around 7.92 GB/s of total file system throughput.

After you know the storage performance target, you can figure out how many compute nodes you need. Each compute node communicates with FSx for Lustre over the network, making the workload throughput-based, and so network bandwidth is the primary sizing factor for compute nodes.

Amazon EC2 instances provide different levels of network bandwidth, depending upon the instance type. In this case, I selected the c4.4xlarge instance type because it is widely available in all Regions and, in testing, provides about 500-MB/s peak network bandwidth. To reach 7.92 GB/s of performance, you want sixteen c4.4xlarge compute nodes in your cluster.

Prerequisites

You must have the AWS CLI installed and configured on a system with Python and PIP. This could be on your laptop, a desktop computer, or an Amazon EC2 bastion host running in AWS. Make sure that you are running the latest version of the AWS CLI by running the following command:

$ pip install --upgrade awscli

If you already have the AWS ParallelCluster CLI installed, run the following command to make sure you are using the latest version:

$ pip install --upgrade aws-parallelcluster

Defining the cluster

AWS ParallelCluster uses a configuration file to define how the cluster should be created and initialized. Use the following configuration file to create the cluster:

[global]
cluster_template = default
update_check = true

[aws]
aws_region_name = us-east-2

[cluster default]
key_name = <EC2-KEY-NAME>
initial_queue_size = 0
max_queue_size = 16
placement_group = DYNAMIC
cluster_type = spot
master_instance_type = c4.4xlarge
compute_instance_type = c4.4xlarge
vpc_settings = myvpc
ebs_settings = myebs
fsx_settings = myfsx

[vpc myvpc]
vpc_id = vpc-<VPC-ID>
master_subnet_id = subnet-<SUBNET-ID>

[ebs myebs]
shared_dir = /shared
volume_type = gp2
volume_size = 100

[fsx myfsx]
shared_dir = /fsx
storage_capacity = 39600

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

Global configuration

This section reviews some of the previous parameters. For a complete description of the parameters, see the AWS ParallelCluster post on the AWS Open Source Blog.

aws_region_name – Name of the Region where the cluster is created. In this case, I used us-east-2 (Ohio) because it’s one of the Regions that supports FSx for Lustre.
key_name – Name of an existing Amazon EC2 key pair, from the specified Region. It is used for Secure Shell (SSH) access to the cluster master node.
initial_queue_size – Initial number of compute nodes attached to the cluster when created.
max_queue_size – Maximum number of compute nodes created using Amazon EC2 Auto Scaling when jobs are submitted.
cluster_type – Defines if Amazon EC2 On-Demand or Spot Instances are to be used for the compute portion of the cluster. The Spot price is capped at the On-Demand price but you can change that setting.
compute_instance_type – EC2 instance type to be used for the compute nodes. In this case, choose c4.4xlarge to get a good balance of CPU performance and network performance. Evaluate different instance types and sizes with your HPC workloads and select the one that makes the most sense.
shared_dir – Directory created on the master node and exported via NFS to all other nodes in the cluster. It provides a shared space that can be used to store applications or static files.
vpc_id – ID of the VPC, in <aws_region_name>, where the cluster deploys.
master_subnet_id – ID of the subnet, in <aws_region_name>, where the cluster deploys.

FSx for Lustre configuration

The new configuration option fsx_settings gives you the following abilities:

Create a new FSx for Lustre file system and attach it to the cluster.
Attach a pre-existing FSx for Lustre file system.

The [fsx] section in the configuration file specifies the options for the file system. In this case, the system creates a 39.6-TiB file system and mounts it under the /fsx folder on all nodes.

FSx for Lustre can also load datasets from Amazon S3 when it creates a file system. This allows you to store data durably, reliably, and at a low price point in Amazon S3. The file data will be loaded from S3 into FSx for Lustre the first time the files are read. Subsequent reads of the file will pull data directly from the FSx for Lustre file system. To initialize your FSx for Lustre file system with data from Amazon S3, use the following settings under the [fsx] configuration section:

import_path – This specifies an S3 bucket and path to import data from (for example, s3://mybucket/mydata). When FSx for Lustre creates its file system, it scans the objects in this path and creates corresponding file metadata in Lustre. The actual object data is not loaded into the file system until the first time it reads the file.
export_path – To export data from the file system back to Amazon S3, you can specify a path, which must be in the same bucket as the import_path.

For more information, see Configuration Options.

Creating the cluster

Now that you’ve defined the configuration file, save it to ~/.parallelcluster/config. Then create the cluster using the following command:

$ pcluster create mycluster
Beginning cluster creation for cluster: mycluster
Creating stack named: parallelcluster-mycluster
Status: parallelcluster-mycluster - CREATE_COMPLETE
MasterPublicIP: 18.221.232.178 
ClusterUser: ec2-user
MasterPrivateIP: 172.31.12.244

Alternatively, place the configuration file anywhere you like and use the following command, where myconfig is the name of the file containing your AWS ParallelCluster configuration:

$ pcluster create mycluster -c myconfig

It takes a few minutes to spin up the cluster, and create and initialize the FSx for Lustre file system. AWS ParallelCluster verifies that the provided configuration file is correct. Then, it generates an AWS CloudFormation template to build the required infrastructure for an HPC system in AWS with automatic scaling capabilities. For more information about the services that AWS ParallelCluster creates, see How AWS ParallelCluster works.

The cluster creation process creates a single master node. When you connect to the cluster using SSH, you land on this node and manage your jobs from it. You can connect to the master node using the following command:

$ pcluster ssh mycluster -i <path to your EC2 SSH private (.pem) key file>

Installing IOR

For your test, install the IOR benchmark to the shared directory on all of the compute nodes in the cluster by running the following script on the master node:

#!/bin/bash

# load MPI wrappers
module load openmpi
mkdir -p /shared/hpc/performance/ior
git clone https://github.com/hpc/ior.git
cd ior
git checkout 3.2.1
./bootstrap
./configure --with-mpiio --with-lustre --prefix=/shared/hpc/performance/ior
make
sudo make install

Running IOR

Use the SGE command line tools to launch your IOR job across the cluster with the following shell script:

#!/bin/bash
#$ -V
#$ -cwd
#$ -N myjob
#$ -j Y
#$ -o output_$JOB_ID.out
#$ -e output_$JOB_ID.err
#$ -pe mpi 256

module load openmpi
mkdir /fsx/ior
OUTPUT=/fsx/ior/test
export PATH=/shared/hpc/performance/ior/bin:$PATH

mpirun -np ${NSLOTS} ior -w -r -B -C -o $OUTPUT -b 5g -a POSIX -i 1 -F -t 4m

The script option -pe mpi 256 tells SGE to use the mpi parallel environment and to allocate 256 slots across the cluster. Because there can be up to 16 compute nodes and each c4.4xlarge has 16 vCPUs, there are 16 IOR processes allocated per node. Each process writes and then reads a 5-GiB file (using 4-MiB I/O transfer sizes) using the following IOR flags:

-B Use O_DIRECT to bypass system library caching and directly access the file system. This provides raw file system performance metrics.
-a Define the I/O access interface to use. In this case, select POSIX as the method.
-F Create one file per process.
-C Change the task ordering to n+1 ordering for read-back. Use to avoid read cache effects on client processes.
-b Define the size of the file to create, per process.
-t Define the I/O size to use for each read/write operation.

Run the following command on the master node to submit your job to the cluster, where myjob.sh is the name of the earlier shell script:

$ qsub myjob.sh
Your job 1 (myjob") has been submitted

You can monitor the status of the job using the qstat command:

$ qstat
job-ID prior   name  user     state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
1      0.55500 myjob ec2-user r     03/15/2019 21:57:27 all.q@ip-172-31-5-228.us-east- 64

After a few minutes, compute instances are available in your cluster through automatic scaling and the IOR job runs automatically. For a description of the workflow, see AWS ParallelCluster processes.

Example of results

Using the shell script, the output of the run goes to a file called output_$JOB_ID.out, where $JOB_ID is the job ID as assigned by the job scheduler. From the qstat output shown earlier, your job ID was “1”. The output from your test run follows:

[ec2-user@ip-172-31-23-169 ~]$ cat output_1.out
IOR-3.2.1: MPI Coordinated Test of Parallel I/O
Began               : Tue Jul 16 14:27:19 2019
Command line        : ior -w -r -B -C -o /fsx/ior/test -b 5g -a POSIX -i 1 -F -t 4m
Machine             : Linux ip-172-31-23-3
TestID              : 0
StartTime           : Tue Jul 16 14:27:19 2019
Path                : /fsx/ior
FS                  : 36.3 TiB   Used FS: 0.0%   Inodes: 167.7 Mi   Used Inodes: 0.0%
Options:
api                 : POSIX
apiVersion          :
test filename       : /fsx/ior/test
access              : file-per-process
type                : independent
segments            : 1
ordering in a file  : sequential
ordering inter file : constant task offset
task offset         : 1
tasks               : 256
clients per node    : 16
repetitions         : 1
xfersize            : 4 MiB
blocksize           : 5 GiB
aggregate filesize  : 1.25 TiB

Results:
access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   --------   ----
write     9115       5242880    4096       0.039549   143.79     85.61      143.79     0
read      8636       5242880    4096       0.048285   151.77     80.74      151.77     0
remove    -          -          -          -          -          -          0.029827   0
Max Write: 9115.26 MiB/sec (9558.04 MB/sec)
Max Read:  8636.00 MiB/sec (9055.50 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write        9115.26    9115.26    9115.26       0.00    2278.82    2278.82    2278.82       0.00  143.79401     0    256  16    1   1     1        1         0    0      1 5368709120  4194304 1310720.0 POSIX      0
read         8636.00    8636.00    8636.00       0.00    2159.00    2159.00    2159.00       0.00  151.77395     0    256  16    1   1     1        1         0    0      1 5368709120  4194304 1310720.0 POSIX      0
Finished            : Tue Jul 16 14:32:15 2019

To summarize:

You launched 256 processes total across the cluster, 16 per node.
Each process created a 5-GiB file on the FSx for Lustre file system, reading and writing from the file sequentially in 4-MiB chunks.
The aggregate throughput was 9.5 GB/s on writes and 9.0 GB/s on reads.

Cleaning up

When you have your results, you can shut down and delete the cluster using the following command:

$ pcluster delete mycluster

This terminates both the cluster nodes and the FSx for Lustre file system.

Conclusion

In this post, you learned how to use AWS ParallelCluster to create a fully functional HPC cluster. After reviewing the AWS ParallelCluster settings for using FSx for Lustre, you created a file system as part of the overall cluster-creation process. Then, you used IOR to benchmark the file system to show how quick and easy it is to run a job on the cluster.

For more information, see High Performance Computing (HPC). Please let us know what you think in the comments.

Happy building!

December 19, 2019: this post was edited to remove references to “min_queue_size,” as it is no longer a valid parameter.