AWS Storage Blog
Building an HPC cluster with AWS ParallelCluster and Amazon FSx for Lustre
High performance computing (HPC) customers love the breadth of services offered by AWS and the flexibility offered by the cloud to address their computational challenges. AWS provides you with the opportunity to innovate quickly and accelerate your workflow thanks to a virtually unlimited capacity. For more information, see the following posts:
- Natural language processing at Clemson University – 1.1 million vCPUs & EC2 Spot Instances
- Creating a 1.3 Million vCPU Grid on AWS using Amazon EC2 Spot Instances and TIBCO GridServer
- Western Digital HDD Simulation at Cloud Scale – 2.5 Million HPC Tasks, 40K EC2 Spot Instances
However, building an HPC system can be complex, which is one reason AWS created AWS ParallelCluster. This tool offers a simple method to create an automatically scaling HPC system in the AWS Cloud, utilizing services such as Amazon EC2, AWS Batch, and Amazon FSx for Lustre. AWS Parallel Cluster configures the compute and storage resources necessary to run your HPC workloads. Settings based on your specific storage, compute, and network requirements all help optimize its functionality.
You can manage an HPC cluster with AWS ParallelCluster through a simple set of commands: create, update, and delete. The compute portion of the cluster automatically scales when jobs have to be processed. As a result, you only pay for what you use. Additionally, to provide shared storage to the cluster, AWS ParallelCluster can use Amazon FSx for Lustre as a high-performance file system that can be accessed by all compute nodes.
In this post, I show how easy it is to spin up a fully functioning compute cluster with a shared, high-performance file system in a matter of minutes using AWS ParallelCluster and Amazon FSx for Lustre. To test file system throughput, which is a common performance metric for sizing HPC clusters, you’ll use the SGE scheduler (pre-installed with AWS ParallelCluster) once your cluster is up to run a quick I/O benchmark job. The I/O benchmark will also verify that the compute cluster can achieve the expected level of file system performance. If you’re new to AWSParallelCluster, review this post detailing how to get started before continuing.
Overview
AWS recently announced AWS ParallelCluster support for FSx for Lustre. AWS ParallelCluster is a cluster-management tool for AWS that makes it easy for scientists, researchers, and IT administrators to deploy and manage HPC clusters in the AWS Cloud. This tool also facilitates both quick-start proofs of concept (POCs) and production deployments.
FSx for Lustre provides a high-performance file system that makes it easy to process data stored on Amazon S3 or on premises. With FSx for Lustre, you can launch and run a file system that provides sub-millisecond access to your data. It allows you to read and write data at speeds of up to hundreds of gigabytes per second of throughput and millions of IOPS. By integrating AWS ParallelCluster with FSx for Lustre, HPC customers can now spin up clusters with thousands of compute instances, and shared storage that scales performance for the most demanding workloads so that compute instances never have to wait for data and you get to the answers faster.
To test the performance of your FSx for Lustre file system, you can use IOR, a commonly used HPC benchmarking tool. IOR provides a number of options that approximate workload performance. However, I always recommend that you benchmark with your own workload to get full visibility into performance characteristics.
Right-sizing
Use IOR to generate a sequential read/write I/O pattern to test the high-throughput capabilities of FSx for Lustre. This means that you begin your sizing at the storage layer, and then determine how many compute nodes should drive your desired I/O performance.
When sizing FSx for Lustre performance, a good rule of thumb is to plan for 200 MB/s of throughput per 1 TiB allocated. FSx for Lustre allocates file systems in 3.6-TiB increments, which provide around 720 MB/s per increment. A 39.6-TiB file system, should give you around 7.92 GB/s of total file system throughput.
After you know the storage performance target, you can figure out how many compute nodes you need. Each compute node communicates with FSx for Lustre over the network, making the workload throughput-based, and so network bandwidth is the primary sizing factor for compute nodes.
Amazon EC2 instances provide different levels of network bandwidth, depending upon the instance type. In this case, I selected the c4.4xlarge instance type because it is widely available in all Regions and, in testing, provides about 500-MB/s peak network bandwidth. To reach 7.92 GB/s of performance, you want sixteen c4.4xlarge compute nodes in your cluster.
Prerequisites
You must have the AWS CLI installed and configured on a system with Python and PIP. This could be on your laptop, a desktop computer, or an Amazon EC2 bastion host running in AWS. Make sure that you are running the latest version of the AWS CLI by running the following command:
$ pip install --upgrade awscli
If you already have the AWS ParallelCluster CLI installed, run the following command to make sure you are using the latest version:
$ pip install --upgrade aws-parallelcluster
Defining the cluster
AWS ParallelCluster uses a configuration file to define how the cluster should be created and initialized. Use the following configuration file to create the cluster:
[global] cluster_template = default update_check = true [aws] aws_region_name = us-east-2 [cluster default] key_name = <EC2-KEY-NAME> initial_queue_size = 0 max_queue_size = 16 placement_group = DYNAMIC cluster_type = spot master_instance_type = c4.4xlarge compute_instance_type = c4.4xlarge vpc_settings = myvpc ebs_settings = myebs fsx_settings = myfsx [vpc myvpc] vpc_id = vpc-<VPC-ID> master_subnet_id = subnet-<SUBNET-ID> [ebs myebs] shared_dir = /shared volume_type = gp2 volume_size = 100 [fsx myfsx] shared_dir = /fsx storage_capacity = 39600 [aliases] ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}
Global configuration
This section reviews some of the previous parameters. For a complete description of the parameters, see the AWS ParallelCluster post on the AWS Open Source Blog.
- aws_region_name – Name of the Region where the cluster is created. In this case, I used us-east-2 (Ohio) because it’s one of the Regions that supports FSx for Lustre.
- key_name – Name of an existing Amazon EC2 key pair, from the specified Region. It is used for Secure Shell (SSH) access to the cluster master node.
- initial_queue_size – Initial number of compute nodes attached to the cluster when created.
- max_queue_size – Maximum number of compute nodes created using Amazon EC2 Auto Scaling when jobs are submitted.
- cluster_type – Defines if Amazon EC2 On-Demand or Spot Instances are to be used for the compute portion of the cluster. The Spot price is capped at the On-Demand price but you can change that setting.
- compute_instance_type – EC2 instance type to be used for the compute nodes. In this case, choose c4.4xlarge to get a good balance of CPU performance and network performance. Evaluate different instance types and sizes with your HPC workloads and select the one that makes the most sense.
- shared_dir – Directory created on the master node and exported via NFS to all other nodes in the cluster. It provides a shared space that can be used to store applications or static files.
- vpc_id – ID of the VPC, in <aws_region_name>, where the cluster deploys.
- master_subnet_id – ID of the subnet, in <aws_region_name>, where the cluster deploys.
FSx for Lustre configuration
The new configuration option fsx_settings gives you the following abilities:
- Create a new FSx for Lustre file system and attach it to the cluster.
- Attach a pre-existing FSx for Lustre file system.
The [fsx] section in the configuration file specifies the options for the file system. In this case, the system creates a 39.6-TiB file system and mounts it under the /fsx folder on all nodes.
FSx for Lustre can also load datasets from Amazon S3 when it creates a file system. This allows you to store data durably, reliably, and at a low price point in Amazon S3. The file data will be loaded from S3 into FSx for Lustre the first time the files are read. Subsequent reads of the file will pull data directly from the FSx for Lustre file system. To initialize your FSx for Lustre file system with data from Amazon S3, use the following settings under the [fsx] configuration section:
- import_path – This specifies an S3 bucket and path to import data from (for example, s3://mybucket/mydata). When FSx for Lustre creates its file system, it scans the objects in this path and creates corresponding file metadata in Lustre. The actual object data is not loaded into the file system until the first time it reads the file.
- export_path – To export data from the file system back to Amazon S3, you can specify a path, which must be in the same bucket as the import_path.
For more information, see Configuration Options.
Creating the cluster
Now that you’ve defined the configuration file, save it to ~/.parallelcluster/config
. Then create the cluster using the following command:
$ pcluster create mycluster Beginning cluster creation for cluster: mycluster Creating stack named: parallelcluster-mycluster Status: parallelcluster-mycluster - CREATE_COMPLETE MasterPublicIP: 18.221.232.178 ClusterUser: ec2-user MasterPrivateIP: 172.31.12.244
Alternatively, place the configuration file anywhere you like and use the following command, where myconfig is the name of the file containing your AWS ParallelCluster configuration:
$ pcluster create mycluster -c myconfig
It takes a few minutes to spin up the cluster, and create and initialize the FSx for Lustre file system. AWS ParallelCluster verifies that the provided configuration file is correct. Then, it generates an AWS CloudFormation template to build the required infrastructure for an HPC system in AWS with automatic scaling capabilities. For more information about the services that AWS ParallelCluster creates, see How AWS ParallelCluster works.
The cluster creation process creates a single master node. When you connect to the cluster using SSH, you land on this node and manage your jobs from it. You can connect to the master node using the following command:
$ pcluster ssh mycluster -i <path to your EC2 SSH private (.pem) key file>
Installing IOR
For your test, install the IOR benchmark to the shared directory on all of the compute nodes in the cluster by running the following script on the master node:
#!/bin/bash
# load MPI wrappers
module load openmpi
mkdir -p /shared/hpc/performance/ior
git clone https://github.com/hpc/ior.git
cd ior
git checkout 3.2.1
./bootstrap
./configure --with-mpiio --with-lustre --prefix=/shared/hpc/performance/ior
make
sudo make install
Running IOR
Use the SGE command line tools to launch your IOR job across the cluster with the following shell script:
#!/bin/bash
#$ -V
#$ -cwd
#$ -N myjob
#$ -j Y
#$ -o output_$JOB_ID.out
#$ -e output_$JOB_ID.err
#$ -pe mpi 256
module load openmpi
mkdir /fsx/ior
OUTPUT=/fsx/ior/test
export PATH=/shared/hpc/performance/ior/bin:$PATH
mpirun -np ${NSLOTS} ior -w -r -B -C -o $OUTPUT -b 5g -a POSIX -i 1 -F -t 4m
The script option -pe mpi 256
tells SGE to use the mpi
parallel environment and to allocate 256 slots across the cluster. Because there can be up to 16 compute nodes and each c4.4xlarge has 16 vCPUs, there are 16 IOR processes allocated per node. Each process writes and then reads a 5-GiB file (using 4-MiB I/O transfer sizes) using the following IOR flags:
- -B Use O_DIRECT to bypass system library caching and directly access the file system. This provides raw file system performance metrics.
- -a Define the I/O access interface to use. In this case, select POSIX as the method.
- -F Create one file per process.
- -C Change the task ordering to n+1 ordering for read-back. Use to avoid read cache effects on client processes.
- -b Define the size of the file to create, per process.
- -t Define the I/O size to use for each read/write operation.
Run the following command on the master node to submit your job to the cluster, where myjob.sh is the name of the earlier shell script:
$ qsub myjob.sh Your job 1 (myjob") has been submitted
You can monitor the status of the job using the qstat command:
$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 1 0.55500 myjob ec2-user r 03/15/2019 21:57:27 all.q@ip-172-31-5-228.us-east- 64
After a few minutes, compute instances are available in your cluster through automatic scaling and the IOR job runs automatically. For a description of the workflow, see AWS ParallelCluster processes.
Example of results
Using the shell script, the output of the run goes to a file called output_$JOB_ID.out, where $JOB_ID is the job ID as assigned by the job scheduler. From the qstat output shown earlier, your job ID was “1”. The output from your test run follows:
[ec2-user@ip-172-31-23-169 ~]$ cat output_1.out IOR-3.2.1: MPI Coordinated Test of Parallel I/O Began : Tue Jul 16 14:27:19 2019 Command line : ior -w -r -B -C -o /fsx/ior/test -b 5g -a POSIX -i 1 -F -t 4m Machine : Linux ip-172-31-23-3 TestID : 0 StartTime : Tue Jul 16 14:27:19 2019 Path : /fsx/ior FS : 36.3 TiB Used FS: 0.0% Inodes: 167.7 Mi Used Inodes: 0.0% Options: api : POSIX apiVersion : test filename : /fsx/ior/test access : file-per-process type : independent segments : 1 ordering in a file : sequential ordering inter file : constant task offset task offset : 1 tasks : 256 clients per node : 16 repetitions : 1 xfersize : 4 MiB blocksize : 5 GiB aggregate filesize : 1.25 TiB Results: access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- write 9115 5242880 4096 0.039549 143.79 85.61 143.79 0 read 8636 5242880 4096 0.048285 151.77 80.74 151.77 0 remove - - - - - - 0.029827 0 Max Write: 9115.26 MiB/sec (9558.04 MB/sec) Max Read: 8636.00 MiB/sec (9055.50 MB/sec) Summary of all tests: Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum write 9115.26 9115.26 9115.26 0.00 2278.82 2278.82 2278.82 0.00 143.79401 0 256 16 1 1 1 1 0 0 1 5368709120 4194304 1310720.0 POSIX 0 read 8636.00 8636.00 8636.00 0.00 2159.00 2159.00 2159.00 0.00 151.77395 0 256 16 1 1 1 1 0 0 1 5368709120 4194304 1310720.0 POSIX 0 Finished : Tue Jul 16 14:32:15 2019
To summarize:
- You launched 256 processes total across the cluster, 16 per node.
- Each process created a 5-GiB file on the FSx for Lustre file system, reading and writing from the file sequentially in 4-MiB chunks.
- The aggregate throughput was 9.5 GB/s on writes and 9.0 GB/s on reads.
Cleaning up
When you have your results, you can shut down and delete the cluster using the following command:
$ pcluster delete mycluster
This terminates both the cluster nodes and the FSx for Lustre file system.
Conclusion
In this post, you learned how to use AWS ParallelCluster to create a fully functional HPC cluster. After reviewing the AWS ParallelCluster settings for using FSx for Lustre, you created a file system as part of the overall cluster-creation process. Then, you used IOR to benchmark the file system to show how quick and easy it is to run a job on the cluster.
For more information, see High Performance Computing (HPC). Please let us know what you think in the comments.
Happy building!
December 19, 2019: this post was edited to remove references to “min_queue_size,” as it is no longer a valid parameter.