AWS Open Source Blog

AWS ParallelCluster

AWS Parallel Cluster graphic

中文版

Orchestration software has played a key role in cluster bring-up and management for decades. Dating back to solutions like SunCluster, PSSP, and community solutions such as CFEngine, the need to launch many resources together to enable large parallel applications continues to be a vital part of the High Performance Computing (HPC) environment. AWS has many cloud native approaches to running your clustered workloads on AWS, but the need to recreate or replicate an environment similar or nearly identical to what you are currently running in your data center may be a necessary first step in moving workloads to AWS.

What if you could build a familiar cluster environment using AWS cloud native resources?

Today we announce AWS ParallelCluster, an AWS supported, open source cluster management tool that makes it easy for scientists, researchers, and IT administrators to deploy and manage High Performance Computing (HPC) clusters in the AWS cloud. With AWS ParallelCluster, many AWS cloud native products are used to launch a cluster environment that should be familiar to those running HPC workloads. For example, AWS CloudFormation, AWS Identity and Access Management (IAM), Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), Amazon Elastic Compute Cloud (Amazon EC2), Amazon EC2 Auto Scaling, Amazon Elastic Block Store (Amazon EBS), Amazon Simple Storage Service (Amazon S3), and Amazon DynamoDB.

AWS ParallelCluster is released via the Python Package Index (PyPI) and can be installed via pip. It is available at no additional cost, and you only pay for the AWS resources needed to run your applications. ParallelCluster leverages CloudFormation to build out your cluster environment. This is the same CloudFormation that you can use to launch just one instance, or a VPC, or an S3 bucket, but now you’re using it launch an entire HPC cluster environment.

Many of you will be familiar with CfnCluster. ParallelCluster used the code base that CfnCluster was built upon, and then we extended it to include new features, functionality, and (of course) bug improvements and fixes. If you are a previous user of CfnCluster, we encourage you to start using ParallelCluster when you can, and going forward create new clusters only using ParallelCluster. You can use your existing CfnCluster config files with ParallelCluster. (Although you can still use CfnCluster, it will no longer be developed.)

Some key features in the initial release of ParallelCluster that were not in CfnCluster are:

  • AWS Batch integration
  • Multiple EBS volumes
  • Better scaling performance – faster, with updates AutoScaling all at once
  • Support for “bring your own AMI” Custom AMI
  • Private cluster using proxy

And we’re not even close to done! We will continue to iterate ParallelCluster based on customer requests and feedback.

Getting Started

Grab a cup of caffeine, and let’s get to it!

You will need:

Decision time #1. You can use ParallelCluster anywhere you can access the internet, but you will need your AWS API keys, or you will need to set up an IAM Role and assign that to an instance to launch the necessary resources for your cluster. For this post, I’ll assume you are using either a Linux or MacOS operating system, you have admin access, and you have access to your API Keys. Please reach out to an AWS Solutions Architect if you have questions about using an IAM Role instead.

Before I install ParallelCluster, I’ll make sure I can access the console using the AWS CLI. To install the AWS CLI you can follow the steps Installing the AWS Command Line Interface, or to install in a Python virtual environment you can follow Install the AWS Command Line Interface in a Virtual Environment. I’ll be using a Python virtual environment for everything.

An optional first step for those wanting to use a Python virtual environment:

[duff]$ virtualenv ~/Envs/pcluster-virtenv
[duff]$ source ~/Envs/pcluster-virtenv/bin/activate
(pcluster-virtenv) [duff]$ 

Now let’s install the AWS CLI and verify functionality by creating a bucket:

(pcluster-virtenv) [duff]$ pip install --upgrade awscli
(pcluster-virtenv) [duff@]$ aws configure
AWS Access Key ID []: <aws_access_key>
AWS Secret Access Key []: <aws_secret_access_key>
Default region name []: us-east-1
Default output format []: json
(pcluster-virtenv) [duff]$ aws s3 mb s3://duff-parallelcluster
make_bucket: duff-parallelcluster 

I’ve installed, setup, and verified functionality of the AWS CLI. Let’s install ParallelCluster now.

Decision time #2.  The VPC that ParallelCluster will use must have DNS Resolution = yes and DNS Hostnames = yes. It should also have DHCP options with the correct domain-name for the region, as defined in the docs: VPC DHCP Options. The subnet that will be used will need to have access to the internet, and there are several way to enable this.  For this blog, I will use a Public subnet (a subnet that has an IGW attached and routes to the internet), but you can use a Private subnet as long as the subnet routes to the internet (e.g. through a NAT Gateway or a proxy server).

The VPC settings can be verified by going to the Console and looking at the configuration, you should see this:

Now I’ll install ParallelCluster using the virtual environment I setup:

(pcluster-virtenv) [duff]$  pip install aws-parallelcluster
... output snipped...
Successfully installed aws-parallelcluster-2.0.0rc1 ...

Before I can launch a cluster I’ll need to configure ParallelCluster. Note that I leave “AWS Access Key ID” and “AWS Secret Access Key ID” blank, as I already configured this with the AWS CLI setup. Also, because we really want to make this easy on you, we’ll display possible values from your account:

(pcluster-virtenv) [duff@]$ pcluster configure
Cluster Template [default]:
AWS Access Key ID []:  <blank>
AWS Secret Access Key ID []: <blank>
Acceptable Values for AWS Region ID:
    ap-south-1
    eu-west-3
    eu-west-2
    eu-west-1
    ap-northeast-2
    ap-northeast-1
    sa-east-1
    ca-central-1
    ap-southeast-1
    ap-southeast-2
    eu-central-1
    us-east-1
    us-east-2
    us-west-1
    us-west-2
AWS Region ID []: us-east-1
VPC Name [public]:
Acceptable Values for Key Name: <blank>
    duff_key_us-east-1
Key Name []: duff_key_us-east-1
Acceptable Values for VPC ID:
    vpc-12345678901234567
    vpc-abcdefghigjlmnopq
VPC ID []: vpc-abcdefghigjlmnopq
Acceptable Values for Master Subnet ID:
    subnet-abcdefghigjlmnop1
    subnet-abcdefghigjlmnop2
    subnet-abcdefghigjlmnop3
    subnet-abcdefghigjlmnop4
    subnet-abcdefghigjlmnop5
    subnet-abcdefghigjlmnop6
Master Subnet ID []: subnet-abcdefghigjlmnop1

Okay, let’s see what that did.  It created the file ~/.parallelcluster/config, let’s cat that and have a look.

(pcluster-virtenv) [duff]$ cat ~/.parallelcluster/config
[aws]
aws_region_name = us-east-1

[cluster default]
vpc_settings = public
key_name = duff_key_us-east-1

[vpc public]
master_subnet_id = subnet-abcdefghigjlmnop1
vpc_id = vpc-abcdefghigjlmnopq

[global]
update_check = true
sanity_check = true
cluster_template = default

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

ParallelCluster uses the file ~/.parallelcluster/config by default for all configuration parameters. You can see an example configuration file site-packages/aws-parallelcluster/examples/config in the github repo. The config file has several sections (if you’re a Python programmer we’re using ConfigParser). Each section has a set of parameters that used to launch the cluster. If I’m not careful, and I accidentally put a config parameter in the wrong section, it will be silently ignored and I’ll be stuck wondering what happened. Refer to the ParallelCluster Configuration docs for more info. If the parameter is not specified in the config file, then the default value is used.

Currently, ParallelCluster supports three schedulers: sge, torque, and slurm. The default is sge, and that’s what I’ll be using.

For now, the only changes I will make in the config file is to add the SSH source location ssh_from in the VPC section, and change the compute_instance_type in the cluster section.

By default, we will allow SSH inbound from any source IP (0.0.0.0/0), and I want to restrict this to just my IP address. I recommend that you do something similar by adding your IP address or trusted CIDR block (e.g. 10.10.0.0/16). I updated my [vpc public] section:

[vpc public]
master_subnet_id = subnet-abcdefghigjlmnop1
vpc_id = vpc-abcdefghigjlmnopq
ssh_from = 11.22.33.44/32

And I will also update the [cluster default] section, and change the compute instance type to c4.large, rather than using the default instance t2.micro:

[cluster default]
vpc_settings = public
key_name = duff_key_us-east-1
compute_instance_type = c4.large

Now that we understand a bit about the config file and we know how to add configuration parameters, let’s launch our first cluster with the create command:

(pcluster-virtenv) [duff]$ pcluster create hello-cluster1

When we start the cluster create, we’ll see a status update as the resources are being brought up. And because I’m interested to see how long it takes to launch a cluster, I’ll be using time:

(pcluster-virtenv) [duff]$ time pcluster create hello-cluster1
Beginning cluster creation for cluster: hello-cluster1
Creating stack named: parallelcluster-hello-cluster1
Status: parallelcluster-hello-cluster1 - CREATE_IN_PROGRESS

When the cluster creation has completed, I have both the public and private IP addresses and the username for login. And because I used time, I see that it took 8 mins and 33 seconds to create the cluster:

MasterPublicIP: 35.153.251.20
ClusterUser: ec2-user
MasterPrivateIP: 172.31.0.14

real	8m33.425s
user	0m2.620s
sys	    0m0.353s

Let’s login with the built-in ssh alias we give you with ParallelCluster pcluster ssh <cluster_name>, and see what cluster resources are already avaiablbe.

(pcluster-virtenv) [duff@]$ pcluster list
hello-cluster1

(pcluster-virtenv) [duff@]$ pcluster ssh hello-cluster1
The authenticity of host '35.153.251.20 (35.153.251.20)' can't be established.
ECDSA key fingerprint is SHA256:u9+A0i6Y94JcRGYW8eyi5e4N+iiNtpPTPAwPY5PQcWk.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '35.153.251.20' (ECDSA) to the list of known hosts.
Last login: Sun Nov 11 20:12:12 2018

       __|  __|_  )
       _|  (     /   Amazon Linux AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2018.03-release-notes/

[ec2-user@ip-172-31-0-14 ~]$ qhost
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
ip-172-31-10-95         lx-amd64        2    1    1    2  0.02    3.7G  156.2M     0.0     0.0
ip-172-31-13-199        lx-amd64        2    1    1    2  0.02    3.7G  156.8M     0.0     0.0

From the output above, you can see that I already have a cluster of instances running. By default, we’re going to use t2.micro for the compute instance type, but I configured this cluster to use the c4.large, and because hyper-threading is on, we see two CPUs and one core for each instance.

Let’s submit a simple hostname job that will show the AutoScaling feature of ParallelCluster using the mpirun command.

[ec2-user@ip-172-31-0-14 ~]$ echo /usr/lib64/openmpi/bin/mpirun hostname | qsub -pe mpi 16
Your job 1 ("STDIN") has been submitted
[ec2-user@ip-172-31-0-14 ~]$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
      1 0.00000 STDIN      ec2-user     qw    11/11/2018 20:25:38                                   16

Now I have a job requesting more instances than I have, which kicks off scaling action. When I have enough instances, in this case I’ll need 8 total instances, the job will run. A few minutes later, I have the resources and the job has already run to completion:

[ec2-user@ip-172-31-0-14 ~]$ qhost
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
ip-172-31-0-72          lx-amd64        2    1    1    2  0.11    3.7G  189.0M     0.0     0.0
ip-172-31-10-65         lx-amd64        2    1    1    2  0.29    3.7G  189.2M     0.0     0.0
ip-172-31-14-49         lx-amd64        2    1    1    2  0.11    3.7G  189.1M     0.0     0.0
ip-172-31-2-78          lx-amd64        2    1    1    2  0.06    3.7G  189.4M     0.0     0.0
ip-172-31-3-226         lx-amd64        2    1    1    2  0.11    3.7G  185.5M     0.0     0.0
ip-172-31-4-248         lx-amd64        2    1    1    2  0.11    3.7G  186.2M     0.0     0.0
ip-172-31-5-112         lx-amd64        2    1    1    2  0.08    3.7G  188.9M     0.0     0.0
ip-172-31-5-50          lx-amd64        2    1    1    2  0.08    3.7G  189.0M     0.0     0.0
[ec2-user@ip-172-31-0-14 ~]$ qstat

Now that the job has run and I have these instnaces just sitting there doing nothing, what happens now? If the instances have been running for more than 10 minutes, but are not running a job, we will terminate those instnaces for you. So after 10 minutes I look at qhost again:

[ec2-user@ip-172-31-0-14 ~]$ qhost
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
[ec2-user@ip-172-31-0-14 ~]$ qstat

The instances have been terminated, and I’m not being charged for idle instances. The scaling features are configurable.

Okay. I have launched what looks and acts like a traditional HPC environment using many AWS Cloud native resources, to include an AutoScaling cluster that will terminate instances when that are not being used. What about using an environment without the scheduler overhead?

Say hello to AWS Batch.

AWS Batch dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted.

With AWS Batch, there is no need to install and manage batch computing software or server clusters that you use to run your jobs, allowing you to focus on analyzing results and solving problems. AWS Batch plans, schedules, and executes your batch computing workloads across the full range of AWS compute services and features, such as Amazon EC2 and Spot Instances.

So now I’ll launch a Batch enviroment and let ParallelCluster do all of the work for me. When launching a AWS Batch enviroment, we’ll leverage even more AWS resources. For example, AWS CodeBuild, Amazon Elastic Container Registry (Amazon ECR), and NFS server will be brought up on the master instance.

I’ll start by editing my config file: ~/.parallelcluster/config, and add this section using some of the same parameters from the [cluster default] section.

[cluster awsbatch]
scheduler = awsbatch
key_name = duff_key_us-east-1
vpc_settings = public

Now that I have a separate cluster template defined, I can launch a separate master instance that will be both the NFS server for my Batch jobs, and will also be the submit host for my batch jobs. I’ll create a cluster now, specifying my awsbatch cluster.

(pcluster-virtenv) [duff@]$ pcluster create awsbatch --cluster-template awsbatch
Beginning cluster creation for cluster: awsbatch
Creating stack named: parallelcluster-awsbatch
Status: parallelcluster-awsbatch - CREATE_COMPLETE
MasterPublicIP: 54.158.75.19
ClusterUser: ec2-user
MasterPrivateIP: 172.31.15.217
ResourcesS3Bucket: parallelcluster-awsbatch-6wjsibr8elx9km0r

From the output above, you can see I’ve successfully created an AWS Batch submit host. I’ll log in and see what’s there:

(pcluster-virtenv) [duff@]$ pcluster ssh awsbatch
The authenticity of host '54.158.75.19 (54.158.75.19)' can't be established.
ECDSA key fingerprint is SHA256:/K8LQYyLliS0+Q7+BZtkhe6ChyM9Oz/RZz0aTCKJ3KQ.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '54.158.75.19' (ECDSA) to the list of known hosts.
Last login: Tue Nov 13 00:46:30 2018

       __|  __|_  )
       _|  (     /   Amazon Linux AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2018.03-release-notes/
[ec2-user@ip-172-31-15-217 ~]$ awsbhosts
ec2InstanceId        instanceType    privateIpAddress    publicIpAddress      runningJobs
-------------------  --------------  ------------------  -----------------  -------------
i-05af380e4950366d4  c4.xlarge       172.31.4.66         18.209.11.53                   0

I see that I have a c4.xlarge instance ready to run jobs. I’ll test with hello world.

[ec2-user@ip-172-31-15-217 ~]$ awsbsub echo hello world
Job 2387b7f5-14c7-41c1-bbf8-c5e50017580a (echo) has been submitted.

The job is submitted, and should go from RUNNABLE to STARTING to RUNNING, and then either SUCCEEDED or FAIL.

[ec2-user@ip-172-31-15-217 ~]$ awsbstat
jobId                                 jobName    status    startedAt    stoppedAt    exitCode
------------------------------------  ---------  --------  -----------  -----------  ----------
2387b7f5-14c7-41c1-bbf8-c5e50017580a  echo       RUNNABLE  -            -            -
[ec2-user@ip-172-31-15-217 ~]$ set -o vi
[ec2-user@ip-172-31-15-217 ~]$ awsbstat
jobId                                 jobName    status    startedAt    stoppedAt    exitCode
------------------------------------  ---------  --------  -----------  -----------  ----------
2387b7f5-14c7-41c1-bbf8-c5e50017580a  echo       STARTING  -            -            -

[ec2-user@ip-172-31-15-217 ~]$ awsbstat
jobId                                 jobName    status    startedAt            stoppedAt    exitCode
------------------------------------  ---------  --------  -------------------  -----------  ----------
2387b7f5-14c7-41c1-bbf8-c5e50017580a  echo       RUNNING   2018-11-13 00:52:31  -            -

Now I see that my job is running, and I can also check with the awsbout command:

[ec2-user@ip-172-31-15-217 ~]$ awsbout 2387b7f5-14c7-41c1-bbf8-c5e50017580a
2018-11-13 00:52:31: Starting Job 2387b7f5-14c7-41c1-bbf8-c5e50017580a
2018-11-13 00:52:31: hello world

After my job has completed, I can check the status with the awsbstat command:

[ec2-user@ip-172-31-15-217 ~]$ awsbstat -s SUCCEEDED
jobId                                 jobName    status     startedAt            stoppedAt              exitCode
------------------------------------  ---------  ---------  -------------------  -------------------  ----------
2387b7f5-14c7-41c1-bbf8-c5e50017580a  echo       SUCCEEDED  2018-11-13 00:52:31  2018-11-13 00:53:02           0

With AWS ParallelCluster you can leverage the benefits of the AWS Cloud, while maintaining a faimiliar, cluster environment. We’re excited about ParallelCluster and we look forward to hearing from you!

Cheers,
Mark

Mark Duffield

Mark Duffield

Mark Duffield is a Worldwide Tech Leader at Amazon Web Services, focusing on the semiconductor industry. Prior to joining AWS, he was a High Performance Computing SME at IBM, and designed multi-petabyte solutions at DDN Storage. He has deep experience with HPC, cluster computing, enterprise software development, and distributed file systems. He architects solutions in several verticals, to include electronic design automation, weather modeling and forecasting, manufacturing, and scientific simulations.